When it comes to predicting continuous outcomes, Linear Regression is the most popular algorithm. However many of us did no know that there is might be other more efficient Algorithms in this regard. 

Algorithms such as KNN, Descision Tress, and Gradient Boosting are very famous for their usage in classification problems. But, it turns out that they can do a very good job in regression problems. In this notebook we will show case the usage of these algorithms and evaluate their performance.

<h2 style="background-color:#f15a39; padding: 20px; font-family:cursive">Outline</h2>

1. [Package imports](#imports)
2. [Quick quality check](#check)
3. [Building a pipline function](#pipe)
4. [Comparing the performance of each model](#compare)

<a id = "imports"></a>
<h2 style="background-color:#f15a39; padding: 20px; font-family:cursive;">Package imports</h2>

In [None]:
#Data Wrangling
import numpy as np
import pandas as pd

#Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

#Model selection
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.pipeline import make_pipeline

#Model evaluation
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV

<a id = "check"></a>
<h2 style="background-color:#f15a39; padding: 20px; font-family:cursive;">Quick quality check</h2>

In [None]:
df = pd.read_csv("../input/rock-density-xray/rock_density_xray.csv")

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe().transpose()

In [None]:
df.isna().sum()

In [None]:
plt.figure(figsize = (8,4), dpi = 100)
sns.scatterplot(data = df, x = df.columns[0], y = df.columns[1])
plt.show()

### **Results**:
1. The data is clean
2. All features are in the right type and fromat
3. No missing values
4. No outliers

<a id = "pipe"></a>
<h2 style="background-color:#f15a39; padding: 20px; font-family:cursive;">Building a pipline function</h2>

We will build a fuction to make the comparison between models easier.The function will tanke in the model, the training and test data. It will return the evaluation metric "Root Mean Square Error" and a scatter plot of the data with the regression line. 

In [None]:
# Train Test Split
X = df['Rebound Signal Strength nHz'].values.reshape(-1,1)  
y = df['Rock Density kg/m3']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=101)

In [None]:
def run_model(model,X_train,y_train,X_test,y_test):
    
    # Fit Model
    model.fit(X_train,y_train)
    
    # Get Metrics
    
    preds = model.predict(X_test)
    
    rmse = np.sqrt(mean_squared_error(y_test,preds))
    mean = df[df.columns[1]].mean()
    print(f'RMSE : {rmse}')
    print(f'MEAN Y : {mean}')
    
    
    # Plot results
    signal_range = np.arange(0,100)
    output = model.predict(signal_range.reshape(-1,1))
    
    
    plt.figure(figsize=(12,6),dpi=150)
    sns.scatterplot(x=df.columns[0],y=df.columns[1],data=df,color='black')
    plt.plot(signal_range,output)

<a id = "compare"></a>
<h2 style="background-color:#f15a39; padding: 20px; font-family:cursive;">Comparing the performance of each model</h2>

#### **1. Linear Regression**

In [None]:
lr = LinearRegression()
run_model(lr,X_train,y_train,X_test,y_test)

The Roote Mean Square Error (RMSE) is 0.257. When compare to a mean of 2.225 it seems that the model is doing a very good job. But, when we look at the graph we can see that the model is not picking any singal from the data i.e it did a very poor job. 

#### **2. Polynomial Regression**

In [None]:
#Third degree polynomial: play around with the degree yourself
pipe = make_pipeline(PolynomialFeatures(3),LinearRegression())
run_model(pipe,X_train,y_train,X_test,y_test)

In [None]:
#Third degree polynomial: play around with the degree yourself
pipe = make_pipeline(PolynomialFeatures(7),LinearRegression())
run_model(pipe,X_train,y_train,X_test,y_test)

For the 6th degree polynomial, the Roote Mean Square Error (RMSE) is 0.136, the model is doing a very good job. At the same time, graph also shows that the model is fitting the data pretty well. 

#### **3. KNN Regression**

In [None]:
k_values = [1,5,10]
for n in k_values:
    model = KNeighborsRegressor(n_neighbors=n)
    run_model(model,X_train,y_train,X_test,y_test)

At number of neighbors = 1 the model is picking so much noise, as it increased to 10 we allowed more bias and thus the model started to work better.

#### **4. Decision Tree Regression**

In [None]:
model = DecisionTreeRegressor()
run_model(model,X_train,y_train,X_test,y_test)

The model has only one dependent variable, therefore there is no space for hyper prameter tuning. Although the RMSE is minimized, the model is picking so much noise, the model is not likely to generalize well on new data with larger scale. 

#### **5. Random Forest Regression**

In [None]:
trees = [10,50,100]
for n in trees:
    model = RandomForestRegressor(n_estimators=n)
    run_model(model,X_train,y_train,X_test,y_test)

10 trees is very low, and 100 is ver large. 50 trees seems to be a good choice. 

#### **6. Gradient Boosting**

In [None]:
model = GradientBoostingRegressor()
run_model(model,X_train,y_train,X_test,y_test)

#### **7. Adaboost**

In [None]:
model = AdaBoostRegressor()
run_model(model,X_train,y_train,X_test,y_test)

In this notebook we tried different regression techniques on a very simple data set in order to illusterate how each regression model differs from the rest. each of them have a very close performance. But ensemble methods seems to be the best, specifically Gradient Boosting. We did not do any hyper parameter tuning, this might be done in following notebooks.