### 2.Choosing the right estimator/algorithm for our problem
Scikit-Learn uses estimator as another term for machine learning model or algorithm
* Classification-Predicting whether a sample is one thing or another
* Regression-Predicting a number
#### Watch ML map https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html & Choose the estimator according to the length of the data
#### 2.1 Picking a ml model for a regression problem

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
# Import Boston Housing dataset
from sklearn.datasets import load_boston
boston=load_boston()
boston;

In [4]:
boston_df=pd.DataFrame(boston['data'],columns=boston['feature_names'])
boston_df['target']=pd.Series(boston['target'])
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [5]:
# Features
X=boston_df.drop('target',axis=1)
# Label
y=boston_df['target']
len(boston_df)

506

In [6]:
# Let's try the Ridge Regression model
from sklearn.linear_model import Ridge
np.random.seed(42)
ridge=Ridge()

In [7]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,
                                              y,
                                              test_size=0.2)
ridge.fit(X_train,y_train)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)

In [8]:
ridge.score(X_test,y_test)

0.6662221670168521

## Let's try another model to improve the score
Watch ml mp https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

In [9]:
from sklearn.ensemble import RandomForestRegressor
np.random.seed(42)

rf=RandomForestRegressor(n_estimators=100)
rf.fit(X_train,y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

In [10]:
rf.score(X_test,y_test)

0.8896648705127477

In [11]:
ridge.score(X_test,y_test)

0.6662221670168521

# 4 Regression model evaluation metrics

Model evaluation metrics documentation - https://scikit-learn.org/stable/modules/model_evaluation.html

* R^2(pronounced r-squared) or coefficient of determination(Closer to 1 is better)
* Mean absolute error(MAE)(minimize is better)
* Mean squared error(MSE)(minimize is better)

**R^2**

What R^2 does:Compares models predictions to the mean of the target.Values can range from negative infinity(a very poor model) to 1.For example,if all of the model does is predict the mean of the targets,it's R^2 value would be 0.And if model perfectly predicts a range of numbers,it's R^2 value would be 1.

In [16]:
from sklearn.metrics import r2_score

# Fill an array with y_test mean
y_test_mean=np.full(len(y_test),y_test.mean())
y_test_mean[:10]

array([21.48823529, 21.48823529, 21.48823529, 21.48823529, 21.48823529,
       21.48823529, 21.48823529, 21.48823529, 21.48823529, 21.48823529])

In [17]:
y_test.mean()

21.488235294117654

In [20]:
r2_score(y_test,y_test_mean)

2.220446049250313e-16

In [19]:
r2_score(y_test,y_test)

1.0

**Mean absolute error(MAE)**

MAE is the average of the absolute differences between predictions and actual values.It gives us an idea of how wrong our models predictions are.

In [12]:
y_preds=rf.predict(X_test)
y_preds[:10]

array([22.877, 30.517, 16.437, 23.531, 16.918, 21.438, 19.274, 15.797,
       21.101, 20.942])

In [13]:
np.array([y_test[:10]])

array([[23.6, 32.4, 13.6, 22.8, 16.1, 20. , 17.8, 14. , 19.6, 16.8]])

In [21]:
# Compare the predictions to the truth
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test,y_preds)

2.04936274509804

In [23]:
df=pd.DataFrame(data={'actual values':y_test,
                     'predicted values':y_preds})
df['differences']=df['predicted values']-df['actual values']
df

Unnamed: 0,actual values,predicted values,differences
173,23.6,22.877,-0.723
274,32.4,30.517,-1.883
491,13.6,16.437,2.837
72,22.8,23.531,0.731
452,16.1,16.918,0.818
...,...,...,...
412,17.9,12.909,-4.991
436,9.6,12.849,3.249
411,17.2,13.211,-3.989
86,22.5,20.514,-1.986


**Mean squared error(MSE)**

In [24]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test,y_preds)

8.091292460784315

In [25]:
# calculate mse by hand
mse=np.square(df['differences']).mean()
mse

8.091292460784317