## 03. Choosing the Right Estimator/Algorithm for our Problem
Scikit uses estimator as another term for machine learning model or algorithm
 * Classification: Predicting whether a sample is one thing or another
 * Regression: Predicting a number
 
 Check the [Scikit-Learn Map](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares) machine learning map 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [24]:
# 3.1 picking a machine learning model for a regression problem
from sklearn.datasets import load_boston
boston = load_boston()

In [25]:
boston_df = pd.DataFrame(data=boston["data"], columns=boston["feature_names"])
boston_df["TARGET"] = boston["target"]
boston_df

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,TARGET
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48,22.0


In [26]:
boston_df.shape

(506, 14)

In [28]:
boston_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  TARGET   506 non-null    float64
dtypes: float64(14)
memory usage: 55.5 KB


In [30]:
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,TARGET
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [33]:
X = boston_df.drop("TARGET", axis=1)
y = boston_df.TARGET

In [38]:
# selecting model 
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split

np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


In [39]:
model = Ridge()
model.fit(X_train, y_train);

In [56]:
print(f"{model.score(X_test, y_test) * 100:.2f}%")

66.62%


How do we improve this score?

What if Ridge (linear model) was'nt working?

In [58]:
# Attempting to find patterns using the RandomForestRegressor
# since it is a regression based problem
from sklearn.ensemble import RandomForestRegressor

# splitting the data into features and labels
np.random.seed(42)
X = boston_df.drop("TARGET", axis=1)
y = boston_df.TARGET

# splitting the data into TRAIN & TEST samples with 20% for the test size
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# fit the model
model = RandomForestRegressor()
model.fit(X_train, y_train);

# score the model
model.score(X_test, y_test)

0.8654448653350507

### 3.1 Choosing an Estimator for Classification

In [59]:
heart_disease = pd.read_csv("./data/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


Based on the Scikit-Learn map, it is recommended to use `Linear SVC`.

In [64]:
# import a linear SVC estimator class
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split as splitter
# setup random seed

np.random.seed(42)
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# splitting the data for TRAIN & TEST samples
X_train, X_test, y_train, y_test = splitter(X, y, test_size=0.2)

lin_svc_clf = LinearSVC(max_iter=100000000)

lin_svc_clf.fit(X_train, y_train)
lin_svc_clf.score(X_test, y_test)

0.8688524590163934

In [73]:
clf = LinearSVC(max_iter=1000, dual=False)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.8688524590163934

#### 3.1.1 Attempting with RandomForestClassifier
Due to the nature of the problem `Classification` LinearSVC might not resolve to correct results. Hence, trying out an Ensemble classifier

In [76]:
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

clf = RandomForestClassifier()
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.8360655737704918

In [75]:
heart_disease["target"].value_counts()

1    165
0    138
Name: target, dtype: int64