## Choose the Right Estimator or Algorithm for our problems

### Scikit-Learn uses Estimator as another name for Machine Learning Model or Algorithm.
Once you've got your data ready, the next step is to choose an appropriate machine learning algorithm or model to find patterns in your data.

Some things to note:
* Sklearn refers to machine learning models and algorithms as estimators.
* **Classification problem** - Predicting whether a sample is one thing or Another
    * Sometimes you'll see `clf` (short for classifier) used as a classification estimator instance's variable name.
* **Regression problem** - predicting a number (selling price of a car).
* **Unsupervised problem** - clustering (grouping unlabelled samples with other similar unlabelled samples).

If you know what kind of problem you're working with, one of the next places you should look at is the [Scikit-Learn algorithm cheatsheet](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html).

This cheatsheet gives you a bit of an insight into the algorithm you might want to use for the problem you're working on.

It's important to remember, you don't have to explicitly know what each algorithm is doing on the inside to start using them. If you do start to apply different algorithms but they don't seem to be working, that's when you'd start to look deeper into each one.

### Classification
**1.1 Choosing an Estimator for Classification Problem**

Let's check out the choosing process for a classification problem.

Say you were trying to predict whether or not a patient had heart disease based on their medical records.

Refering the map it says try LinearSVC

In [3]:
import pandas as pd
import numpy as np
%matplotlib inline

In [4]:
heart_disease = pd.read_csv('Dataset/heart-disease.csv')
heart_disease.head(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [6]:
# Import the LinearSVC Estimator class

from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
# Setup random seed
np.random.seed(42)

# Make the Data Ready
x = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Lets Split the data into Train and test
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)

# Instantiate and fit LinearSVC
clf = LinearSVC(max_iter=1000)
clf.fit(x_train, y_train)

# Evaluate the LinearSVC
clf.score(x_test, y_test)



0.4918032786885246

In [7]:
heart_disease["target"].value_counts()

1    165
0    138
Name: target, dtype: int64

In [8]:
# Import the RandomForestClassifier Estimator class

from sklearn.ensemble import RandomForestClassifier

# Setup random seed
np.random.seed(42)

# Make the Data Ready
x = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Lets Split the data into Train and test
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)

# Instantiate and fit RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(x_train, y_train)

# Evaluate the RandomForestClassifier
clf.score(x_test, y_test)

0.8524590163934426

### Regression

**2.1 Choosing a Machine Learning model for Regression problem**


In [9]:
#### Import Boston Housing Dataset
from sklearn.datasets import load_boston

boston = load_boston()
boston

{'data': array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
         4.9800e+00],
        [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
         9.1400e+00],
        [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
         4.0300e+00],
        ...,
        [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
         5.6400e+00],
        [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
         6.4800e+00],
        [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
         7.8800e+00]]),
 'target': array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
        18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
        15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
        13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
        21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
        35.4, 24.7, 3

In [10]:
boston_df = pd.DataFrame(boston["data"], columns=boston["feature_names"])
boston_df["target"] = pd.Series(boston['target'])
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [11]:
len(boston_df)

506

In [12]:
# Lets try the Ridge Regression Model
from sklearn.linear_model import Ridge

# Setup random seen
np.random.seed(42)

# Create the Data
x = boston_df.drop("target", axis=1)
y = boston_df["target"]

# Split into Test and Train Set
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)

# Instantiate Rdge Model
model = Ridge()
model.fit(x_train, y_train)

# Check the Score of the Ridge Model on the Test data
model.score(x_test, y_test)

0.6662221670168521

How do we Improve the Score?

What if Ridge wasnt Working?

Refer back to the map : https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

In [13]:
# Lets Try Random Forest Regressor

from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

# Create the Data
x = boston_df.drop("target", axis=1)
y = boston_df["target"]

# Split the Data
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)

# Instantiate RandomForestRegressor
rf = RandomForestRegressor()
rf.fit(x_train, y_train)

# Evaluate the Random Forest Regressor
rf.score(x_test, y_test)

0.873969014117403

In [14]:
# Check the Score of Ridge Model Again
model.score(x_test, y_test)

0.6662221670168521

We can see the accuracy is improved by using the Randomforest

**Tidbit:**
    1. If you have Structured Data, use Ensemble Method
    2. If you have Unstructured Data, use Deep Learning or Transfer Learning