# Introduction to Scikit-Learn (sklearn)

This notebook demonstrates some of the most important functions of the beautiful scikit-learn library.

What are we going to cover:
    
0. An end-to-end scikit-learn workflow
1. Getting the data ready
2. Choose the right estimator/model/algorithm for our problems
3. Fit the model/algorithm/estimator and use it to make predictions on our data
4. Evaluating a model
5. Improve a model
6. Save and load a trained model
7. Putting it all together

## 0. An end-to-end scikit-learn workflow

In [1]:
# 1. Get the data ready
import pandas as pd

In [2]:
import numpy as np

In [6]:
heart_disease = pd.read_csv("../data/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [7]:
# Seperate the features and the target variables from the dataset
# Create a feature matrix (X)
X = heart_disease.drop("target",axis=1)

# Create a label matrix (y)
y = heart_disease["target"]

In [8]:
# 2. Choose the right model and hyperparameters
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

# We'll keep the default hyperparameters we can see those hyperparameters 
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [9]:
# 3. Fit the model to the training data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [10]:
clf.fit(X_train,y_train);

In [11]:
# Make a prediction 
y_preds = clf.predict(X_test)
y_preds

array([1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1,
       0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1])

In [12]:
y_test

9      1
193    0
6      1
189    0
202    0
      ..
265    0
14     1
61     1
280    0
67     1
Name: target, Length: 61, dtype: int64

In [13]:
# 4. Evaluate our model on the training set and the test set
clf.score(X_train,y_train)

1.0

In [14]:
clf.score(X_test,y_test)

0.819672131147541

In [15]:
# Evaulating further with the help of some other evaluation metrics
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [16]:
print(classification_report(y_test,y_preds))

              precision    recall  f1-score   support

           0       0.81      0.84      0.83        31
           1       0.83      0.80      0.81        30

    accuracy                           0.82        61
   macro avg       0.82      0.82      0.82        61
weighted avg       0.82      0.82      0.82        61



In [17]:
confusion_matrix(y_test,y_preds)

array([[26,  5],
       [ 6, 24]])

In [18]:
accuracy_score(y_test,y_preds)

0.819672131147541

In [19]:
# 5. Improve the model 
# (Usually we try by changing some hyperparameters, here we will experiment by tunning the n_estimators hyperparameter)
np.random.seed(42)
for i in range(10,110,10):
    print(f"Fitting the model with the number of estimators = {i}")
    clf = RandomForestClassifier(n_estimators=i)
    clf.fit(X_train,y_train)
    y_preds = clf.predict(X_test)
    print(f"Accuracy score for the model with {i} estimators = {accuracy_score(y_test,y_preds) * 100:.2f}%")
    print("")

Fitting the model with the number of estimators = 10
Accuracy score for the model with 10 estimators = 85.25%

Fitting the model with the number of estimators = 20
Accuracy score for the model with 20 estimators = 80.33%

Fitting the model with the number of estimators = 30
Accuracy score for the model with 30 estimators = 80.33%

Fitting the model with the number of estimators = 40
Accuracy score for the model with 40 estimators = 80.33%

Fitting the model with the number of estimators = 50
Accuracy score for the model with 50 estimators = 83.61%

Fitting the model with the number of estimators = 60
Accuracy score for the model with 60 estimators = 81.97%

Fitting the model with the number of estimators = 70
Accuracy score for the model with 70 estimators = 81.97%

Fitting the model with the number of estimators = 80
Accuracy score for the model with 80 estimators = 83.61%

Fitting the model with the number of estimators = 90
Accuracy score for the model with 90 estimators = 80.33%

F

In [22]:
# 6. Save a model and load it.
# Here the last model state where n_estimators value is 100 will be saved.
import pickle

pickle.dump(clf, open("../random_forest_model_classifier_1.pkl", "wb"))

In [24]:
# Load the saved model
loaded_model = pickle.load(open("../random_forest_model_classifier_1.pkl", "rb"))
loaded_model.score(X_test, y_test)

0.819672131147541

# Now we will explore each step one by one in depth 

Let's begin ....

## 1. Getting our data ready to be used with machine learning

Three main things we need to do:
1. Split the data into features and labels (Usually `X` and `y`).
2. Filling (also called Imputing) or disregarding missing values.
3. Converting non-numerical values to numerical values (also called feature encoding).

In [25]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [28]:
X = heart_disease.drop("target", axis=1)
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [29]:
y = heart_disease["target"]
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [30]:
# Split data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [31]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

In [33]:
X.shape[0] * 0.2 # Number of rows in the test set

60.6

## 1.1 Make sure the data is numerical.

We will try to make the data numerical where the data is not in some numerical form.

In [36]:
car_sales = pd.read_csv("../scikit-learn-data/car-sales-extended.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [37]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [39]:
# Split the data into X,y
X = car_sales.drop("Price", axis=1)
y = car_sales["Price"]

In [40]:
# Split into training and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [41]:
# Build Machine learning model 
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X_train,y_train)
model.score(X_test,y_test)

ValueError: could not convert string to float: 'Toyota'

In [44]:
# Doors though has numerical value it is a categorical variable as it divided cars into different categories based on number of doors.
car_sales["Doors"].value_counts()

4    856
5     79
3     65
Name: Doors, dtype: int64

In [46]:
# We will get an error like above if the data is non-numerical as our model won't be able to make sense of it.
# Here we have the categorical variables Make (Honda,BMW) and Color (White,Blue,etc), Doors is also a categotrical feature.

# Hence we will turn the categories into numbers using OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot',
                                one_hot,
                                categorical_features)],
                               remainder="passthrough")
transformed_X = transformer.fit_transform(X)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

In [48]:
X.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3


In [47]:
pd.DataFrame(transformed_X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,155144.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,215883.0


In [59]:
# Another way to encode pandas dataframe is using Pandas dumies. 
dummies = pd.get_dummies(X, columns = ["Make","Colour","Doors"])
dummies

Unnamed: 0,Odometer (KM),Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White,Doors_3,Doors_4,Doors_5
0,35431,0,1,0,0,0,0,0,0,1,0,1,0
1,192714,1,0,0,0,0,1,0,0,0,0,0,1
2,84714,0,1,0,0,0,0,0,0,1,0,1,0
3,154365,0,0,0,1,0,0,0,0,1,0,1,0
4,181577,0,0,1,0,0,1,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,35820,0,0,0,1,1,0,0,0,0,0,1,0
996,155144,0,0,1,0,0,0,0,0,1,1,0,0
997,66604,0,0,1,0,0,1,0,0,0,0,1,0
998,215883,0,1,0,0,0,0,0,0,1,0,1,0


In [62]:
# Let's refit the model as all our data is in numbers.
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)
model.fit(X_train,y_train);

In [63]:
model.score(X_train,y_train)

0.891612713353635

In [64]:
model.score(X_test,y_test)

0.3235867221569877

## What if we had missing values?

1. Fill out the missing values (also known as imputing).
2. Remove the missing data altogether.

In [65]:
car_sales_missing = pd.read_csv("../scikit-learn-data/car-sales-extended-missing-data.csv")
car_sales_missing

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [68]:
car_sales_missing.info(), car_sales_missing.shape[0]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Make           951 non-null    object 
 1   Colour         950 non-null    object 
 2   Odometer (KM)  950 non-null    float64
 3   Doors          950 non-null    float64
 4   Price          950 non-null    float64
dtypes: float64(3), object(2)
memory usage: 39.2+ KB


(None, 1000)

In [70]:
# It will give the number of missing values in all columns
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [72]:
# Let's split the data into X and y.
X = car_sales_missing.drop("Price",axis=1)
y = car_sales_missing["Price"]

In [73]:
# If we try to apply the OneHotEncoder it will throw an error as we have missing values in our dataset.
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
transformer = ColumnTransformer([("one_hot",
                                 one_hot,
                                 categorical_features)],
                               remainder="passthrough")
transformed_X = transformer.fit_transform(X)
transformed_X

ValueError: Input contains NaN

## Let's fill the missing Data.

### Option 1: Fill the missing data with Pandas

* Generally we fill the missing values of a column with strings by replacing NaN with "missing" or some appropriate string.
* We can fill the missing values of numerical column with something like the mean of all values in that column.

In [77]:
int(car_sales_missing["Doors"].median())

4

In [78]:
# Fill the "Make" column.
car_sales_missing["Make"].fillna("missing",inplace=True)

# Fill the "Colour" column.
car_sales_missing["Colour"].fillna("missing",inplace=True)

# Fill the "Odometer (KM) column".
car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(), inplace=True)

# Fill the "Doors" column.
car_sales_missing["Doors"].fillna(int(car_sales_missing["Doors"].median()), inplace=True)

In [79]:
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

In [84]:
# Price column is the Target/Label column. Hence we need to remove the rows where there are missing Prices. Rather than Imputing with some data.
# We will lose some data by doing this but it is okay.
car_sales_missing.dropna(inplace=True)

In [85]:
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [87]:
car_sales_missing.shape # This shows that we lost 50 rows which had missing Price values

(950, 5)

In [98]:
# Let's try to apply OneHotEncoder now.
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
transformer = ColumnTransformer([("one_hot",
                                 one_hot,
                                 categorical_features)],
                               remainder="passthrough")
transformed_X = transformer.fit_transform(car_sales_missing)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        3.54310e+04, 1.53230e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        1.92714e+05, 1.99430e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        8.47140e+04, 2.83430e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        6.66040e+04, 3.15700e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.15883e+05, 4.00100e+03],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.48360e+05, 1.27320e+04]])

In [99]:
pd.DataFrame(transformed_X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,35431.0,15323.0
1,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0,19943.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,84714.0,28343.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,154365.0,13434.0
4,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0,14043.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
945,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0,32042.0
946,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,155144.0,5716.0
947,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0,31570.0
948,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,215883.0,4001.0


In [100]:
car_sales_missing

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,missing,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


### Option 2 : Filling out the missing data with scikit-learn

In [3]:
car_sales_missing = pd.read_csv("../scikit-learn-data/car-sales-extended-missing-data.csv")
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [4]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [5]:
# Drop the rows with no labels
car_sales_missing.dropna(subset=["Price"], inplace=True)
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [6]:
# Split into X and Y
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [8]:
# Fill missing values with scikit-learn
from sklearn.impute import SimpleImputer  # To fill missing values
from sklearn.compose import ColumnTransformer # To use the required transformation to a list of Columns

# Fill categorical values with 'missing' and numerical values with mean
cat_imputer = SimpleImputer(strategy='constant',fill_value='missing')
door_imputer = SimpleImputer(strategy='constant', fill_value=4)
num_imputer = SimpleImputer(strategy='mean')

# Define columns 
cat_features = ["Make", "Colour"]
door_feature = ["Doors"]
num_features = ["Odometer (KM)"]

# Create an imputer
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, cat_features),
    ("door_imputer", door_imputer, door_feature),
    ("num_imputer", num_imputer, num_features)
])

filled_X = imputer.fit_transform(X)
filled_X

array([['Honda', 'White', 4.0, 35431.0],
       ['BMW', 'Blue', 5.0, 192714.0],
       ['Honda', 'White', 4.0, 84714.0],
       ...,
       ['Nissan', 'Blue', 4.0, 66604.0],
       ['Honda', 'White', 4.0, 215883.0],
       ['Toyota', 'Blue', 4.0, 248360.0]], dtype=object)

In [17]:
# Let's check if there are any missing values left
car_sales_filled = pd.DataFrame(filled_X,
                               columns=["Make", "Colour", "Doors", "Odometer (KM)"])
car_sales_filled.head()

Unnamed: 0,Make,Colour,Doors,Odometer (KM)
0,Honda,White,4,35431
1,BMW,Blue,5,192714
2,Honda,White,4,84714
3,Toyota,White,4,154365
4,Nissan,Blue,3,181577


In [18]:
car_sales_filled.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

In [19]:
# Now let's turn the data into numerical data
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder="passthrough")
transformed_car_sales_filled = transformer.fit_transform(car_sales_filled)
transformed_car_sales_filled

<950x15 sparse matrix of type '<class 'numpy.float64'>'
	with 3800 stored elements in Compressed Sparse Row format>

In [22]:
pd.DataFrame(transformed_car_sales_filled.toarray())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
945,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0
946,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,155144.0
947,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0
948,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,215883.0


In [29]:
# Now we have got the data so let's try to fit our model
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(transformed_car_sales_filled, y, test_size=0.2)
model = RandomForestRegressor()
model.fit(X_train,y_train)
model.score(X_train, y_train)

0.8845872827949007

In [30]:
model.score(X_test, y_test)

0.21990196728583944

# 2. Choosing the right estimator/algorithm for our problem.

Scikit-learn uses estimator as another term for machine learning estimator or model.

* Classification - predicting whether a particular thing is one thing or another.
* Regression - predicting a number.
* Clustering - predicting which category a particular thing belongs to when the categories are not known.

Step - 1 : Check the scikit learn machine learning map....
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

### 2.1 Picking Machine learning model for a Regression problem.

In [33]:
# Import Boston housing dataset
from sklearn.datasets import load_boston
boston = load_boston()
# boston

In [35]:
boston_df = pd.DataFrame(boston["data"], columns=boston.feature_names)
boston_df["target"] = pd.Series(boston["target"])
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [38]:
# How many samples are there?
len(boston_df)

506

In [39]:
# Let's try the Ridge Regression Model.
from sklearn.linear_model import Ridge

# Setup a random seed
np.random.seed(42)

# Create the data
X = boston_df.drop("target", axis=1)
y = boston_df["target"]

# Split data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate Ridge model
regressor = Ridge(alpha=0.5)
regressor.fit(X_train, y_train)

# Check the score of the ridge model on test data
regressor.score(X_test, y_test)

0.6675800871276227

In [40]:
regressor.score(X_train, y_train)

0.7500178709433354

In [44]:
# let's try the support vector regression model on the data that we have
from sklearn.svm import SVR
svr_regressor = SVR()
svr_regressor.fit(X_train, y_train)
svr_regressor.score(X_test, y_test)

0.27948125010200275

##### How do we improve the score ?

Here the support vector regressor is clearly not working as we want.

Let's refer back to the map .... https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

<img alt="scikit-learn-ml-map" />

In [49]:
# Let's try ensemble regressors particularly the random forest regressor
from sklearn.ensemble import RandomForestRegressor

# Setup random seed
np.random.seed(42)

# Create the data
X = boston_df.drop("target", axis=1)
y = boston_df['target']

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Initialize the model
ensemble_model = RandomForestRegressor()

# Train the model on the training set
ensemble_model.fit(X_train, y_train)

# Check the model's performance on the test set
ensemble_model.score(X_test, y_test)

0.8654448653350507

In [50]:
ensemble_model.score(X_train,y_train)

0.9763520974033731

### 2.2 Choosing an estimator for classification problem

Let's go to the map... https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

In [51]:
heart_disease = pd.read_csv("../scikit-learn-data/heart-disease.csv")
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [52]:
# How many samples ?
len(heart_disease)

303

Consulting our map and it tells us to use `LinearSVC`

In [58]:
# Import the LinearSVC estimator class
from sklearn.svm import LinearSVC

# Set up random seed
np.random.seed(42)

# Get the data
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Split the data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Intantiate the LinearSVC
linsvc_clf = LinearSVC()

# Train the model on the training set.
linsvc_clf.fit(X_train, y_train)

# Evaluate the model on test set.
linsvc_clf.score(X_test, y_test)



0.8688524590163934

In [59]:
X_train

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
132,42,1,1,120,295,0,1,162,0,0.0,2,0,2
202,58,1,0,150,270,0,0,111,1,0.8,2,0,3
196,46,1,2,150,231,0,1,147,0,3.6,1,0,2
75,55,0,1,135,250,0,0,161,0,1.4,1,0,2
176,60,1,0,117,230,1,1,160,1,1.4,2,2,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
188,50,1,2,140,233,0,1,163,0,0.6,1,1,3
71,51,1,2,94,227,0,1,154,1,0.0,2,1,3
106,69,1,3,160,234,1,0,131,0,0.1,1,1,2
270,46,1,0,120,249,0,0,144,0,0.8,2,0,3


In [62]:
# Let's use ensemble estimator
from sklearn.ensemble import RandomForestClassifier

# Set up random seed
np.random.seed(42)

# Get the data
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Split the data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate the model
rfc_model = RandomForestClassifier(n_estimators=20000)

# Train the model on the training set.
rfc_model.fit(X_train, y_train)

# Evaluate the model on test set.
rfc_model.score(X_test, y_test)

0.8688524590163934

#### Tidbit :

1. If you have structured data, use ensemble methods.
2. If you have unstructured data, use deep learning or transfer learning.

# 3. Fit the model/algorithm/estimator on our data and use it to make predictions

### 3.1 Fitting the model to the data

Different name for :
* `X` = features, feature variables, data
* `y` = labels, targets, target variables

In [63]:
# Let's use ensemble estimator
from sklearn.ensemble import RandomForestClassifier

# Set up random seed
np.random.seed(42)

# Get the data
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Split the data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate the model
rfc_model = RandomForestClassifier(n_estimators=20000)

# Fit the model on the training set. ------ Here we fit the model so that it can learn patterns from the data provided.
rfc_model.fit(X_train, y_train)

# Evaluate the model on test set. (use the patterns that the machine has learnt)
rfc_model.score(X_test, y_test)

0.8688524590163934

### 3.2 Make predictions using a Machine Learning model

2 main ways to make predictions
1. `predict()`
2. `predict_proba()`

In [64]:
# Use a trained model to make predictions
rfc_model.predict(X_test)

array([0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0])

In [66]:
np.array(y_test)

array([0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0])

In [67]:
# Compare predictions to truth labels to evauluate the model
y_preds = rfc_model.predict(X_test)
y_preds

array([0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0])

In [68]:
np.mean(y_preds == y_test)

0.8688524590163934

In [69]:
rfc_model.score(X_test, y_test)

0.8688524590163934

In [72]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_preds)

0.8688524590163934

Make predictions with `predict_proba()`

In [75]:
# predict_proba() returns probabilities of a classification label
rfc_model.predict_proba(X_test[:5])

array([[0.90695, 0.09305],
       [0.4193 , 0.5807 ],
       [0.4638 , 0.5362 ],
       [0.8731 , 0.1269 ],
       [0.20495, 0.79505]])

In [74]:
y_preds

array([0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0])

`predit()` can also be used with regression models.

In [78]:
from sklearn.ensemble import RandomForestRegressor

# Set up random seed
np.random.seed(42)

# Get the data
X = boston_df.drop("target", axis=1)
y = boston_df["target"]

# Split the data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate the model
rfr_model = RandomForestRegressor()

# Fit the model
rfr_model.fit(X_train,y_train)

# Evaluate the model
rfr_model.score(X_test,y_test)

0.8654448653350507

In [80]:
rfr_model.predict(X_test[:5])

array([23.081, 30.574, 16.759, 23.46 , 16.893])

In [82]:
y_test[:5]

173    23.6
274    32.4
491    13.6
72     22.8
452    16.1
Name: target, dtype: float64

In [86]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred=rfr_model.predict(X_test))

2.136382352941176

In [None]:
mean_absolute_error()

# 4. Evaluating a ML model

Three ways to evaluate Scikit-learn models/estimators :
1. Estimator `score` method.
2. The `scoring` paramter.
3. Problem specific metric functions.

###  4.1 Evaluating the model with score method

In [89]:
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

rfc = RandomForestClassifier()

rfc.fit(X_train, y_train);

In [90]:
rfc.score(X_test, y_test)

0.8524590163934426

💡 Note: Every estimator score model has it's own default metrics that it uses for the evaulation.

### 4.2 Evaluating a model using on `scoring` paramter.

In [92]:
from sklearn.model_selection import cross_val_score

from sklearn.ensemble import RandomForestClassifier

# Set up random seed
np.random.seed(42)

# Get the data
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Split the data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate the model
rfc_model = RandomForestClassifier(n_estimators=100)

# Fit the model on the training set. ------ Here we fit the model so that it can learn patterns from the data provided.
rfc_model.fit(X_train, y_train);

In [93]:
rfc_model.score(X_test, y_test)

0.8524590163934426

In [96]:
cross_val_score(rfc_model, X, y, cv=5) # k-fold cv, here we took k=5 but we can take whatever we want

array([0.83606557, 0.90163934, 0.81967213, 0.83333333, 0.78333333])

A brief overview of what cross validation does :

<img src="../images/cross_val.png" />

In [97]:
np.random.seed(42)

# Single training and test split score
rfc_single_score = rfc_model.score(X_test, y_test)

# Take the mean of 10-fold cross-validation score
rfc_cross_val_score = np.mean(cross_val_score(rfc_model, X, y))

# Compare the scores
rfc_single_score, rfc_cross_val_score

(0.8524590163934426, 0.8248087431693989)

In [None]:
# Default scoring parameter of classifier = mean accuracy
rfc_model.score() # returns mean accuracy

In [98]:
# Scoring parameter set to NONE by default, ie. if it is NONE it will use default scoring parameter for the estimator
cross_val_score(rfc_model, X, y, scoring=None)

array([0.78688525, 0.86885246, 0.80327869, 0.78333333, 0.76666667])