## What to cover?


* an end-to-end scikit learn workflow
* getting data ready (to be used with machine learning models)
* choosing a machine learning model
* fitting a model to the data (learning patterns)
* making predictions with model (using patterns)
* evaluating model predictions
* improving model predictions
* saving and loading models

## an end-to-end scikit learn workflow

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# getting data ready
import pandas as pd
heart_disease = pd.read_csv('data/heart-disease.csv')
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [3]:
# create X (features matrix)
X = heart_disease.drop('target', axis=1)

# create Y (label)
y = heart_disease['target']

In [4]:
# choose the right model and hyper parameters
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

# keep the default hyper parameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [5]:
# fit the model to the data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [6]:
# fits the training the data into model
clf.fit(X_train, y_train);

In [7]:
X_train

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
200,44,1,0,110,197,0,0,177,0,0.0,2,1,2
254,59,1,3,160,273,0,0,125,0,0.0,2,0,2
24,40,1,3,140,199,0,1,178,1,1.4,2,0,3
90,48,1,2,124,255,1,1,175,0,0.0,2,2,2
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
141,43,1,0,115,303,0,1,181,0,1.2,1,0,2
75,55,0,1,135,250,0,0,161,0,1.4,1,0,2
241,59,0,0,174,249,0,1,143,1,0.0,1,0,2
198,62,1,0,120,267,0,1,99,1,1.8,1,2,3


In [8]:
# make prediction
y_preds = clf.predict(X_test)
y_preds

array([0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1,
       1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0], dtype=int64)

In [9]:
y_test

242    0
112    1
225    0
267    0
187    0
      ..
73     1
48     1
191    0
222    0
235    0
Name: target, Length: 61, dtype: int64

In [10]:
# evaluate the model on the training data and test data
clf.score(X_train, y_train)

1.0

In [11]:
clf.score(X_test, y_test)

0.9016393442622951

In [12]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [13]:
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.93      0.86      0.89        29
           1       0.88      0.94      0.91        32

    accuracy                           0.90        61
   macro avg       0.90      0.90      0.90        61
weighted avg       0.90      0.90      0.90        61



In [14]:
confusion_matrix(y_test, y_preds)

array([[25,  4],
       [ 2, 30]], dtype=int64)

In [15]:
accuracy_score(y_test, y_preds)

0.9016393442622951

In [16]:
# Improving the model

# try different amount n_estimators
np.random.seed(22)
for i in range(10,100,10):
    print(f'Trying model with {i} estimators')
    clf = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f'Model accuracy on test set: {clf.score(X_test, y_test) * 100:.2f}%')
    print('')

Trying model with 10 estimators
Model accuracy on test set: 80.33%

Trying model with 20 estimators
Model accuracy on test set: 83.61%

Trying model with 30 estimators
Model accuracy on test set: 85.25%

Trying model with 40 estimators
Model accuracy on test set: 85.25%

Trying model with 50 estimators
Model accuracy on test set: 88.52%

Trying model with 60 estimators
Model accuracy on test set: 83.61%

Trying model with 70 estimators
Model accuracy on test set: 85.25%

Trying model with 80 estimators
Model accuracy on test set: 90.16%

Trying model with 90 estimators
Model accuracy on test set: 86.89%



In [17]:
# saving a model and load it
import pickle
pickle.dump(clf, open('random_forst_model_1.pkl', 'wb'))

In [18]:
loaded_model = pickle.load(open('random_forst_model_1.pkl', 'rb'))
loaded_model.score(X_test, y_test) #this will give you the last result i.e. 90 estimators in our case

0.8688524590163934

## In Depth

## 1. Getting our data ready

Three main things we need to do
1. Split the data into features and labels (usually 'X' and 'y')
2. Filling (also called imputing) or disregarding missing values
3. Converting non numerical values to numerical values (also called feature encoding)

In [19]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [20]:
X = heart_disease.drop('target', axis=1)
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [21]:
y = heart_disease['target']
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [22]:
# Split the training data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [23]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

In [24]:
len(heart_disease)

303

In [25]:
242 + 61

303

## Making sure all data is numerical

In [26]:
car_sales = pd.read_csv('data/car-sales-extended.csv')
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [27]:
len(car_sales)

1000

In [28]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [29]:
# this example does not make sense as it can not predict the car price with the given information
# but just for the sake of learning we are doing this.

In [30]:
# split the data into X and y
X = car_sales.drop('Price', axis=1)
y = car_sales['Price']

# split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [31]:
# build machine learning model
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

ValueError: could not convert string to float: 'Honda'

### converting categorical data into number

In [None]:

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ['Make', 'Colour', 'Doors']    # we are considering doors as category because of data
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot', one_hot, categorical_features)], remainder='passthrough')
transformed_X = transformer.fit_transform(X)
transformed_X

In [None]:
pd.DataFrame(transformed_X)

In [None]:
# another way of hot encoding, but this time using pandas
dummies = pd.get_dummies(car_sales[['Make', 'Colour', 'Doors']])
dummies

In [None]:
# let's refit the model
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)
model.fit(X_train, y_train)
model.score(X_test, y_test)

In [None]:
# NOTE: this score may be less but we already discussed that predicting the price out of door odometer color and make is 
# illogical and hence lower accuracy

### Handling Missing Values

Two main ways 
1. Fill them with values (also known as imputation)
2. Remove the samples with missing data altogether

In [None]:
# import missing data file
car_sales_missing = pd.read_csv('data/car-sales-extended-missing-data.csv')
car_sales_missing.head()

In [None]:
car_sales_missing.isna().sum()

In [None]:
# create X and y
X = car_sales_missing.drop('Price', axis=1)
y = car_sales_missing['Price']

In [None]:
# let's try and convert our data to numbers 
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ['Make', 'Colour', 'Doors']    # we are considering doors as category because of data
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot', one_hot, categorical_features)], remainder='passthrough')
transformed_X = transformer.fit_transform(X)
transformed_X

In [None]:
car_sales_missing

### Option 1: filling missing data with Pandas

In [None]:
# fill the 'Make' column
car_sales_missing['Make'].fillna('missing', inplace=True)

# fill the 'Colour' column
car_sales_missing['Colour'].fillna('missing', inplace=True)

# fill the 'Odometer (KM)' column
car_sales_missing['Odometer (KM)'].fillna(car_sales_missing['Odometer (KM)'].mean(), inplace=True)

# fill the 'Doors' column
car_sales_missing['Doors'].fillna(4, inplace=True)

In [None]:
# recheck and verify our dataframe
car_sales_missing.isna().sum()

In [None]:
# removing rows with missing price values
car_sales_missing.dropna(inplace=True)

In [None]:
car_sales_missing.isna().sum()

In [None]:
len(car_sales_missing)

In [None]:
X = car_sales_missing.drop('Price', axis=1)
y = car_sales_missing['Price']

In [None]:
# now let's turn our categorical data into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ['Make', 'Colour', 'Doors']    # we are considering doors as category because of data
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot', one_hot, categorical_features)], remainder='passthrough')
transformed_X = transformer.fit_transform(car_sales_missing)
transformed_X

In [None]:
pd.DataFrame(transformed_X)

In [None]:
car_sales_missing

### Option 2: fill missing values with Scikit learn

In [None]:
car_sales_missing = pd.read_csv('data/car-sales-extended-missing-data.csv')
car_sales_missing

In [None]:
car_sales_missing.isna().sum()

In [None]:
# dropping rows with no price label
car_sales_missing.dropna(subset=['Price'], inplace=True)
car_sales_missing.isna().sum()

In [None]:
# split into X and y
X = car_sales_missing.drop('Price', axis=1)
y = car_sales_missing['Price']

In [None]:
# fill missing values with scikit learn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# fill categorical values with 'missing' and numerical values with its mean
cat_imputer = SimpleImputer(strategy='constant', fill_value='missing')
door_imputer = SimpleImputer(strategy='constant', fill_value=4)
num_imputer = SimpleImputer(strategy='mean')

# define columns
cat_features = ['Make', 'Colour']
door_feature = ['Doors']
num_features = ['Odometer (KM)']

# create an imputer (something that fills missing data)
imputer = ColumnTransformer([
    ('cat_imputer', cat_imputer, cat_features),
    ('door_imputer', door_imputer, door_feature),
    ('num_imputer', num_imputer, num_features)
])

# transform the data
filled_X = imputer.fit_transform(X)
filled_X

In [None]:
car_sales_filled = pd.DataFrame(filled_X, 
                                columns=['Make', 'Colour', 'Doors', 'Odometer (KM)'])
car_sales_filled.head()

In [None]:
car_sales_filled.isna().sum()

In [None]:
# now again, let's turn our categorical data into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ['Make', 'Colour', 'Doors']    # we are considering doors as category because of data
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot', one_hot, categorical_features)], remainder='passthrough')
transformed_X = transformer.fit_transform(car_sales_filled)
transformed_X

In [None]:
# now we've got our data as numbers and filled it (no missing values now)
# let's fit a model
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)

model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

In [None]:
len(car_sales_filled)

## 2.Choosing the right model/algorithm/estimator for our problem

### 2.1 Picking a machine learning model for regression problem

In [33]:
# import boston housing dataset
from sklearn.datasets import load_boston
boston = load_boston()
boston

{'data': array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
         4.9800e+00],
        [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
         9.1400e+00],
        [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
         4.0300e+00],
        ...,
        [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
         5.6400e+00],
        [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
         6.4800e+00],
        [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
         7.8800e+00]]),
 'target': array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
        18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
        15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
        13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
        21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
        35.4, 24.7, 3

In [34]:
boston_df = pd.DataFrame(boston['data'], columns=boston['feature_names'])
boston_df['target'] = pd.Series(boston['target'])
boston_df

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48,22.0


In [35]:
# number of samples
len(boston_df)

506

In [37]:
# trying ridge regression model
from sklearn.linear_model import Ridge

# setup random seed
np.random.seed(42)

# create the data 
X = boston_df.drop('target', axis=1)
y = boston_df['target']

# splitting the data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate Ridge model
model = Ridge()
model.fit(X_train, y_train)

# check the score of Ridge model on test data
model.score(X_test, y_test)

0.6662221670168522

Now how do we improve this score?

What if the ridge model isn't working?

Refer this link https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

In [38]:
# let's try random forest
from sklearn.ensemble import RandomForestRegressor

# setup random seed
np.random.seed(42)

# create the data 
X = boston_df.drop('target', axis=1)
y = boston_df['target']

# splitting the data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate Random Forest Regressor model
rf = RandomForestRegressor()
rf.fit(X_train, y_train)

# check the score of Ridge model on test data
rf.score(X_test, y_test)

0.873969014117403

### Selecting an estimator for classification problem
check the ML map

In [40]:
heart_disease = pd.read_csv('data/heart-disease.csv')
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


After seeing map, we start with `LinearSVC`

In [49]:
# Import linear SVC estimator class
from sklearn.svm import LinearSVC

# setup random seed
np.random.seed(42)

#make data ready
X = heart_disease.drop('target', axis = 1)
y = heart_disease['target']

# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# instantiate linear SVC
clf = LinearSVC()
clf.fit(X_train, y_train)

# evaluate LinearSVC
clf.score(X_test, y_test)



0.4918032786885246

In [50]:
# Import linear Random Forest Classifier estimator class
from sklearn.ensemble import RandomForestClassifier

# setup random seed
np.random.seed(42)

#make data ready
X = heart_disease.drop('target', axis = 1)
y = heart_disease['target']

# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# instantiate RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

# evaluate LinearSVC
clf.score(X_test, y_test)

0.8524590163934426

QUICK NOTE: 
    1. Structured Data - Use Ensemble Method
    2. Unstructured Data - Use Deep Learning or Transfer Learning