# Introduction to Scikit Learn (Sklearn)

This notebook demonstrate some of the most useful functions of the beautiful scikit learn library.

What we're going to cover:

0. An end to end Scikit learn workflow
1. Getting the data ready
2. Choose the right estimator(for sclearn)/algorithm for our problems
3. Fit the model/algorithm and use it to make predictions on our data
4. Evaluating a model
5. Improve a model
6. Save and load a trained model
7. Putting it all together!

In [3]:
#0.1 we always do in all ML notebook.
#Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# 0. An end-to-end Scikit-Learn workflow

In [4]:
import numpy as np

In [5]:
# 1. Get the data ready

import pandas as pd
heart_disease = pd.read_csv("data/heart-disease.csv")
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


# We want to predict the heart disease Y (target) from the X (features (age to thal))

In [6]:
# Create X (features matrix)

X = heart_disease.drop("target", axis = 1)
# remember axis=1 means the first-title-column in panda and axis=0 is row.
# we want everything except target.

# Create Y (labels)

Y = heart_disease["target"]
# only target

## Our problem is a classification problem

In [7]:
# 2. Choose the right model and hyperparameters
# (hyperparameters are the dials which you use to make your model better or worse)

from sklearn.ensemble import RandomForestClassifier
# RandomForestClassifier is a classification Machine Learning model - is capable of learning patterns in data
# - and than classifiying a data weather a row is one thing or another thing.
clf = RandomForestClassifier()
# Intensiating the RFC as clf (recomended)

# We'll keep the default hyperparameters
clf.get_params()
# to see the default parameters it use

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [8]:
#3. Fit the model to the training data

# -> 1st we will train our model on training set and test on a test set

# -> we need to split our data into training and test and we can do that with sklearn train_test_split

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
# 8 will be used for training and 2 will be used for training

In [9]:
clf.fit(X_train, Y_train);
# hey classifiying model(RFC) find the pattern in the training data

In [10]:
# make a prediction
Y_label = clf.predict(np.array([0,2,3,4]))
# cause you can only predict on the data which look like the data you have trained your model with.

ValueError: Expected 2D array, got 1D array instead:
array=[0. 2. 3. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [None]:
#But what does look like that is
X_test

In [None]:
#y_preds is a convention name for making prediction on test data.

In [None]:
y_preds = clf.predict(X_test)
y_preds

In [None]:
Y_test

In [None]:
# 4. Evaluate the model on the training data and test data
clf.score(X_train, Y_train)

In [None]:
# 1.0 well wow 100% score - means our model has find the patters on the data very well.

# let's check that on test data.
clf.score(X_test, Y_test)

In [None]:
#0.83 - means when you tested your model on test data it only perform 0.83 % correct - cause he never saw the data and label before.

# Other parameters/matrix for evaluating our model

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(classification_report(Y_test, y_preds))
# comparing the test labels with the prediction our model has made with our model.

In [None]:
confusion_matrix(Y_test, y_preds)

In [None]:
accuracy_score(Y_test, y_preds)

In [None]:
# we are not happy with the accuracy score.

# 5. Improve our model
# Try different amount of n_estimators aka the dails on our hyperparameter to tune our model.
np.random.seed(42)
for i in range(10, 100, 10):
    print(f"Trying model with {i} estimators...")
    clf = RandomForestClassifier(n_estimators = i).fit(X_train, Y_train)
    print(f"Model accuracy on test set: {clf.score(X_test, Y_test) * 100:.2f}%")
    print(" ")

# 6. Save a model and load it

In [None]:
#python pickle library allow to save models
import pickle

pickle.dump(clf, open("random_forest_model_1.pk1", "wb"))
#passing the classifier(our model) then passing the name we want to give it and then write binary (wb).

In [None]:
#importing that model or loading that model and checking the score.
loaded_model = pickle.load(open("random_forest_model_1.pk1", "rb"))
#loading that and reading it as binary (rb)
loaded_model.score(X_test, Y_test)

# ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## 1.1.Getting our data to be ready to be used with machine learning.

Three main things we have to do:
   * 1. Split the data into features and labels (usually 'X' and 'y')
   * 2. Filling (also called imputing) or disregarding missing values (matlab ya toh remove karo koi field me data nahi hai toh ya fill karo unko)
   * 3. Converting non-numerical values to numerical values (also called feature encoding) (Honda / Toyota ko number form me karna -- aage shikhega kaise karna hai)

In [11]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [12]:
X = heart_disease.drop("target",axis = 1)
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [13]:
y = heart_disease["target"]
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [14]:
#Split the data into training and test sets.
from sklearn.model_selection import train_test_split
X_train , X_test , y_train , y_test = train_test_split(X , y , test_size=0.2)

In [15]:
X_train.shape , X_test.shape , y_train.shape , y_test.shape

((242, 13), (61, 13), (242,), (61,))

In [16]:
X.shape

(303, 13)

In [17]:
X.shape[0] * 0.8

242.4

In [18]:
242 + 61

303

In [19]:
len(heart_disease)

303

# 107. Getting your Data ready: Converting Data to Numbers

## 1.1.2 Make sure it's all numerical

In [20]:
car_sales = pd.read_csv("data/car-sales-extended.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [21]:
len(car_sales)

1000

In [22]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

### You can clearly see that many of the datas are not numbers (like Honda and all)

In [23]:
# let's prove that we need all in numbers.
# Split into X/y
X = car_sales.drop("Price", axis=1)
y = car_sales["Price"]

# Split into training and test
X_train, X_test , y_train , y_test = train_test_split(X, y, test_size=0.2)

In [24]:
# Build machine learning model
from sklearn.ensemble import RandomForestRegressor
# This RandomForestRegressor is same as classifire but this time it predicts a number(Price)

model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

ValueError: could not convert string to float: 'Toyota'

#As you can see from the error that our machine learning model can't deall with other data types (here string) than number. - As we expected.


# So let see how we can convert it into numbers

In [25]:
# Turn categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make" , "Colour" , "Doors"]

In [26]:
# You mioght be thinking why Doors
# the answer is that we will treat doors as a category that's why.  1 is 856 category this that
car_sales['Doors'].value_counts()

4    856
5     79
3     65
Name: Doors, dtype: int64

In [27]:
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot" ,one_hot, categorical_features)], remainder = "passthrough")
transformed_X = transformer.fit_transform(X)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

In [28]:
# let's put this array in a data frame
pd.DataFrame(transformed_X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,155144.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,215883.0


In [29]:
X.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3


In [30]:
# Other way of converting the data into number is by using the dummies function
dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]])
dummies

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,0,1,0,0,0,0,0,0,1
1,5,1,0,0,0,0,1,0,0,0
2,4,0,1,0,0,0,0,0,0,1
3,4,0,0,0,1,0,0,0,0,1
4,3,0,0,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...
995,4,0,0,0,1,1,0,0,0,0
996,3,0,0,1,0,0,0,0,0,1
997,4,0,0,1,0,0,1,0,0,0
998,4,0,1,0,0,0,0,0,0,1


In [31]:
# Now let refit the model

np.random.seed(42)

X_train, X_test, y_train, y_test = train_test_split(transformed_X , y , test_size=0.2)

model.fit(X_train, y_train)

RandomForestRegressor()

In [32]:
model.score(X_test, y_test)

0.3235867221569877

# Getting Your data Ready: Handling Missing, Values with pandas

### What if there is missing values?

1. Fill them with somevalue (also known as imputation)
2. Remove the samples with missing data altogether.
One thing to note. here is that there is no good way of dealing with missing data cause if you enter some value then you are adding some non real value and if you remove samples wiith missing values then. you have to work with less data.


In [33]:
# Import car sales missing data
car_sales_missing = pd.read_csv("data/car-sales-extended-missing-data.csv")

car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [34]:
# Now here you are seeing that there is no missing data in the first 5 but there could be missing data after 5,  so let's see how we can see that
car_sales_missing.isna().sum()

# When you import data with missing values the panda fill those places with NAN
# The above function is adding all of that NAN and showing how many missing data are there.

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [35]:
# Create X & y
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [36]:
# Let's try and convert our data to numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make" , "Colour" , "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot" ,one_hot, categorical_features)], remainder = "passthrough")
transformed_X = transformer.fit_transform(X)
transformed_X

<1000x16 sparse matrix of type '<class 'numpy.float64'>'
	with 4000 stored elements in Compressed Sparse Row format>

In [37]:
car_sales_missing

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


# Option 1: Fill missing data with Pandas
## ALthough its not good practice to fill data with anything else like missing or mean(the mean value)

In [48]:
# Fill the " Make" column
car_sales_missing["Make"].fillna("missing" , inplace=True)

# Fill the " Colour" column
car_sales_missing["Colour"].fillna("missing" , inplace=True)

# Fill the " Odometer (KM)" column
car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean() , inplace=True)


# Fill the "Doors" column
car_sales_missing["Doors"].fillna(4 , inplace=True)

In [49]:
# Check our dataframe again
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

In [50]:
# Now we are not filling Price value cause that is what we want to predict 
# Remove row with missing Price value - we will lose some data but we have to cause we want to predict the right vaue.
car_sales_missing.dropna(inplace=True)

In [51]:
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [52]:
len(car_sales_missing)

950

In [53]:
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [54]:
# Let's try and convert our data to numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make" , "Colour" , "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot" ,one_hot, categorical_features)], remainder = "passthrough")
transformed_X = transformer.fit_transform(car_sales_missing)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        3.54310e+04, 1.53230e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        1.92714e+05, 1.99430e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        8.47140e+04, 2.83430e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        6.66040e+04, 3.15700e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.15883e+05, 4.00100e+03],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.48360e+05, 1.27320e+04]])

### Option 2: Fill mising values with Scikit-Learn