INTRODUCTION TO SCI-KIT LEARN.

What We Are Going to Cover:
1. An End-end sci-kit Learn workflow.
2. Choose the right estimator or algorithm for our problem.
3. Fit the Model/Algorithm and use it to make predictions on our data.
4. Evaluating a Model.
5. Improve a Model.
6. Save and Load a Trained Model.
7. Putting it all together.

### An End-end Scikit learn Workflow.

In [1]:
#get the data
import pandas as pd
heart_disease = pd.read_csv("Data/heart-disease.csv")
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [2]:
# Create X  (Feature Matrix)

x = heart_disease.drop("target", axis=1) # all the columns excep target.

# create y (label)
y = heart_disease["target"]

In [35]:
#import numpy as np

### Choose the rght algorithm for your problem

In [36]:
#Choose the right model and hyperparameters
from sklearn.ensemble import RandomForestClassifier 
clf = RandomForestClassifier()

# we keep the default hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

### Fit the Model/Algorithm and use it to make predictions on our data.

In [5]:
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2)

In [6]:
clf.fit(x_train,y_train)  #fit the model to learn new patterns

In [7]:
### Make Prediction
### You can only make prediction with data that has same array.

y_pred = clf.predict(x_test)
y_pred

array([1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0], dtype=int64)

### Evaluating a Model.

In [8]:
clf.score(x_train,y_train) #Train score

1.0

In [9]:
clf.score(x_test,y_test) #Test score , it does learned well about the data

0.7377049180327869

In [10]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [11]:
print(classification_report(y_test,y_pred)) #Compare the y_test and the prediction labels.

              precision    recall  f1-score   support

           0       0.79      0.63      0.70        30
           1       0.70      0.84      0.76        31

    accuracy                           0.74        61
   macro avg       0.75      0.74      0.73        61
weighted avg       0.75      0.74      0.73        61



In [12]:
confusion_matrix(y_pred,y_pred) #Compare the y_test and the prediction labels.

array([[24,  0],
       [ 0, 37]], dtype=int64)

In [13]:
accuracy_score(y_test,y_pred) #Compare the y_test and the prediction labels.

0.7377049180327869

In [14]:
## All  the above return same range of scoress. 
## So lets Improve the model.

In [15]:
import numpy as np

### Improve a Model.

In [16]:
### Try different amount of n_estimators/algorithm

np.random.seed(42)
for i in range(10,100,10):
    print(f"Trying Model with {i} estimator..")
    clf = RandomForestClassifier(n_estimators=i).fit(x_train,y_train)
    print(f"Model Accuracy on test set:{clf.score(x_test,y_test)*100:.2f}%")
    print(" ")

Trying Model with 10 estimator..
Model Accuracy on test set:72.13%
 
Trying Model with 20 estimator..
Model Accuracy on test set:77.05%
 
Trying Model with 30 estimator..
Model Accuracy on test set:77.05%
 
Trying Model with 40 estimator..
Model Accuracy on test set:78.69%
 
Trying Model with 50 estimator..
Model Accuracy on test set:77.05%
 
Trying Model with 60 estimator..
Model Accuracy on test set:77.05%
 
Trying Model with 70 estimator..
Model Accuracy on test set:73.77%
 
Trying Model with 80 estimator..
Model Accuracy on test set:78.69%
 
Trying Model with 90 estimator..
Model Accuracy on test set:77.05%
 


### Save model and Load it.

In [17]:
import pickle

pickle.dump(clf,open("Model/random_forest_model_1.pkl","wb")) # wb-web binary

In [18]:
load_model =  pickle.load(open("Model/random_forest_model_1.pkl", "rb")) #read binary
load_model.score(x_test,y_test)

0.7704918032786885

BREAKING DOWN THE STEPS ABOVE.

## Getting Our Data Ready to be used by Machine Learning.

 1. Spli the data into Features and Labels(Usually `X` and `Y`)
 2. Filling (also called inputing) or disgarding missing values.
 3. Converting non-numericals to numerical values (also call feature encoding)

In [19]:
heart_disease = pd.read_csv("Data/heart-disease.csv")
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [20]:
#features
x = heart_disease.drop(["target"], axis=1)
x.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [21]:
#labels
y = heart_disease["target"]
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [22]:
## Splitting the data into train set ans test set

from sklearn.model_selection import train_test_split

x_train, x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)

In [23]:
x_train.shape,x_test.shape,y_train.shape,x_test.shape

((212, 13), (91, 13), (212,), (91, 13))

### Converting non-numericals to numericals
 the data set above is Numerical already so lets import another one

In [24]:
car_sales = pd.read_csv("Data/car-sales-extended.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [25]:
# split the data
x = car_sales.drop(["Price"],axis=1)
y = car_sales["Price"]

#splitting into train set and test set

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2)

In [26]:
## Build a Machine learnig model to make prediction
## the regressor can predict numbers

from sklearn.ensemble import RandomForestRegressor

model =  RandomForestRegressor()

model.fit(x_train,y_train)
model.score(x_test,y_test)

#code wont run be machine model only understand numbers, see code below.

ValueError: could not convert string to float: 'Toyota'

### Turn Category to numbers
The `Doors` Column is included because 4,5,7 are categorised.

In [30]:
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer

categorical_features = ["Make","Colour","Doors"]
one_hot = OneHotEncoder()
tranformer = ColumnTransformer([("one_hot",one_hot,categorical_features)],remainder="passthrough")


tranformer_x = tranformer.fit_transform(x)

tranformer_x

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

In [31]:
## put it into a data set
pd.DataFrame(tranformer_x).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0


### Anothe way to turn non-numericals to numericals

In [32]:
dummies = pd.get_dummies(car_sales[["Make","Colour","Doors"]])
dummies

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,0,1,0,0,0,0,0,0,1
1,5,1,0,0,0,0,1,0,0,0
2,4,0,1,0,0,0,0,0,0,1
3,4,0,0,0,1,0,0,0,0,1
4,3,0,0,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...
995,4,0,0,0,1,1,0,0,0,0
996,3,0,0,1,0,0,0,0,0,1
997,4,0,0,1,0,0,1,0,0,0
998,4,0,1,0,0,0,0,0,0,1


In [33]:
## NOW, let's refit our model above with the numerical (transformer_x)
np.random.seed(42)
x_train,x_test,y_train,y_test = train_test_split(tranformer_x,y,test_size=0.2)

model.fit(x_train,y_train)

In [34]:
model.score(x_test,y_test)

0.3235867221569877

## Removing missing values

 1.fill in the missing values(inputution)
 2.remove the missing values all together.

In [None]:
car_sales_missing = pd.read_csv("Data/car-sales-extended-missing-data.csv")
car_sales_missing.head()

In [None]:
car_sales_missing.isna().sum()

In [None]:
#### Fill missing values 

car_sales_missing["Make"].fillna("missing",inplace=True) #make
car_sales_missing["Colour"].fillna("missing",inplace=True) #colour
car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(),inplace=True)
car_sales_missing["Doors"].fillna(4,inplace=True)

In [None]:
car_sales_missing.isna().sum()

In [None]:
#Remove the remaining mssing values in Price
car_sales_missing.dropna(inplace=True)

In [None]:
car_sales_missing.isna().sum()

In [None]:
## Split the data

x  = car_sales_missing.drop(["Price"],axis=1)
y  = car_sales_missing["Price"]

In [None]:
## NB.OneHotEncoder class was upgraded to be able to handle None & NaN values.
## So the code below will run without an error.
## for the seek of practice the NaN values where removed.

from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer

categorical_features = ["Make","Colour","Doors"]
one_hot = OneHotEncoder()
tranformer = ColumnTransformer([("one_hot",one_hot,categorical_features)],remainder="passthrough")

## Transform the car_sales_missing to Numericals
tranformer_x = tranformer.fit_transform(car_sales_missing)

tranformer_x

## Option 2: Removing and filling missing values with sklearn.

In [37]:
car_sales_missing = pd.read_csv("Data/car-sales-extended-missing-data.csv")
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [38]:
# Check missing values
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [39]:
# Drop the rows with no labels
car_sales_missing.dropna(subset=["Price"], inplace=True)
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

The data is split into train and test before any filling missing values or transformations take place.

In [40]:
from sklearn.model_selection import train_test_split

# Split into X & y
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

# Split data into train and test
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2)

In [41]:
# Check missing values
X.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
dtype: int64

Let's fill the missing values. We'll fill the training and test values separately to ensure training data stays with the training data and test data stays with the test data.

Note: We use fit_transform() on the training data and transform() on the testing data. In essence, we learn the patterns in the training set and transform it via imputation (fit, then transform). Then we take those same patterns and fill the test set (transform only).

In [42]:
# Fill missing values with Scikit-Learn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical values with 'missing' & numerical values with mean
cat_imputer = SimpleImputer(strategy="constant", fill_value="missing")
door_imputer = SimpleImputer(strategy="constant", fill_value=4)
num_imputer = SimpleImputer(strategy="mean")

# Define columns
cat_features = ["Make", "Colour"]
door_feature = ["Doors"]
num_features = ["Odometer (KM)"]

# Create an imputer (something that fills missing data)
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, cat_features),
    ("door_imputer", door_imputer, door_feature),
    ("num_imputer", num_imputer, num_features)
])

# Fill train and test values separately
filled_X_train = imputer.fit_transform(X_train) # fit_transform imputes the missing values from the training set and fills them simultaneously
filled_X_test = imputer.transform(X_test) # tranform takes the imputing missing values from the training set and fills the test set with them

# Check filled X_train
filled_X_train

array([['Honda', 'White', 4.0, 71934.0],
       ['Toyota', 'Red', 4.0, 162665.0],
       ['Honda', 'White', 4.0, 42844.0],
       ...,
       ['Toyota', 'White', 4.0, 196225.0],
       ['Honda', 'Blue', 4.0, 133117.0],
       ['Honda', 'missing', 4.0, 150582.0]], dtype=object)

Now we've filled our missing values, let's check how many are missing from each set.

In [43]:
# Get our transformed data array's back into DataFrame's
car_sales_filled_train = pd.DataFrame(filled_X_train, 
                                      columns=["Make", "Colour", "Doors", "Odometer (KM)"])

car_sales_filled_test = pd.DataFrame(filled_X_test, 
                                     columns=["Make", "Colour", "Doors", "Odometer (KM)"])

# Check missing data in training set
car_sales_filled_train.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

In [44]:
# Check missing data in test set
car_sales_filled_test.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

No more missing values!

Okay, no missing values but we've still got to turn our data into numbers. Let's do that using one hot encoding.

Again, keeping our training and test data separate.

In [47]:
# Import OneHotEncoder class from sklearn
from sklearn.preprocessing import OneHotEncoder

# Now let's one hot encode the features with the same code as before 
categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", 
                                 one_hot, 
                                 categorical_features)],
                                 remainder="passthrough")

# Fill train and test values separately
transformed_X_train = transformer.fit_transform(car_sales_filled_train) # fit and transform the training data
transformed_X_test = transformer.transform(car_sales_filled_test) # transform the test data

# Check transformed and filled X_train
transformed_X_train.toarray()

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 7.19340e+04],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.62665e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 4.28440e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.96225e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.33117e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.50582e+05]])

Fit a model

Wonderful! Now we've filled and transformed our data, ensuring the training and test sets have been kept separate. Let's fit a model to the training set and evaluate it on the test set.

In [48]:
# Now we've transformed X, let's see if we can fit a model
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor

# Setup model
model = RandomForestRegressor()

# Make sure to use transformed (filled and one-hot encoded X data)
model.fit(transformed_X_train, y_train)
model.score(transformed_X_test, y_test)

0.21229043336119102

the model above performed poor to the previous one because the dataset is small let's see below

In [55]:
#campere the lenth of the two dataset used.
len(car_sales_missing),len(car_sales)

(950, 1000)