### Introduction to scikit-learn(sklearn)

- An end-to-end scikit-learn workflow
- Getting the data ready
- Choose the right estimator/algorithm for our problems
- Fit the model/algorithm and use it to make predictions on our data
- Evaluating a mode
- Improve a model
- Save an load a trained model
- Put it all together

In [1]:
import numpy as np

In [2]:
import pandas as pd
heart_disease = pd.read_csv('heart-disease.csv')
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [3]:
#create X (feature matrix)
X= heart_disease.drop('target',axis=1)
#create y(label matrix)
y = heart_disease['target']

In [4]:
#Choose the right estimator/algorithm for our problems
# our problem is classification because we want to classify if a person has heart disease or not
from sklearn.ensemble import RandomForestClassifier #a machine learning classifying model that learns patterns in data and classifies whether a sample ie a row is one thing or another thing
#we instantiate the classifier by using clf also known as model
clf= RandomForestClassifier()

#We will keep the default parameters
clf.get_params()



{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [5]:
#3. Fit the model to the data
from sklearn.model_selection import train_test_split
X_train, X_test,y_train,y_test = train_test_split(X,y,test_size=0.3)

In [6]:
clf.fit(X_train,y_train)# this says the classification model(randomforest) find the patterns in the training data

RandomForestClassifier()

In [7]:
y_preds = clf.predict(X_test)
y_preds

array([0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0,
       1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0,
       1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0,
       0, 0, 1], dtype=int64)

In [8]:
#4. Evaluate the model on the trainig data and the test data
clf.score(X_train,y_train) *100

100.0

In [9]:
clf.score(X_test,y_test)

0.7912087912087912

In [10]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

print(classification_report(y_test,y_preds))

              precision    recall  f1-score   support

           0       0.76      0.81      0.78        42
           1       0.83      0.78      0.80        49

    accuracy                           0.79        91
   macro avg       0.79      0.79      0.79        91
weighted avg       0.79      0.79      0.79        91



In [11]:
confusion_matrix(y_test,y_preds)

array([[34,  8],
       [11, 38]], dtype=int64)

In [12]:
accuracy_score(y_test,y_preds)

0.7912087912087912

In [13]:
np.random.seed(2)
for i in range(10,100,10):
    print(f"Trying model with {i} estimators...")
    clf = RandomForestClassifier(n_estimators=i).fit(X_train,y_train)
    print(f"Model accuracy on test set: {clf.score(X_test,y_test) *100:.2f}%")
    print('')

Trying model with 10 estimators...
Model accuracy on test set: 78.02%

Trying model with 20 estimators...
Model accuracy on test set: 74.73%

Trying model with 30 estimators...
Model accuracy on test set: 81.32%

Trying model with 40 estimators...
Model accuracy on test set: 81.32%

Trying model with 50 estimators...
Model accuracy on test set: 79.12%

Trying model with 60 estimators...
Model accuracy on test set: 83.52%

Trying model with 70 estimators...
Model accuracy on test set: 81.32%

Trying model with 80 estimators...
Model accuracy on test set: 79.12%

Trying model with 90 estimators...
Model accuracy on test set: 80.22%



In [14]:
#6 Save a model and load it
import pickle
pickle.dump(clf,open('random_forest_model_1.pkl','wb'))

In [15]:
loaded_model = pickle.load(open('random_forest_model_1.pkl','rb'))
loaded_model.score(X_test,y_test)

0.8021978021978022

In [16]:
import sklearn
sklearn.show_versions()


System:
    python: 3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\franc\anaconda3\python.exe
   machine: Windows-10-10.0.22538-SP0

Python dependencies:
          pip: 21.3.1
   setuptools: 52.0.0.post20210125
      sklearn: 1.0.2
        numpy: 1.19.5
        scipy: 1.6.2
       Cython: 0.29.23
       pandas: 1.3.0
   matplotlib: 3.3.4
       joblib: 1.0.1
threadpoolctl: 2.1.0

Built with OpenMP: True


## 1. Getting our data ready to be used with machine learning
#Three main things we have to do:
    1. Split the data into features and labels(`X` and `y`)
    2. Filling also imputinf or disregarding missing values
    3.Converting non-numerical values(also called feature encoding)


In [17]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [18]:
X = heart_disease.drop('target',axis=1)
y= heart_disease.target

In [19]:
#Split the data into training or test sets
from sklearn.model_selection import train_test_split

In [20]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)

## 1.1 Make sure it's all numerical

In [23]:
car_sales = pd.read_csv('car-sales-extended.csv')
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [24]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [27]:
car_sales.columns

Index(['Make', 'Colour', 'Odometer (KM)', 'Doors', 'Price'], dtype='object')

In [25]:
#Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [None]:
cat_col= ['Make', 'Colour', 'Doors']