<a href="https://colab.research.google.com/github/xtructt/z2m-ML/blob/master/Scikit_learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Scikit-Learn
This notebook demonstrates some of the most useful functionss of the beautiful Scikit-Learn library.
What we're going to cover:
0. An end-to-end Scikit-Learn workflow
1. Geting the data ready
2. Choose the right estimator/algorithm for our problems
3. Fit the model/elgorithm and use it to make predictions on ouor data
4. Evaluating a model
5. Improvce a model
6. Save and load a trained model
7. Putting it all together

In [1]:
what_were_covering = [
'0. An end-to-end Scikit-Learn workflow',
'1. Geting the data ready',
'2. Choose the right estimator/algorithm for our problems',
'3. Fit the model/elgorithm and use it to make predictions on ouor data',
'4. Evaluating a model',
'5. Improvce a model',
'6. Save and load a trained model',
'7. Putting it all together'
]

## 0.An end-to-end Scikit-Learn workflow

In [2]:
#Standars imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline



In [3]:
#1. get the dat ready

file = "https://raw.githubusercontent.com/xtructt/z2m-ML/master/Resources/zero-to-mastery-ml-master/data/heart-disease.csv"
heart_disease = pd.read_csv(file)
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [4]:
#Create X (Features matrix)
x = heart_disease.drop("target", axis=1)
#Create Y (Labels)
y = heart_disease["target"]

In [5]:
#Choose the right model and hyperparameters
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
# We'll keep the default parameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [6]:
#3 Fit the model to the training data
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2)

In [7]:
clf.fit(x_train,y_train);

In [8]:
y_preds = clf.predict(x_test)
y_preds

array([1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0])

In [9]:
#Step 4 eveluate the model
clf.score(x_train,y_train)

1.0

In [10]:
clf.score(x_test,y_test)

0.8688524590163934

In [11]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(classification_report(y_test,y_preds))

              precision    recall  f1-score   support

           0       0.88      0.82      0.85        28
           1       0.86      0.91      0.88        33

    accuracy                           0.87        61
   macro avg       0.87      0.87      0.87        61
weighted avg       0.87      0.87      0.87        61



In [12]:
confusion_matrix(y_test,y_preds)

array([[23,  5],
       [ 3, 30]])

In [13]:
accuracy_score(y_test,y_preds)

0.8688524590163934

In [14]:
#5. Improve a model
# Try different amount of n_estimatiors
for i in range (10,100,10):
  print(f"Trying model with {i} estimators...")
  clf = RandomForestClassifier(n_estimators=i).fit(x_train,y_train)
  print(f"Model accuracy on test set: {clf.score(x_test,y_test)*100:2f}%")
  print(" ")

Trying model with 10 estimators...
Model accuracy on test set: 85.245902%
 
Trying model with 20 estimators...
Model accuracy on test set: 83.606557%
 
Trying model with 30 estimators...
Model accuracy on test set: 81.967213%
 
Trying model with 40 estimators...
Model accuracy on test set: 86.885246%
 
Trying model with 50 estimators...
Model accuracy on test set: 85.245902%
 
Trying model with 60 estimators...
Model accuracy on test set: 81.967213%
 
Trying model with 70 estimators...
Model accuracy on test set: 86.885246%
 
Trying model with 80 estimators...
Model accuracy on test set: 85.245902%
 
Trying model with 90 estimators...
Model accuracy on test set: 86.885246%
 


In [15]:
#6. save the model and load it
import pickle
pickle.dump(clf,open("random_forest_model_1.pkl","wb"))

In [16]:
loaded_model = pickle.load(open("random_forest_model_1.pkl","rb"))
loaded_model.score(x_test,y_test)

0.8688524590163934

##1. Geting our data ready to be used with the machine learning
Three main thins we have to do:
1. Split the data into features and labels(usually 'X' & 'Y')w
2. Filling (also called impting) or disregarding missing values
3. converting non-numerical values to numerical values (also called feature encoding)

In [17]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [18]:
x = heart_disease.drop("target", axis=1)
x.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [19]:
y =  heart_disease["target"]
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [20]:
#Split the data into training and test sets
from sklearn.model_selection import train_test_split
x_train, x_test,y_train, y_test = train_test_split(x,y, test_size = 0.2)

In [21]:
x_train.shape, x_test.shape,y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

## 1.1 Make sure its all numerical

In [22]:
car_sales_url = "https://raw.githubusercontent.com/xtructt/z2m-ML/master/Resources/zero-to-mastery-ml-master/data/car-sales-extended.csv"
car_sales = pd.read_csv(car_sales_url)
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [23]:
len(car_sales)

1000

In [24]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [25]:
#Split into X/Y
x = car_sales.drop(["Price"], axis=1 )
y=car_sales["Price"]
#Split into training and test
x_train, x_test, y_train, y_test = train_test_split(x,
                                                    y,
                                                    test_size = 0.2)

In [26]:
#Turn the categories into number
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [27]:
categorical_features = ["Make", "Colour","Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_host",
                                  one_hot,
                                  categorical_features)],
                                remainder='passthrough')
transformed_x = transformer.fit_transform(x)
transformed_x

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

In [28]:
pd.DataFrame(transformed_x)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,155144.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,215883.0


In [29]:
x_train, x_test, y_train, y_test = train_test_split(transformed_x,
                                                    y,
                                                    test_size = 0.2)

In [30]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(x_train,y_train)
model.score(x_test,y_test)

0.417656912606981

###1.2 what if ther were missing values?
1. Fill them with some value (also know as imputation).
2. Remove the sample with missing value altogether.

In [31]:
#Import car sales missing data
url = "https://raw.githubusercontent.com/xtructt/z2m-ML/master/Resources/zero-to-mastery-ml-master/data/car-sales-extended-missing-data.csv"
car_sales_missing = pd.read_csv(url)
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [32]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [33]:
#Create X and Y
x = car_sales_missing.drop("Price", axis=1)
y= car_sales_missing["Price"]

###Option 1 Fill missing data with pandas


In [34]:
#Fill the "Make" column
car_sales_missing["Make"].fillna("missing", inplace=True)
#Fill the colour column
car_sales_missing["Colour"].fillna("missing", inplace=True)
#Fill the odometer column
car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(), inplace=True)
#Fill the door column
car_sales_missing["Doors"].fillna("4", inplace=True)

In [35]:
#Remove rows with missing price value
car_sales_missing.dropna(inplace=True)

In [36]:
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [37]:
#Split into X/Y
x = car_sales.drop(["Price"], axis=1 )
y=car_sales["Price"]
#Split into training and test
x_train, x_test, y_train, y_test = train_test_split(x,
                                                    y,
                                                    test_size = 0.2)


In [38]:
#Turn the categories into number
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
#Let try and convert data to number
categorical_features = ["Make", "Colour","Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_host",
                                  one_hot,
                                  categorical_features)],
                                remainder='passthrough')
transformed_x = transformer.fit_transform(x)
transformed_x

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

###Fill missing value with sklearn

In [39]:
#Import car sales missing data
url = "https://raw.githubusercontent.com/xtructt/z2m-ML/master/Resources/zero-to-mastery-ml-master/data/car-sales-extended-missing-data.csv"
car_sales_missing = pd.read_csv(url)
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [40]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [41]:
car_sales_missing.dropna(subset=["Price"], inplace=True)
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [42]:
#split into x and y
x= car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [43]:
#Fill missing values with scikitlearn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer


#Fill categorical values with 'missing' & numerical calues with mean
cat_imputer = SimpleImputer(strategy="constant", fill_value="missing")
door_imputer = SimpleImputer(strategy="constant", fill_value=4)
num_imputer = SimpleImputer(strategy="mean")
#Define columns
cat_feature = ["Make", "Colour"]
door_feature = ["Doors"]
num_feature = ["Odometer (KM)"]
#Create imputer (Something that fill missing data)
imputer = ColumnTransformer([("cat_imputer",cat_imputer, cat_feature),
                             ("door_imputer",door_imputer, door_feature),
                             ("num_imputer",num_imputer, num_feature)])
#Tranform the data
filled_x =imputer.fit_transform(x)
filled_x

array([['Honda', 'White', 4.0, 35431.0],
       ['BMW', 'Blue', 5.0, 192714.0],
       ['Honda', 'White', 4.0, 84714.0],
       ...,
       ['Nissan', 'Blue', 4.0, 66604.0],
       ['Honda', 'White', 4.0, 215883.0],
       ['Toyota', 'Blue', 4.0, 248360.0]], dtype=object)

In [44]:
car_sales_filled = pd.DataFrame(filled_x,
                                columns=["Make", "Colour","Doors", "Odometer (KM)" ])
car_sales_filled

Unnamed: 0,Make,Colour,Doors,Odometer (KM)
0,Honda,White,4,35431
1,BMW,Blue,5,192714
2,Honda,White,4,84714
3,Toyota,White,4,154365
4,Nissan,Blue,3,181577
...,...,...,...,...
945,Toyota,Black,4,35820
946,missing,White,3,155144
947,Nissan,Blue,4,66604
948,Honda,White,4,215883


In [45]:
car_sales_filled.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

In [46]:
#Turn the categories into number
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
categorical_features = ["Make", "Colour","Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_host",
                                  one_hot,
                                  categorical_features)],
                                remainder='passthrough')
transformed_x = transformer.fit_transform(car_sales_filled)
transformed_x.ndim



2

In [47]:
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(transformed_x, y, test_size=0.2)
model = RandomForestRegressor(n_estimators=100)
x_train
model.fit(x_train, y_train)
model.score(x_test, y_test)


0.21990196728583944

## 2. Choosing the right estimator/algorithm for our problem
Scikit-Learn uses estimator as anotheer term for machine learning model or algorithm

Classification - predicting wrther a sample is one thing or another
Regression - predict a number

### 2.1 Picking a machine learning model for a regression problem

In [48]:
# Import Boston housing dataset
from sklearn.datasets import load_boston
boston = load_boston()
boston;

In [49]:
boston_df = pd.DataFrame(boston["data"], columns=boston["feature_names"])
boston_df["target"] = pd.Series(boston["target"])
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [50]:
#How many sample?
len(boston_df)


506

In [51]:
# let try the Ridge Regression model
from sklearn.linear_model import Ridge
#Setup random seed
np.random.seed(42)
#Clearn the data
x = boston_df.drop("target", axis=1)
y = boston_df["target"]
#Split into train and test set
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)
#Instantiate Ride model
model = Ridge()
model.fit(x_train,y_train)
#score the model test data
model.score(x_test,y_test)




0.6662221670168522

How do we improve this score what if Ridge not work?

In [52]:
from sklearn.ensemble import RandomForestRegressor
np.random.seed(42)
x = boston_df.drop("target" ,axis=1)
y = boston_df["target"]
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)
rf = RandomForestRegressor()
rf.fit(x_train,y_train)
rf.score(x_test, y_test)

0.873969014117403

## 2.Choosing the right estimator for classification problems


In [53]:
url = "https://raw.githubusercontent.com/xtructt/z2m-ML/master/Resources/zero-to-mastery-ml-master/data/heart-disease.csv"
heart_disease = pd.read_csv(url)
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [54]:
len(heart_disease)

303

Consuting the map and it says try 'LinearSVC'

In [55]:
#import the LineaSVC estimator class
from sklearn.svm import LinearSVC

#Setup random seed
np.random.seed(42)
x = heart_disease.drop("target", axis=1)
y = heart_disease["target"]
#Split the data
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)

#instantiate LinearSVC
clf = LinearSVC(max_iter=10000)
clf.fit(x_train, y_train)
clf.score(x_test, y_test)



0.8688524590163934

In [56]:
#import the RandomForestClassifier estimator class
from sklearn.ensemble import RandomForestClassifier

#Setup random seed
np.random.seed(42)
x = heart_disease.drop("target", axis=1)
y = heart_disease["target"]
#Split the data
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)

#instantiate RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(x_train, y_train)
clf.score(x_test, y_test)

0.8524590163934426

# 3. Fit the model/ algorithm on our data

In [57]:
#import the LineaSVC estimator class
from sklearn.svm import LinearSVC

#Setup random seed
np.random.seed(42)
x = heart_disease.drop("target", axis=1)
y = heart_disease["target"]
#Split the data
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)

#instantiate LinearSVC
clf = LinearSVC(max_iter=10000)
clf.fit(x_train, y_train)
clf.score(x_test, y_test)



0.8688524590163934

## 3.1 Making prediction using our model



In [58]:
clf_pred = clf.predict(x_test)
(clf_pred == y_test).mean()

0.8688524590163934

In [59]:
#Prediction for an Regression model
from sklearn.ensemble import RandomForestRegressor
np.random.seed(42)
x = boston_df.drop("target" ,axis=1)
y = boston_df["target"]
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)
rf = RandomForestRegressor()
rf.fit(x_train,y_train)
res_pre = rf.predict(x_test)
res_pre[:10]

array([23.002, 30.826, 16.734, 23.467, 16.853, 21.725, 19.232, 15.239,
       21.067, 20.738])

In [60]:
#Compare the prediction with the truth
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, res_pre)

2.1226372549019623

## 4.Evaluating a machine learning model

In [61]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [62]:
from sklearn.svm import LinearSVC
x =  heart_disease.drop("target", axis=1)
y = heart_disease["target"]
x_train,x_test,y_train,y_test = train_test_split(x,y, test_size=0.2)
model = LinearSVC(max_iter=10000)
model.fit(x_train,y_train)
model.score(x_test,y_test)



0.6229508196721312

### Evaluating a model using `scoring` method



In [63]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
np.random.seed(42)
x = boston_df.drop("target" ,axis=1)
y = boston_df["target"]
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)
rf = RandomForestRegressor()
rf.fit(x_train,y_train);

In [64]:
rf.score(x_test,y_test)

0.873969014117403

In [65]:
cross_val_score(rf,x,y,cv=5)

array([0.75909537, 0.84959941, 0.75551512, 0.45660835, 0.23564758])

## 4.2.1 Classification model evaluation metrics
1. Accuracy
2. Areaunder ROC curve
3. Confusion matrix
4. Classification report

In [66]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
x = heart_disease.drop("target", axis=1)
y = heart_disease["target"]
clf = RandomForestClassifier(n_estimators=100)
cross_val_score=cross_val_score(clf,x,y,cv=5)

In [67]:
np.mean(cross_val_score)

0.8150819672131148

In [68]:
print(f"Heart disease Classifier Cross-Validated Accuracy: {np.mean(cross_val_score) * 100:.2f}%")

Heart disease Classifier Cross-Validated Accuracy: 81.51%


**Area under receiver operating characteristics curve (AUC/ROC)**
* Area under curve (AUC)
* ROC curve

ROC curves are a comparasion of a model's true positive rate (tpr) versus a models false positive rate (fpr).
* True positive  =  model predicts 1 when truth is 1
* False positive = model predict 1 when truth is 0
* True nagative = model predicts 0 when truth is 0
* False negative = model predict 0 when truth is 1

In [70]:
#Create x_test,... ect
x_train, x_test, y_train, y_test =  train_test_split(x,y, test_size = 0.2)

In [83]:
from sklearn.metrics import roc_curve
# make prdictions with probabilities
clf.fit(x_train, y_train)
y_probs = clf.predict_proba(x_test)
y_probs[:10]

array([[0.52, 0.48],
       [0.76, 0.24],
       [0.34, 0.66],
       [0.19, 0.81],
       [0.53, 0.47],
       [0.34, 0.66],
       [0.21, 0.79],
       [0.04, 0.96],
       [0.7 , 0.3 ],
       [0.86, 0.14]])

In [84]:
y_probs_positive = y_probs[:,1]
y_probs_positive

array([0.48, 0.24, 0.66, 0.81, 0.47, 0.66, 0.79, 0.96, 0.3 , 0.14, 0.8 ,
       0.93, 0.05, 0.22, 0.59, 0.06, 0.16, 0.74, 0.24, 0.5 , 0.96, 0.54,
       0.21, 0.19, 0.41, 0.45, 0.17, 0.4 , 0.3 , 0.11, 0.21, 0.68, 0.4 ,
       0.81, 0.01, 0.28, 0.72, 0.62, 0.52, 0.36, 0.28, 0.46, 0.05, 0.96,
       0.85, 0.89, 0.14, 0.71, 0.78, 0.21, 0.34, 0.19, 0.87, 0.64, 0.1 ,
       0.83, 0.82, 0.76, 0.96, 0.46, 0.82])

In [86]:
# Calculate fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_probs_positive)
# Check the false positive rates
fpr

array([0.        , 0.        , 0.        , 0.03333333, 0.03333333,
       0.03333333, 0.03333333, 0.06666667, 0.06666667, 0.1       ,
       0.1       , 0.13333333, 0.13333333, 0.2       , 0.2       ,
       0.23333333, 0.23333333, 0.26666667, 0.33333333, 0.36666667,
       0.36666667, 0.43333333, 0.5       , 0.53333333, 0.6       ,
       0.66666667, 0.73333333, 0.8       , 0.9       , 0.96666667,
       1.        ])