# scikit learn for training and evaluating ML Models

**Project Life Cycle**

*1. Import the dataset*

*2. Prepare the Train, Test sets for training and testing model*

*3. Choose the ML model*

*4. Fitting the MOdel to Data*

*5. Let the model make Predictions*

*6. Evaluate the predictions(check accuracy and others)*

*7. HyperParameters Tuning for best results*

**Import dataset**

In [1]:
import pandas as pd
# importing the dataset
patients_record = pd.read_csv('../heart.csv')
patients_record

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


**Preparing the dataset**

In [3]:
# now we will specify the X_train, Y_train, X_test, and Y_test samples of data
X = patients_record.drop('target',axis = 1) # data without results(target)
Y = patients_record['target'] # only stores the results (target)

**Choosing the right model**

In [6]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.3) 

In [8]:
rfc.fit(X_train,Y_train)

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [10]:
# Evaluation of the model, the score (accuracy)
rfc.score(X_test,Y_test)

0.8351648351648352

In [18]:
# Tuning the model with just n_estimators
# but I will check for various values of n_estimators, so i can use a for loop to achieve that
# for i in range(10,100,10):
#     print(f"the accurracy score for {i} n_estimators is: ")
#     rfc = RandomForestClassifier(n_estimators = 60).fit(X_train,Y_train)
#     print(rfc.score(X_test,Y_test))
print("The Maximum score is: ")
rfc = RandomForestClassifier(n_estimators = 50).fit(X_train,Y_train)
print(rfc.score(X_test,Y_test))

The Maximum score is: 
0.8791208791208791


In [19]:
# saving the model
import pickle

In [21]:
pickle.dump(rfc,open("Heart Disease Prediction","wb"))

In [22]:
# load the model
model = pickle.load(open("Heart Disease Prediction","rb"))
model.score(X_test,Y_test)

0.8791208791208791

In [4]:
sklearn_steps = [
"1. Import the dataset",
"2. Prepare the Train, Test sets for training and testing model",
"3. Choose the ML model",
"4. Fitting the MOdel to Data",
"5. Let the model make Predictions",
"6. Evaluate the predictions(check accuracy and others)",
"7. HyperParameters Tuning for best results",
]

# lets dive into details of sklearn model development life cycle

In [5]:
sklearn_steps

['1. Import the dataset',
 '2. Prepare the Train, Test sets for training and testing model',
 '3. Choose the ML model',
 '4. Fitting the MOdel to Data',
 '5. Let the model make Predictions',
 '6. Evaluate the predictions(check accuracy and others)',
 '7. HyperParameters Tuning for best results']

# 1. Import and prepare the dataset
## a. split the dataset into dependant and indepandant samples, lets call it X and Y
## b. fill the missing values
## c. converting data types(to integers)

In [6]:
# necessary imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [7]:
dataset = pd.read_csv('../heart.csv')
dataset

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [13]:
# now split into x and y samples
x = dataset.drop('target',axis = 1)
x.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [15]:
y = dataset['target']
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [17]:
# turn data into train and test dataset
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x,y,test_size = 0.25)