# Scikit-learn
- Scikit-learn is a free machine learning library for the Python programming language.
- Installation: https://scikit-learn.org/stable/install.html

* Simple and efficient tools for predictive data analysis
* Accessible to everybody, and reusable in various contexts
* Built on NumPy, SciPy, and matplotlib
* Open source, commercially usable - BSD license

In [42]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

## Random Forest Classifier Workflow for Classifying Heart Disease

### 1. Get the cleaned data.

In [21]:
# Import Training Data 
training = pd.read_csv('data/training_data.csv')
training.head()

(242, 14)

In [22]:
# Import Test Data 
test = pd.read_csv('data/testing_data.csv')
test.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,58,1,1,120,284,0,0,160,0,1.8,1,0,2,0
1,52,1,0,112,230,0,1,160,0,0.0,2,1,2,0
2,42,0,2,120,209,0,1,173,0,0.0,1,0,2,1
3,55,1,1,130,262,0,1,155,0,0.0,2,0,2,1
4,53,0,0,130,264,0,0,143,0,0.4,1,0,2,1


In [23]:
# Create X (all the feature columns)
X_train = training.drop("target", axis=1)
X_test = test.drop("target", axis=1)

# Create y (the target column)
y_train = training["target"]
y_test = test["target"]

In [24]:
# Check For Balance
y_train.value_counts()

1    132
0    110
Name: target, dtype: int64

In [25]:
y_test.value_counts()

1    33
0    28
Name: target, dtype: int64

In [26]:
X_train.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,67,1,2,152,212,0,0,150,0,0.8,1,0,3
1,53,1,2,130,246,1,0,173,0,0.0,2,3,2
2,61,1,3,134,234,0,1,145,0,2.6,1,2,2
3,45,1,1,128,308,0,0,170,0,0.0,2,0,2
4,50,1,0,144,200,0,0,126,1,0.9,1,0,3


### 2. Choose the model and hyperparameters

In [32]:
# We'll use a Random Forest
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

In [33]:
# We'll leave the hyperparameters as default to begin with...
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

### 3. Fit the model to the data and use it to make a prediction
* Fitting the model on the data involves passing it the data so that the ML Algorithm can the patterns.

* If there are labels (supervised learning), the model tries to work out the relationship between the data and the labels.

* If there are no labels (unsupervised learning), the model tries to find patterns and group similar samples together.

In [34]:
clf.fit(X_train, y_train)

RandomForestClassifier()

### 4. Use the model to make a prediction
Once our model instance is trained, you can use the predict() method to predict a target value given a set of features. In other words, use the model, along with some unlabelled data to predict the label.

Note, data you predict on has to be in the same shape as data you trained on.

In [35]:
# In order to predict a label, data has to be in the same shape as X_train
X_test.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,58,1,1,120,284,0,0,160,0,1.8,1,0,2
1,52,1,0,112,230,0,1,160,0,0.0,2,1,2
2,42,0,2,120,209,0,1,173,0,0.0,1,0,2
3,55,1,1,130,262,0,1,155,0,0.0,2,0,2
4,53,0,0,130,264,0,0,143,0,0.4,1,0,2


In [36]:
y_preds = clf.predict(X_test)
print(y_preds)

[1 0 1 1 1 1 0 1 0 0 0 1 0 1 0 0 1 1 1 0 1 0 0 1 1 1 0 1 0 0 1 0 1 0 1 0 0
 0 1 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 0 1 0 0 0 1 0]


### 5. Evaluate the model
Each model or estimator has a built-in score method. This method compares how well the model was able to learn the patterns between the features and labels. In other words, it returns how accurate your model is.

##### Score Method

- Accuracy is the default metric for the score() function within each of Scikit-Learn's classifier models.
- It is the ratio of number of correct predictions to the total number of input samples. 
- It works well only if there are equal number of samples belonging to each class.

In [37]:
# Evaluate the model on the training set
clf.score(X_train, y_train)

1.0

In [38]:
# Evaluate the model on the test set
clf.score(X_test, y_test)

0.8032786885245902

#### classification_report
* Precision — What percent of your predictions were correct
* Recall — What proportion of the positive cases were predicted.
* F1 score — What percent of positive predictions were correct?
* Support is the number of actual occurrences of the class in the specified dataset.

In [39]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.77      0.82      0.79        28
           1       0.84      0.79      0.81        33

    accuracy                           0.80        61
   macro avg       0.80      0.80      0.80        61
weighted avg       0.81      0.80      0.80        61



#### Confusion Matrix
* True positive = model predicts 1 when truth is 1
* False positive = model predicts 1 when truth is 0
* True negative = model predicts 0 when truth is 0
* False negative = model predicts 0 when truth is 1

In [40]:
conf_mat = confusion_matrix(y_test, y_preds)
conf_mat

array([[23,  5],
       [ 7, 26]], dtype=int64)

### 6. Experiment to Improve the Model
* The first model you build is often referred to as a <b>baseline.</b>

* The next step in the workflow is to try and improve upon your baseline model.

* Experiment with different hyperparameters

* All different parameters should be cross-validated

* Different models you use will have different hyperparameters you can tune. For the case of our model, the RandomForestClassifier(), we'll start trying different values for n_estimators.

In [43]:
np.random.seed(42)
highest_score = 0
best_est = 0
for i in range(10, 151, 10):
    print(f"Trying model with {i} estimators...")
    model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    score = model.score(X_test, y_test) * 100
    if score > highest_score:
        highest_score = score
        best_est = i
    print(f"Model accuracy on test set: { score}%")
    print("")
    
print(f"High Score: {highest_score}, Estimators: {best_est}")

Trying model with 10 estimators...
Model accuracy on test set: 78.68852459016394%

Trying model with 20 estimators...
Model accuracy on test set: 81.9672131147541%

Trying model with 30 estimators...
Model accuracy on test set: 77.04918032786885%

Trying model with 40 estimators...
Model accuracy on test set: 81.9672131147541%

Trying model with 50 estimators...
Model accuracy on test set: 80.32786885245902%

Trying model with 60 estimators...
Model accuracy on test set: 78.68852459016394%

Trying model with 70 estimators...
Model accuracy on test set: 78.68852459016394%

Trying model with 80 estimators...
Model accuracy on test set: 80.32786885245902%

Trying model with 90 estimators...
Model accuracy on test set: 78.68852459016394%

Trying model with 100 estimators...
Model accuracy on test set: 80.32786885245902%

Trying model with 110 estimators...
Model accuracy on test set: 81.9672131147541%

Trying model with 120 estimators...
Model accuracy on test set: 83.60655737704919%

Tryi

In [44]:
# Merge the Datasets for cross validation.
combined = pd.concat([training, test], axis=0)
X = combined.drop("target", axis=1)
y = combined["target"]

In [45]:
from sklearn.model_selection import cross_val_score

# With cross-validation
np.random.seed(42)
for i in range(10, 101, 10):
    print(f"Trying model with {i} estimators...")
    model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accuracy on test set: {model.score(X_test, y_test) * 100}%")
    print(f"Cross-validation score: {np.mean(cross_val_score(model, X, y, cv=5)) * 100}%")
    print("")

Trying model with 10 estimators...
Model accuracy on test set: 78.68852459016394%
Cross-validation score: 79.53551912568307%

Trying model with 20 estimators...
Model accuracy on test set: 81.9672131147541%
Cross-validation score: 80.21311475409837%

Trying model with 30 estimators...
Model accuracy on test set: 80.32786885245902%
Cross-validation score: 81.20218579234972%

Trying model with 40 estimators...
Model accuracy on test set: 83.60655737704919%
Cross-validation score: 80.87431693989072%

Trying model with 50 estimators...
Model accuracy on test set: 81.9672131147541%
Cross-validation score: 82.51366120218577%

Trying model with 60 estimators...
Model accuracy on test set: 78.68852459016394%
Cross-validation score: 81.21311475409836%

Trying model with 70 estimators...
Model accuracy on test set: 77.04918032786885%
Cross-validation score: 82.5136612021858%

Trying model with 80 estimators...
Model accuracy on test set: 81.9672131147541%
Cross-validation score: 82.1912568306011

##### GridSearchCV
It helps to loop through predefined hyperparameters and fit your estimator (model) on your training set.

In [46]:
np.random.seed(42)
from sklearn.model_selection import GridSearchCV

# Define the parameters to search over
param_grid = {'n_estimators': [i for i in range(10, 101, 10)], 'max_depth': [0,5,10]}

# Setup the grid search
grid = GridSearchCV(RandomForestClassifier(),
                    param_grid,
                    cv=5,
                    scoring='recall'
                   )

# Fit the grid search to the data
grid.fit(X, y)

# Find the best parameters
grid.best_params_

{'max_depth': 5, 'n_estimators': 100}

In [47]:
# Set the model to be the best estimator
clf = grid.best_estimator_
clf

RandomForestClassifier(max_depth=5)

In [48]:
# Fit the best model
clf = clf.fit(X_train, y_train)

In [49]:
# Find the best model scores
clf.score(X_test, y_test)

0.819672131147541

### 7. Save a model for someone else to use
* When you've done a few experiments and you're happy with how your model is doing, you'll likely want someone else to be able to use it.

* This may come in the form of a teammate or colleague trying to replicate and validate your results or through a customer using your model as part of a service or application you offer.

* Saving a model also allows you to reuse it later without having to go through retraining it. Which is helpful, especially when your training times start to increase.

* You can save a scikit-learn model using Python's in-built pickle module.

In [26]:
import pickle

# Save an existing model to file
pickle.dump(model, open("myModel.pkl", "wb"))

In [27]:
# Load a saved model and make a prediction
loaded_model = pickle.load(open("myModel.pkl", "rb"))
y_test_pred = loaded_model.predict(X_test)
y_test_pred

array([1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0], dtype=int64)

In [28]:
import joblib
filename='model_2'
joblib.dump(model, filename)

['model_2']

In [29]:
mymodel = joblib.load(filename)

In [30]:
mymodel.score(X_test, y_test)

0.819672131147541

# END