<p style="font-size:32px;text-align:center"> <b>Stroke Prediction</b> </p>

# Introduction

- A stroke occurs when a blood vessel in the brain ruptures and bleeds, or when there’s a blockage in the blood    supply to the brain. The rupture or blockage prevents blood and oxygen from reaching the brain’s tissues.

- Risk factors for stroke

Certain risk factors make you more susceptible to stroke. According to the National Heart, Lung, and Blood InstituteTrusted Source, the more risk factors you have, the more likely you are to have a stroke. Risk factors for stroke include:

**Diet**

An unhealthy diet that increases your risk of stroke is one that’s high in:

    salt
    saturated fats
    trans fats
    cholesterol

**Inactivity**

Inactivity, or lack of exercise, can also raise your risk for stroke.

Regular exercise has a number of health benefits. The CDC recommends that adults get at least 2.5 hoursTrusted Source of aerobic exercise every week. This can mean simply a brisk walk a few times a week.

**Alcohol consumption**

Your risk for stroke also increases if you drink too much alcohol. Alcohol consumption should be done in moderation. This means no more than one drink per day for women, and no more than two for men. More than that may raise blood pressure levels as well as triglyceride levels, which can cause atherosclerosis.

**Tobacco use**

Using tobacco in any form also raises your risk for stroke, since it can damage your blood vessels and heart. This is further increased when smoking, because your blood pressure rises when you use nicotine.
Personal background

There are certain personal risk factors for stroke that you can’t control. Stroke risk can be linked to your:

    Family history: Stroke risk is higher in some families because of genetic health issues, such as high blood pressure.
    Sex: While both women and men can have strokes, they’re more common in women than in men in all age groups.
    Age: The older you are, the more likely you are to have a stroke.
    Race and ethnicity: Caucasians, Asian Americans, and Hispanics are less likely to have a stroke than African-Americans, Alaska Natives, and American Indians.

**Health history**

Certain medical conditions are linked to stroke risk. These include:

    a previous stroke or TIA
    high blood pressure
    high cholesterol
    heart disorders, such as coronary artery disease
    heart valve defects
    enlarged heart chambers and irregular heartbeats
    sickle cell disease
    diabetes
    
    
    
![stroke-image](https://www.mayoclinic.org/-/media/kcms/gbs/patient-consumer/images/2013/11/15/17/44/ds00150_ds01030_my00077_im00074_r7_ischemicstrokethu_jpg.jpg)

![](http://)

# Data 

1. id: unique identifier

2. gender: "Male", "Female" or "Other"

3. age: age of the patient

4. hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension

5. heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease

6. ever_married: "No" or "Yes"

7. work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"

8. Residence_type: "Rural" or "Urban"

9. avg_glucose_level: average glucose level in blood

10. bmi: body mass index

11. smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*

12. stroke: 1 if the patient had a stroke or 0 if not

**Note: "Unknown" in smoking_status means that the information is unavailable for this patient**

# Importing Libraries and Data

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from scipy.sparse import hstack
import seaborn as sns
sns.set()
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")


from imblearn.over_sampling import RandomOverSampler

from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier as rgb
from xgboost import XGBRFClassifier as xgb
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import log_loss, confusion_matrix, accuracy_score

from sklearn.ensemble import StackingClassifier
from prettytable import PrettyTable, MSWORD_FRIENDLY, DEFAULT

In [None]:
stroke = pd.read_csv("../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")

In [None]:
stroke.head()

In [None]:
stroke.info()

In [None]:
stroke.describe()

In [None]:
stroke.shape

# Cleaning of Data

In [None]:
stroke.isnull().sum()

In [None]:
stroke.bmi.replace(to_replace=np.nan, value=stroke.bmi.mean(), inplace=True)

In [None]:
stroke.isnull().sum()

# Exploratory Data Analysis

In [None]:
stroke.nunique()

In [None]:
stroke.stroke.value_counts()

**Our data is imbalanced.**

In [None]:
stroke.gender.unique()

In [None]:
stroke.hypertension.unique()

In [None]:
stroke.heart_disease.unique()

In [None]:
stroke.ever_married.unique()

In [None]:
stroke.work_type.unique()

In [None]:
stroke.Residence_type.unique()

In [None]:
stroke.smoking_status.unique()

## Univariate  Analysis

### Boxplot of bmi

In [None]:
sns.boxplot(data=stroke, y='bmi')
plt.title('Boxplot of bmi')
plt.show()

In [None]:
for i in range(0, 110, 10):
    print(f'The {i}th percentile of BMI is: {np.percentile(stroke.bmi, i)}')

In [None]:
for i in range(90, 101, 1):
    print(f'The {i}th percentile of BMI is: {np.percentile(stroke.bmi, i)}')

In [None]:
for i in np.arange(0, 1.1, 0.1):
    print(f'The {99+i}th percentile of BMI is: {np.percentile(stroke.bmi, 99+i)}')

**From the above calculations we can see that 99.9% of people have BMI less than 65.**

### Boxplot of Average Glucose Level

In [None]:
sns.boxplot(data=stroke, y='avg_glucose_level')
plt.title('Boxplot of avg_glucose_level')
plt.show()

In [None]:
for i in range(0, 110, 10):
    print(f'The {i}th percentile of Average Glucose Level is: {np.percentile(stroke.avg_glucose_level, i)}')

In [None]:
for i in range(90, 101):
    print(f'The {i}th percentile of Average Glucose Level is: {np.percentile(stroke.avg_glucose_level, i)}')

In [None]:
for i in np.arange(0, 1.1, 0.1):
    print(f'The {99+i}th percentile of Average Glucose Level is: {np.percentile(stroke.avg_glucose_level, 99+i)}')

### Boxplot of age

In [None]:
sns.boxplot(data=stroke, y='age')
plt.title('Boxplot of age')
plt.show()

## Bivariate Analysis

### Hypertension vs Stroke

In [None]:
plots = sns.countplot(x='hypertension', hue='stroke', data=stroke)

for bar in plots.patches:
    plots.annotate(f'{round(bar.get_height()/len(stroke)*100,2)} %', xy=(bar.get_x() + bar.get_width() / 2,  
                   bar.get_height()), ha='center', va='center', size=13, xytext=(0, 8), textcoords='offset points')

    
plt.title('Effect of Hypertension on Stroke')

### Gender vs Stroke

In [None]:
plots = sns.countplot(x='gender', hue='stroke', data=stroke)

for bar in plots.patches:
    plots.annotate(f'{round(bar.get_height()/len(stroke)*100,2)} %', xy=(bar.get_x() + bar.get_width() / 2,  
                   bar.get_height()), ha='center', va='center', size=13, xytext=(0, 8), textcoords='offset points')

plt.title('Stroke based on Gender')

### Heart Disease vs Stroke

In [None]:
plots = sns.countplot(x='heart_disease', hue='stroke', data=stroke)

for bar in plots.patches:
    plots.annotate(f'{round(bar.get_height()/len(stroke)*100,2)} %', xy=(bar.get_x() + bar.get_width() / 2,  
                   bar.get_height()), ha='center', va='center', size=13, xytext=(0, 8), textcoords='offset points')

plt.title('Effect of Heart Disease on Stroke')

### Marital Status vs Stroke

In [None]:
plots = sns.countplot(x='ever_married', hue='stroke', data=stroke)

for bar in plots.patches:
    plots.annotate(f'{round(bar.get_height()/len(stroke)*100,2)} %', xy=(bar.get_x() + bar.get_width() / 2,  
                   bar.get_height()), ha='center', va='center', size=13, xytext=(0, 8), textcoords='offset points')

plt.title('Effect of Marital Status on Stroke')

### Residence Type vs Stroke

In [None]:
plots = sns.countplot(x='Residence_type', hue='stroke', data=stroke)

for bar in plots.patches:
    plots.annotate(f'{round(bar.get_height()/len(stroke)*100,2)} %', xy=(bar.get_x() + bar.get_width() / 2,  
                   bar.get_height()), ha='center', va='center', size=13, xytext=(0, 8), textcoords='offset points')

plt.title('Effect of Residence Type on Stroke')

### Work Type vs Stroke

In [None]:
plots = sns.countplot(x='work_type', data=stroke)

for bar in plots.patches:
    plots.annotate(f'{round(bar.get_height()/len(stroke)*100,2)} %', xy=(bar.get_x() + bar.get_width() / 2,  
                   bar.get_height()), ha='center', va='center', size=13, xytext=(0, 8), textcoords='offset points')

plt.title('Work Type of People')

In [None]:
plots = sns.countplot(x='work_type', hue='stroke', data=stroke)

for bar in plots.patches:
    plots.annotate(f'{round(bar.get_height()/len(stroke)*100,2)} %', xy=(bar.get_x() + bar.get_width() / 2,  
                   bar.get_height()), ha='center', va='center', size=13, xytext=(0, 8), textcoords='offset points')

plt.title('Effect of Work Type on Stroke')

### Smoking vs Stroke

In [None]:
plots = sns.countplot(x='smoking_status', data=stroke)

for bar in plots.patches:
    plots.annotate(f'{round(bar.get_height()/len(stroke)*100,2)} %', xy=(bar.get_x() + bar.get_width() / 2,  
                   bar.get_height()), ha='center', va='center', size=13, xytext=(0, 8), textcoords='offset points')

plt.title('Number of Smokers/Non Smokers')

In [None]:
plots = sns.countplot(x='smoking_status', hue='stroke', data=stroke)

for bar in plots.patches:
    plots.annotate(f'{round(bar.get_height()/len(stroke)*100,2)} %', xy=(bar.get_x() + bar.get_width() / 2,  
                   bar.get_height()), ha='center', va='center', size=13, xytext=(0, 8), textcoords='offset points')

plt.title('Effect of Smoking on Stroke')

In [None]:
correlation = stroke.corr()

## Correlation Heatmap

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(correlation, annot=True)

# Removing Outliers and Redundant Columns

In [None]:
stroke.drop(labels='id', axis=1, inplace=True)

In [None]:
stroke.head()

In [None]:
stroke.drop(stroke[stroke.bmi>65].index, inplace=True)
stroke.shape

In [None]:
stroke.head()

# Machine Learning Models

## Splitting The Dataset

Splitting the dataset into train and test in 80:20.

In [None]:
x = stroke.iloc[:,:-1]
y = stroke.iloc[:,-1]

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, stratify=y)
x_train, x_cv, y_train, y_cv = train_test_split(x_train, y_train, test_size=0.2, stratify=y_train)

In [None]:
x_train.shape

In [None]:
x_test.shape

In [None]:
y_train.value_counts()

In [None]:
rom = RandomOverSampler(random_state=42)
x_train, y_train = rom.fit_resample(x_train, y_train)

In [None]:
x_train.shape

In [None]:
y_train.value_counts()

## One-Hot Encoding of Categorical Data

In [None]:
ohe = ColumnTransformer([('ohe', OneHotEncoder(handle_unknown='ignore'), ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status'])], remainder='passthrough')
ohe.fit(x_train)
print(x_train.shape, y_train.shape)

x_train_ohe = ohe.transform(x_train)
x_cv_ohe = ohe.transform(x_cv)
x_test_ohe = ohe.transform(x_test)

print('After Vectorization......')
print(x_train_ohe.shape, y_train.shape)
print(x_cv_ohe.shape, y_cv.shape)
print(x_test_ohe.shape, y_test.shape)

In [None]:
x_train_ohe = x_train_ohe[:, :16]
x_cv_ohe = x_cv_ohe[:, :16]
x_test_ohe = x_test_ohe[:, :16]

print('After Vectorization......')
print(x_train_ohe.shape, y_train.shape)
print(x_cv_ohe.shape, y_cv.shape)
print(x_test_ohe.shape, y_test.shape)

In [None]:
features = ohe.get_feature_names()
features

In [None]:
x_train_hyp = np.array(x_train['hypertension']).reshape((-1,1))
x_cv_hyp = np.array(x_cv['hypertension']).reshape((-1,1))
x_test_hyp = np.array(x_test['hypertension']).reshape((-1,1))

print('After Vectorization......')
print(x_train_hyp.shape, y_train.shape)
print(x_cv_hyp.shape, y_cv.shape)
print(x_test_hyp.shape, y_test.shape)

In [None]:
x_train_hd = np.array(x_train['heart_disease']).reshape((-1,1))
x_cv_hd = np.array(x_cv['heart_disease']).reshape((-1,1))
x_test_hd = np.array(x_test['heart_disease']).reshape((-1,1))

print('After Vectorization......')
print(x_train_hd.shape, y_train.shape)
print(x_cv_hd.shape, y_cv.shape)
print(x_test_hd.shape, y_test.shape)

In [None]:
std = ColumnTransformer([('norm', MinMaxScaler(), ['age', 'avg_glucose_level', 'bmi'])], remainder='drop')
std.fit(x_train)
print(x_train.shape, y_train.shape)

x_train_std = std.transform(x_train)
x_cv_std = std.transform(x_cv)
x_test_std = std.transform(x_test)

print('After Vectorization......')
print(x_train_std.shape, y_train.shape)
print(x_cv_std.shape, y_cv.shape)
print(x_test_std.shape, y_test.shape)

### Combining all encoded columns

In [None]:
x_tr = np.hstack((x_train_ohe.astype(np.float), x_train_hyp.astype(np.float), x_train_hd.astype(np.float), x_train_std.astype(np.float)))
x_cv = np.hstack((x_cv_ohe.astype(np.float), x_cv_hyp.astype(np.float), x_cv_hd.astype(np.float), x_cv_std.astype(np.float)))
x_te = np.hstack((x_test_ohe.astype(np.float), x_test_hyp.astype(np.float), x_test_hd.astype(np.float), x_test_std.astype(np.float)))

print("Final Data Matrix Shape is........")
print(x_tr.shape,y_train.shape)
print(x_cv.shape,y_cv.shape)
print(x_te.shape,y_test.shape)

In [None]:
def cnf_matrix(true_y, pred_y):

    cf_matrix = confusion_matrix(y_test, predicted_y)
    print('-'*40, 'Confusion Matrix', '-'*40)
    group_counts = ['{0:0.0f}'.format(value) for value in cf_matrix.flatten()]
    group_percentages = ['{0:.2%}'.format(value) for value in cf_matrix.flatten()/np.sum(cf_matrix)]
    labels = [f'{v1}\n{v2}\n' for v1, v2 in zip(group_counts,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)
    sns.heatmap(cf_matrix, annot=labels, fmt='', cmap="YlGnBu")
    plt.xlabel('Predicted Class')
    plt.ylabel('Original Class')
    plt.show()

    # Precision Matrix
    pc_matrix =(cf_matrix/cf_matrix.sum(axis=0))
    print("-"*40, "Precision matrix (Columm Sum=1)", "-"*40)
    sns.heatmap(pc_matrix, annot=True, cmap="YlGnBu", fmt=".3f", xticklabels=[0,1], yticklabels=[0,1])
    plt.xlabel('Predicted Class')
    plt.ylabel('Original Class')
    plt.show()

    # Recall Matrix
    rl_matrix =(((cf_matrix.T)/(cf_matrix.sum(axis=1))).T)
    print("-"*40, "Recall matrix (Row sum=1)", "-"*40)
    sns.heatmap(rl_matrix, annot=True, cmap="YlGnBu", fmt=".3f", xticklabels=[0,1], yticklabels=[0,1])
    plt.xlabel('Predicted Class')
    plt.ylabel('Original Class')

## Random Model and its Performance

In [None]:
test_data_len = x_test.shape[0]
cv_data_len = x_cv.shape[0]

# we create a output array that has exactly same size as the CV data
cv_predicted_y = np.zeros((cv_data_len,2))
for i in range(cv_data_len):
    rand_probs = np.random.rand(1,2)
    cv_predicted_y[i] = ((rand_probs/sum(sum(rand_probs)))[0])
    
# Test-Set error.
# We create a output array that has exactly same as the test data
test_predicted_y = np.zeros((test_data_len,2))
for i in range(test_data_len):
    rand_probs = np.random.rand(1,2)
    test_predicted_y[i] = ((rand_probs/sum(sum(rand_probs)))[0])

predicted_y = np.argmax(test_predicted_y, axis=1)
predicted_cv = np.argmax(cv_predicted_y, axis=1)


ll_rm_cv = log_loss(y_cv,cv_predicted_y, eps=1e-15)
ac_rm_cv = accuracy_score(y_cv, predicted_cv)
ll_rm_te = log_loss(y_test,test_predicted_y, eps=1e-15)
ac_rm_te = accuracy_score(y_test, predicted_y)

print("Log loss on Cross Validation Data using Random Model",ll_rm_cv)
print("Log loss on Test Data using Random Model",ll_rm_te)
print('Accuracy on Cross Validation using Random Model', ac_rm_cv)
print('Accuracy on Test Data using Random Model', ac_rm_te)

cnf_matrix(y_test, predicted_y)

## Logistic Regression

### Hyperparameter Tuning

In [None]:
alpha = [10 ** x for x in range(-6, 4)]
params = {'alpha':alpha}
clf1 = SGDClassifier(loss='log', n_jobs=-1, random_state=42)
r_search = RandomizedSearchCV(clf1, param_distributions=params, return_train_score=True, random_state=42)
r_search.fit(x_tr, y_train)

In [None]:
print(f'The best hyperparameter values is {r_search.best_params_} at which the score is {r_search.best_score_}')

### Training the model

In [None]:
clf1 = SGDClassifier(loss='log', n_jobs=-1, random_state=42, **r_search.best_params_)
clf1.fit(x_tr, y_train)
cal_clf1 = CalibratedClassifierCV(clf1, cv='prefit')
cal_clf1.fit(x_tr, y_train)

In [None]:
y_pred_cv = cal_clf1.predict(x_cv)
y_prob_cv = cal_clf1.predict_proba(x_cv)
y_pred = cal_clf1.predict(x_te)
y_prob = cal_clf1.predict_proba(x_te)

### Performance of the model

In [None]:
ll_lg_cv = log_loss(y_cv, y_prob_cv, eps=1e-15)
ac_lg_cv = accuracy_score(y_cv, y_pred_cv)
ll_lg_te = log_loss(y_test, y_prob, eps=1e-15)
ac_lg_te = accuracy_score(y_test, y_pred)


print("Log loss on Cross Validation Data using Logistic Regression",ll_lg_cv)
print("Log loss on Test Data using Logistic Regression",ll_lg_te)
print('Accuracy on Cross Validation using Logistic Regression', ac_lg_cv)
print('Accuracy on Test Data using Logistic Regression', ac_lg_te)

cnf_matrix(y_test, y_pred)

### Feature Importance

In [None]:
importance = clf1.coef_
# summarize feature importance
for i,v in enumerate(importance[0]):
    print(f'Feature: {i}, Score: {v}')

plt.figure(figsize=(20,7))    
sns.barplot(x=[x for x in range(importance.shape[1])], y=importance[0]).set_xticklabels(features, rotation=90)
plt.show()

## Support Vector Machines

### Hyperparameter Tuning

In [None]:
alpha = [10 ** x for x in range(-6, 4)]
params = {'alpha':alpha}
clf2 = SGDClassifier(loss='hinge', n_jobs=-1, random_state=42)
r_search = RandomizedSearchCV(clf2, param_distributions=params, return_train_score=True, random_state=42)
r_search.fit(x_tr, y_train)

In [None]:
print(f'The best hyperparameter values is {r_search.best_params_} at which the score is {r_search.best_score_}')

### Training the model

In [None]:
clf2 = SGDClassifier(loss='hinge', n_jobs=-1, random_state=42, **r_search.best_params_)
clf2.fit(x_tr, y_train)
cal_clf2 = CalibratedClassifierCV(clf2, cv='prefit')
cal_clf2.fit(x_tr, y_train)

In [None]:
y_pred_cv = cal_clf2.predict(x_cv)
y_prob_cv = cal_clf2.predict_proba(x_cv)
y_pred = cal_clf2.predict(x_te)
y_prob = cal_clf2.predict_proba(x_te)

### Performance of the model

In [None]:
ll_svm_cv = log_loss(y_cv, y_prob_cv, eps=1e-15)
ac_svm_cv = accuracy_score(y_cv, y_pred_cv)
ll_svm_te = log_loss(y_test, y_prob, eps=1e-15)
ac_svm_te = accuracy_score(y_test, y_pred)


print("Log loss on Cross Validation Data using SVM",ll_svm_cv)
print("Log loss on Test Data using SVM",ll_svm_te)
print('Accuracy on Cross Validation using SVM', ac_svm_cv)
print('Accuracy on Test Data using SVM', ac_svm_te)

cnf_matrix(y_test, y_pred)

### Feature Importance

In [None]:
importance = clf2.coef_
# summarize feature importance
for i,v in enumerate(importance[0]):
    print(f'Feature: {i}, Score: {v}')

plt.figure(figsize=(20,7))    
sns.barplot(x=[x for x in range(importance.shape[1])], y=importance[0]).set_xticklabels(features, rotation=90)
plt.show()

## Naive Bayes

### Hyperparameter Tuning

In [None]:
alpha = [0.0000001,0.000001,0.00001,0.0001,0.001,0.01,0.1,1,10,50,100]
params = {'alpha':alpha}
clf3 = MultinomialNB()
r_search = RandomizedSearchCV(clf3, param_distributions=params, return_train_score=True, random_state=42)
r_search.fit(x_tr, y_train)

In [None]:
print(f'The best hyperparameter values is {r_search.best_params_} at which the score is {r_search.best_score_}')

### Training the model

In [None]:
clf3 = MultinomialNB(**r_search.best_params_)
clf3.fit(x_tr, y_train)
cal_clf3 = CalibratedClassifierCV(clf3, cv='prefit')
cal_clf3.fit(x_tr, y_train)

In [None]:
y_pred_cv = cal_clf3.predict(x_cv)
y_prob_cv = cal_clf3.predict_proba(x_cv)
y_pred = cal_clf3.predict(x_te)
y_prob = cal_clf3.predict_proba(x_te)

### Performance of the model

In [None]:
ll_nb_cv = log_loss(y_cv, y_prob_cv, eps=1e-15)
ac_nb_cv = accuracy_score(y_cv, y_pred_cv)
ll_nb_te = log_loss(y_test, y_prob, eps=1e-15)
ac_nb_te = accuracy_score(y_test, y_pred)


print("Log loss on Cross Validation Data using Naive Bayes",ll_nb_cv)
print("Log loss on Test Data using Naive Bayes",ll_nb_te)
print('Accuracy on Cross Validation using Naive Bayes', ac_nb_cv)
print('Accuracy on Test Data using Naive Bayes', ac_nb_te)

cnf_matrix(y_test, y_pred)

## Random Forest using Sklearn

### Hyperparameter Tuning

In [None]:
# Maximum number of levels in tree
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

# Number of features to consider at every split
max_features = ['auto', 'log2', None]

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10, 15]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 3, 4, 5]

# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
params = {'n_estimators': n_estimators, 
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
clf4 = rgb(n_jobs=-1, random_state=42)
r_search = RandomizedSearchCV(clf4, param_distributions=params, return_train_score=True, random_state=42)
r_search.fit(x_tr, y_train)

In [None]:
print(f'The best hyperparameter values are {r_search.best_params_} at which the score is {r_search.best_score_}')

### Training the model

In [None]:
clf4 = rgb(**r_search.best_params_, n_jobs=-1, random_state=42)
clf4.fit(x_tr, y_train)
cal_clf4 = CalibratedClassifierCV(clf4, cv='prefit')
cal_clf4.fit(x_tr, y_train)

In [None]:
y_pred_cv = cal_clf4.predict(x_cv)
y_prob_cv = cal_clf4.predict_proba(x_cv)
y_pred = cal_clf4.predict(x_te)
y_prob = cal_clf4.predict_proba(x_te)

### Performance of the model

In [None]:
ll_rf_cv = log_loss(y_cv, y_prob_cv, eps=1e-15)
ac_rf_cv = accuracy_score(y_cv, y_pred_cv)
ll_rf_te = log_loss(y_test, y_prob, eps=1e-15)
ac_rf_te = accuracy_score(y_test, y_pred)


print("Log loss on Cross Validation Data using RF",ll_rf_cv)
print("Log loss on Test Data using RF",ll_rf_te)
print('Accuracy on Cross Validation using RF', ac_rf_cv)
print('Accuracy on Test Data using RF', ac_rf_te)

cnf_matrix(y_test, y_pred)

### Feature Importance

In [None]:
importance = clf4.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
    print(f'Feature: {i}, Score: {v}')

plt.figure(figsize=(20,7))    
sns.barplot(x=[x for x in range(len(importance))], y=importance).set_xticklabels(features, rotation=90)
plt.show()

## Random Forest using Xgboost

### Hyperparameter Tuning

In [None]:
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 2000, num = 10)]

learning_rate = [x for x in np.linspace(start=0.01, stop=0.2, num=10)]

min_child_weight = [1, 3, 5, 7]

max_depth = [3, 5, 7, 9]

subsample = [0.5, 0.6, 0.7, 0.8, 0.9, 1]

colsample_bytree = [0.5, 0.6, 0.7, 0.8, 0.9, 1]

# Create the random grid
params = {'n_estimators': n_estimators, 
          'learning_rate':learning_rate,
          'min_child_weight': min_child_weight,
          'max_depth': max_depth,
          'subsample': subsample,
          'colsample_bytree': colsample_bytree}
clf5 = xgb(n_jobs=-1, random_state=42)
r_search = RandomizedSearchCV(clf5, param_distributions=params, return_train_score=True, random_state=42)
r_search.fit(x_tr, y_train)

In [None]:
print(f'The best hyperparameter values are {r_search.best_params_} at which the score is {r_search.best_score_}')

### Training the model

In [None]:
clf5 = xgb(**r_search.best_params_, n_jobs=-1, random_state=42)
clf5.fit(x_tr, y_train)
cal_clf5 = CalibratedClassifierCV(clf5, cv='prefit')
cal_clf5.fit(x_tr, y_train)

In [None]:
y_pred_cv = cal_clf5.predict(x_cv)
y_prob_cv = cal_clf5.predict_proba(x_cv)
y_pred = cal_clf5.predict(x_te)
y_prob = cal_clf5.predict_proba(x_te)

### Performance of the model

In [None]:
ll_xg_cv = log_loss(y_cv, y_prob_cv, eps=1e-15)
ac_xg_cv = accuracy_score(y_cv, y_pred_cv)
ll_xg_te = log_loss(y_test, y_prob, eps=1e-15)
ac_xg_te = accuracy_score(y_test, y_pred)


print("Log loss on Cross Validation Data using RF",ll_xg_cv)
print("Log loss on Test Data using RF",ll_xg_te)
print('Accuracy on Cross Validation using RF', ac_xg_cv)
print('Accuracy on Test Data using RF', ac_xg_te)

cnf_matrix(y_test, y_pred)

### Feature Importance

In [None]:
importance = clf5.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
    print(f'Feature: {i}, Score: {v}')

plt.figure(figsize=(20,7))    
sns.barplot(x=[x for x in range(len(importance))], y=importance).set_xticklabels(features, rotation=90)
plt.show()

## Stacking Classifier

In [None]:
estimators = [('svc', clf2), ('nb', clf3), ('rf',  clf4)]

scl = StackingClassifier(estimators=estimators, final_estimator=clf1, n_jobs=-1)
scl.fit(x_tr, y_train)
cal_clf = CalibratedClassifierCV(scl, cv='prefit')
cal_clf.fit(x_tr, y_train)

In [None]:
y_pred_cv = cal_clf.predict(x_cv)
y_prob_cv = cal_clf.predict_proba(x_cv)
y_pred = cal_clf.predict(x_te)
y_prob = cal_clf.predict_proba(x_te)

### Performance of the model

In [None]:
ll_sc_cv = log_loss(y_cv, y_prob_cv, eps=1e-15)
ac_sc_cv = accuracy_score(y_cv, y_pred_cv)
ll_sc_te = log_loss(y_test, y_prob, eps=1e-15)
ac_sc_te = accuracy_score(y_test, y_pred)


print("Log loss on Cross Validation Data using Stacking Classifier",ll_sc_cv)
print("Log loss on Test Data using Stacking Classifier",ll_sc_te)
print('Accuracy on Cross Validation using Stacking Classifier', ac_sc_cv)
print('Accuracy on Test Data using Stacking Classifier', ac_sc_te)

cnf_matrix(y_test, y_pred)

# Summary of Performance

In [None]:
table = PrettyTable()
table.field_names = ['Model', 'CV Log Loss', 'Test Log Loss', 'CV Accuracy', 'Test Accuracy']
table.add_rows([['Random Model', round(ll_rm_cv,3), round(ll_rm_te,4), round(ac_rm_cv,3), round(ac_rm_te,3)],
                ['Logistic Regression', round(ll_lg_cv,3), round(ll_lg_te,3), round(ac_lg_cv,3), round(ac_lg_te,3)],
                ['Naive Bayes', round(ll_nb_cv,3), round(ll_nb_te,3), round(ac_nb_cv,3), round(ac_nb_te,3)],
                ['SVM', round(ll_svm_cv,3), round(ll_svm_te,3), round(ac_svm_cv,3), round(ac_svm_te,3)],
                ['Random Forest(Scikit)', round(ll_rf_cv,3), round(ll_rf_te,3), round(ac_rf_cv,3), round(ac_rf_te,3)],
                ['Random Forest(Xgboost)', round(ll_xg_cv,3), round(ll_xg_te,3), round(ac_xg_cv,3), round(ac_xg_te,3)],
                ['Stacking Classifier', round(ll_sc_cv,3), round(ll_sc_te,3), round(ac_sc_cv,3), round(ac_sc_te,3)]])

In [None]:
table.set_style(DEFAULT)
print(table)

So our best model is Random Forest using sklearn with **an accuracy of 94.6% and log loss of 0.322**