<html>
    <a id="toc"></a>
    <h1 style='color:#FCF6F5FF;background-color:#89ABE3FF;font-size:40px;font-style:italic;padding:10px;'><center><b>TABLE OF CONTENTS</b></center></h1>
    
</html>

* [1. MOTIVATION](#1)

* [2. DATASET](#2)

* [3. OVERVIEW](#3)

* [4. VISUALIZATIONS](#4)
  
  * [4.1 UNIVARIATE ANALYSIS](#4.1)
    * [4.1.1. CATEGORICAL FEATURES](#4.1.1)
    * [4.1.2  CONTINUOUS FEATURES](#4.1.2)
    
* [5. MODEL & PREDICTION](#5)
  * [5.1 OPTUNA+ONE HOT ENCODING+ENSEMBLING](#5.1)
    * [5.1.2 RANDOM FOREST CLASSIFIER](#5.1.2)
    * [5.1.2 LGBM CLASSIFIER](#5.1.2)
    
  * [5.2 OPTUNA+LABEL ENCODING+ENSEMBLING](#5.2)
    * [5.2.1 RANDOM FOREST CLASSIFIER](#5.2.1)
    * [5.2.2 LGBM CLASSIFIER](#5.2.2)



[Slide to top](#toc)
<html>
    <a id="1"></a>
    <h1 style='color:#FCF6F5FF;background-color:#89ABE3FF;font-size:40px;font-style:italic;padding:10px;'><center><b>1. MOTIVATION</b></center></h1>
    
</html>

![STROKE](https://knoxvillecpr.com/wp-content/uploads/2014/04/stroke.jpg)



1. **A stroke occurs when a blood vessel that carries oxygen and nutrients to the brain is either blocked by a clot or bursts (or ruptures)**

2. **When that happens, part of the brain cannot get the blood (and oxygen) it needs, so it and brain cells die**

3. **According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths**

3. **80 percent of strokes are preventable, if we can predict this early we can save many lives**

[Slide to top](#toc)
<html>
    <a id="2"></a>
    <h1 style='color:#FCF6F5FF;background-color:#89ABE3FF;font-size:40px;font-style:italic;padding:10px;'><center><b>2. DATASET</b></center></h1>
    
</html>

1. **This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.**

**Features are:**

1. gender: "Male", "Female" or "Other"
3. age: age of the patient
4. hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5. heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart    disease
6. ever_married: "No" or "Yes"
7. work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8. Residence_type: "Rural" or "Urban"
9. avg_glucose_level: average glucose level in blood
10. bmi: body mass index
11. smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
12. stroke: 1 if the patient had a stroke or 0 if not


[Slide to top](#toc)
<html>
    <a id="3"></a>
    <h1 style='color:#FCF6F5FF;background-color:#89ABE3FF;font-size:40px;font-style:italic;padding:10px;'><center><b>3. OVERVIEW</b></center></h1>
    
</html>

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import pandas as pd

In [None]:
df=pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')
print("***** First Five Rows of Dataset *****")
df.head()

In [None]:
print("***** Shape of dataset *****")
df.shape

In [None]:
L=list(df.columns)
print("***** Column names of dataset *****")
print()
print(L)

In [None]:
print('***** Description of dataset *****')
print()
df.describe()

In [None]:
print("***** Basic Information about dataset *****")
print()
print(df.info())

**Checking null values**

In [None]:
df.isnull().sum()

**Used mean of bmi to fill all null values**

In [None]:
df['bmi']=df['bmi'].fillna(df['bmi'].mean())
print("Mean value of bmi is : ",df['bmi'].mean())

In [None]:
cat_columns=['gender' , 'hypertension' , 'ever_married' , 'work_type' , 'heart_disease' , 'Residence_type' , 'smoking_status']
print("***** Value counts in categorical features *****")
print()

for i in cat_columns:
    print("Value counts of",i,'feature are : ')
    print(df[i].value_counts())
    print()

In [None]:
#Dropping 'id' column
df=df.drop('id',axis=1)


[Slide to top](#toc)
<html>
    <a id="4"></a>
    <h1 style='color:#FCF6F5FF;background-color:#89ABE3FF;font-size:40px;font-style:italic;padding:10px;'><center><b>4. VISUALIZATIONS</b></center></h1>
    
</html>

In [None]:
print("***** Value Count of 'stroke' column *****")
df['stroke'].value_counts()

**DATA IS IMBALANCED**

In [None]:
def with_hue(data,feature,ax):
    
    #Numnber of categories
    num_of_cat=len([x for x in data[feature].unique() if x==x])
    
    bars=ax.patches
    
    for ind in range(num_of_cat):
        ##     Get every hue bar
        ##     ex. 8 X categories, 4 hues =>
        ##    [0, 8, 16, 24] are hue bars for 1st X category
        hueBars=bars[ind:][::num_of_cat] 
        # Get the total height (for percentages)
        total=sum([x.get_height() for x in hueBars])
        #Printing percentages on bar
        for bar in hueBars:
            percentage='{:.1f}%'.format(100 * bar.get_height()/total)
            ax.text(bar.get_x()+bar.get_width()/2.0,
                   bar.get_height(),
                   percentage,
                    ha="center",va="bottom",fontweight='bold',fontsize=10)
    

    
def without_hue(data,feature,ax):
    
    total=float(len(data))
    bars_plot=ax.patches
    
    for bars in bars_plot:
        percentage = '{:.1f}%'.format(100 * bars.get_height()/total)
        x = bars.get_x() + bars.get_width()/2.0
        y = bars.get_height()
        ax.text(x, y,(percentage,bars.get_height()),ha='center',fontweight='bold',fontsize=10)

In [None]:
fig=plt.figure(figsize=(10,5))
#Setting Colour
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")

#Dealing with spines
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.grid(linestyle="--",axis='y',color='gray')

#count plot
x_stroke=sns.countplot(data=df,x='stroke',palette="Set1")

#with percentages
without_hue(df,'stroke',x_stroke)

[Slide to top](#toc)
<html>
    <a id="4.1"></a>
    <h1 style='color:#0063B2FF;background-color:#9CC3D5FF;font-size:30px;padding:10px;'><center><b>4.1. UNIVARIATE ANALYSIS</b></center></h1>
    
</html>

[Slide to top](#toc)
<html>
    <a id="4.1.1"></a>
    <h1 style='color:#B1624EFF;background-color:#5CC8D7FF;font-size:20px;padding:10px;'><center><b>4.1.1. CATEGORICAL FEATURES</b></center></h1>
    
</html>

**1. CATEGORICAL VALUES : ['gender' , 'hypertension' , 'ever_married' , 'work_type' , 'heart_disease' , 'Residence_type' , 'smoking_status']**

In [None]:
def plotting_cat_features(nrows,ncols,cat_columns):
    
    f,ax=plt.subplots(nrows=nrows,ncols=ncols,figsize=(15,19))
    f.patch.set_facecolor('#F2EDD7FF')

    #Setting background and foreground color
    for i in range(0,nrows):
        for j in range(0,ncols):
            ax[i][j].set_facecolor('#F2EDD7FF')

    #Plotting count plot 
    for i in range(0,nrows):
        for j in range(0,ncols):
            if(i==0): #For [0,0] sub plot
                if(j==0):
                    ax[i][j].spines['bottom'].set_visible(False)
                    ax[i][j].spines['left'].set_visible(False)
                    ax[i][j].spines['top'].set_visible(False)
                    ax[i][j].spines['right'].set_visible(False)
                    
                    ax[i][j].tick_params(left=False,bottom=False)
                    ax[i][j].set_xticklabels([])
                    ax[i][j].set_yticklabels([])
                    ax[i][j].text(0.5,0.5,"Count plot of\ncategorical features",
                                    horizontalalignment="center",
                                    verticalalignment='center',
                                    fontweight='bold',fontsize=15,fontstyle='italic')
                elif(j==1): #For [0,1] subplot
                    ax[i][j].spines['bottom'].set_visible(False)
                    ax[i][j].spines['left'].set_visible(False)
                    ax[i][j].spines['top'].set_visible(False)
                    ax[i][j].spines['right'].set_visible(False)
                    
                    ax[i][j].tick_params(left=False,bottom=False)
                    ax[i][j].set_xticklabels([])
                    ax[i][j].set_yticklabels([])
                    ax[i][j].text(0.5,0.5,"Count plot with respect to\ntarget",
                                    horizontalalignment="center",
                                    verticalalignment='center',
                                    fontweight='bold',fontsize=15,fontstyle='italic')

            else:
                #Without hueness
                if(j==0):
                    a1=sns.countplot(data=df,x=cat_columns[i-1],palette='rocket',ax=ax[i][j])
                    without_hue(df,cat_columns[i-1],a1)
                #With hueness
                elif(j==1):
                    a2=sns.countplot(data=df,x=cat_columns[i-1],hue='stroke',ax=ax[i][j],palette='rocket')
                    with_hue(df,cat_columns[i-1],a2)
                
                #Dealing with spines
                ax[i][j].spines['top'].set_visible(False)
                ax[i][j].spines['right'].set_visible(False)
                ax[i][j].spines['left'].set_visible(False)
                ax[i][j].grid(linestyle="--",axis='y',color='gray')
        
        
    

In [None]:
#First four columns
cat_columns= ['gender' , 'hypertension' , 'ever_married' , 'work_type']       
plotting_cat_features(5,2,cat_columns) 

In [None]:
#Last three columns
cat_columns= ['heart_disease' , 'Residence_type' , 'smoking_status']       
plotting_cat_features(4,2,cat_columns) 

[Slide to top](#toc)
<html>
    <h1 style='color:#B1624EFF;background-color:#5CC8D7FF;font-size:20px;padding:20px;'><center><b>OBSERVATIONS FROM PLOTS OF CATEGORICAL FEATURES</b></center></h1>
    
</html>

* **4.7% of Females and 5.1% Males had strokes**

* **Only 13% of people who have hypertension had stroke i.e. 498x0.132530=66**

* **17% people who have heart disease had stroke and 4.1% who don't have heart disease**

* **6.5% of people who are married had stroke and 1.6% are not married had stroke**

* **5% of Government Job and Private Job people had stroke**

* **7.9% of Self-Employed people had stroke**

* **4.5% of people who live in Rural areas had stroke**

* **5.2% of people who live in Urban Areas had stroke**



[Slide to top](#toc)
<html>
    <a id="4.1.2"></a>
    <h1 style='color:#B1624EFF;background-color:#5CC8D7FF;font-size:20px;padding:10px;'><center><b>4.1.2. CONTINUOUS FEATURES</b></center></h1>
    
</html>

### **AGE**

**As we can see from people above the age of 40 are more likely to get strokes**

In [None]:
nrows=1
ncols=2
f,ax=plt.subplots(nrows=1,ncols=2,figsize=(15,6))
f.patch.set_facecolor('#F2EDD7FF')

#Setting background and foreground color
for j in range(0,ncols):
    ax[j].set_facecolor('#F2EDD7FF')
    ax[j].spines['top'].set_visible(False)
    ax[j].spines['right'].set_visible(False)
    ax[j].spines['left'].set_visible(False)
    ax[j].grid(linestyle="--",axis='y',color='gray')
        
        
sns.histplot(data=df,x='age',ax=ax[0],palette="Set1",kde="True")
sns.histplot(data=df,x='age',hue='stroke',multiple='stack',ax=ax[1],palette="gnuplot",kde=True)

In [None]:
fig=plt.figure(figsize=(15,5))
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.grid(linestyle="--",axis='y',color='gray')
        
plt.text(-0.2,-0.2,"No outliers",fontweight='bold',fontsize=15)
plt.title("Boxen plot of age column",fontweight='bold',fontsize=20)
ax=sns.boxenplot(data=df,x='age',palette="gnuplot")


**Log distribution of age**

In [None]:
fig=plt.figure(figsize=(15,5))
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.grid(linestyle="--",axis='y',color='gray')

plt.text(-2,350,"Log distribution is left skewed\nso we won't change anything",fontweight='bold',fontsize=12)
plt.title("Log distribution of age",fontweight='bold',fontsize=20)
sns.histplot(np.log(df['age']),kde=True,palette="coolwarm")

### **BMI**

In [None]:
f,ax=plt.subplots(nrows=1,ncols=2,figsize=(15,6))
f.patch.set_facecolor('#F2EDD7FF')

#Setting background and foreground color
for j in range(0,ncols):
    ax[j].set_facecolor('#F2EDD7FF')
    ax[j].spines['top'].set_visible(False)
    ax[j].spines['right'].set_visible(False)
    ax[j].spines['left'].set_visible(False)
    ax[j].grid(linestyle="--",axis='y',color='gray')

ax[0].text(50,500,"Distribution of\nbmi without stroke\nis Right Skewed",fontweight='bold',fontsize=12)
sns.histplot(data=df,x='bmi',ax=ax[0],palette="coolwarm",kde="True",bins=40)
ax[1].text(50,500,"Distribution of\nbmi with stroke",fontweight='bold',fontsize=12)
sns.histplot(data=df,x='bmi',hue='stroke',multiple='stack',ax=ax[1],palette="gnuplot",kde=True,bins=40)

**Bmi has lots of outliers**

In [None]:
#Outliers in bmi 
df_bmi=sorted(df['bmi'])
Q1,Q3=np.percentile(df_bmi,[25,75])
IQR= Q3-Q1
lower_range= Q1-(1.5*IQR)
upper_range=Q3+(1.5*IQR)

print("Lower range of outliers : ",lower_range)
print("Upper range of outliers : ",upper_range)
df_lower_outliers=df[df.bmi<lower_range]
df_upper_outliers=df[df.bmi>upper_range]

In [None]:
print("***** Lower outliers of bmi *****")
print()
df_lower_outliers

In [None]:
print(df_upper_outliers.shape)
print()
print("Percentage of upper outliers in bmi are:", (125/5110)*100 )
print()
print("**** Outer outliers of bmi *****")
print()
df_upper_outliers


In [None]:
fig=plt.figure(figsize=(15,5))
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.grid(linestyle="--",axis='y',color='gray')

plt.text(60,-0.3,'Percentage of upper outliers in bmi are: 2.446%',fontweight='bold',fontsize=15)
plt.text(60,-0.4,'Percentage of lower outliers in bmi are: 0.0002%',fontweight='bold',fontsize=15)
plt.title("Box Plot of body mass index",fontweight='bold',fontsize=20)
sns.boxplot(data=df,x='bmi',palette='gnuplot')


In [None]:
#Dropping outliers
df1_without_outliers=df.drop(df[df.bmi>upper_range].index)

In [None]:
df1_without_outliers

**DISTRIBUTION OF BMI WITHOUT OUTLIERS**

**Distribution of bmi become more normally distributed after removing outliers , so we will now consider this data for further modelling and prediction**

In [None]:
f,ax=plt.subplots(nrows=1,ncols=2,figsize=(15,5))
f.patch.set_facecolor('#F2EDD7FF')

#Setting background and foreground color
for j in range(0,ncols):
    ax[j].set_facecolor('#F2EDD7FF')
    ax[j].spines['top'].set_visible(False)
    ax[j].spines['right'].set_visible(False)
    ax[j].spines['left'].set_visible(False)
    ax[j].grid(linestyle="--",axis='y',color='gray')

ax[0].text(35,300,"Distribution of\nbmi without stroke",fontweight='bold',fontsize=15)
sns.histplot(data=df1_without_outliers,x='bmi',ax=ax[0],palette="gnuplot",kde="True",bins=40)
ax[1].text(35,300,"Distribution of\nbmi with stroke",fontweight='bold',fontsize=15)
sns.histplot(data=df1_without_outliers,x='bmi',hue='stroke',multiple='stack',ax=ax[1],palette="gnuplot",kde=True,bins=40)

### **AVERAGE GLUCOSE LEVEL**

**Glucose level less than 150 and greater than 150 leading to strokes**

In [None]:
f,ax=plt.subplots(nrows=1,ncols=2,figsize=(15,5))
f.patch.set_facecolor('#F2EDD7FF')

#Setting background and foreground color
for j in range(0,ncols):
    ax[j].set_facecolor('#F2EDD7FF')
    ax[j].spines['top'].set_visible(False)
    ax[j].spines['right'].set_visible(False)
    ax[j].spines['left'].set_visible(False)
    ax[j].grid(linestyle="--",axis='y',color='gray')
    
ax[0].text(150,300,"Distribution of\nglucose_level without stroke",fontweight='bold',fontsize=15)
sns.histplot(data=df,x='avg_glucose_level',ax=ax[0],palette="Set1",kde="True",bins=40)
ax[1].text(150,300,"Distribution of\nglucose_level without stroke",fontweight='bold',fontsize=15)
sns.histplot(data=df,x='avg_glucose_level',hue='stroke',multiple='stack',ax=ax[1],palette="gnuplot",kde=True,bins=40)

In [None]:
df_glucose=sorted(df['avg_glucose_level'])
Q1,Q3=np.percentile(df_glucose,[25,75])
IQR= Q3-Q1
lower_range= Q1-(1.5*IQR)
upper_range=Q3+(1.5*IQR)

print("Lower range of outliers in avg_glucose_level : ",lower_range)
print("Upper range of outliers in avg_glucose_level : ",upper_range)
df_lower_outliers=df[df.avg_glucose_level<lower_range]
df_upper_outliers=df[df.avg_glucose_level>upper_range]

In [None]:
#No lower outlier
df_lower_outliers
print("Percentage of upper outliers in avg_glucose_level are:", (627/5110)*100 )
print()
print("***** Upper outliers of avg_glucose_level *****")
print()
df_upper_outliers #627 upper outliers for avg_glucose_level column

In [None]:
fig=plt.figure(figsize=(15,5))
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.grid(linestyle="--",axis='y',color='gray')

plt.text(180,-0.3,'Percentage of upper outliers in bmi are: 12.27%',fontweight='bold',fontsize=15)
plt.text(180,-0.4,'Percentage of lower outliers in bmi are: 0%',fontweight='bold',fontsize=15)
sns.boxplot(data=df,x='avg_glucose_level',palette='gnuplot')

**DISTRIBUTION OF GLUCOSE_LEVEL WITHOUT OUTLIERS**

In [None]:
df1_outliers_glucose=df.drop(df[df.avg_glucose_level>upper_range].index)

**Distribution of avg_glucose_level also got the same results like bmi but removing 12.27% of data from table will cause loss of information , so won't change anything**

In [None]:
f,ax=plt.subplots(nrows=1,ncols=2,figsize=(15,5))
f.patch.set_facecolor('#F2EDD7FF')

#Setting background and foreground color
for j in range(0,ncols):
    ax[j].set_facecolor('#F2EDD7FF')
    ax[j].spines['top'].set_visible(False)
    ax[j].spines['right'].set_visible(False)
    ax[j].spines['left'].set_visible(False)
    ax[j].grid(linestyle="--",axis='y',color='gray')
    
ax[0].text(65,300,"Distribution of\navg glucose without stroke",fontweight='bold',fontsize=15)
sns.histplot(data=df1_outliers_glucose,x='avg_glucose_level',ax=ax[0],palette="gnuplot",kde="True",bins=40)
ax[1].text(65,300,"Distribution of\navg glucose with stroke",fontweight='bold',fontsize=15)
sns.histplot(data=df1_outliers_glucose,x='avg_glucose_level',hue='stroke',multiple='stack',ax=ax[1],palette="gnuplot",kde=True,bins=40)

[Slide to top](#toc)
<html>
    <a id="4.1"></a>
    <h1 style='color:#0063B2FF;background-color:#9CC3D5FF;font-size:30px;padding:10px;'><center><b>4.2. BIVARIATE ANALYSIS</b></center></h1>
    
</html>

**ENCODING: ONE HOT ENCODING**

In [None]:
df1=pd.get_dummies(df1_without_outliers,columns=["work_type",'smoking_status'])
df1.head()

In [None]:
print("Number of Features : ",len(list(df1.columns)))

In [None]:
df1['gender']=df1["gender"].map({"Male":0,"Female":1,"Other":2}).astype(int)
df1['ever_married']=df1["ever_married"].map({"Yes":1,"No":0}).astype(int)
df1['Residence_type']=df1["Residence_type"].map({"Urban":1,"Rural":0}).astype(int)

In [None]:
""" 'Accent', 'Accent_r', 'Blues', 'Blues_r', 'BrBG', 'BrBG_r', 'BuGn', 'BuGn_r', 'BuPu', 'BuPu_r', 
'CMRmap', 'CMRmap_r', 'Dark2', 'Dark2_r', 'GnBu', 'GnBu_r', 'Greens', 'Greens_r', 'Greys', 'Greys_r', 
'OrRd', 'OrRd_r', 'Oranges', 'Oranges_r', 'PRGn', 'PRGn_r', 'Paired', 'Paired_r', 'Pastel1', 
'Pastel1_r', 'Pastel2', 'Pastel2_r', 'PiYG', 'PiYG_r', 'PuBu', 'PuBuGn', 'PuBuGn_r', 
'PuBu_r', 'PuOr', 'PuOr_r', 'PuRd', 'PuRd_r', 'Purples', 'Purples_r', 'RdBu', 'RdBu_r', 
'RdGy', 'RdGy_r', 'RdPu', 'RdPu_r', 'RdYlBu', 'RdYlBu_r', 'RdYlGn', 'RdYlGn_r', 'Reds', 'Reds_r', 
'Set1', 'Set1_r', 'Set2', 'Set2_r', 'Set3', 'Set3_r', 'Spectral', 'Spectral_r', 'Wistia', 
'Wistia_r', 'YlGn', 'YlGnBu', 'YlGnBu_r', 'YlGn_r', 'YlOrBr', 
'YlOrBr_r', 'YlOrRd', 'YlOrRd_r', 'afmhot', 'afmhot_r', 'autumn', 
'autumn_r', 'binary', 'binary_r', 'bone', 'bone_r', 'brg', 'brg_r', 'bwr', 'bwr_r', 
'cividis', 'cividis_r', 'cool', 'cool_r', 'coolwarm', 'coolwarm_r', 'copper', 'copper_r', 
'crest', 'crest_r', 'cubehelix', 'cubehelix_r', 'flag', 'flag_r', 'flare', 'flare_r', 
'gist_earth', 'gist_earth_r', 'gist_gray', 'gist_gray_r', 'gist_heat', 'gist_heat_r', 'gist_ncar', 
'gist_ncar_r', 'gist_rainbow', 'gist_rainbow_r', 'gist_stern', 'gist_stern_r', 'gist_yarg', 
'gist_yarg_r', 'gnuplot', 'gnuplot2', 'gnuplot2_r', 'gnuplot_r', 'gray', 'gray_r', 'hot', 'hot_r', 
'hsv', 'hsv_r', 'icefire', 'icefire_r', 'inferno', 'inferno_r', 'jet', 'jet_r', 'magma', 'magma_r', 
'mako', 'mako_r', 'nipy_spectral', 'nipy_spectral_r', 'ocean', 'ocean_r', 'pink', 'pink_r', 'plasma', 
'plasma_r', 'prism', 'prism_r', 'rainbow', 'rainbow_r', 'rocket', 'rocket_r', 'seismic', 'seismic_r', 
'spring', 'spring_r', 'summer', 'summer_r', 'tab10', 'tab10_r', 'tab20', 'tab20_r', 'tab20b', 
'tab20b_r', 'tab20c', 'tab20c_r', 'terrain', 'terrain_r', 'turbo', 'turbo_r', 'twilight', 
'twilight_r', 'twilight_shifted', 'twilight_shifted_r', 'viridis', 'viridis_r', 'vlag', 'vlag_r', 
'winter', 'winter_r' """


In [None]:
fig=plt.figure(figsize=(15,12))
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")


sns.heatmap(df1.corr(),annot=True,linewidth=2)

In [None]:
xvars=['bmi','avg_glucose_level','age','stroke']
yvars=['bmi','avg_glucose_level','age','stroke']
sns.pairplot(df1,x_vars=xvars,y_vars=yvars,hue="stroke",palette="gnuplot")

1. **There is no significant +ve and -ve correlation between two features**

2. **Age and ever_married are related by 0.68 which is obvious and work_type_children and age are -vely correlated and I think which is also very obvious to understand**

3. **People above the age of 40 are more likely to get strokes**

4. **You can also see increasing age lead to increasing        hypertension,heart_disease,bmi,avg_glucose_level and stroke (see the heatmap)**


In [None]:
fig=plt.figure(figsize=(15,5))
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.grid(linestyle="--",axis='y',color='gray')
plt.title("bmi vs avg_glucose_level",fontweight='bold',fontsize=20)
sns.scatterplot(data=df1,x=df1['avg_glucose_level'],y=df1['bmi'],hue='stroke',style='stroke',palette='cool')

In [None]:
fig=plt.figure(figsize=(15,5))
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.grid(linestyle="--",axis='y',color='gray')
plt.title("age vs avg_glucose_level",fontweight='bold',fontsize=20)

sns.scatterplot(data=df1,x=df['avg_glucose_level'],y=df['age'],hue='stroke',style='stroke',palette='cool')

In [None]:
fig=plt.figure(figsize=(15,5))
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.grid(linestyle="--",axis='y',color='gray')
plt.title("bmi vs age",fontweight='bold',fontsize=20)

sns.scatterplot(data=df1,x=df['bmi'],y=df['age'],hue='stroke',style='stroke',palette='cool')

[Slide to top](#toc)
<html>
    <a id="5"></a>
    <h1 style='color:#FCF6F5FF;background-color:#89ABE3FF;font-size:40px;font-style:italic;padding:10px;'><center><b>5. MODEL & PREDICTION</b></center></h1>
    
</html>


[Slide to top](#toc)
<html>
    <a id="5.1"></a>
    <h1 style='color:#0063B2FF;background-color:#9CC3D5FF;font-size:30px;padding:10px;'><center><b>5.1. OPTUNA + ONE HOT ENCODING + ENSEMBLING METHODS</b></center></h1>
    
</html>

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import optuna
from sklearn.metrics import accuracy_score,classification_report,plot_confusion_matrix,roc_auc_score,plot_roc_curve,f1_score,roc_curve,auc
from sklearn.preprocessing import StandardScaler

In [None]:
Y=df1['stroke']
X=df1.drop('stroke',axis=1)
X

In [None]:
x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.2,random_state=42)

In [None]:
#BEFORE RESAMPLING
print("'stroke' value counts before oversampling")
print()
y_train.value_counts()

In [None]:
#Oversampling
smt=SMOTE()
x_train_sampling,y_train_sampling=smt.fit_resample(x_train,y_train)

In [None]:
#AFTER RESAMPLING
print("'stroke' value counts after oversampling")
print()

y_train_sampling.value_counts()

In [None]:
print("'stroke' value counts in test dataset")
print()

y_test.value_counts()

In [None]:
smt_test=SMOTE()
x_test_sampling,y_test_sampling=smt_test.fit_resample(x_test,y_test)

In [None]:
print("'stroke' value counts in test dataset after sampling")
print()

y_test_sampling.value_counts()

[Slide to top](#toc)
<html>
    <a id="5.1.1"></a>
    <h1 style='color:#0063B2FF;background-color:#9CC3D5FF;font-size:30px;padding:10px;'><center><b>5.1.1. RANDOM FOREST CLASSIFIER</b></center></h1>
    
</html>

In [None]:
def objective(trial):
    
    n_estimators = trial.suggest_int('n_estimators', 2, 200)
    max_depth = int(trial.suggest_int('max_depth', 1, 40))
    clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
    return cross_val_score(clf, x_train_sampling, y_train_sampling, 
           n_jobs=-1, cv=5,scoring='f1').mean()
    



In [None]:
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=30)

In [None]:
trial = study.best_trial
print("***** Best parameters *****")
print(trial.values)
print(trial.params)

In [None]:
clf=RandomForestClassifier(n_estimators=129,max_depth=36)
clf.fit(x_train_sampling,y_train_sampling)


In [None]:
pred_rf=clf.predict(x_test_sampling)
print("***** Accuracy of random forest classifier *****")
print()
print(accuracy_score(y_test_sampling,pred_rf))

In [None]:
print("***** CLassification report of random forest classifier ****")
print()
print(classification_report(y_test_sampling,pred_rf))

In [None]:
fig=plt.figure(figsize=(12,8))
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")

fpr,tpr,_=roc_curve(y_test_sampling,pred_rf)

plt.title('Random Forest ROC curve: CC Fraud')
plt.xlabel('FPR (Precision)')
plt.ylabel('TPR (Recall)')

plt.plot(fpr,tpr)
plt.plot((0,1), ls='dashed',color='black')
plt.show()
print ('Area under curve (AUC): ', auc(fpr,tpr))


[Slide to top](#toc)
<html>
    <h1 style='color:#0063B2FF;background-color:#9CC3D5FF;font-size:30px;padding:10px;'><center><b> FEATURE IMPORTANCE OF RF CLASSIFIER</b></center></h1>
    
</html>

In [None]:
feature_importance = np.array(clf.feature_importances_)
feature_names = np.array(x_train_sampling.columns)
data={'feature_names':feature_names,'feature_importance':feature_importance}
df_plt = pd.DataFrame(data)
df_plt.sort_values(by=['feature_importance'], ascending=False,inplace=True)
fig=plt.figure(figsize=(12,8))
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")

sns.barplot(x=df_plt['feature_importance'], y=df_plt['feature_names'])
#plt.style.use("ggplot")
plt.xlabel('FEATURE IMPORTANCE')
plt.ylabel('FEATURE NAMES')
plt.title("Important features for RandomForest Classifier")
plt.show()


[Slide to top](#toc)
<html>
    <a id="5.1.2"></a>
    <h1 style='color:#0063B2FF;background-color:#9CC3D5FF;font-size:30px;padding:10px;'><center><b>5.1.2. LGBM CLASSIFIER</b></center></h1>
    
</html>

In [None]:
import lightgbm as lgb

In [None]:
def objective_lgbm(trial):
    
    n_estimators = trial.suggest_int('n_estimators', 2, 300)
    max_depth = int(trial.suggest_int('max_depth', 2, 50))
    learning_rate=trial.suggest_loguniform('learning_rate',0.001,1)
    colsample_bytree=trial.suggest_loguniform("colsample_bytree",0.1, 1)
    num_leaves=trial.suggest_int('num_leaves',10,300)
    reg_alpha= trial.suggest_loguniform('reg_alpha',0.1,1)
    reg_lambda= trial.suggest_loguniform('reg_lambda',0.1,1)
    min_split_gain=trial.suggest_loguniform('min_split_gain',0.1,1)
    subsample=trial.suggest_loguniform('subsample',0.1,1)    
    clf = lgb.LGBMClassifier(n_estimators=n_estimators, max_depth=max_depth,
                            learning_rate=learning_rate,colsample_bytree=colsample_bytree,
                            num_leaves=num_leaves,reg_alpha=reg_alpha,reg_lambda=reg_lambda,
                            min_split_gain=min_split_gain,subsample=subsample)
    return cross_val_score(clf, x_train_sampling, y_train_sampling, 
           n_jobs=-1, cv=5,scoring='f1').mean()


In [None]:
study_lgbm= optuna.create_study(direction='maximize')
study_lgbm.optimize(objective_lgbm, n_trials=40)

In [None]:
trial_lgbm= study_lgbm.best_trial
print("***** Best parameters *****")
print(trial_lgbm.value)
print(trial_lgbm.params)

In [None]:
model_lgbm=lgb.LGBMClassifier(n_estimators=139, max_depth=21, learning_rate=0.045552659197751554, 
                              colsample_bytree=0.5296024837571571, num_leaves=161, reg_alpha=0.13175537618486874, 
                              reg_lambda=0.31574328097598714, 
                              min_split_gain=0.18022256039561763,subsample=0.35631707963483955)

In [None]:
model_lgbm.fit(x_train_sampling,y_train_sampling)

In [None]:
pred_lgbm=model_lgbm.predict(x_test_sampling)
print("***** Accuracy of LGBM classifier *****")
print()
print(accuracy_score(pred_lgbm,y_test_sampling))

In [None]:
print(classification_report(y_test_sampling,pred_lgbm))

In [None]:
fig=plt.figure(figsize=(12,8))
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")

fpr,tpr,_=roc_curve(y_test_sampling,pred_lgbm)

plt.title('LGBM ROC curve: CC Fraud')
plt.xlabel('FPR (Precision)')
plt.ylabel('TPR (Recall)')

plt.plot(fpr,tpr)
plt.plot((0,1), ls='dashed',color='black')
plt.show()
print ('Area under curve (AUC): ', auc(fpr,tpr))


[Slide to top](#toc)
<html>
    <h1 style='color:#0063B2FF;background-color:#9CC3D5FF;font-size:30px;padding:10px;'><center><b> FEATURE IMPORTANCE OF LGBM CLASSIFIER</b></center></h1>
    
</html>

In [None]:
feature_importance = np.array(model_lgbm.feature_importances_)
feature_names = np.array(x_train_sampling.columns)
data={'feature_names':feature_names,'feature_importance':feature_importance}
df_plt = pd.DataFrame(data)
df_plt.sort_values(by=['feature_importance'], ascending=False,inplace=True)
fig=plt.figure(figsize=(12,8))
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")

sns.barplot(x=df_plt['feature_importance'], y=df_plt['feature_names'])
#plt.style.use("ggplot")
plt.xlabel('FEATURE IMPORTANCE')
plt.ylabel('FEATURE NAMES')
plt.title("Important features for RandomForest Classifier")
plt.show()


[Slide to top](#toc)
<html>
    <a id="5.2"></a>
    <h1 style='color:#0063B2FF;background-color:#9CC3D5FF;font-size:30px;padding:10px;'><center><b>5.2. OPTUNA + LABEL ENCODING + ENSEMBLING METHODS</b></center></h1>
    
</html>

In [None]:
df2=df1_without_outliers.copy()
df2

### **LABEL ENCODING**

In [None]:
df2['gender']=df2["gender"].map({"Male":0,"Female":1,"Other":2}).astype(int)
df2['ever_married']=df2["ever_married"].map({"Yes":1,"No":0}).astype(int)
df2['Residence_type']=df2["Residence_type"].map({"Urban":1,"Rural":0}).astype(int)
df2['work_type']=df2['work_type'].map({"Private":0,'Self-employed':1,'children':2,'Govt_job':3,
                                      "Never_worked":4})
df2['smoking_status']=df2['smoking_status'].map({'never smoked':0,'Unknown':1,'formerly smoked':2,
                                                "smokes":3})

df2.head()

In [None]:
fig=plt.figure(figsize=(15,10))
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")

sns.heatmap(df2.corr(),annot=True,linewidth=2)

In [None]:
Y_new=df2['stroke']
X_new=df2.drop('stroke',axis=1)


In [None]:
x_train_new,x_test_new,y_train_new,y_test_new=train_test_split(X_new,Y_new,test_size=0.2,random_state=42)

In [None]:
smt=SMOTE()
x_train_sampling_new,y_train_sampling_new=smt.fit_resample(x_train_new,y_train_new)

In [None]:
smt_test=SMOTE()
x_test_sampling_new,y_test_sampling_new=smt_test.fit_resample(x_test_new,y_test_new)

[Slide to top](#toc)
<html>
    <a id="5.2.1"></a>
    <h1 style='color:#0063B2FF;background-color:#9CC3D5FF;font-size:30px;padding:10px;'><center><b>5.2.1. RANDOM FOREST CLASSIFIER</b></center></h1>
    
</html>

In [None]:
def objective(trial):
    
    n_estimators = trial.suggest_int('n_estimators', 2, 200)
    max_depth = int(trial.suggest_int('max_depth', 1, 40))
    clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
    return cross_val_score(clf, x_train_sampling_new, y_train_sampling_new, 
           n_jobs=-1, cv=5,scoring='f1').mean()

In [None]:
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=30)


In [None]:
trial = study.best_trial
print("***** Best Parameters *****")
print(trial.value)
print(trial.params)

In [None]:
clf=RandomForestClassifier(n_estimators=88,max_depth=28)
clf.fit(x_train_sampling_new,y_train_sampling_new)


In [None]:
pred_new=clf.predict(x_test_sampling_new)
print("***** Accuracy if random forest classifier *****")
print(accuracy_score(y_test_sampling_new,pred_new))

In [None]:
print("***** Classification report of random forest *****")
print()
print(classification_report(y_test_sampling_new,pred_new))

In [None]:
fig=plt.figure(figsize=(12,8))
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")

fpr,tpr,_=roc_curve(y_test_sampling_new,pred_new)

plt.title('Random Forest ROC curve: CC Fraud')
plt.xlabel('FPR (Precision)')
plt.ylabel('TPR (Recall)')

plt.plot(fpr,tpr)
plt.plot((0,1), ls='dashed',color='black')
plt.show()
print ('Area under curve (AUC): ', auc(fpr,tpr))


[Slide to top](#toc)
<html>
    <a id="5.2.2"></a>
    <h1 style='color:#0063B2FF;background-color:#9CC3D5FF;font-size:30px;padding:10px;'><center><b>5.2.2. LGBM CLASSIFIER</b></center></h1>
    
</html>

In [None]:
def objective_lgbm(trial):
    
    n_estimators = trial.suggest_int('n_estimators', 2, 300)
    max_depth = int(trial.suggest_loguniform('max_depth', 2, 50))
    learning_rate=trial.suggest_loguniform('learning_rate',0.001,1)
    colsample_bytree=trial.suggest_loguniform("colsample_bytree",0.1, 1)
    num_leaves=trial.suggest_int('num_leaves',10,300)
    reg_alpha= trial.suggest_loguniform('reg_alpha',0.1,1)
    reg_lambda= trial.suggest_loguniform('reg_lambda',0.1,1)
    min_split_gain=trial.suggest_loguniform('min_split_gain',0.1,1)
    subsample=trial.suggest_loguniform('subsample',0.1,1)    
    clf = lgb.LGBMClassifier(n_estimators=n_estimators, max_depth=max_depth,
                            learning_rate=learning_rate,colsample_bytree=colsample_bytree,
                            num_leaves=num_leaves,reg_alpha=reg_alpha,reg_lambda=reg_lambda,
                            min_split_gain=min_split_gain,subsample=subsample)
    return cross_val_score(clf, x_train_sampling_new, y_train_sampling_new, 
           n_jobs=-1, cv=5,scoring='f1').mean()


In [None]:
study_lgbm= optuna.create_study(direction="maximize")
study_lgbm.optimize(objective_lgbm, n_trials=40)

In [None]:
trial_lgbm= study_lgbm.best_trial
print("***** Best parameters *****")
print(trial_lgbm.value)
print(trial_lgbm.params)

In [None]:
model_lgbm=lgb.LGBMClassifier(n_estimators=228, max_depth=49, learning_rate=0.07246416747184325, 
                              colsample_bytree=0.659803224139728, num_leaves=182, reg_alpha=0.2647777683795973, 
                              reg_lambda=0.5432085589936458, 
                              min_split_gain=0.10089495584597963,subsample=0.17472830028174025)

In [None]:
model_lgbm.fit(x_train_sampling_new,y_train_sampling_new)
pred=model_lgbm.predict(x_test_sampling_new)
print(accuracy_score(y_test_sampling_new,pred))

In [None]:
print("***** Classification Report of LGBM *****")
print(classification_report(y_test_sampling_new,pred))

In [None]:
fig=plt.figure(figsize=(12,8))
ax = plt.axes() 
ax.set_facecolor("#F2EDD7FF") 
fig.patch.set_facecolor("#F2EDD7FF")

fpr,tpr,_=roc_curve(y_test_sampling_new,pred)

plt.title('LGBM ROC curve: CC Fraud')
plt.xlabel('FPR (Precision)')
plt.ylabel('TPR (Recall)')

plt.plot(fpr,tpr)
plt.plot((0,1), ls='dashed',color='black')
plt.show()
print ('Area under curve (AUC): ', auc(fpr,tpr))


[Slide to top](#toc)
<html>
    <a id="5.2.2"></a>
    <h1 style='color:#0063B2FF;background-color:#9CC3D5FF;font-size:30px;padding:10px;'><center><b>ANY SUGGESTIONS ARE MOST WELCOMED , PLEASE GIVE IT A UPVOTE</b></center></h1>
    
</html>

**I HAVE GOT HIGHEST AUC(AREA UNDER CURVE) OF 0.90 WITH ONE HOT ENCODING AND LGBM CLASSIFIER**

**IF YOU THINK THAT I HAVE TO SOMETHING MORE OR ANY STEP TO INCREASE MY AUC TELL ME IN THE COMMENTS I WILL EDIT THIS NOTEBOOK AGAIN ACCORDING TO THE SUGGESTIONS**

**GIVE IT A UPVOTE MAY IT CAN HELP ME TO GET A JOB/INTERNSHIP**

