# <p style="text-align:center;"> <u>STROKE RISK PREDICTION</u> </p>

### AIM: OUR AIM IS TO PREDICT WHETHER A PERSON HAD A STROKE OR NOT BASED ON THE FOLLOWING FEATURES:

* id: unique identifier
* gender: "Male", "Female" or "Other"
- age: age of the patient
- hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
- heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
- ever_married: "No" or "Yes"
- work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
- Residence_type: "Rural" or "Urban"
- avg_glucose_level: average glucose level in blood
- bmi: body mass index
- smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"

-----------

# <p style="text-align:center;"> <u>STEPS ONE SHOULD FOLLOW WHILE WORKING ON ML DATASET</u></p>


* <span style="font-size:20px;"> In this kernel, I have worked on the Exploratory Data Analysis or EDA of the stroke risk dataset. Exploratory Data Analysis or EDA is a first step in analysing a new dataset. The primary objective of EDA is to analyse the data for distribution, outliers and anomalies in the dataset. It includes analysing the data to find the distribution of data, its main characteristics, identifying patterns and visualizations. It also provides tools for hypothesis generation by visualizing and understanding the data through graphical representation.</span>

* <span style="font-size:20px;">Feature Engineering is the Second Step one should follow while doing any Machine Learning Project. In this step we perform the following things:
          
          Handle the Missing Values if any.
          Handling the Outliers.
          Handling the Categorical Data.
          Normalizing the Data for further Model building.
    
    
* <span style="font-size:20px;">The third Step is Feature Selection. In this step we use some techniques to identify the important and unnecessary features to feed to our model.</span>
    
* <span style="font-size:20px;">The Final step I performed is Building Machine Learning Models: Random Forest, XGBOOST </span>

* <span style="font-size:20px;">If the Accuracy is NOT SATISFYING, then we perform Hyper-Parameter Tuning to enhance the performance of our Model</span>



------

------

## IMPORT REQUIRED LIBRARIES:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.rcParams['figure.figsize'] = (5, 5)

In [None]:
df = pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')

In [None]:
#GET THE FIRST 5 ROWS:
df.head()

In [None]:
#GET THE LIST OF COLUMNS IN DATASET:
df.columns

In [None]:
#GET THE STATISTICS OF DATA:
df.describe()

* HERE WE CAN SEE THAT THE COUNT OF "BMI" IS LESS COMPARED TO OTHER FEATURES. THAT MEANS BMI HAS SOME NULL VALUES.

In [None]:
df.info()

* HERE THE CATEGORICAL FEATURES ARE - "GENDER", "EVER_MARRIED", "WORK_TYPE", "RESIDENCE_TYPE", "SMOKING_STATUS".
* WE NEED TO HANDLE THESE COLUMNS. (CONVERT TO NUMERIC DATA)

-----

## EDA & FEATURE ENGINEERING

-----------

## EDA

* <span style="font-size:20px;">Exploratory Data Analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. (source: Wikipedia)</span>

* <span style="font-size:20px;">In summary, EDA can show us hidden relationships and attributes present in our data even before we throw it at a machine learning model.</span>

## FEATURE ENGINEERING
    
*  <span style="font-size:20px;">Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning and is both difficult and expensive. (source: Wikipedia)</span>
    
*  <span style="font-size:20px;">In summary, FE is simply using your existing knowledge of the dataset to create new features that can help a machine learning model perform better.</span>

In [None]:
# Plot histograms of each parameter 

df.hist(figsize = (20, 20))
plt.show()

In [None]:
#visualize the correlation
plt.figure(figsize=(15,10))
sns.heatmap(df.corr(), annot=True, cmap = 'Wistia')
plt.show()

In [None]:
plt.subplots(figsize=(15,5))
sns.distplot(df['bmi'], color = 'cyan')
plt.title('Distribution of bmi', fontsize = 20)
plt.show()

In [None]:
plt.subplots(figsize=(15,5))
sns.distplot(df['avg_glucose_level'], color = 'cyan')
plt.title('Distribution of avg_glucose_level', fontsize = 20)
plt.show()

In [None]:
sns.pairplot(df, hue='stroke')

In [None]:
df['stroke'].value_counts(dropna = False).plot.bar(color = 'cyan')
plt.title('Comparison of stroke feature')
plt.xlabel('zero & one')
plt.ylabel('count')
plt.show()

* HERE WE CAN SEE THE DATA IS COMPLETELY IMBALANCE. WE'LL HANDLE THIS BELOW.

------

## 1. HANDLING MISSING VALUES:

### WHAT ARE MISSING VALUES ?

* <span style="font-size:20px;">Many real-world datasets may contain missing values for various reasons. They are often encoded as NaNs, blanks or any other placeholders. Training a model with a dataset that has a lot of missing values can drastically impact the machine learning model's quality.</span>

### WAYS TO HANDLE MISSING VALUES:

* 1. Deletion
* 2. Impute missing values with Mean/Median
* 3. Prediction Model
* 4. KNN Imputer



In [None]:
df.isnull().sum()

In [None]:
#WE ARE REPLACING THE NULL VALUES WITH MEAN OF THAT FEATURE.
df['bmi'].fillna(df['bmi'].mean(),inplace=True)

In [None]:
df.isnull().sum()

* HERE WE CAN SEE THAT NOW OUR DATA HAVE ZERO NULL VALUES.

---

## 2. CHECK FOR OUTLIERS IN OUR DATA


### What is an Outlier ?

* <span style="font-size:20px;">Outlier is a commonly used terminology by analysts and data scientists as it needs close attention else it can result in wildly wrong estimations. Simply speaking, Outlier is an observation that appears far away and diverges from an overall pattern in a sample.(SOURCE:Analytics Vidhya)</span>

![](https://www.analyticsvidhya.com/wp-content/uploads/2015/02/Outlier.png)

### What is the impact of Outliers on a dataset?

<span style="font-size:20px;">Outliers can drastically change the results of the data analysis and statistical modeling. There are numerous unfavourable impacts of outliers in the data set:

* It increases the error variance and reduces the power of statistical tests
* If the outliers are non-randomly distributed, they can decrease normality
* They can bias or influence estimates that may be of substantive interest
* They can also impact the basic assumption of Regression, ANOVA and other statistical model assumptions.
    
    

In [None]:
#BMI FEATURE:
df.boxplot(column='bmi')

In [None]:
#AVG_GLUCOSE_LEVEL:
df.boxplot(column='avg_glucose_level')

## NO NEED TO WORRY ABOUT THESE OUTLIERS. THESE WILL BE HANDLED AUTOMATICALLY BY THE MODEL WE'LL BE USING i.e., XGBOOST.

---

## 3. HANDLE THE CATEGORICAL VARIABLES USING LABEL ENCODER

In [None]:
from sklearn.preprocessing import LabelEncoder
enc=LabelEncoder()

In [None]:
gender=enc.fit_transform(df['gender'])
smoking_status=enc.fit_transform(df['smoking_status'])
work_type=enc.fit_transform(df['work_type'])
Residence_type=enc.fit_transform(df['Residence_type'])
ever_married=enc.fit_transform(df['ever_married'])

In [None]:
df['ever_married']=ever_married
df['Residence_type']=Residence_type
df['smoking_status']=smoking_status
df['gender']=gender
df['work_type']=work_type

In [None]:
df[['ever_married', 'Residence_type', 'smoking_status', 'gender', 'work_type']].head()

In [None]:
df.info()

#### THE CATEGORICAL FEATURES ARE HANDLED.

-------

## 4. REVOME UNNECESSARY COLUMNS IF ANY

In [None]:
#ID COLUMN IS NOT REQUIRED.
df = df.drop('id', axis=1)

In [None]:
df.head()

----

## SPLIT THE DATASET INTO X & Y

In [None]:
X = df.drop('stroke', axis=1)
y = df['stroke']

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

-----

## 5. HANDLING IMBALANCED DATA

* WE WILL BE USING SMOTE TECHNIQUE TO HANDLE THE IMBALANCED DATA.

In [None]:
from imblearn.over_sampling import SMOTE

print("Before OverSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train==0)))

sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train.ravel())

print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_res.shape))

print("After OverSampling, counts of label '1': {}".format(sum(y_train_res==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res==0)))

## THE DATA IS NOW BALANCED !!!!!!!!!

-----------

## 6. MODEL BUILDING USING THE PRE-PROCESSED / BALANCED DATASET.

In [None]:
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, classification_report

In [None]:
import numpy as np
rdf_model = RandomForestClassifier()
rdf_model.fit(X_train_res, y_train_res)
print('Training Score: {}'.format(rdf_model.score(X_train_res, y_train_res)))
print('Test Score: {}'.format(rdf_model.score(X_test, y_test)))

In [None]:
xgb_model = XGBClassifier()
xgb_model.fit(X_train_res, y_train_res)
print('Training Score: {}'.format(xgb_model.score(X_train_res, y_train_res)))

print('Test Score: {}'.format(xgb_model.score(X_test, y_test)))

## HERE, WE HAVE TRAINED TWO MODELS - RANDOM FOREST & XGBOOST. IT GAVE US VERY GOOD RESULT WITH OUT ANY HYPER-PARAMETERE TUNING.

-----

## CONCLUSION

-----

### THIS IS HOW YOU HAVE TO APPROACH A MACHINE LEARNING REGRESSION OR CLASSIFICATION PROBLEM. EACH THING SHOULD BE FOLLOWED STEP-WISE.

* ## BEFORE STARTING TO CODE - FIRST UNDERSTAND THE PROBLEM, THEN MAKE A MIND MAP OF HOW TO START AND APPROACH THIS.

----

### I HOPE THIS HELPS YOU TO START YOUR JOURNEY IN THIS FIELD.

### IF YOU LIKE THIS, PLEASE GIVE ME AN UPVOTE.