**Problem Statement:**

* A stroke occurs when the blood supply to part of your brain is interrupted or reduced, preventing brain tissue from getting oxygen and nutrients. Brain cells begin to die in minutes.

* A stroke is a medical emergency, and prompt treatment is crucial. Early action can reduce brain damage and other complications.

**Goal:**

* To create a prediction system to predict the stroke in its early stages.

**Approach:**

* A mixture of SVM, XGBOOST And MLPCLASSIFIER is used to archieve maximum accuracy.

**Import Neccessary Files:**

In [None]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

**Read Data:**

In [None]:
data = pd.read_csv("../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")

**Analysing Data Read From CSV File:**

In [None]:
data.dtypes

**Data PreProcessing:**

1. As we can see in the above list of datatypes, columns named gender, ever_married, work_type, Residence_type, smoking_status all are of object data type which need to be converted to numerical values before providing them for training the model. So here we mapped them to numerical values such as 1 or 0.  

In [None]:
data["Residence_type"] = data["Residence_type"].apply(lambda x: 1 if x == "Urban" else 0)
data["ever_married"] = data["ever_married"].apply(lambda x: 1 if x == "Yes" else 0)
data["gender"] = data["gender"].apply(lambda x: 1 if x == "Male" else 0)
data = pd.get_dummies(data=data, columns=['smoking_status', 'work_type'])

In [None]:
data.isnull().sum()

2. Before providing data for modelling we need to ensure that our dataset does not contain any null values. But as we can see above "bmi"column has null values which we have filled using mean of the column. 

In [None]:
data['bmi'] = data['bmi'].fillna(data['bmi'].mean())

3. So the last step of data preprocessing is to make sure that the data does not contain any kind of imbalance. Here, "Standard Scalar" is used to remove imbalance in avg_glucose_level, bmi and age columns. 

In [None]:
std = StandardScaler()
columns = ['avg_glucose_level', 'bmi', 'age']
data[columns] = std.fit_transform(data[['avg_glucose_level', 'bmi', 'age']])

**Exploring "Stroke" column which shows high level of imbalance which will be dealt with later.**

In [None]:
print("Data shape : ", data.shape)
print("stroke Data : ", sum(data.stroke == 1))
print("stroke Data : ", sum(data.stroke == 0))

**Here, column "id" does not need to be a part of model training so it is removed.**

In [None]:
data.drop(columns='id', axis=1, inplace=True)

**Dataset is divided into Target And Features. Here X contains the features and y contains the target.**

In [None]:
X = data[['gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
          'Residence_type', 'avg_glucose_level', 'bmi',
          'smoking_status_Unknown', 'smoking_status_formerly smoked',
          'smoking_status_never smoked', 'smoking_status_smokes',
          'work_type_Govt_job', 'work_type_Never_worked', 'work_type_Private',
          'work_type_Self-employed', 'work_type_children']].values
y = data['stroke'].values

**As it was explored above that there is a high imbalance so "smote" is used which is an oversampling technique where the synthetic samples are generated for the minority class.**

In [None]:
smote = SMOTE()
x_smote, y_smote = smote.fit_resample(X, y)

**Here the datset will be split for training and testing using train_test_split.**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x_smote, y_smote, test_size=0.1, stratify = y_smote)

In [None]:
print("stroke Data in Train : ", sum(y_train))
print("stroke Data in Test : ", sum(y_test))

**Now the training and Testing data is provided to SVM and model is trained.**

In [None]:
print("\n\n")
print("=" * 80)
print("=" * 15, "SVM", "=" * 15)
print("=" * 80)
svc = SVC(random_state=0, kernel='linear', gamma='auto',C=1)
svc.fit(X_train, y_train)

In [None]:
svc_score = svc.score(X_train, y_train)
svc_test = svc.score(X_test, y_test)

In [None]:
y_pred = svc.predict(X_test)

In [None]:
print('\nTraining Score', svc_score)
print('Testing Score ', svc_test)
print("confusion_matrix  : \n", confusion_matrix(y_test, y_pred))
print("classification_report : \n", classification_report(y_test, y_pred, zero_division=True))

In [None]:
auc = roc_auc_score(y_test,y_pred)
auc

**So above displayed is the accuracy which we archieved using SVM which is still in minimal terms. In order to enhahnce it more "GradientBoostingClassifier" is used.**

In [None]:
print("\n\n")
print("=" * 80)
print("=" * 35, "XGBOOST", "=" * 35)
print("=" * 80)
xgboost = GradientBoostingClassifier(random_state=0)
xgboost.fit(X_train, y_train)

In [None]:
xgboost_score = xgboost.score(X_train, y_train)
xgboost_test = xgboost.score(X_test, y_test)


In [None]:
y_pred = xgboost.predict(X_test)

In [None]:
print('\nTraining Score', xgboost_score)
print('Testing Score ', xgboost_test)
print("confusion_matrix  : \n", confusion_matrix(y_test, y_pred))
print("classification_report : \n", classification_report(y_test, y_pred))

In [None]:
auc = roc_auc_score(y_test,y_pred)
auc

**Here the accuracy is 91 which is significantly increasd by the use of GradientBoostingClassifier. But in order to obtain a more enhanced version "MLPClassifier" is used.**

In [None]:
print("\n\n")
print("=" * 80)
print("=" * 35, "MLPClassifier", "=" * 35)
print("=" * 80)
mlp = MLPClassifier(hidden_layer_sizes=(300, 300, 300), max_iter=2000, alpha=0.00001,
                    solver='adam', verbose=1, random_state=21)
mlp.fit(X_train, y_train)

mlp_score = mlp.score(X_train, y_train)
mlp_test = mlp.score(X_test, y_test)

y_pred = mlp.predict(X_test)

print('\nTraining Score', mlp_score)
print('Testing Score ', mlp_test)
print("confusion_matrix  : \n", confusion_matrix(y_test, y_pred))
print("classification_report : \n", classification_report(y_test, y_pred))

In [None]:
auc = roc_auc_score(y_test,y_pred)
auc

**Here as we can see the last recorded accuracy is 96 which we obtained after merging different types of alorithms.**

Conclusion:

Hello Fellow Coders, this is my first submission and beginning to the AI/ML Journey. I recently started learning and got help of friends as well as submissions provided by other coders and just continued trial and errors to complete the task.

So if you find it helpful give it an Upvote and most importantly any suggestions, description of mistakes in my code and learnings are most welcomed. I would like to gain knowlege and move further in my journey of learning AI/ML. 

print("Thank-You")