# **Stroke Pre-Processing: MICE & Encoding**

**Hello and welcome**.  

**This is part 1 to a 3-kernel project on Stroke Prediction.**

  
**Part 1 (which is this one) is Preprocessing: Data Cleaning, Encoding and MICE for missing values**  
  
**Part 2 is EDA (including UMAP and PCA) and Random Oversampling**  
Link: **https://www.kaggle.com/mahmoudlimam/stroke-eda-umap-resampling**

  
**Part 3 is Detailed Feature extraction and Selection, and model evaluation**  
Link: **https://www.kaggle.com/mahmoudlimam/stroke-pca-ica-lda-kmeans-dbscan-prediction** 

I didn't include a hyperparameter tuning section as Feature Engineering in an F1_Score of 1 with a somewhat deep Random Forest.

بسم الله

In [None]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings(action='ignore')

In [None]:
data=pd.read_csv("../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")

# A bit of Exploration

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.drop("id",axis=1,inplace=True)

In [None]:
data.describe()

In [None]:
print("Unique Values per Variable")
for col in data.columns:
    un=data[col].unique()
    print("\n\nUnique Values in {}:\n{}".format(col,un))

# Pre-processing

If very few people have a gender value of "Other" then it might be better to drop them or turn them into NaN and impute them.  
Same for people with an "Unknown" smoking status, as unknown is the very definition of a missing value.

In [None]:
(data["gender"]=="Other").sum()

I'll just drop that one.

In [None]:
data[data["gender"]=="Other"]

In [None]:
data=data.drop(3116,axis=0)

Now we have a missing row at 3116:

In [None]:
data.iloc[3114:3118,:]

In [None]:
index=[i for i in range(data.shape[0])]
data.index=index
data.iloc[3114:3118,:]

### Encoding

In [None]:
from category_encoders.target_encoder import TargetEncoder

In [None]:
enc=TargetEncoder()
to_encode="work_type"
enc.fit(X=data[to_encode],y=data["stroke"])
encoded = enc.transform(data[to_encode])

In [None]:
data["work_type"] = encoded["work_type"]

In [None]:
data[["ever_married","Residence_type","gender"]]=pd.get_dummies(data[["ever_married","Residence_type","gender"]],drop_first=True)

In [None]:
data.head()

### Dealing with Missing Values

In [None]:
print("Proportions of 'smoking' categories:")
data["smoking_status"].value_counts()/data.shape[0]

That's about 30%.  
Quite a lot.  
"Unknown" is the very definition of "missing value"/NaN.  
Thus, I'll turn it into NaNs and impute it.  

Since people who've never smoked are probably less likely (on average) to have a stroke than those who did smoke in the past, which in turn are less likely to have a stroke than those who currently smoke, we can say there is some inherent order to these three categories.  
Thus, it would be meaningful to encode them with 0, 1 & 2.  

In [None]:
smoking_mapper={"never smoked":0,"formerly smoked":1,"smokes":2,"Unknown":np.nan}

In [None]:
for i in range(data.shape[0]):
    status=data["smoking_status"][i]
    data["smoking_status"][i]=smoking_mapper[status]

In [None]:
data["smoking_status"].unique()

#### Multiple Imputation by Chained Equations (or simply MICE)

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor

In [None]:
estimator=RandomForestRegressor(max_depth=8)
mice = IterativeImputer(estimator=estimator,random_state=11,skip_complete=True)

In [None]:
impdata=mice.fit_transform(data)

In [None]:
impdata=pd.DataFrame(impdata,columns=data.columns)

In [None]:
impdata.isnull().sum()

In [None]:
impdata.head()

In [None]:
for i in range(impdata.shape[0]):
    if impdata.loc[i,"smoking_status"]<0.5:
        impdata.loc[i,"smoking_status"]=0
    elif impdata.loc[i,"smoking_status"] <1.5:
        impdata.loc[i,"smoking_status"]=1
    else:
        impdata.loc[i,"smoking_status"]=2

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.style as style
style.use('seaborn-darkgrid')

In [None]:
fig,axes=plt.subplots(nrows=2,ncols=2,figsize=(16,10))
fig.suptitle("Effect of MICE on Distributions\n",fontsize=25)
sns.histplot(x=data["bmi"],ax=axes[0,0],color="mediumspringgreen")
axes[0,0].set_title("BMI before MICE")
axes[0,0].set_xlabel(None)
sns.histplot(x=impdata["bmi"],ax=axes[0,1],color="mediumspringgreen")
axes[0,1].set_title("BMI after MICE")
axes[0,1].set_xlabel(None)
sns.countplot(x=data["smoking_status"],ax=axes[1,0],palette="cool")
axes[1,0].set_title("Smoking Status before MICE")
axes[1,0].set_xlabel(None)
sns.countplot(x=impdata["smoking_status"],ax=axes[1,1],palette="cool")
axes[1,1].set_title("Smoking Status after MICE")
axes[1,1].set_xlabel(None)
plt.show()

# Baseline Model

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
rf=RandomForestClassifier(n_jobs=-1,max_depth=7)
x=impdata.drop('stroke',axis=1)
y=impdata["stroke"]

In [None]:
xtrain, xtest, ytrain, ytest = train_test_split(x,y,test_size=0.2, random_state=2)
rf.fit(xtrain,ytrain)
y_pred_tr=rf.predict(xtrain)
y_pred_ts=rf.predict(xtest)
train_mat=classification_report(ytrain,y_pred_tr)
test_mat=classification_report(ytest,y_pred_ts)
print("Baseline Random Forest Results:")
print("Training Classification_Report:\n{}".format(train_mat))
print("Testing Classification_Report:\n{}".format(test_mat))

##### Notes:
The model scored a very low recall and 1 in precision for the stroke class on the training data.  
This shows that the dataset is seriously imbalanced.  
The results on the testing data are even worse: the model is classifying everything as without stroke.  

**What now?**  
**Resampling**  .
But some EDA first.  
Then resampling. Random sampling to be exact.  
Make sure you check it out in part 2 here: https://www.kaggle.com/mahmoudlimam/stroke-eda-random-sampling  

الحمد لله الذي بنعمته تتم الصالحات