# Stroke Prediction Part 3: Detailed Feature Extraction and Selection
## (and Prediction, eventually)

**Hello and welcome**.  

**This is part 3 to a 3-kernel project on Stroke Prediction.**

  
**Part 1 is Preprocessing: Data Cleaning, Target Encoding and MICE for missing values**  
Link: **https://www.kaggle.com/mahmoudlimam/stroke-pre-processing-mice-target-encoding**

  
**Part 2 is EDA (including UMAP and PCA) and Random Oversampling**  
Link: **https://www.kaggle.com/mahmoudlimam/stroke-eda-umap-resampling**

  
**Part 3 (which is this one) is Detailed Feature extraction and Selection, and model evaluation**  
I didn't include a hyperparameter tuning section as Feature Engineering in an F1_Score of 1 with a somewhat deep Random Forest.

### **Summary**  :  
I'll be using a pre-processed version of the original stroke prediction dataset.  
It's basically the end result of the first notebook (Part 1 linked above).  
You can find it here: https://www.kaggle.com/mahmoudlimam/imputed-stroke-dataset  
I will use random oversampling with a minority class proportion of 0.25, since the original data is extremely imbalanced.  

بسم الله

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import RobustScaler
rb=RobustScaler()
from imblearn.over_sampling import RandomOverSampler
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.style as stl
stl.use("ggplot")
import warnings
warnings.filterwarnings("ignore")

In [None]:
data=pd.read_csv("../input/imputed-stroke-dataset/impstroke.csv")
data.drop('Unnamed: 0', axis=1, inplace=True)
x = data.drop("stroke",axis=1)
y = data["stroke"]

# Feature Engineering

We'll use the randomly oversampled dataset, to experiment with feature engineering.  
The reason being is that the original dataset is so imbalanced that perhaps no feature engineering could improve it.  
I actually did experiment with it a bit, and it didn't seem informative at all; f1_score, recall and precision remained 0.  
Experimenting with the randomly oversampled data would help us tell which features improve results, and then we can try other oversampling techniques.  
Although one may ask: "What if the feature engineering results are biased towards random oversampling?"  
Well, we're trying to construct features that separate classes better, 
Let's take a look at what we have.

## 1 - Combining Features

In [None]:
xfe, yfe = RandomOverSampler(sampling_strategy=0.25, random_state=11).fit_resample(x, y)

In [None]:
xref = xfe.copy(deep=True)
xtest = x.copy(deep=True)

In [None]:
xfe["Blood&Heart"]=xfe["hypertension"]*xfe["heart_disease"]
xtest["Blood&Heart"]=xtest["hypertension"]*xtest["heart_disease"]

In [None]:
xfe["Effort&Duration"] = xfe["work_type"]*(xfe["age"])
xtest["Effort&Duration"] = xtest["work_type"]*(xtest["age"])

In [None]:
xfe["Obesity"] = xfe["bmi"]*xfe["avg_glucose_level"]/1000
xtest["Obesity"] = xtest["bmi"]*xtest["avg_glucose_level"]/1000

In [None]:
xfe["AwfulCondition"] = xfe["Obesity"] * xfe["Blood&Heart"] * xfe["smoking_status"]
xtest["AwfulCondition"] = xtest["Obesity"] * xtest["Blood&Heart"] * xtest["smoking_status"]

In [None]:
xfe["AwfulCondition"].unique()

In [None]:
#effect of residence type on Effort&Duration

In [None]:
xfe.head()

## 2 - Principal Component Analysis

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca = PCA()

In [None]:
pca_feats = pca.fit_transform(rb.fit_transform(xref))

In [None]:
pca.explained_variance_ratio_

In [None]:
list(range(1,11))

In [None]:
fig=plt.figure(figsize=(20,9))
sns.barplot(x=list(range(1,11)),y=pca.explained_variance_,palette = 'Reds_r')
plt.ylabel('Variation',fontsize=15)
plt.xlabel('PCA Components',fontsize=15)
plt.title("PCA Components\nRanked by Variation",fontsize=25)
plt.show()

In [None]:
xfe["PC1"], xfe["PC2"] = pca_feats[:,0], pca_feats[:,1]

In [None]:
xtestpca = pca.transform(rb.transform(x))

In [None]:
xtest["PC1"], xtest["PC2"] = xtestpca[:,0], xtestpca[:,1]

In [None]:
xfe.head()

## 3 - Independent Component Analysis

In [None]:
from sklearn.decomposition import FastICA as ICA

In [None]:
ica = ICA(random_state=11)

In [None]:
xica = ica.fit_transform(X=rb.fit_transform(xref))

In [None]:
ncomp = ica.components_.shape[0]

In [None]:
fig,axes=plt.subplots(ncols=1,nrows=ncomp,figsize=(20,10*ncomp))
fig.suptitle("Target Distributions\nAcross ICA Components",fontsize=40)
for i in range(ncomp):
    sns.boxenplot(y=xica[:,i], x=yfe, palette="seismic",showfliers=True,ax=axes[i])
    axes[i].set_xlabel("Stroke",fontsize=15)
    axes[i].set_ylabel(f"IC{i+1}",fontsize=25)
plt.show()

In [None]:
xfe["ICA"] = xica[:,3]

In [None]:
xtest["ICA"] = ica.transform(rb.transform(x))[:,3]

## 4 - Linear Discriminant Analysis

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()

In [None]:
xlda = lda.fit_transform(rb.fit_transform(xref),yfe)
xlda = xlda.reshape((xlda.shape[0],))

In [None]:
plt.figure(figsize=(20,8))
sns.boxenplot(y=xlda, x=yfe, color='crimson',showfliers=True)
plt.title("Separation of Classes with LDA",fontsize=30)
plt.xlabel("Stroke",fontsize=20)
plt.show()

In [None]:
xfe["LDA"] = xlda

In [None]:
xtest["LDA"] = lda.transform(rb.transform(x)).reshape((x.shape[0],))

### The 2 following sections are an attempt at using cluster labels as features.  
KMeans did a decent job.  
DBSCAN didn't.

## 5 - K-Means Clustering

In [None]:
from sklearn.cluster import KMeans

In [None]:
inertias = []

ks=list(range(1,10))

#xkm = rb.fit_transform(xref)

xfesc = rb.fit_transform(xfe)
xtsc = rb.transform(xtest)

for k in ks:
    
    model=KMeans(n_clusters=k)
    
    model.fit(xfesc)
    
    inertias.append(model.inertia_)

In [None]:
plt.figure(figsize=(20,7))
sns.barplot(x = ks, y = inertias, palette='mako')
plt.xlabel('Number of Clusters',fontsize=20)
plt.ylabel('Inertia',fontsize=20)
plt.xticks(ks)
plt.title("Inertia per Number of KMeans Clusters",fontsize=30)
plt.show()

In [None]:
figure, axes = plt.subplots(nrows=2, ncols=2,figsize=(20, 30))
figure.suptitle('\n\nTarget Proportions per Cluster', fontsize=40)

for index in range(4):
    
    i,j = (index // 2), (index % 2)
    
    model=KMeans(n_clusters=index+2)
    
    model.fit(xfesc)
    
    cluster_labels=model.predict(xfesc)
    
    sns.heatmap(pd.crosstab(cluster_labels,yfe,normalize="index"),
                ax=axes[i,j],
                cmap='Blues',
                square='True',
                cbar=False,
                annot=True,
                annot_kws={'fontsize':30})
    
    axes[i,j].set_title(f"{index+2} Clusters",fontsize=30)
    
    axes[i,j].set_xlabel("Stroke",fontsize=20)

    axes[i,j].set_ylabel("Cluster Labels",fontsize=20)
    
    axes[i,j].set_yticklabels(axes[i,j].get_yticklabels(),fontsize=20)

plt.show()

In [None]:
from sklearn.metrics import adjusted_mutual_info_score
from sklearn.metrics import normalized_mutual_info_score

In [None]:
ami = []
nmi = []
for k in ks:
    model = KMeans(n_clusters = k)
    cluster_labels=model.fit_predict(xfesc)
    ami.append(adjusted_mutual_info_score(yfe,cluster_labels))
    nmi.append(normalized_mutual_info_score(yfe,cluster_labels))

In [None]:
plt.figure(figsize=(20,7))
sns.barplot(x = ks, y = ami, palette='summer_r')
plt.xlabel('Number of Clusters',fontsize=20)
plt.ylabel('AMI',fontsize=20)
plt.xticks(ks)
plt.title("Adjusted Mutual Information per Number of Clusters",fontsize=30)
plt.show()

In [None]:
plt.figure(figsize=(20,7))
sns.barplot(x = ks, y = nmi, palette='summer_r')
plt.xlabel('Number of Clusters',fontsize=20)
plt.ylabel('AMI',fontsize=20)
plt.xticks(ks)
plt.title("Normalized Mutual Information per Number of KMeans Clusters",fontsize=30)
plt.show()

In [None]:
model=KMeans(n_clusters=4)
    
model.fit(xfesc)
    
cluster_labels=model.predict(xfesc)

#### Target Encoding on Cluster Labels

In [None]:
cluster_labels = pd.Series(cluster_labels).astype("object")

In [None]:
from category_encoders.target_encoder import TargetEncoder

In [None]:
kmeans_enc=TargetEncoder()
enc_clus = kmeans_enc.fit_transform(cluster_labels, y=yfe)

In [None]:
xfe["KMeans"] = enc_clus

In [None]:
cluster_labels_test = pd.Series(model.predict(xtsc)).astype("object")
xtest["KMeans"] = kmeans_enc.transform(cluster_labels_test, y=y)

## 6 - DBSCAN Clustering

In [None]:
from sklearn.cluster import DBSCAN

In [None]:
eps = [1.01,1.02,1.05,1.1,1.15,1.2,1.25,1.3,1.35]
min_samples = [7,8,9,10,11,12,13,14,15]

In [None]:
inds = [f"eps={e}" for e in eps]
cols = [f"min_samples={m}" for m in min_samples]
dbdata = pd.DataFrame(np.zeros((9,9)),columns=cols,index=inds)

for i in range(len(eps)):
    for j in range(len(min_samples)):
        dbscan = DBSCAN(eps=eps[i], min_samples=min_samples[j])
        dbscan.fit(xfesc)
        dbdata.iloc[i,j] = np.unique(dbscan.labels_).size

In [None]:
plt.figure(figsize=(20,9))
sns.heatmap(dbdata, cmap='Blues', annot=True, annot_kws={'fontsize':18},cbar=False)
plt.title("Number Of DBSCAN Clusters for Different Values\nof Epsilon and Minimum_Samples", fontsize=35)
plt.xlabel("Minimum Number of Points per Cluster",fontsize=20)
plt.ylabel("Epsilon",fontsize=20)
plt.show()

In [None]:
inds = [f"  eps={e}" for e in eps]
cols = [f"min_samples={m}" for m in min_samples]
ami = pd.DataFrame(np.zeros((9,9)),columns=cols,index=inds)
nmi = pd.DataFrame(np.zeros((9,9)),columns=cols,index=inds)

for i in range(len(eps)):
    for j in range(len(min_samples)):
        dbscan = DBSCAN(eps=eps[i], min_samples=min_samples[j])
        labels=dbscan.fit_predict(xfesc)
        nmi.iloc[i,j] = normalized_mutual_info_score(labels,yfe)
        ami.iloc[i,j] = adjusted_mutual_info_score(labels,yfe)

In [None]:
plt.figure(figsize=(20,10))
sns.heatmap(ami, cmap='mako_r', annot=True, annot_kws={'fontsize':20},cbar=False)
plt.title("DBSCAN Clusters:\nAdjusted Mutual Information Scores", fontsize=35)
plt.xlabel("Minimum Number of Points per Cluster",fontsize=20)
plt.ylabel("Epsilon",fontsize=20)
plt.show()

In [None]:
plt.figure(figsize=(20,10))
sns.heatmap(nmi, cmap='mako_r', annot=True, annot_kws={'fontsize':20},cbar=False)
plt.title("DBSCAN Clusters:\nNormalized Mutual Information Scores", fontsize=35)
plt.xlabel("Minimum Number of Points per Cluster",fontsize=20)
plt.ylabel("Epsilon",fontsize=20)
plt.show()

In [None]:
fig, axes = plt.subplots(nrows=len(eps), ncols=len(min_samples),figsize=(20,15))
fig.suptitle("DBSCAN Cluster Sizes", fontsize=40)

for i in range(len(eps)):
    for j in range(len(min_samples)):
        dbscan = DBSCAN(eps=eps[i], min_samples=min_samples[j])
        labels=dbscan.fit_predict(xfesc)
        sns.countplot(x=labels, palette='Set2',ax=axes[i,j])
        axes[i,j].set_xticklabels([])
        axes[i,j].set_ylabel(None)
        axes[i,j].set_yticklabels([])
plt.show()

### Results:
DBSCAN couldn't find any densely-packed clusters. It tends to lump most of the data points into one (or a few) big cluster, and several tiny clusters.  
I can't use cluster labels as a variable for the following reasons:
1. The tiny clusters would cause the predictive model to overfit.  
2. Most points would be in the bigger lump (I don't think the word "cluster" fits it) which isn't informative as it contains pretty much all points.
3. Sklearn doesn't have a predict method for the DBSCAN class XD They could make one that assigns points to clusters by looking at nearest neighbours but oh well. One could try coding this from scratch (which might be a little tedious to do) but for now there's no need to, considering the 2 points above.

## 7 - Forward Feature Selection

In [None]:
from sklearn.feature_selection import SequentialFeatureSelector

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

In [None]:
models = [SVC(kernel='linear'),
          SVC(kernel='rbf'),
          SVC(kernel='poly',degree=2),
          RandomForestClassifier(n_jobs=-1,max_depth=10),
          RandomForestClassifier(n_jobs=-1,max_depth=30),
          KNeighborsClassifier(n_neighbors=4),
          KNeighborsClassifier(n_neighbors=8),
          LogisticRegression(),
          GaussianNB()]

In [None]:
names = ["SVM_Linear","SVM_RBF","SVM_Poly2","ShallowForest","DeepForest","4NN","8NN","LogReg","GaussianNB"]

In [None]:
xfesc = rb.fit_transform(xfe)
xtsc = rb.transform(xtest)

In [None]:
ffs_scores = pd.DataFrame(np.zeros((len(names),len(names))),columns=names,index=names)
ffsdata= dict()

In [None]:
from sklearn.metrics import f1_score

In [None]:
for i in range(len(models)):
    sel_name = names[i]
    ffs=SequentialFeatureSelector(direction='forward', n_jobs = -1, estimator=models[i])
    xffs = ffs.fit_transform(xfesc,yfe)
    xtfs = ffs.transform(xtsc)
    ffsdata[sel_name] = [xffs,xtfs]
    print(f"Finished Selection with {sel_name}\n")
    print(f"{ffs.n_features_to_select_} Features are Selected:\n{list(xfe.columns[ffs.support_])}\n")
    for j in range(len(models)):
        pred_name = names[j]
        model = models[j]
        model.fit(xffs,yfe)
        ypred = model.predict(xtfs)
        score = f1_score(ypred,y)
        ffs_scores.loc[sel_name,pred_name] = score
        print(f"F1_score with {pred_name}: {score}")
    print("\n\n\n")

In [None]:
plt.figure(figsize=(20,8))
sns.heatmap(ffs_scores,cmap="BuGn",annot=True, annot_kws={'fontsize':20},cbar=False)
plt.title("F1 Scores with Forward Feature Selection\n",fontsize=35)
plt.xlabel("Predictive Model",fontsize=20)
plt.ylabel("Model Used for Selection",fontsize=25)
plt.show()

Well, there you go.  
Honestly, I didn't expect results to be this good.  
My plan was to do feature engineering to separate the classes a bit, so that I can then use SMOTE-based Oversampling methods, such as BorderSmote, DBSMOTE, etc...  
I also planned to use the genetic algorithm for hyperparameter tuning, and wasn't sure that I would eventually get very good results.  
الحمد لله

Anyways, thank you for reading,  
I hope you've enjoyed and benefitted.

الحمد لله الذي بنعمته تتم الصالحات