Authored by: *Swaviman Kumar*<br>
*Indraprastha Institute of Information Technology, Delhi* <br>

1. You need to download ‘Stroke Prediction Dataset’ data using the library Scikit learn; ref is given
below. [5]<br>  <br>
2. Divide the data randomly in training and testing with a 7:3 ratio 100 times, perform the following
tasks with training data and test the performance on testing data. Testing data should remain
unseen for all steps.
a. Apply one of the best-known imputation methods to handle the missing/infinite values
and state the significance of the used method if required. [5]<br>
b. Visualize the data in 3-D scatter plot and write the inferences, How the data look like. [5]<br>
c. Make a boxplot for each feature and highlight the outlier, if any, then remove the outlier,
again visualize the data in 3-D scatter plot to show the outlier effect and write the
inferences. [5]<br>
d. Normalized the data if required, and write a note for what, why and how you performed
normalization.[5]<br>
e. Balance the data if required; you may increase the sample using upsampling if needed.[5]<br>
f. Perform at least three clustering methods with varying cluster sizes. Perform any three
best-known methods to find out correct cluster numbers for each method; how you
finalized this cluster number.[10]<br>
g. Perform at least three supervised methods for classification, and report at least three
performance metrics out of (accuracy, precision, Cohen's kappa, F1-score, MCC,
sensitivity and specificity) with proper reason. [10]<br>

Ref:
1. https://www.kaggle.com/fedesoriano/stroke-prediction-dataset

In [None]:
pip install pyforest

In [None]:
# importing necessary libraries

import pyforest
import sklearn
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import preprocessing
import warnings
warnings.filterwarnings("ignore")

### 1. You need to download ‘Stroke Prediction Dataset’ data using the library Scikit learn; ref is given below. [5]

The same csv file has been downloaded from the given link (https://www.kaggle.com/fedesoriano/stroke-prediction-dataset) <br>
and it has been hosted on drive to make the code reproducible with no dependency.

In [None]:
df = pd.read_csv("https://drive.google.com/uc?export=download&id=1AaUsEwFoAbHy1m1AFECwYsZmcdcIDPg3")
df.head(2)

Basic analysis of the dataframe:

In [None]:
df.info()

1. We could see most of the columns are of int or float type. A few are in Object type which will have to be ordinal encoded so as to be able to use all features in algorithms.<br>
2. We can also see there are missing values in column bmi. 

In [None]:
df.isnull().sum()

We see only bmi column has 201 missing values.

In [None]:
df["bmi"].describe()

The bmi column has 201 missing values. From the above describe method we see it has a maximum entry value of 97.6 which is an impossible bmi value for any human to have. Clearly it appears to be an outlier. So in order to replace the missing values we will use median bmi instead of mean in future.

### 2 Divide the data randomly in training and testing with a 7:3 ratio 100 times, perform the following tasks with training data and test the performance on testing data. Testing data should remain unseen for all steps.  <br>


#### a. Apply one of the best-known imputation methods to handle the missing/infinite values and state the significance of the used method if required. [5]

Please note that before we perform a train test split, we have to perform EDA on the entire dataset. In order to achieve that we have performed train test split after the outlier removal and the EDA. Doing away with this would mean presence of anomaly & unexpected data in train or test data.

#### Treating missing values from the data

In [None]:
print("missing value count is "+str(20100/5110)+" % which is not significant\nwith the median at " +str(df['bmi'].median()))

#### Significance: 
Replacing missing bmi values with the median, as we have already observed there are outliers present in bmi column and replacing with a mean (which gets influenced by outliers) should be avoided. 

In [None]:
df['bmi'] = df['bmi'].fillna(df['bmi'].median())
df["bmi"].describe()

In [None]:
df.isna().sum()

Missing values have been dealt with.

Plotting the distribution of bmi column to check skewness.

In [None]:
sns.FacetGrid(df, size=5) \
   .map(sns.distplot, "bmi") \
   .add_legend();
sns.FacetGrid(df, hue="stroke", size=5) \
   .map(sns.distplot, "bmi") \
   .add_legend();
plt.show();

the bmi column appears very slightly right skewed.

In [None]:
# Encoding categorical values using Ordinal Encoder from sklearn

ord_enc = OrdinalEncoder()
df["Gender_code"] = ord_enc.fit_transform(df[["gender"]])
df["Ever_married_code"] = ord_enc.fit_transform(df[["ever_married"]])
df["Work_Type_code"] = ord_enc.fit_transform(df[["work_type"]])
df["Residence_type_code"] = ord_enc.fit_transform(df[["Residence_type"]])
df["Smoking_Status_code"] = ord_enc.fit_transform(df[["smoking_status"]])


df.head(11)

Now we have dataframe with double the number of columns. Categorical cols have been encoded. We haven't used pd.get_dummies as introducing dummy var creates a very high dimensional sparse dataframe. Decision trees tend to overfit on data with a large number of features. Decision tree would suffer from curse of dimensionality if we introduced dummy vars. Hence we used Ordinal Encoder from sklearn instead.

In [None]:
df.columns

In [None]:
df_new = df[['id',
       'Gender_code', 'age', 'hypertension', 'heart_disease',
       'Ever_married_code', 'Work_Type_code', 'Residence_type_code', 'avg_glucose_level',
       'bmi', 'Smoking_Status_code', 'stroke']]

df_new.info()

In [None]:
# perfectly numeric dataframe
df_new.head()

### 2. b. Visualize the data in 3-D scatter plot and write the inferences, How the data look like. [5]

3d Scatter plot:

#### Age vs BMI vs Avg Glucose for stroke

In [None]:
sns.set(style = "darkgrid")

fig = plt.figure(figsize=(12, 9))
ax = fig.add_subplot(111, projection = '3d')

x = df_new['age']
y = df_new['bmi']
z = df_new['avg_glucose_level']

ax.set_xlabel("Age")
ax.set_ylabel("BMI")
ax.set_zlabel("Avg Glucose Level")

for s in df_new.stroke.unique():
  ax.scatter(x[df_new.stroke==s],y[df_new.stroke==s],z[df_new.stroke==s],label=s)

#ax.scatter(x, y, z)
ax.legend()
plt.xticks(np.arange(0, 100, 10))
plt.yticks(np.arange(0, 200, 10))
plt.show()

#### Inference:
It is evident from the above 3d scatter plot that with rise in BMI index, the average blood glucose level of the individual also goes up & hence those individuals are more vulnerable to strokes.

### 2. c. Make a boxplot for each feature and highlight the outlier, if any, then remove the outlier, again visualize the data in 3-D scatter plot to show the outlier effect and write the inferences. [5]

Now let's find out outliers. Looking at the dataframe we can say only two of the columns, "bmi" & "avg_glucose_level" columns are vulnerable to outliers. Rest of the columns are either categorical values which do not contain any outliers or are numeric columns with discrete entries also without outliers. 

Boxplots for all the columns before outlier removal:

In [None]:
global new_df # declaring it as global so as to be able to 
              #use this var within local scopes of functions in future
new_df = df_new.copy() # Making a copy of the original dataframe for ease

df_0 = new_df[new_df['stroke'] == 0]
df_1 = new_df[new_df['stroke'] == 1]
fig = plt.figure(figsize=(20,20))

#
for i,b in enumerate(list(new_df.columns[0:30])):
    
    i +=1
    ax = fig.add_subplot(4,3,i)
    ax.boxplot([df_0[b], df_1[b]])

    ax.set_title(b)

sns.set_style("whitegrid")
plt.tight_layout()
plt.legend()
plt.show()

From the box plots we are sure that only two of the mentioned columns contain outliers

In [None]:
fig = plt.figure(figsize=(5,5))

plt.boxplot([df_new["avg_glucose_level"], df_new["bmi"]])

sns.set_style("whitegrid")
plt.tight_layout()
plt.legend()
plt.show()

In [None]:
df_new["bmi"].value_counts().sort_index(ascending=False)

Treating Outliers with IQR method:

In [None]:
def remove_outlier(df, col):
  q1 = df[col].quantile(0.25)
  q3 = df[col].quantile(0.75)

  iqr = q3 - q1
  lower_bound  = q1 - (1.5  * iqr)
  upper_bound = q3 + (1.5 * iqr)

  out_df = df.loc[(df[col] > lower_bound) & (df[col] < upper_bound)]
  return out_df

In [None]:
df1234 = remove_outlier(df_new, "bmi")
df1234.info()

We removed the rows where the respective bmi feature contained outliers. Now we end up with 4984 rows in total.

In [None]:
df12345 = remove_outlier(df1234, "avg_glucose_level")
df12345.info()

We removed the rows where the respective avg_glucose_level feature contained outliers. Now we end up with 4390 rows in total.

In [None]:
df12345["bmi"].value_counts().sort_index(ascending=False)

now it ranges within 11.3 & 46.2 unlike previously between 10.3 & 97.6.

Make a box plot post outlier removal:

In [None]:
df_copy = df12345.copy() # making a copy of the dataframe for easier handling

In [None]:


df_0 = df_copy[df_copy['stroke'] == 0]
df_1 = df_copy[df_copy['stroke'] == 1]
fig = plt.figure(figsize=(20,20))

#
for i,b in enumerate(list(df_copy.columns[0:30])):
    
    i +=1
    ax = fig.add_subplot(4,3,i)
    ax.boxplot([df_0[b], df_1[b]])

    ax.set_title(b)

sns.set_style("whitegrid")
plt.tight_layout()
plt.legend()
plt.show()

Clearly the bmi & avg glucose level columns (and the rest of the columns as well) are now free from outliers.

### make 3d scatter plot after outlier removal:

In [None]:
sns.set(style = "darkgrid")

fig = plt.figure(figsize=(12, 9))
ax = fig.add_subplot(111, projection = '3d')

x = df12345['age']
y = df12345['bmi']
z = df12345['avg_glucose_level']

ax.set_xlabel("Age")
ax.set_ylabel("BMI")
ax.set_zlabel("Avg Glucose Level")

for s in df12345.stroke.unique():
  ax.scatter(x[df12345.stroke==s],y[df12345.stroke==s],z[df12345.stroke==s],label=s)

#ax.scatter(x, y, z)
ax.legend()
plt.xticks(np.arange(0, 100, 10))
plt.yticks(np.arange(0, 200, 10))
plt.show()

#### Inference:
We performed the outlier removal as asked using the Inter Quartile Range method. 
We also plotted 3D scatter plots before & after outlier removal as shown above.
Both the plots appear significantly different. However it is still not very obvious to separate the dots purely based on the three features. We will apply clustering & classification algorithms to understand the pattern.

### 2. d. Normalized the data if required, and write a note for what, why and how you performed normalization.[5]

#### Reason for normalization & Significance:
Here comes the need for normalization. All the columns are not at a common comparable range. This makes the analysis difficult. Moreover the machine learning akgorithms we are about to implement ahead work better with normalized data. Hence we are applying MinMaxScaler method to normalize the dataframe.

In [None]:
df12345.head()

In [None]:
# Using MinMaxScaler to perform normalization
scaler = MinMaxScaler() 
scaled_values = scaler.fit_transform(df_copy) 
df_copy.loc[:,:] = scaled_values

sns.set(rc={'figure.figsize':(16,8)}, font_scale=0.9, style='whitegrid')
df_copy.boxplot(widths = 0.9)

### 2. e. Balance the data if required; you may increase the sample using upsampling if needed.[5]

Check if your dataset is balanced.

In [None]:
df12345["stroke"].value_counts()

looks like we have got a very imbalanced dataset with only about 4% who got stroke and the rest 96% who didn't. <br><br>
We need to upsample the data so that we can make it balanced. The reason is, if we don't, with a very bad model which predicts not a stroke 100% of the time irrespective of the input data, we will still get a 96% accuracy which is quite high & clearly misleading.

In [None]:
from sklearn.utils import resample

In [None]:
df_minority = df12345[df12345['stroke']==1]
df_majority = df12345[df12345['stroke']==0]

In [None]:
min_class = resample(df_minority, 
                             replace=True,     
                             n_samples=4225,    
                             random_state=10) 
df_upsampled = pd.concat([min_class,df_majority])

In [None]:
df_upsampled["stroke"].value_counts()

In [None]:
df_upsampled.head(2)

Imbalance issue has been dealt wth. 

### Train Test Split : 

Now that we have a clean dataframe devoid of any outliers or missing values, we can go ahead an perform train test split as we mentioned earlier.

In [None]:
list34 = list(df_upsampled.columns)
list34.remove('stroke')
x = df_upsampled[list34]
y = df_upsampled[["stroke"]]

X_train, X_test, Y_train, Y_test= train_test_split(x, y, test_size=0.3,random_state=42)

### 2. f. Perform at least three clustering methods with varying cluster sizes. Perform any three best-known methods to find out correct cluster numbers for each method; how you finalized this cluster number.[10]

#### Clustering Methos 1:

#### K Means Clustering :
We will perform KMeans clustering at first. We will use the elbow method to find the elbow value, i.e. the optimum value of K for which we get the best clustering.<br>We are using varying cluster sizes as asked in the question.

In [None]:
n_clusters = [2,3,4,5,6,7,8,9,10] # number of clusters
clusters_inertia = [] # inertia of clusters
s_scores = [] # silhouette scores

for n in n_clusters:
    KM_est = KMeans(n_clusters=n, init='k-means++').fit(X_train)
    clusters_inertia.append(KM_est.inertia_)    # data for the elbow method
    silhouette_avg = silhouette_score(X_train, KM_est.labels_)
    s_scores.append(silhouette_avg) # data for the silhouette score method

In [None]:
fig, ax = plt.subplots(figsize=(12,5))
ax = sns.lineplot(n_clusters, clusters_inertia, marker='o', ax=ax)
ax.set_title("Elbow method")
ax.set_xlabel("number of clusters")
ax.set_ylabel("clusters inertia")
ax.axvline(3, ls="--", c="red")
ax.axvline(4, ls="--", c="red")
plt.grid()
plt.show()

From this plot we observe that "elbow" has to be 3. Though a choice of 3 or 4 clusters seems to be fair. Let's see the silhouette score to be sure of the elbow point.

In [None]:
# Plot for Silhouette score to find the optimum K

fig, ax = plt.subplots(figsize=(12,5))
ax = sns.lineplot(n_clusters, s_scores, marker='o', ax=ax)
ax.set_title("Silhouette score method")
ax.set_xlabel("number of clusters")
ax.set_ylabel("Silhouette score")
ax.axvline(3, ls="--", c="red")
plt.grid()
plt.show()

Now we can say the best option would be 3. Hence we will go with K=3

#### K Means Clustering with Cluster size = 3 :

In [None]:
# To initialize and fit K-Means model
KM_3_clusters = KMeans(n_clusters=3 , init='k-means++').fit(X_train)
KM_3_clusters.labels_

In [None]:
KM_3_clusters.cluster_centers_

In [None]:
KM_3_clusters.predict(X_test)

#### Clustering Methos 2:

#### MiniBatchKMeans Clustering:
We are applying MiniBatchKMeans Clustering algorithm here.

In [None]:

n_clusters = [2,3,4,5,6,7,8,9,10] # number of clusters
clusters_inertia = [] # inertia of clusters
s_scores = [] # silhouette scores


from sklearn.cluster import MiniBatchKMeans
for n in n_clusters:
    KM_est = MiniBatchKMeans(n_clusters=n, init='k-means++').fit(X_train)
    clusters_inertia.append(KM_est.inertia_)    # data for the elbow method
    silhouette_avg = silhouette_score(X_train, KM_est.labels_)
    s_scores.append(silhouette_avg) # data for the silhouette score method

In [None]:
fig, ax = plt.subplots(figsize=(12,5))
ax = sns.lineplot(n_clusters, clusters_inertia, marker='o', ax=ax)
ax.set_title("Elbow method")
ax.set_xlabel("number of clusters")
ax.set_ylabel("clusters inertia")
ax.axvline(3, ls="--", c="red")
ax.axvline(4, ls="--", c="red")
plt.grid()
plt.show()

In [None]:
# Plot for Silhouette score to find the optimum K

fig, ax = plt.subplots(figsize=(12,5))
ax = sns.lineplot(n_clusters, s_scores, marker='o', ax=ax)
ax.set_title("Silhouette score method")
ax.set_xlabel("number of clusters")
ax.set_ylabel("Silhouette score")
ax.axvline(3, ls="--", c="red")
plt.grid()
plt.show()

In [None]:
# To initialize and fit Mini batch K-Means model
MKM_3_clusters = MiniBatchKMeans(n_clusters=3 , init='k-means++').fit(X_train)

MKM_3_clusters.labels_

In [None]:
MKM_3_clusters.cluster_centers_

In [None]:
MKM_3_clusters.predict(X_test)

#### Clustering Methos 3:

#### Agglomerative Clustering:
We are applying Agglomerative Clustering algorithm here.

In [None]:
# Importing AgglomerativeClustering from Sklearn
from sklearn.cluster import AgglomerativeClustering

# Running Agglomerative Clustering
no_of_clusters = []
n_clusters = range(2, 10) # Range is arbitrarily chosen
ag_sil_score = [] # silouette scores

for p in n_clusters:
    AG = AgglomerativeClustering(n_clusters=p).fit(X_train)
    no_of_clusters.append((len(np.unique(AG.labels_))))
    ag_sil_score.append(silhouette_score(X_train, AG.labels_))
    
results = pd.DataFrame([n_clusters, no_of_clusters, ag_sil_score], index=['n_clusters','clusters', 'sil_score']).T
results.sort_values(by='sil_score', ascending=False).head() # display only 5 best scores

In [None]:
# For plotting silhoette score

fig, ax = plt.subplots(figsize=(12,5))
ax = sns.lineplot(n_clusters, ag_sil_score, marker='o', ax=ax)
ax.set_title("Silhouette score method")
ax.set_xlabel("number of clusters")
ax.set_ylabel("Silhouette score")
ax.axvline(3, ls="--", c="red")
plt.grid()
plt.show()

From the silhouette score model we observe that 3 is optimum cluster number.

#### Agglomerative Clustering with Cluster size = 3 :

In [None]:
# To initialize and fit agglomerative model
AG = AgglomerativeClustering(n_clusters=3).fit(X_train)
AG.labels_

In [None]:
AG.n_leaves_ # Shows Number of leaves in the hierarchical tree.

### 2. g. Perform at least three supervised methods for classification, and report at least three performance metrics out of (accuracy, precision, Cohen's kappa, F1-score, MCC, sensitivity and specificity) with proper reason. [10]

#### Classification method 1:

#### Naive bayes classification:

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, Y_train)

y_pred  =  classifier.predict(X_test)

from sklearn.metrics import confusion_matrix,accuracy_score
cm = confusion_matrix(Y_test, y_pred)
ac = accuracy_score(Y_test,y_pred)

print("Accuracy:",ac)
print("Precision:",metrics.precision_score(Y_test, y_pred))
print("Recall:",metrics.recall_score(Y_test, y_pred))

#### Classification method 2:

#### Logistic Regression:

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear', random_state=0)
model.fit(X_train, Y_train)
Y_Pred2 = model.predict(X_test)

from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(Y_test, Y_Pred2)
cnf_matrix

In [None]:
print("Accuracy:",metrics.accuracy_score(Y_test, Y_Pred2))
print("Precision:",metrics.precision_score(Y_test, Y_Pred2))
print("Recall:",metrics.recall_score(Y_test, Y_Pred2))

#### Classification method 3:

#### K-Nearest neighbour:

In [None]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, Y_train)

y_pred2 = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix,accuracy_score
cm1 = confusion_matrix(Y_test, y_pred2)
ac1 = accuracy_score(Y_test,y_pred2)

print("Accuracy:",ac1)
print("Precision:",metrics.precision_score(Y_test, y_pred2))
print("Recall:",metrics.recall_score(Y_test, y_pred2))

In [None]:
ml_names = ['Gaussian Naive Bayes', 'Logistic Regression', 'Decision Tree']

acc_all = [acc_gnb, acc_logit, acc_dtree]
prec_all = [prec_gnb, prec_logit, prec_dtree]
f1_all = [f1_gnb, f1_logit, f1_dtree]

def autolabel(bars):
    """Attach a text label above each bar in displaying its height."""
    for bar in bars:
        height = bar.get_height()
        ax.annotate('{:.2f}'.format(height),
                    xy=(bar.get_x() + bar.get_width() / 2, height),
                    xytext=(0, 5),  # 3 points vertical offset
                    textcoords="offset points",
                    fontsize=12,
                    rotation=90,
                    ha='center', va='bottom')

width = 0.25  # the width of the bars
r1 = np.arange(len(ml_names))  # the label locations
r2 = [x + width for x in r1]
r3 = [x + width for x in r2]

# plot sensitivity, specificity, and auc
fig, ax = plt.subplots(figsize=(8,6))
bar1 = ax.bar(r1, acc_all, width, label='Accuracy')
bar2 = ax.bar(r2, prec_all, width, label='Precision')
bar3 = ax.bar(r3, f1_all, width, label='F1')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylim([0,1.13])
ax.set_ylabel('Scores',fontsize=14)

#ax.set_title('Performance benchmark across ML models')
ax.set_xticks(r2)
ax.set_xticklabels(ml_names)
ax.tick_params(axis='both', which='major', labelsize=12)
ax.set_xlabel("Machine Learning Model",fontsize=14)
ax.legend(loc='lower left',ncol=3,bbox_to_anchor=(0.25,1),fontsize=12)
autolabel(bar1)
autolabel(bar2)
autolabel(bar3)
fig.tight_layout()
fig.savefig("ml_benchmark_f1.pdf", bbox_inches='tight')
plt.show()    