# DSA5102 Project: Are You Likely to Suffer a Heart Attack?<a id='top'></a>
## Contents <a id='top'></a>
1. <a href=#intro>Introduction</a>
    1. <a href=#back>Background</a>
    1. <a href=#object>Objective</a>
    2. <a href=#data>Data description</a>
1. <a href=#model>Modeling</a>
    1. <a href=#sl>Supervised Learning</a>
    1. <a href=#ul>Unsupervised Learning</a>
1. <a href=#conclusion>Conclusion</a>
1. <a href=#ref>Links</a>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import tensorflow as tf
from sklearn.metrics import confusion_matrix,classification_report
from scipy.cluster.hierarchy import linkage, dendrogram
from sklearn.metrics import silhouette_score
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from scipy.cluster import hierarchy
import plotly.express as px

<a id='intro'></a>
# 1. Introduction
<a href=#top>(back to top)</a>

<a id='back'></a>
### 1.1 Background
Nowadays with the development of remote equipments and  communication technologies, the pace of life has become more compact, which caused people are working longer, and the pressure of workers is also increasing fast.At the same time, in such an environment, people usually do not take good care of their bodies, leading to the onset of various diseases.Heart disease as one of  the most common circulatory system diseases, because of its sudden onset and unpredictability, often can not be prevented to cause a fatal blow to patients. It is a huge threat to health.

<a id='object'></a>
### 1.2 Objective
Therefore, in this project, my interest is to classify the high-risk and low-risk population by using machine learning (supervised learning and unsupervised learning) through 14 attributes that may be associated with cardiac disease, such as patient's age, gender, type of chest pain, resting blood pressure, fasting blood sugar, etc.In the comparison of each model, we want to select the most accurate and most robust model.

In [None]:
heart_disease=pd.read_csv('../input/health-care-data-set-on-heart-attack-possibility/heart.csv')#import data as dataframe
heart_disease.columns          #chcek columns name

### Data Visualization

In [None]:
heart_disease.head()

In [None]:
heart_disease.info()

In [None]:
fig, ax=plt.subplots(5,3,figsize=(20,28))
sns.distplot(heart_disease['age'],bins=10,ax=ax[0,0],axlabel='Age Distribution')
sns.countplot(x="sex", data=heart_disease,ax=ax[0,1])
sns.countplot(x="cp", data=heart_disease,ax=ax[0,2])
sns.distplot(heart_disease['trestbps'],bins=10,ax=ax[1,0],axlabel='resting blood pressure')
sns.distplot(heart_disease['chol'],bins=10,ax=ax[1,1],axlabel='serum cholestoral in mg/dl')
sns.countplot(x="fbs", data=heart_disease,ax=ax[1,2])
sns.countplot(x="restecg", data=heart_disease,ax=ax[2,0])
sns.distplot(heart_disease['thalach'],bins=10,ax=ax[2,1],axlabel='maximum heart rate achieved')
sns.countplot(x="exang", data=heart_disease,ax=ax[2,2])
sns.distplot(heart_disease['oldpeak'],bins=10,ax=ax[3,0],axlabel='ST depression induced by exercise relative to rest')
sns.countplot(x='slope',data=heart_disease,ax=ax[3,1])
sns.countplot(x='ca',data=heart_disease,ax=ax[3,2])
sns.countplot(x='thal',data=heart_disease,ax=ax[4,0])
sns.countplot(x='target',data=heart_disease,ax=ax[4,1])
ax[4,1].set_title(' target: 0= less chance of heart attack 1= more chance of heart attack')
ax[4,0].set_title('thal: 0 = normal; 1 = fixed defect; 2 = reversable defect')
ax[3,2].set_title('number of major vessels (0-3) colored by flourosopy')
ax[3,1].set_title('the slope of the peak exercise ST segment')
ax[2,2].set_title('exercise induced angina')
ax[1,2].set_title("fasting blood sugar > 120 mg/dl")
ax[0,2].set_title("chest pain type")
ax[2,0].set_title('resting electrocardiographic results')

In [None]:
plt.figure(figsize= (10, 6))
corrMatrix = heart_disease.corr()
sns.heatmap(corrMatrix, annot=True)
plt.show()

<a id='model'></a>
# 2. Modeling
<a href=#top>(back to top)</a>

### Data cleaning
Check for missing value

At a glance of the table, we know there is not any `null/NAN` value in the DataFrame, which is good.

In [None]:
heart_disease.replace(to_replace= r'^\s*$', value=np.nan,regex=True, inplace=True ) #replace any unit value that only contains " ", space
heart_disease.isnull().any() #check whether each column contains a missing value

Then if there is any unit that only contains " " or space value, it will be replaced by a `NAN value`. After the replacement, we check again if there is any `null value`. Fortunately, the data are clean.

### 2.1.1 Deep Neural Network Model

In [None]:
# cut the Dataframe into two parts, one for features, another for target
X=heart_disease.drop(['target'],axis=1)
y=heart_disease['target']

For `DNN`, we need to convert *categorical feature* columns(cp, fbs, exang,slope and so on) into `one-hot code`. To do that first we need to change those columns type from integer to object. As follows:

In [None]:
X[['sex','cp','fbs','restecg','exang','slope','ca','thal']]=(X[['sex','cp','fbs','restecg','exang','slope','ca','thal']].astype(object))
X.info()

In [None]:
#convert all the categorical columns into onehot code, 
#and drop the the first onehot code from each conversion.
X_encode=pd.get_dummies(X,drop_first=True)
from keras.utils.np_utils import to_categorical 
y_encode=to_categorical(y)

In [None]:
from keras.layers import Dense
from keras.layers import Dropout
from keras.models import Sequential
from keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale

In [None]:
n_cols=X_encode.shape[1] #find number of node in input layer
n_cols

#### Normalize all the data (train and test) 

In [None]:
X_encode_scaled=scale(X_encode)
pd.DataFrame(X_encode_scaled)

#### Split data into train set and test set

In [None]:
features_train1,features_test1,target_train1,target_test1 = train_test_split(X_encode_scaled,y_encode,test_size=0.2,random_state=42,stratify=y)

#### DNN Model construction:

sequential model;adam optimizer; loss:categorical crossentropy; metrics: categorical crossentropy

When building `DNN`, we first build a overfited model that can ensure it has enough capacity to pass information. Then we use dropout or regularization methods to reduce overfitting, and finally we do hyperparameter tunning(nodes number). Here is the final stage model that already has gone through this process

In [None]:
model1=Sequential()
model1.add(Dense(40,activation='relu',input_shape=(n_cols,)))
model1.add(Dropout(0.25))
model1.add(Dense(40,activation='relu'))
model1.add(Dropout(0.25))
model1.add(Dense(2,activation='softmax'))                                             #using softmax as activation function(for classfication)
model1.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['categorical_accuracy']) #using categorical_crossentropy as loss function

We use early stop method to prevent overfitting from training too many times.

In [None]:
early_stopping_monitor=EarlyStopping(patience=2)        #set a early stop to prevent overfitting
record=model1.fit(features_train1,target_train1,validation_split=0.2,epochs=50,callbacks=[early_stopping_monitor])   #10% of data would be used for validation

#### Evaluate model

In [None]:
loss,accuracy=model1.evaluate(features_test1,target_test1)
print('loss is ', loss, '\nDNN accuracy on test data is ' ,accuracy)

In [None]:
# Plot accuracy change vs validation accuracy change based on epochs
plt.plot(record.epoch, record.history.get('categorical_accuracy'),color='orange')
plt.plot(record.epoch, record.history.get('val_categorical_accuracy'),color='blue')

In [None]:
# Save the trained model
from keras.models import load_model
model1.save('model1.h5')

Confusion Matrix and classification report

In [None]:
target_pred_label1=np.argmax(model1.predict(features_test1),axis=1) # use DNN to predict labels on test data
C_dnn=confusion_matrix(
    target_test1[:,1],   # array, Gound true (correct) target values
    target_pred_label1,  # array, Estimated targets as returned by a classifier
    labels=[0,1],        # array, List of labels to index the matrix.
    sample_weight=None  # array-like of shape = [n_samples], Optional sample weights
)

col_name=pd.MultiIndex.from_product([['Predicted label'], ['low risk','high risk']])
row_name=pd.MultiIndex.from_product([['True label'], ['low risk','high risk']])
pd.DataFrame(C_dnn,columns=col_name,index=row_name)

In [None]:
def plot_confusion_matrix(cm, labels_name, title):
    cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]    # normalization
    plt.imshow(cm, interpolation='nearest')    
    plt.title(title)    
    plt.colorbar()
    num_local = np.array(range(len(labels_name)))    
    plt.xticks(num_local, labels_name, rotation=90)    
    plt.yticks(num_local, labels_name)    
    plt.ylabel('True label')    
    plt.xlabel('Predicted label')

plot_confusion_matrix(C_dnn,['low risk','high risk'], "DNN_pred Confusion Matrix")
plt.show()

In [None]:
print(classification_report(target_test1[:,1],target_pred_label1))

From above we can see the *model accuracy* is 87%, and it also perform well on the confusion matrix. What is more, from the plot we can see there seems to be no overfitting as the difference of two lines are not so big.

#### Probabilistic Calculation from DNN model:

Instead of just knowing a binary result that whether a patient with given features is likely to have heart disease, we want to know the specific probability of heart disease with gien features.

In [None]:
def Probability1(features,model):
    prediction_array=model.predict(features)
    probability_array =prediction_array[:,1]
    probability_table=pd.DataFrame(probability_array)
    return probability_table
# we define a function that calculate the probability of heart disease with gien features.

In [None]:
Probability1(features_test1,model1)

In this way we can not only find the binary result, but also find the actual predicted probability of heart disease for the given feature.

### 2.1.2 Logistic Regression

Apart from DNN, `logistic regression` model is also suitable for binary classification problem.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

In [None]:
features_train2,features_test2,target_train2,target_test2 = train_test_split(X_encode_scaled,y,test_size=0.2,random_state=42,stratify=y)

#### Model construction

In [None]:
logreg=LogisticRegression(max_iter=3000) # set the max iteration to be 3000 otherwise the process can't be finished
logreg.fit(features_train2,target_train2)
target_pred2=logreg.predict(features_test2)

#### Evaluate model
Using test dataset to plot ROC curve

In [None]:
from sklearn.metrics import roc_curve
y_pred_prob = logreg.predict_proba(features_test2)[:,1]
fpr, tpr, thresholds = roc_curve(target_test2,y_pred_prob)
plt.plot([0,1],[0,1],'k--')
plt.plot(fpr,tpr,label='Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve')
plt.show()

In [None]:
cv_scores=cross_val_score(logreg,X_encode_scaled,y,cv=5,scoring='roc_auc')
print('AUC of logistic model is ',cv_scores.mean())

Confusion Matrix and classification report

In [None]:
C_logistic=confusion_matrix(
    target_test2,   # array, Gound true (correct) target values
    target_pred2,  # array, Estimated targets as returned by a classifier
    labels=[0,1],        # array, List of labels to index the matrix.
    sample_weight=None  # array-like of shape = [n_samples], Optional sample weights
)
pd.DataFrame(C_logistic,columns=col_name,index=row_name)

In [None]:
plot_confusion_matrix(C_logistic,['low risk','high risk'], "Logistic Regression Confusion Matrix")
plt.show()

In [None]:
print(classification_report(target_test2,target_pred2))

From above we can see the model performance is 91.5% with *accuracy* 0.87, and it also perform well on the `confusion matrix`. What is more, it also performs well on the ROC curve, which means it has high true positive rate and also has low false positive rate. It indicates that model can perform well on classification and filter.

#### Probability calculation from logistic regression model

In [None]:
def Probability2(features):
    y_pred_prob = logreg.predict_proba(features)[:,1]
    return pd.DataFrame(y_pred_prob)

In [None]:
Probability2(features_test2)

In this way we can not only find the binary result, but also find the actual predicted probability of heart disease for the given feature.

### 2.1.3 KNN model

We can also use k nearest neighbors model to classify the data points into two categories.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
features_train3,features_test3,target_train3,target_test3 = train_test_split(X_encode_scaled,y,test_size=0.2,random_state=42,stratify=y)

#### Model Construction

Hyperparameter Tuning:

`n_neighbors` is a hyperparamter which means we should decide its value before we train the model. In order to find the best value of it, I use *hperparamter tunning* to find the optimal n_neighbors between 1 and 49. During the process cross validation will be used to find the best parameter and its score.

In [None]:
param_grid = {'n_neighbors': np.arange(1,50)}
knn=KNeighborsClassifier()
knn_cv=GridSearchCV(knn,param_grid,cv=5)
knn_cv.fit(features_train3,target_train3)
#find best parameter
n_neighbor=knn_cv.best_params_
n_neighbor

In [None]:
#find score of best parameter
knn_cv.best_score_
print('parameter score (n_neighbors=45)is ', knn_cv.best_score_)

`KNN` score is 0.85, which is a quite good number for hyperparameter score

#### Evaluate the model

In [None]:
target_pred3=knn_cv.predict(features_test3)  #calculate the predicted target
knn_cv.score(features_test3,target_test3)
print('KNN accuracy on test data is ',knn_cv.score(features_test3,target_test3) )

Confusion Matrix and classification report

In [None]:
C_KNN=confusion_matrix(
    target_test3,   # array, Gound true (correct) target values
    target_pred3,  # array, Estimated targets as returned by a classifier
    labels=[0,1],        # array, List of labels to index the matrix.
    sample_weight=None  # array-like of shape = [n_samples], Optional sample weights
)
pd.DataFrame(C_KNN,columns=col_name,index=row_name)

In [None]:
plot_confusion_matrix(C_KNN,['low risk','high risk'], "KNN Confusion Matrix")
plt.show()

In [None]:
print(classification_report(target_test3,target_pred3))

### 2.1.4 Decision Tree model

We can also use `decision tree` as classifier.

In [None]:
features_train4,features_test4,target_train4,target_test4 = train_test_split(X_encode_scaled,y,test_size=0.2,random_state=42,stratify=y)

#### Model Construction

In [None]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features_train4,target_train4)
import graphviz 
dot_data = tree.export_graphviz(clf, out_file=None,
                     filled=True, rounded=True,  
                     special_characters=True) 
graph = graphviz.Source(dot_data)  
graph 

#### Evaluate the model

In [None]:
clf_accuracy=clf.score(features_test4,target_test4)
target_pred4=clf.predict(features_test4)
print('Decision tree model accuracy on test data is ',clf_accuracy)

In [None]:
C_decisiontree=confusion_matrix(
    target_test4,   # array, Gound true (correct) target values
    target_pred4,  # array, Estimated targets as returned by a classifier
    labels=[0,1],        # array, List of labels to index the matrix.
    sample_weight=None  # array-like of shape = [n_samples], Optional sample weights
)
pd.DataFrame(C_decisiontree,columns=col_name,index=row_name)

In [None]:
plot_confusion_matrix(C_decisiontree,['low risk','high risk'], "Decision Tree Confusion Matrix")
plt.show()

In [None]:
print(classification_report(target_test4,target_pred4))

### 2.1.5 Random forest model

In [None]:
features_train5,features_test5,target_train5,target_test5 = train_test_split(X_encode_scaled,y,test_size=0.2,random_state=42,stratify=y)

#### Model Construction
Hyperparameter Tuning:

In [None]:
from sklearn.ensemble import RandomForestClassifier
param_grid = {'n_estimators':np.arange(10,101,10)}        # set the range of n_estimators that we want to search for
rfc = RandomForestClassifier(n_jobs=4)                  # build up a model
rfc_cv = GridSearchCV(rfc,param_grid,scoring='accuracy', cv=3) 
rfc_cv.fit(features_train5,target_train5) 

In [None]:
# fubd the best parameter
rfc_cv.best_params_

In [None]:
# find the best score of best parameter
rfc_cv.best_score_

#### Evaluate the model

In [None]:
target_pred5=rfc_cv.predict(features_test5)

# find the accuracy on test data
rfc_accuracy=rfc_cv.score(features_test5,target_test5)
print('Random forest model accuracy on test data is',rfc_accuracy)

confusion matrix and classification report

In [None]:
C_rfc=confusion_matrix(
    target_test5,   # array, Gound true (correct) target values
    target_pred5,  # array, Estimated targets as returned by a classifier
    labels=[0,1],        # array, List of labels to index the matrix.
    sample_weight=None  # array-like of shape = [n_samples], Optional sample weights
)
pd.DataFrame(C_rfc,columns=col_name,index=row_name)

In [None]:
plot_confusion_matrix(C_rfc,['low risk','high risk'], "Random Forest Confusion Matrix")
plt.show()

In [None]:
print(classification_report(target_test5,target_pred5))

### 2.1.6 SVM model
#### Model 1: SVM.SVC(kernel='linear')

In [None]:
features_train6,features_test6,target_train6,target_test6 = train_test_split(X_encode_scaled,y,test_size=0.2,random_state=42,stratify=y)

#### Model Construction
Hyperparameter Tuning:

In [None]:
param_grid={'C':np.arange(0.1,10,0.1)}
model_svm1=svm.SVC(kernel='linear')
model_svm1_cv=GridSearchCV(model_svm1,param_grid,scoring='accuracy',cv=3)
model_svm1_cv.fit(features_train6,target_train6)

In [None]:
print(model_svm1_cv.best_params_) # find best hyperparameter.
print(model_svm1_cv.best_score_)  # find the score of the best hyperparameter.

#### Evaluate the model

In [None]:
target_pred6=model_svm1_cv.predict(features_test6)
# find the accuracy on test data
svm1_accuracy=model_svm1_cv.score(features_test6,target_test6)
print('svm with linear kernel model accuracy on test data is',svm1_accuracy)

confusion matrix and classification report

In [None]:
C_svm1=confusion_matrix(
    target_test6,   # array, Gound true (correct) target values
    target_pred6,  # array, Estimated targets as returned by a classifier
    labels=[0,1],        # array, List of labels to index the matrix.
    sample_weight=None  # array-like of shape = [n_samples], Optional sample weights
)
pd.DataFrame(C_svm1,columns=col_name,index=row_name)

In [None]:
plot_confusion_matrix(C_svm1,['low risk','high risk'], "SVM(kernel=linear) Confusion Matrix")
plt.show()

In [None]:
print(classification_report(target_test6,target_pred6))

#### Model 2: SVM.SVC(kernel='rbf')

In [None]:
features_train8,features_test8,target_train8,target_test8 = train_test_split(X_encode_scaled,y,test_size=0.2,random_state=42,stratify=y)

#### Model Construction
Hyperparameter Tuning:

In [None]:
param_grid={'C':np.arange(0.1,10,0.1)}
model_svm3=svm.SVC(kernel='rbf',gamma='scale')
model_svm3_cv=GridSearchCV(model_svm3,param_grid,scoring='accuracy',cv=3)
model_svm3_cv.fit(features_train8,target_train8)

In [None]:
print(model_svm3_cv.best_params_) # find best hyperparameter.
print(model_svm3_cv.best_score_)  # find the score of the best hyperparameter.


#### Evaluate the model

In [None]:
target_pred8=model_svm3_cv.predict(features_test8)
# find the accuracy on test data
svm3_accuracy=model_svm3_cv.score(features_test8,target_test8)
print('svm with rbf kernel model accuracy on test data is',svm3_accuracy)

confusion matrix and classification report

In [None]:
C_svm3=confusion_matrix(
    target_test8,   # array, Gound true (correct) target values
    target_pred8,  # array, Estimated targets as returned by a classifier
    labels=[0,1],        # array, List of labels to index the matrix.
    sample_weight=None  # array-like of shape = [n_samples], Optional sample weights
)
pd.DataFrame(C_svm3,columns=col_name,index=row_name)

In [None]:
plot_confusion_matrix(C_svm3,['low risk','high risk'], "SVM.SVC(kernel='rbf') Confusion Matrix")
plt.show()

In [None]:
print(classification_report(target_test8,target_pred8))

From above we can conclude that `logistic regression model` and `SVM` with linear kernel perform best among those models.

## 2.2 Unsupervised Learning <a id='ul'></a>
### 2.2.1 Principle Component Analysis (PCA)

Problem statement:
Close scrutiny of the dataset,there are 13 features in total, so after using supervised machine learning to build predictive models and measure their performance, next I want to use `principle component analysis` to do feature extraction among the 13 features, and reduce the number of features(dimension reduction), and I want to see which of the features have large influence.

#### t-SNE for 2-dimensional maps 

`t-SNE` is t-distributed stochastic neighbor embeding, which can do dimension reduction in data visualization. And its x-axis is meaningless. 

In [None]:
from sklearn.manifold import TSNE

In [None]:
samples= heart_disease
samples[['sex','cp','fbs','restecg','exang','slope','ca','thal']]=(samples[['sex','cp','fbs','restecg','exang','slope','ca','thal']].astype(object))
samples.info()

In [None]:
samples_encode=pd.get_dummies(samples,drop_first=True)
samples_encode_scaled=scale(samples_encode)

In [None]:
model_TSNE = TSNE (learning_rate=100)
transformed = model_TSNE.fit_transform(samples_encode_scaled)
xs=transformed[:,0]
ys=transformed[:,1]
plt.scatter(xs,ys,c=y)

As the `t-SNE` graph shows,we could find a clear boundary between the two categories(target=0, target=1). From this perspective we can visualize samples in 2D.

#### Construct PCA model

In [None]:
# set number of components = 3.
X_scale=scale(X)
pca = PCA(n_components=3)
X_PCAtransform=pd.DataFrame(pca.fit_transform(X_scale))
X_PCAtransform['target'] = y
fig = px.scatter_3d(
    X_PCAtransform, 
    x=0, 
    y=1,
    z=2, 
    color=y, 
    title='3d scatter for PCA',
      width=700,
    height=700 
)
fig.show()

In [None]:
# set number of components = 7.
X_scale=scale(X)
pca = PCA(n_components=7)
pca.fit(X_scale)
# plot the eigenvalue of 7 components
plt.plot(pca.explained_variance_)
plt.xlabel(r'$j$')
plt.ylabel(r'$\lambda_j$');

From this eigenvalue graph we can see we need to cover 7 principle components in order to represent enough explained variance.

In [None]:
# plot scree plot to show the percentage of each eigenvalue which indicates the explained variance.
per_var=np.round(pca.explained_variance_ratio_*100, decimals=1)
labels= ['PC'+str(x) for x in range (1, len(per_var)+1)]
plt.figure(figsize=(14,4))
plt.bar(x=range(1,len(per_var)+1),height=per_var, tick_label = labels)
plt.ylabel('percentage of explained variance')
plt.xlabel('principal component')
plt.title('scree plot')

From the graph we can see each of the principle component contribute an amount of explained variance, which indicates that original features contribute to the result or target, and it is relatively hard to do feature extraction and compression using `PCA`.

In [None]:
# show the eigenvectors for each component (n_component = 7)
columns_name=heart_disease.columns[:-1]
index_name=['pc1','pc2','pc3','pc4','pc5','pc6','pc7']
pd.DataFrame(pca.components_,columns=columns_name,index=index_name)

In [None]:
# show pca score (n_component=7) for each observations
scores = pca.transform(X_scale)
pd.DataFrame(scores,columns=index_name)

This is the table of PCA score for each principle component among 303 observations.

### 2.2.3 K-means Clustering Model

For **unsupervised learning**, we can also use `K-means` Clustering model to cluster those given features into certain groups. We also apply `sihouette score` to evaluate the number of cluster.
#### Model Construction

In [None]:
# we set hyperparameter n_cluster =2 
from sklearn.cluster import KMeans
model6= KMeans(n_clusters=2)
model6.fit(X_encode_scaled)
kmeans_labels=model6.predict(X_encode_scaled)

#### Crosstable labels vs actual target

In [None]:
grouping1=pd.DataFrame({'kmeans_labels':kmeans_labels,'target':y})
grouping1

In [None]:
ct1=pd.crosstab(grouping1['kmeans_labels'],grouping1['target'])
ct1

From crosstable we can know `Kmeans` can predict most labels correctly, but the accuracy is not good enough.

#### Draw inertia plot

Inertia can measure clustering quality, and it is the distance from samples to its cluster. Elbow of graph indicates best number of clusters.

In [None]:
inertia=[]
for k in range(1,11):
    model6=KMeans(n_clusters=k)
    model6.fit(X_encode_scaled)
    inertia.append(model6.inertia_)

In [None]:
plt.plot(range(1,11),inertia)
plt.title('Kmeans inertia')
plt.xlabel('number of clusters')
plt.ylabel('inertia value')

#### evaluate model with sihouette score 

In [None]:
sc_scores = []
clusters = range(2,11)
for i in clusters:  
    model=KMeans(n_clusters=i)
    model.fit(X_encode_scaled)
    labels = model.predict(X_encode_scaled).ravel()
    sc_scores.append(silhouette_score(X_encode_scaled, labels))
plt.plot(clusters, sc_scores)

### 2.2.4 Hierarchical Clustering model
Apart from `Kmeans` clustering, we also apply `hierachical clustering` to cluster the dataset without target. We also apply `sihouette score` to evaluate the number of clusters.

#### Model Construction

In [None]:
lm1 = linkage(X_encode_scaled, method='ward')
plt.figure(figsize=(12,4))
dendrogram(lm1, p=2,truncate_mode='level');

In [None]:
# cut the tree by 2 clusters
out=hierarchy.cut_tree(lm1,n_clusters=2).ravel()
grouping2=pd.DataFrame({'hierachical_labels':out,'target':y})
grouping2

In [None]:
ct2=pd.crosstab(grouping2['hierachical_labels'],grouping2['target'])
ct2

#### evaluate model with sihouette score 

In [None]:
sc_scores = []
clusters = range(2,11)
for i in clusters:  
    labels = hierarchy.cut_tree(lm1, n_clusters=i).ravel()
    sc_scores.append(silhouette_score(X_encode_scaled, labels))
plt.plot(clusters, sc_scores)