# Diagnosing Heart Disease



## Contents:

- <b>1. Read Dataset </b>
    - Observing dataset
-  <b> 2. Data Cleaning </b>
    - 2.1 Outliers
    - 2.2 Uncorrelated Columns
    - 2.3 Filling null values
- <b> 3. Data Visulatization
- <b>4. Model Preparation </b>
    - 4.1 Encoding categorical features
    - 4.2 Normalization
    - 4.3 Split training and testing set
    
- <b>5 Models and tuning </b>
    - 5.1 Linear Regression
    - 5.2 Lasso Regression
    - 5.3 Multi-layer perceptron
    - 5.4 K-Nearest Neighbor
    - 5.5 Decision Tree
    - 5.6 Random Forest
    - 5.7 SVM
  

We will follow the given outlies. 

# 1. Read Dataset

In [None]:
import os
import pandas as pd

In [None]:
path = "../input/heart-disease-uci/"

In [None]:
df = pd.read_csv(os.path.join(path,'heart.csv'))

In [None]:
df.head()

### Observing Datset

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.info()

These datasets contains all types of shortforms. Let's reserach the features and it's importance for the dataset

In [None]:
## Plotting a heatmap to observe correlation 
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [None]:
corr = df.corr()

In [None]:
fig, ax = plt.subplots(figsize = (8,8))
sns.heatmap(corr, ax= ax, linewidth = 0.1 );

As we can from the above diagram:
- <b> cp
- thalach
- slope
- restecg </b>

are the ones that are highly correlated. Anyway, let's reasearch more about the features and it roles for a Heart Disease.

### CP (Chest Pain): 

Since there are four type of Chest Pains. 
   - Typical Angina (1)
   - Atypical angina (2)
   - Non-anginal Pain (3)
   - Asymptomatic (0) 

In [None]:
df['cp'].value_counts()

### trestbps: 
Resting blood pressure (in mm Hg on admission to the hospital)

![fromheartorg.PNG](attachment:fromheartorg.PNG)

In [None]:
df['trestbps'].describe()

### chol (Cholesterol)

Cholesterol is a type of fat found in your blood. Your liver makes cholesterol for your body. You also can get cholesterol from your foods you eat. 
When there is too much <b> cholesterol </b> in your blood, it builds up in the walls of your arteries, causing a process called <b> atherosclerosis, </b> a form of heart disease The person's cholesterol measurement in mg/dl

![fromverywellhealthcom.PNG](attachment:fromverywellhealthcom.PNG)

In [None]:
df['chol'].describe()

### fbs (fasting blood sugar)

A fasting blood sugar level less than <b> 100 mg/dL (5.6 mmol/L) </b> is  <b> normal </b>. A fasting blood sugar level from <b> 100 to 125 mg/dL (5.6 to 6.9 mmol/L) </b> is considered prediabetes. If it's <b> 126 mg/dL (7 mmol/L) </b> or <b> higher </b> on two separate tests, you have diabetes.

In [None]:
df['fbs'].describe()

### restecg (resting electrocardiogram)

The resting electrocardiogram is a test that measures the electrical activity of the heart. The heart is a muscular organ which pumps blood through rhythmic contractions induced by electric impulses generated by the sinus node, the heat's natural pacemaker.

Electrocardiogram (ECG) to assess the heart rate and rhythm. This test can often detect heart disease, heart attack, an enlarged heart, or abnormal heart rhythms that may cause heart failure

<b> ( 0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria) 

In [None]:
df['restecg'].describe()

### thalach (Maximum Heart Rate Achieved)

In [None]:
df['thalach'].describe()

### exang (Exercise induced angina) 

Angina is a type of pain that occurs when not enough blood flows to the heart muscle. Angina may feel like pressure in the chest, jaw or arm. It frequently may occur with exercise or stress.
<b> (1 = Yes; 0 = no) </b>

In [None]:
df['exang'].describe()

### oldpeak (ST depression)

<b> ST-segment depression </b> is associated with a 100% increase in the occurrence of three-vessel/left main <b> disease </b> and to an increased risk of subsequent cardiac events.

In [None]:
df['oldpeak'].describe()

### slope 

the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)

In [None]:
df['slope'].describe()

### ca 

The number of major vessels (0-3)

In [None]:
df['ca'].describe()

### thal ( Thalassemia)

Thalassemia is an inherited blood disorder that causes your body to have less hemoglobin than normal. Hemoglobin enables red blood cells to cary oxygen.

Heart problems. Congestive heart failure and abnormal heart rhythms can be associated with severe thalassemia.

- 0 (dataset): null
- 1 (dataset): 6 = fixed defect 
- 2 (dataset): 3 = normal
- 3 (dataset) 7 = reversable defect

In [None]:
df['thal'].describe()

### target 

Heart disease (0 = no, 1 = yes)

In [None]:
df['target'].describe()

# 2. Data Cleaning
    2.1 Outliers
    2.2 Uncorrelated Columns
    2.3 Filling null values

### 2.1 Treating Outliers

In [None]:
df.describe()

In [None]:
## Quickly checking on 'Chol'

In [None]:
plt.figure(figsize = (10,5))
sns.boxplot(x = 'chol', data = df);

As we can see there is one or more outlier, but it's not huge. Since our dataset is so small, it will not make any huge difference. 
Even though, to satisfy my curiosity I am going to train the model with both the dataset, with and without outlier.

In [None]:
## let's clone our dataset

xdf = df.copy()

- <b> xdf : </b> The dataset will clean all outliers
- <b> df : </b> This dataset is with the outliers.

In [None]:
## Lets clean the dataset

In [None]:
Q1 = xdf['chol'].quantile(0.25)
Q3 = xdf['chol'].quantile(0.75)

IQR = Q3 - Q1
min = Q1 - 1.5 * IQR
max = Q3 + 1.5 * IQR

In [None]:
filter = (xdf['chol'] >= min) & (xdf['chol'] <= max)
init_size = xdf.count()['chol']
xdf = xdf.loc[filter]
filtered_size = xdf.count()['chol']
print(init_size - filtered_size, '(', '{:.2f}'.format(100 * (init_size - filtered_size) / init_size), '%', ')', 'outliers removed from dataset')

In [None]:
plt.figure(figsize = (10,5))
sns.boxplot(x = 'chol', data = xdf);

### 2.2 Uncorrelated Columns

Since there are less columns. We will not be dropping any columns. 

In [None]:
fig, ax = plt.subplots(figsize = (12,12))
sns.heatmap(corr, ax= ax, annot = True, linewidth = 0.1 );

From the above figure, we can see <b> exang </b>, <b> oldpeak </b>, <b> ca </b> and <b> thal </b> are less important to predict <b> target i.e heart disease </b>.

We will drop these features on new dataset <b> xdf </b> and we will test the accuracy with default dataset <b> df </b>

In [None]:
df.shape

In [None]:
## Since it has less features we will not drop any featuers 

In [None]:
xdf.shape

### 2.1 Filling missing values 

In [None]:
df.isnull().sum()

*Noice* There are no nan values.

# 3. Data Visualization

Since Data Visulaization on numbers is difficult. We will replace all the numbers with labels. 

In [None]:
dfvis = df.copy()

In [None]:
## Sex

for i in dfvis['sex']:
    if i==0:
        dfvis['sex'] = dfvis['sex'].replace(0, "female")
    else:
        dfvis['sex'] = dfvis['sex'].replace(1, "male")


In [None]:
## Cp

for i in dfvis['cp']:
    if i==0:
        dfvis['cp'] = dfvis['cp'].replace(0, "asymptomatic")
    elif i==1:
        dfvis['cp'] = dfvis['cp'].replace(1,"typical-angina")
    elif i==2:
        dfvis['cp'] = dfvis['cp'].replace(2,'atypical-angina')
    else:
        dfvis['cp'] = dfvis['cp'].replace(3,'non-anginal_pain')

In [None]:
## fbs

for i in dfvis['fbs']:
    if i==0:
        dfvis['fbs'] = dfvis['fbs'].replace(0, "false")
    else:
        dfvis['fbs'] = dfvis['fbs'].replace(1, "true")

In [None]:
## restecg

for i in dfvis['restecg']:
    if i==0:
        dfvis['restecg'] = dfvis['restecg'].replace(0, "false")
    else:
        dfvis['restecg'] = dfvis['restecg'].replace(1, "true")

In [None]:
## exang

for i in dfvis['exang']:
    if i==0:
        dfvis['exang'] = dfvis['exang'].replace(0, "false")
    else:
        dfvis['exang'] = dfvis['exang'].replace(1, "true")

In [None]:
## slope

for i in dfvis['slope']:
    if i==0:
        dfvis['slope'] = dfvis['slope'].replace(0, "down")
    elif i==1:
        dfvis['slope'] = dfvis['slope'].replace(1, "flat")
    else:
        dfvis['slope'] = dfvis['slope'].replace(2, "up")

In [None]:
## thal

for i in dfvis['thal']:
    if i==0:
        dfvis['thal'] = dfvis['thal'].replace(0, 2)
    elif i==1:
        dfvis['thal'] = dfvis['thal'].replace(1, "fixed_defect")
    elif i==2:
        dfvis['thal'] = dfvis['thal'].replace(2, "normal")
    else:
        dfvis['thal'] = dfvis['thal'].replace(3, "reversible_defect")

In [None]:
## target

for i in dfvis['target']:
    if i==0:
        dfvis['target'] = dfvis['target'].replace(0, "no")
    else:
        dfvis['target'] = dfvis['target'].replace(1, "yes")

In [None]:
## target


In [None]:
dfvis.head()

### Age

In [None]:
sns.displot(x = 'age', data = dfvis,aspect = 1, height = 7);

Let's see it's relationship with <b> Target i.e Heart Disease </b>

In [None]:
sns.displot(x = 'age', data = dfvis,hue = 'target', multiple="stack", aspect = 1, height = 7);

In [None]:
plt.figure(figsize = (8,8))
sns.boxplot(y = 'age',x='target',  data = dfvis);

In [None]:
plt.figure(figsize = (8,8))
sns.boxplot(y = 'age',x='target',hue = 'sex',  data = dfvis);

### Sex

(0 = Female, 1 = Male)

In [None]:
plt.figure(figsize = (8,8))
sns.countplot(x = 'sex', data = dfvis);

In [None]:
plt.figure(figsize = (8,8))
sns.countplot(x = 'sex', hue = 'target',data = dfvis);

target: yes = heart disease and no = no heart disease

(which indicates that, there are more number of <b> Males </b> than <b> Females </b>. Also, more <b> Males </b> are victim of heart disease than <b> Females. </b>

But as compared to total number of <b> Males </b> , less number of <b> Males </b> has <b> Heart diseases </b>

Also, largest number of <b>Females </b> has heart disease overall, on the total of <b> Females. </b>


In [None]:
dfvis['sex'].shape

In [None]:
malecount = 0
for i in dfvis['sex']:
    if i=="male":
        malecount = malecount + 1

In [None]:
print("There are:", malecount, " number of males which is : {:.2f} %".format((malecount/303)*100))

In [None]:
femalecount = 0
for i in dfvis['sex']:
    if i=="female":
        femalecount = femalecount + 1

In [None]:
print("There are : {} total number of females which is : {:.2f}% ".format(femalecount, ((femalecount)/303)* 100))

### cp (chest pain)

In [None]:
dfvis['cp'].value_counts()

In [None]:
plt.figure(figsize = (8,8))
sns.countplot(x = 'cp', data = dfvis);

From the research we know, there are 4 types of chest pain:
   - <b> Typical Angina </b> (substernal chest pain precipitated by physical exertion or emotional stress and relieved with rest or nitroglycerin)
   - <b> Atypical angina </b> (When one experiences chest pain that doesn't meet the criteria for angina, it's known as atypical chest pain)
   - <b> Non-anginal Pain </b> (The pain typically is felt behind the breast bone (sternum) and is described as oppressive, squeezing or pressure-like.)
   - <b> Asymptomatic </b> (Silent Heart Attack)
   
If this is wrong please feel free to correct me. Thanks

Anyways, from the above dataset <b> asymptomatic </b> and <b>atypical-angina </b> is common 

Let's see its relation with target (Heart Disease)

In [None]:
plt.figure(figsize = (8,8))
sns.countplot(x = 'cp', hue = 'target',data = dfvis);

In [None]:
sns.catplot(y = 'age', x='cp',hue = 'target',  data = dfvis, aspect = 1, height = 7);

As we can see, <b> atypical-angina </b> and <b>typical-angina </b> has the higher chance to cause heart disease.

### trestbps (Resting blood pressure)

In [None]:
sns.displot(x = 'trestbps', data = dfvis, aspect = 1, height = 7);

In [None]:
plt.figure(figsize = (8,8))
sns.scatterplot(x = 'age', y = 'trestbps',hue = 'target',data = dfvis);

<b> Checking its relation with Target

In [None]:
sns.displot(x = 'trestbps', data = dfvis, hue= 'target', kind = 'kde',aspect = 1, height = 7);

![fromheartorg.PNG](attachment:fromheartorg.PNG)

In [None]:
plt.figure(figsize = (8,8))
sns.boxplot(y = 'trestbps',x='target', hue = 'sex', data = dfvis);

In [None]:
sns.catplot(y = 'trestbps', x='target',hue = 'sex',  data = dfvis, aspect = 1, height = 7);

### chol (cholestrol)

In [None]:
sns.displot(x = 'chol', data = dfvis, hue= 'target', kind = 'kde',aspect = 1, height = 7);

In [None]:
plt.figure(figsize = (8,8))
sns.scatterplot(x = 'age', y = 'chol',hue = 'target',data = dfvis);

In [None]:
plt.figure(figsize = (8,8))
sns.boxplot(y = 'chol',x='target',hue = 'sex',  data = dfvis);

In [None]:
sns.catplot(y = 'chol', x='target',hue = 'sex',  data = dfvis, aspect = 1, height = 7);

### fbs (fasting blood sugar)

In [None]:
dfvis['fbs'].value_counts()

In [None]:
plt.figure(figsize = (8,8))
sns.boxplot(y = 'age',x='target',hue = 'fbs',  data = dfvis);

In [None]:
sns.catplot(y = 'age', x='fbs',hue = 'target',  data = dfvis, aspect = 1, height = 7);

### thalach (Maximum Heart Rate Achieved)

In [None]:
sns.displot(x = 'thalach', data = dfvis, hue= 'target', kind = 'kde',aspect = 1, height = 7);

In [None]:
plt.figure(figsize = (8,8))
sns.boxplot(y = 'thalach', x='target',hue = 'sex',  data = dfvis);

In [None]:
sns.catplot(y = 'thalach', x='target',hue = 'sex',  data = dfvis, aspect = 1, height = 7);

### slope 
the slope of the peak exercise ST segment

In [None]:
dfvis['slope'].value_counts()

In [None]:
sns.catplot(y = 'age', x='slope',hue = 'target',  data = dfvis, aspect = 1, height = 7);

# 4. Model Preparation
- 4.1 Encoding Categorical Features
- 4.2 Normalization
- 4.3 Split training and testing set

### 4.2 Split training and testing

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
## Splitting original dataset (df) without normalization

X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis = 1), df['target'], test_size = 0.30, random_state = 42)

In [None]:
## Original Data (df) with normalizing
X_train2, X_test2, y_train2, y_test2 = train_test_split(df.drop('target', axis = 1), df['target'], test_size = 0.30, random_state = 42)

In [None]:
## Outliers removed dataset (xdf) with normalization
X_train3, X_test3, y_train3, y_test3 = train_test_split(xdf.drop('target', axis = 1), xdf['target'], test_size = 0.30, random_state = 42)

If you are confused:
- <b> X_train = original dataset (no norm and no outlier removal)
- X_train2 = original datset (no outlier removal (df): with only norm)
- X_train3 = outlier removed dataset (xdf: with norm )

### 4.2 Normalization

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
sc = StandardScaler()

In [None]:
X_train2 = sc.fit_transform(X_train2)
X_test2 = sc.fit_transform(X_test2)

X_train3 = sc.fit_transform(X_train3)
X_test3 = sc.fit_transform(X_test3)


If you are confused, see this:
- <b>df = original dataset
- df_norm = original dataset with normalization
- xdf_norm = outliers removed dataset with normalization

# 5. Models and tuning

Since it is classification problem, we will test it with following algorithms
- <b> Logistic Regression
- Linear Regression
- Naive Bayes
- KNN 
- Decision Trees
- Random Forest
- SVM
  
    
We use <b> F1 Score, Precision, Recall, ROC Curve, Confusion Matrix </b> and <b> PR Curve </b> as the way to evaluate our models

In [None]:
model_score = pd.DataFrame(columns = ("df","df_norm","xdf_norm"))

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

#### Original Dataset
(Without removing outliers and without normalizing)

In [None]:
lrmodel = LogisticRegression(max_iter = 1000)

In [None]:
lrmodel.fit(X_train, y_train)

In [None]:
lrmodel.score(X_test, y_test)

In [None]:
lrscore1 = lrmodel.score(X_test, y_test)

In [None]:
y_pred = lrmodel.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, y_pred)

In [None]:
print("Confusion Matrix: \n", cm)

In [None]:
print(classification_report(y_test, y_pred))

#### Original Data 

Original Data (df) with normalizing i.e X_train2

In [None]:
lrmodel2 = LogisticRegression(max_iter = 1000)

In [None]:
lrmodel2.fit(X_train2, y_train2)

In [None]:
lrmodel2.score(X_test2, y_test2)

In [None]:
scorelr2 =  lrmodel2.score(X_test2, y_test2)

In [None]:
#### Confusion matrix for Dataset

In [None]:
y_pred2 = lrmodel2.predict(X_test2)

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test2, y_pred2)

In [None]:
print("Confusion Matrix: \n", cm)

In [None]:
print(classification_report(y_test2, y_pred2))

#### Outlier removed data (xdf)

Outlier removed data (xdf) with normalization (X_train3)

In [None]:
lrmodel3 = LogisticRegression(max_iter = 1000)

In [None]:
lrmodel3.fit(X_train3, y_train3)

In [None]:
lrmodel3.score(X_test3, y_test3)

In [None]:
y_pred3 = lrmodel3.predict(X_test3)

In [None]:
cm = confusion_matrix(y_test3, y_pred3)

In [None]:
print("Confusion Matrix: \n", cm)

In [None]:
print(classification_report(y_test3, y_pred3))

In [None]:
model_score = model_score.append(pd.DataFrame({'df':[lrmodel.score(X_test,y_test)],'df_norm':[lrmodel2.score(X_test2,y_test2)], 'xdf_norm':[lrmodel3.score(X_test3, y_test3)]}, index = ['Logistic Regression']))

## Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import precision_score, recall_score, confusion_matrix

In [None]:
naiveb = GaussianNB()

#### Original Data 

Original Data (df) with normalizing i.e X_train2

In [None]:
naiveb.fit(X_train, y_train)

In [None]:
naiveb.score(X_test, y_test)

#### Original dataset (after norm) (X_train2)

In [None]:
naiveb2 = GaussianNB()

In [None]:
naiveb2.fit(X_train2, y_train2)

In [None]:
naiveb2.score(X_test2, y_test2)

#### Normalization dataset (xdf) (after norm)

In [None]:
naiveb3 = GaussianNB()

In [None]:
naiveb3.fit(X_train3, y_train3)

In [None]:
naiveb3.score(X_test3, y_test3)

In [None]:
model_score = model_score.append(pd.DataFrame({'df':[naiveb.score(X_test,y_test)],'df_norm':[naiveb2.score(X_test2,y_test2)], 'xdf_norm':[naiveb3.score(X_test3, y_test3)]}, index = ['Naive Bayes']))

## K- Nearest Neighbour

In [None]:
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

In [None]:
knn = KNeighborsClassifier(n_neighbors = 7)

#### Original Data
Original Data (df) with normalizing i.e X_train2

In [None]:
knn.fit(X_train, y_train)

In [None]:
knn.score(X_test, y_test)

In [None]:
knn_pred = knn.predict(X_test)

In [None]:
print("Confusion Matrix: \n",confusion_matrix(y_test, knn_pred))

In [None]:
error_rate = []

for i in range(1,40):
    knn = KNeighborsClassifier(n_neighbors = i)
    knn.fit(X_train,y_train)
    pred_knn = knn.predict(X_test)
    error_rate.append(np.mean(pred_knn != y_test))

In [None]:
plt.figure(figsize = (8,8))
plt.plot(range(1,40), error_rate, color = 'blue', linestyle = 'dashed', marker = 'o', markerfacecolor = 'red', markersize = 10);
plt.title('Error Rate vs K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

In [None]:
## let's chose the k-value

knn = KNeighborsClassifier(n_neighbors = 21)
knn.fit(X_train, y_train)
pred_knn = knn.predict(X_test)
print(knn.score(X_test, y_test))

print("\n With k = 21")
print('\n')
print("Confusion Matrix: \n", confusion_matrix(y_test, pred_knn))
print('\n')
print(classification_report(y_test,pred_knn))


#### Original Dataset after normalization

In [None]:
knn2 = KNeighborsClassifier(n_neighbors = 7)

In [None]:
knn2.fit(X_train2, y_train2)

In [None]:
knn2.score(X_test2, y_test2)

In [None]:
knn_pred2 = knn2.predict(X_test2)

In [None]:
print("Confusion Matrix: \n",confusion_matrix(y_test2, knn_pred2))

In [None]:
error_rate = []

for i in range(1,40):
    knn2 = KNeighborsClassifier(n_neighbors = i)
    knn2.fit(X_train2,y_train2)
    pred_knn2 = knn2.predict(X_test2)
    error_rate.append(np.mean(pred_knn2 != y_test2))

In [None]:
plt.figure(figsize = (8,8))
plt.plot(range(1,40), error_rate, color = 'blue', linestyle = 'dashed', marker = 'o', markerfacecolor = 'red', markersize = 10);
plt.title('Error Rate vs K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

In [None]:
## let's chose the k-value

knn2 = KNeighborsClassifier(n_neighbors = 21)
knn2.fit(X_train2, y_train2)
pred_knn2 = knn2.predict(X_test2)
print(knn2.score(X_test2, y_test2))

print("\n With k = 21")
print('\n')
print("Confusion Matrix: \n", confusion_matrix(y_test2, pred_knn2))
print('\n')
print(classification_report(y_test2,pred_knn2))


In [None]:
#### Outliers removed dataset after normalization

In [None]:
knn3 = KNeighborsClassifier(n_neighbors = 7)

In [None]:
knn3.fit(X_train3, y_train3)

In [None]:
knn3.score(X_test3, y_test3)

In [None]:
knn_pred3 = knn.predict(X_test3)

In [None]:
print("Confusion Matrix: \n",confusion_matrix(y_test3, knn_pred3))

In [None]:
error_rate = []

for i in range(1,40):
    knn3 = KNeighborsClassifier(n_neighbors = i)
    knn3.fit(X_train3,y_train3)
    pred_knn3 = knn3.predict(X_test3)
    error_rate.append(np.mean(pred_knn3 != y_test3))

In [None]:
plt.figure(figsize = (8,8))
plt.plot(range(1,40), error_rate, color = 'blue', linestyle = 'dashed', marker = 'o', markerfacecolor = 'red', markersize = 10);
plt.title('Error Rate vs K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

In [None]:
## let's chose the k-value

knn3 = KNeighborsClassifier(n_neighbors = 28)
knn3.fit(X_train3, y_train3)
pred_knn3 = knn.predict(X_test3)
print(knn3.score(X_test3, y_test3))

print("\n With k = 28")
print('\n')
print("Confusion Matrix: \n", confusion_matrix(y_test3, pred_knn3))
print('\n')
print(classification_report(y_test3,pred_knn3))


In [None]:
model_score = model_score.append(pd.DataFrame({'df':[knn.score(X_test,y_test)],'df_norm':[knn2.score(X_test2,y_test2)], 'xdf_norm':[knn3.score(X_test3, y_test3)]}, index = ['KNN']))

#### Sorry it's taking more space 

.

## Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dt = DecisionTreeClassifier()

#### original data

In [None]:
dt = dt.fit(X_train, y_train)

In [None]:
dt_pred = dt.predict(X_test)

In [None]:
print("Accuracy:", metrics.accuracy_score(y_test, dt_pred))

In [None]:
dtcm = confusion_matrix(y_test, dt_pred)
print("Confusion Matrix: \n", dtcm)

#### original data (after norm)

In [None]:
dt2 = DecisionTreeClassifier()

In [None]:
dt2 = dt.fit(X_train2, y_train2)

In [None]:
dt2.score(X_test2, y_test2)

In [None]:
dt_pred2 = dt2.predict(X_test2)

In [None]:
dtcm = confusion_matrix(y_test2, dt_pred2)
print("Confusion Matrix: \n", dtcm)

#### outliers removed data after norm

In [None]:
dt3 = DecisionTreeClassifier()

In [None]:
dt3 = dt3.fit(X_train3, y_train3)

In [None]:
dt3.score(X_test3, y_test3)

In [None]:
dt_pred3 = dt.predict(X_test3)

In [None]:
dtcm = confusion_matrix(y_test3, dt_pred3)
print("Confusion Matrix: \n", dtcm)

In [None]:
model_score = model_score.append(pd.DataFrame({'df':[dt.score(X_test,y_test)],'df_norm':[dt2.score(X_test2,y_test2)], 'xdf_norm':[dt3.score(X_test3, y_test3)]}, index = ['Decision Trees']))

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfclf = RandomForestClassifier(n_estimators = 100)

#### With original Dataset

In [None]:
rfclf.fit(X_train, y_train)

In [None]:
rfclf.score(X_test, y_test)

In [None]:
rf_pred = rfclf.predict(X_test)

In [None]:
print("Confusion Matrix:\n",confusion_matrix(y_test, rf_pred))

#### With Original dataset after normalizaing

In [None]:
rfclf2 = RandomForestClassifier(n_estimators = 100)

In [None]:
rfclf2 = RandomForestClassifier(n_estimators = 100)

In [None]:
rfclf2.fit(X_train2, y_train2)

In [None]:
rfclf2.score(X_test2, y_test2)

In [None]:
rf_pred = rfclf2.predict(X_test2)

In [None]:
print("Confusion Matrix:\n",confusion_matrix(y_test2, rf_pred))

#### With Outlier removed dataset after normalizaing

In [None]:
rfclf3 = RandomForestClassifier(n_estimators = 100)

In [None]:
rfclf3 = RandomForestClassifier(n_estimators = 100)

In [None]:
rfclf3.fit(X_train3, y_train3)

In [None]:
rfclf3.score(X_test3, y_test3)

In [None]:
rf_pred = rfclf3.predict(X_test3)

In [None]:
print("Confusion Matrix:\n",confusion_matrix(y_test3, rf_pred))

In [None]:
model_score = model_score.append(pd.DataFrame({'df':[rfclf.score(X_test,y_test)],'df_norm':[rfclf2.score(X_test2,y_test2)], 'xdf_norm':[rfclf3.score(X_test3, y_test3)]}, index = ['Random Forest']))

## Support Vector Machine

In [None]:
from sklearn import svm

#### With original dataset before norm(df)

In [None]:
clf_svm = svm.SVC(kernel='linear')

In [None]:
clf_svm.fit(X_train, y_train)

In [None]:
clf_svm.score(X_test, y_test)

In [None]:
svm_pred = clf_svm.predict(X_test)

In [None]:
print("Confusion Matrix:\n",confusion_matrix(y_test, svm_pred))

#### With original dataset after norm 

In [None]:
clf_svm2 = svm.SVC(kernel='linear')

In [None]:
clf_svm2.fit(X_train2, y_train2)

In [None]:
clf_svm2.score(X_test2, y_test2)

In [None]:
svm_pred2 = clf_svm2.predict(X_test2)

In [None]:
print("Confusion Matrix:\n",confusion_matrix(y_test2, svm_pred2))

#### Outliers removed Dataset after Norm

In [None]:
clf_svm3 = svm.SVC(kernel='linear')

In [None]:
clf_svm3.fit(X_train3, y_train3)

In [None]:
clf_svm3.score(X_test3, y_test3)

In [None]:
svm_pred3 = clf_svm3.predict(X_test3)

In [None]:
print("Confusion Matrix:\n",confusion_matrix(y_test3, svm_pred3))

In [None]:
model_score = model_score.append(pd.DataFrame({'df':[clf_svm.score(X_test,y_test)],'df_norm':[clf_svm2.score(X_test2,y_test2)], 'xdf_norm':[clf_svm3.score(X_test3, y_test3)]}, index = ['SVM']))

### We know

- <b> df = original dataset (without norm and without removing outliers)
- df_norm = original dataset without removing outliers with normalization
- xdf_norm = dataset after removing outliers and normalization


In [None]:
model_score

We can clearly see, 
- <b> Logistic Regression </b> : has higher accuracy after normalization *xdf_norm*
- <b> Naive Bayes </b> : has higher accuracy before normalization *df*
- <b> KNN </b> : has higher accuracy after normalization *xdf_norm*
- <b> Decision Trees </b> : has higher accuracy has higher accuracy after normalization
- <b> Random Forest </b> : higher accuracy after normalization
- <b> SVM </b> : higher accuracy before normalization

We couldn't get more accuracy as data was quite less