# Classification Project

In this project, I have used [Cardiovascular Disease dataset](https://www.kaggle.com/sulianova/cardiovascular-disease-dataset) from kaggle. Based on some health conditions of an individual my model will predict whether he has any cardiovascular disease or not.

Features:

* Age | Objective Feature | age | int (days)
* Height | Objective Feature | height | int (cm) |
* Weight | Objective Feature | weight | float (kg) |
* Gender | Objective Feature | gender | categorical code |
* Systolic blood pressure | Examination Feature | ap_hi | int |
* Diastolic blood pressure | Examination Feature | ap_lo | int |
* Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
* Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
* Smoking | Subjective Feature | smoke | binary |
* Alcohol intake | Subjective Feature | alco | binary |
* Physical activity | Subjective Feature | active | binary |
* Presence or absence of cardiovascular disease | Target Variable | cardio | binary |

In [None]:
#importing basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import warnings
warnings.filterwarnings('ignore')

raw_data = pd.read_csv('../input/cardiovascular-disease-dataset/cardio_train.csv',sep=';')
# Check the data
raw_data.info()

In [None]:
raw_data.head(3)

In [None]:
print(f"Missing values are present: {raw_data.isnull().sum().any()}")

* There is no mising value in the data.
* I will drop column 'id' as it is irrelevant to target variable.
* Transform age column into years instead of days.
* Gender feature should not be categorized into 1 and 2 because 2 is always numerically bigger than 1, the model would take into account that and give a bigger ratio to one gender for having a disease. So, I will make that binary.
* I will check and drop duplicates.

In [None]:
raw_data.drop('id',axis=1,inplace=True)
raw_data.age = np.round(raw_data.age/365.25,decimals=1)
raw_data.gender = raw_data.gender.replace(2,0)

In [None]:
raw_data.duplicated().sum()

In [None]:
raw_data.drop_duplicates(inplace=True)

# Exploratory Data Analysis and Data Preprocessing

In [None]:
sns.set_style('darkgrid')
sns.countplot(raw_data.cardio,palette='summer')
plt.xlabel('Presence of cardiovascular disease',fontdict={'fontsize': 15,'color':'Green'},labelpad=3);

So, data is almost balanced. Let's see which gender has more cases of disease. As in this data there was no knowledge of which gender is denoted by which number, I will use simple fact that women's average age is less than that of men.

In [None]:
a = raw_data[raw_data["gender"]==0]["height"].mean()
b = raw_data[raw_data["gender"]==1]["height"].mean()
if a > b:
    gender = "male"
    gender1 = "female"
else:
    gender = "female"
    gender1 = "male"
print("Gender:0 is "+ gender +" & Gender:1 is " + gender1)

In [None]:
sns.set_style('whitegrid')
sns.countplot(raw_data.gender,hue=raw_data.cardio, palette="Set2");

In [None]:
sns.set_style('dark')
sns.boxplot(raw_data.height,palette='pink')
plt.title('Distribution of height');

In [None]:
sns.set_style('white')
sns.boxplot(raw_data.weight,palette='terrain')
plt.title('Distribution of weight');

I will remove extremely rare cases of height and weight. As data is quite big, there will be no prblem while modelling.

In [None]:
raw_data = raw_data[(raw_data['height']<250) & (raw_data['weight']>20.0)]

There are many outliers in height and weight features. I combine both of these into a new feature bmi.

In [None]:
raw_data["bmi"] = (raw_data["weight"]/ (raw_data["height"]/100)**2).round(1)

In [None]:
raw_data[raw_data['bmi']<10]

In [None]:
raw_data[raw_data['bmi']>100].sort_values(by='weight',ascending=False).head(5)

Further, I will remove extremely underweight and obese people because such cases seems impossible. For example, there are observations with 80 cm height and 165 kgs weight which is quite impossible. May be it was a fake observation or typing mistake. Also, health conditions of dwarf and abnormally tall people are totally different so I don't want to include them.

In [None]:
data= raw_data[(raw_data['bmi']>10) & (raw_data['bmi']<100)].copy()

In [None]:
sns.boxplot(data.bmi,color='Green')
plt.title('Distribution of BMI');

In [None]:
data.drop(['weight','height'],axis=1,inplace=True)

In [None]:
sns.violinplot(data.age,color='orange')
print("Observations have been recorded mostly for people with age between 40 and 65");

Now, I will remove outliers and abrupt blood pressure values.

In [None]:
(data['ap_lo']>360).sum()

In [None]:
(data['ap_hi']>360).sum()

In [None]:
data= data[(data['ap_lo']<360) & (data['ap_hi']<360)].copy()
data= data[(data['ap_lo']>20) & (data['ap_hi']>20)].copy()
data=data[data['ap_hi']>data['ap_lo']]

In [None]:
sns.violinplot(data.ap_hi,color='orange');

In [None]:
sns.violinplot(data.ap_lo,color='orange');

In [None]:
#creating dummy variables for categorical column
data['cholesterol']=data['cholesterol'].map({ 1: 'normal', 2: 'above normal', 3: 'well above normal'})
data['gluc']=data['gluc'].map({ 1: 'normal', 2: 'above normal', 3: 'well above normal'})
dummies = pd.get_dummies(data[['cholesterol','gluc']],drop_first=True)
final_data = pd.concat([data,dummies],axis=1)
final_data.drop(['cholesterol','gluc'],axis=1,inplace=True)
final_data.head()

In [None]:
#plotting using plotly
import cufflinks as cf
from plotly.offline import download_plotlyjs,init_notebook_mode,plot,iplot
init_notebook_mode(connected=True)
cf.go_offline()

print('Correlation of features with target variable')
final_data.corr()['cardio'].sort_values()[:-1].iplot(kind='barh');

### Splitting and Standardizing data

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(final_data.drop('cardio',axis=1),final_data.cardio,test_size=0.30)

to_be_scaled_feat = ['age', 'ap_hi', 'ap_lo','bmi']
other_feat = ['gender', 'cholesterol_normal', 'cholesterol_well above normal',
       'gluc_normal', 'gluc_well above normal', 'smoke', 'alco', 'active']
scaler=StandardScaler()
scaler.fit(X_train[to_be_scaled_feat])
X_train[to_be_scaled_feat] = scaler.transform(X_train[to_be_scaled_feat])
X_test[to_be_scaled_feat] = scaler.transform(X_test[to_be_scaled_feat])

### Modelling

In [None]:
# importing classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.metrics import f1_score,accuracy_score,classification_report

classifiers = {
    'Logistic Regression' : LogisticRegression(),
    'Decision Tree' : DecisionTreeClassifier(),
    'Random Forest' : RandomForestClassifier(),
    'Support Vector Machines' : SVC(),
    'K-nearest Neighbors' : KNeighborsClassifier(),
    'XGBoost' : XGBClassifier()
}
results=pd.DataFrame(columns=['Accuracy in %','F1-score'])
for method,func in classifiers.items():
    func.fit(X_train,y_train)
    pred = func.predict(X_test)
    results.loc[method]= [100*np.round(accuracy_score(y_test,pred),decimals=4),
                         round(f1_score(y_test,pred),2)]
results

# Improving Accuracy by Hyperparameter Tuning

## K- Nearest Neighbors (by elbow method)

In [None]:
error_rate = []

for i in range(1,15):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

plt.figure(figsize=(10,6))
plt.plot(range(1,15),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate');

In [None]:
knn = KNeighborsClassifier(n_neighbors=12)
knn.fit(X_train,y_train)
knn_pred = knn.predict(X_test)
print(classification_report(y_test,knn_pred))

In [None]:
results.loc['K-nearest Neighbors(Improved)']= [100*np.round(accuracy_score(y_test,knn_pred),decimals=4),
                         round(f1_score(y_test,knn_pred),2)]

By using **elbow method** we have increased accuracy of this model from 69.4% to 72%.

## Random Forest (by GridSearchCV)

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'max_depth': [80, 90],
    'max_features': [2, 3],
    'min_samples_leaf': [3, 4],
    'min_samples_split': [8, 10],
    'n_estimators': [100, 200]}
grid=GridSearchCV(RandomForestClassifier(),param_grid,verbose=1)
grid.fit(X_train,y_train)
grid.best_params_

In [None]:
grid_pred = grid.predict(X_test)
print(classification_report(y_test,grid_pred))

By using **GridSearchCV** we have increased accuracy of this model from 69% to 73%.

# Deep Neural Networks

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation,Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
#splitting further into validation set
X_train, X_val, y_train, y_val = train_test_split(X_train,y_train,test_size=0.20)

model = Sequential()
model.add(Dense(12,activation='relu'))
model.add(Dense(50,activation='relu',kernel_initializer=tf.random_uniform_initializer(minval=-0.1, maxval=0.1),
    bias_initializer=tf.random_uniform_initializer(minval=-0.1, maxval=0.1)))
model.add(Dropout(0.2))
model.add(Dense(50,activation='relu',kernel_initializer=tf.random_uniform_initializer(minval=-0.1, maxval=0.1),
    bias_initializer=tf.random_uniform_initializer(minval=-0.1, maxval=0.1)))
model.add(Dropout(0.2))
model.add(Dense(50,activation='relu',kernel_initializer=tf.random_uniform_initializer(minval=-0.1, maxval=0.1),
    bias_initializer=tf.random_uniform_initializer(minval=-0.1, maxval=0.1)))
model.add(Dropout(0.2))
model.add(Dense(1,activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam')

early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=5)

model.fit(x=X_train.values,y=y_train.values,
          validation_data=(X_val,y_val.values),
          batch_size=100,epochs=150,callbacks=[early_stop])

In [None]:
losses = pd.DataFrame(model.history.history)
losses[['loss','val_loss']].plot();

In [None]:
dnn_pred = model.predict_classes(X_test)
print(classification_report(y_test,dnn_pred))

# Results

In [None]:
results.loc['Random Forest(Improved)']= [100*np.round(accuracy_score(y_test,grid_pred),decimals=4),
                         round(f1_score(y_test,grid_pred),2)]
results.loc['Deep Neural Network']= [100*np.round(accuracy_score(y_test,dnn_pred),decimals=4),
                         round(f1_score(y_test,dnn_pred),2)]
results.sort_values(by='Accuracy in %',ascending=False).style.highlight_max()