# Introduction

***I have made a simple notebook which will help you in getting a better understanding of applications of library like Sklearn
If you have any doubts feel free to reach out to me by commenting below.***

***Do upvote my notebook so that it can reach wider audience***

**Happy learning!!**

# Workflow of this Tutorial

1. Loading Libraries 
2. Exploratory Data Analysis
3. Basic Data Visualization
4. Scaling of the Data
5. Supervised Machine Learning Algorithms
6. Neural Networks
7. Classification reports
8. Unsupervised Machine Learning Algorithms

# Understanding the Data

* age - age in years
* sex - (1 = male; 0 = female)
* cp - chest pain type
* trestbps - resting blood pressure (in mm Hg on admission to the hospital)
* chol - serum cholestoral in mg/dl
* fbs - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
* restecg - resting electrocardiographic results
* thalach - maximum heart rate achieved
* exang - exercise induced angina (1 = yes; 0 = no)
* oldpeak - ST depression induced by exercise relative to rest
* slope - the slope of the peak exercise ST segment
* ca - number of major vessels (0-3) colored by flourosopy
* thal - 3 = normal; 6 = fixed defect; 7 = reversable defect
* target - have disease or not (1=yes, 0=no)

# Libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import svm
import tensorflow as tf
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Loading the Data

In [None]:
data_df = pd.read_csv('../input/heart-disease-uci/heart.csv')
data_df.head(5)

# Checking Missing Values

In [None]:
missing_vals = data_df.isnull().sum().sum()
missing_vals

***So our dataset is free of any missing values which makes our task more simpler now***

In [None]:
data_df.describe()

In [None]:
data_df.mean()

***Target column indicates whether the person has disease or not, with 0 - No disease and 1- disease***

In [None]:
data_df.max() #Maximum values per columns

In [None]:
data_df.min() ##Minimum values per columns

# Lets Visualize the Data

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(y=data_df['target'])
plt.title('Estimates of Diseased and Unaffected individuals')
plt.legend(["Safe", "Have Disease"])

In [None]:
plt.figure(figsize=(15,15))
sns.countplot(y=data_df['age'], hue=data_df['target'])
plt.title('Age wise data distribution with Disease estimates')

In [None]:
#Patients with Heart disease percentage
no_disease = len(data_df[data_df.target==0])
disease = len(data_df[data_df.target==1])

perc_safe = no_disease/len(data_df.target)*100
perc_diseased = disease/len(data_df.target)*100
print('Distribution of unaffected is {:.2f}%, while for diseased, it is {:.2f}%'.format(perc_safe, perc_diseased))

In [None]:
plt.figure(figsize=(10,10))
plt.subplot(2,2,1)
sns.countplot(y=data_df['sex'])
plt.ylabel('0-female, 1-Male')

plt.subplot(2,2,2)
sns.countplot(y=data_df['sex'], hue=data_df['target'])

plt.title('Gender wise data distribution')

In [None]:
#estimating male and female patients
female = len(data_df[data_df.sex==0])
male = len(data_df[data_df.sex==1])

perc_female = female/len(data_df.sex)*100
perc_male = male/len(data_df.sex)*100
print('Distribution of females is {:.2f}%, while for males, it is {:.2f}%'.format(perc_female, perc_male))

# Correlation in Dataset

***We do this to analyze the dataset in such a way that more correlated features are considered when going for model developement***

In [None]:
corr = data_df.corr()
plt.figure(figsize=(15,10))
sns.heatmap(corr, annot=True, linewidth = 0.2)
plt.title('Correlation in Dataset')

In [None]:
corr

In [None]:
plt.figure(figsize=(15,10))
sns.set(color_codes=True)
sns.boxplot(data=data_df, orient='h', palette = 'Set2',linewidth=2.5)

In [None]:
pd.crosstab(data_df.sex,data_df.target).plot(kind="barh",figsize=(15,6))
plt.title('Heart Disease Frequency for Sex')
plt.xlabel('Sex (0 = Female, 1 = Male)')
plt.xticks(rotation=0)
plt.legend(["Safe", "Have Disease"])
plt.ylabel('Frequency')
plt.show()

In [None]:
sns.scatterplot(data_df.age[data_df.target==1], y=data_df.thalach[(data_df.target==1)], color='red')
sns.scatterplot(data_df.age[data_df.target==0], y=data_df.thalach[(data_df.target==0)], color='green')
plt.legend(["Disease", "Not Disease"])

# Developing Data for Model

In [None]:
x = data_df.drop('target',axis=1)
y = data_df['target']

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2, random_state=5)

In [None]:
print('Training data : {},{} '.format(x_train.shape, y_train.shape))
print('Testing data : {},{} '.format(x_test.shape, y_test.shape))

# Scaling

In [None]:
scale = StandardScaler()
x_train = scale.fit_transform(x_train)
x_test = scale.transform(x_test)

In [None]:
x_train

In [None]:
score=[]

# Supervised Machine Learning Algorithms

# Logisitic Regression

In [None]:
clf1=LogisticRegression()
clf1.fit(x_train,y_train)
pred1=clf1.predict(x_test)
s1=accuracy_score(y_test,pred1)
score.append(s1*100)
print(s1)

# KNN

In [None]:
knn = KNeighborsClassifier()
knn.fit(x_train,y_train)

y_true0 = knn.predict(x_test)
s2 = accuracy_score(y_test,y_true0)
score.append(s2*100)
print(s2)

# XGB

In [None]:
xgb = XGBClassifier()
xgb.fit(x_train,y_train)

y_true = xgb.predict(x_test)
s3 = accuracy_score(y_test,y_true)
score.append(s3*100)
print(s3)

# Random Forest

In [None]:
rf = RandomForestClassifier()
rf.fit(x_train,y_train)

y_true1 = rf.predict(x_test)
s4 = accuracy_score(y_test,y_true1)
score.append(s4*100)
print(s4)

# SVM

In [None]:
svc = svm.SVC()
svc.fit(x_train,y_train)

y_true2 = svc.predict(x_test)
s5 = accuracy_score(y_test,y_true2)
score.append(s5*100)
print(s5)

# Decision Tree Classifier

In [None]:
dtc = DecisionTreeClassifier()
dtc.fit(x_train,y_train)

y_true3 = dtc.predict(x_test)
s6 = accuracy_score(y_test,y_true3)
score.append(s6*100)
print(s6)

In [None]:
print(score)

In [None]:
label = ['LogisticRegression', 'KNN', 'XGB', 'RandomForest', 'SVM', 'DecisionTreeClassifier']
scores = pd.Series(data = score, index = label)
print(scores)

In [None]:
sc = scores.sort_index()
plt.figure(figsize=(15,10))
sc.plot(kind='barh')
plt.title('Models Accuracy Scores')

# Neural Networks

In [None]:
ann = tf.keras.models.Sequential()
ann.add(tf.keras.layers.Dense(6, activation='relu'))
ann.add(tf.keras.layers.Dense(6, activation='relu'))
ann.add(tf.keras.layers.Dense(1, activation='sigmoid'))

ann.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

In [None]:
history = ann.fit(x_train, y_train, batch_size = 10, epochs=100)
print(history)

In [None]:
preds = ann.evaluate(x_test,y_test, batch_size=10,verbose=2)
print('Accuracy score : {}'.format(preds[1]))

***so we got a loss of 0.377 and accuracy of 0.88 on test data using 2 layered ANN***

# Predictions

In [None]:
pred_ann = ann.predict(x_test) 
pred_ann1 = np.argmax(pred_ann, axis = 1)
label = np.argmax(y_test)

In [None]:
pred_ann1[:5]

In [None]:
y_test[:5]

***So our ANN correctly predicts the first 4***

# Predictions from Random Forest

In [None]:
pred = rf.predict(x_test)
pred[:5]

In [None]:
y_test[:5]

***Random Forest on other hand with higher accuracy predicts only 3 correctly***

# Classification Report

In [None]:
# for RandomForest
cf1 = confusion_matrix(y_test,y_true1)
cf1

In [None]:
rfc = classification_report(y_test,y_true1)
print(rfc)

In [None]:
#SVM
cf3 = confusion_matrix(y_test,y_true2)
cf3

In [None]:
rfc = classification_report(y_test,y_true2)
print(rfc)

In [None]:
#KNN
cf4 = confusion_matrix(y_test,y_true0)
cf4

In [None]:
rfc = classification_report(y_test,y_true0)
print(rfc)

In [None]:
# ANN
cf2 = confusion_matrix(y_test,pred_ann1)
print(cf2)

In [None]:
ann_c = classification_report(y_test,pred_ann1)
print(ann_c)

# Unsupervised Machine Learning Algorithms

***Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses. The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data.***

**Types include**
* K-Means Clustering 
* PCA
* ICA

***So lets check out PCA first here***

# PCA

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=5)
pca.fit(x_train)

In [None]:
pca_samples = pca.transform(x_train)
pca_samples[:5]

In [None]:
 print(pca.explained_variance_ratio_)

In [None]:
print(pca.singular_values_)

In [None]:
ps = pd.DataFrame(pca_samples)
ps.head()

In [None]:
tocluster = pd.DataFrame(ps[[4,1]])

# K-Means

In [None]:
from sklearn.cluster import KMeans

km = KMeans(n_clusters=5,random_state=35).fit(tocluster)
clusters = km.cluster_centers_
k_preds = km.predict(tocluster)
k_preds[:5]