# INTRODUCTION
<font color = 'blue'>
Content

1. [Load and Check Data](#1)
2. [Variable Description](#2)
    * [Univariate Variable Analysis](#3)
        * [Categorical Variable Analysis](#4)
        * [Numerical Variable Analysis](#5)
3. [Basic Data Analysis](#6)
4. [Outlier Detection](#7)
5. [Missing Value](#8)
1. [Visualization](#9)
    * [Correlation Map](#10)
    * [Target -- cp](#11)
    * [Target -- slope](#12)
    * [Target -- thalach](#13)
    * [Target -- sex](#14)
    * [Target -- exang](#15)
    * [Target -- oldpeak](#16)
    * [Target -- ca](#17)
    * [Target -- thal](#18)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
plt.style.use("seaborn-whitegrid")

import seaborn as sns

from collections import Counter

import warnings
warnings.filterwarnings("ignore")

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

<a id = '1'></a><br>
# Load and Check Data

In [None]:
df = pd.read_csv("/kaggle/input/heart-disease-uci/heart.csv")

In [None]:
df.columns

In [None]:
df.head()

In [None]:
df.describe()

<a id = '2'></a><br>
# Variable Description

1. age : Age of patient
2. sex : Gender of patient 1:Male, 0:Female
3. cp : Chest Pain Type
4. trestbps : resting blood pressure
5. chol : serum cholestoral in mg/dl
6. fbs : fasting blood sugar > 120 mg/dl
7. restecg : resting electrocardiographic results (values 0,1,2)
8. thalach : maximum heart rate achieved
9. exang : exercise induced angina
10. oldpeak : ST depression induced by exercise relative to rest
11. slope : the slope of the peak exercise ST segment
12. ca : number of major vessels (0-3) colored by flourosopy
13. thal : thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
14. target : Have heart disease(1) or not(0)


In [None]:
df.info()

We do not have any null entries.

* float(1) : oldpeak
* int(13) : age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, slope, ca, thal, target

<a id = '3'></a><br>
# Univariate Variable Analysis
* Categorical Variable Analysis : sex, cp, fbs, restecg, exang, slope, ca, thal, target
* Numerical Variable Analysis : age, trestbps, chol, thalach, oldpeak

<a id = '4'></a><br>
## Categorical Variable

In [None]:
def bar_plot(variable):
    
    
    
    var = df[variable]
    var_value = var.value_counts()
    
    #visualize
    
    plt.figure(figsize = (10,3))
    plt.bar(var_value.index, var_value)
    plt.xticks(var_value.index)
    plt.ylabel("Frequency")
    plt.title(variable)
    plt.show()
    
    print("{} \n {}".format(variable,var_value))

In [None]:
categorical_cols = ['sex','cp','fbs','restecg','exang','slope','ca','thal','target']
for i in categorical_cols:
    bar_plot(i)

<a id = '5'></a><br>
## Numerical Variable

In [None]:
def plot_hist(variable):
    """
    age, trestbps, chol, thalach, oldpeak
    
    """
    
    var = df[variable]
    
    #visualize
    plt.figure(figsize = (10,3))
    plt.hist(var,bins = 50)
    plt.xlabel(variable)
    plt.ylabel("Frequency")
    plt.title("{} Distribution with histogram".format(variable))
    plt.show()

In [None]:
numerical = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
for i in numerical:
    plot_hist(i)

<a id = '6'></a><br>
# Basic Data Analysis
* sex vs target
* cp vs target
* fbs vs target
* restecg vs target
* exang vs target
* slope vs target
* ca vs target
* thal vs target

In [None]:
df[["sex","target"]].groupby(["sex"], as_index = False).mean().sort_values(by = 'target', ascending = False)

In [None]:
df[["cp","target"]].groupby(["cp"], as_index = False).mean().sort_values(by = 'target', ascending = False)

In [None]:
df[["fbs","target"]].groupby(["fbs"], as_index = False).mean().sort_values(by = 'target', ascending = False)

In [None]:
df[["restecg","target"]].groupby(["restecg"], as_index = False).mean().sort_values(by = 'target', ascending = False)

In [None]:
df[["exang","target"]].groupby(["exang"], as_index = False).mean().sort_values(by = 'target', ascending = False)

In [None]:
df[["slope","target"]].groupby(["slope"], as_index = False).mean().sort_values(by = 'target', ascending = False)

In [None]:
df[["ca","target"]].groupby(["ca"], as_index = False).mean().sort_values(by = 'target', ascending = False)

In [None]:
df[["thal","target"]].groupby(["thal"], as_index = False).mean().sort_values(by = 'target', ascending = False)

In [None]:
df[["sex","target","cp"]].groupby(["sex","cp"], as_index = False).mean().sort_values(by = 'target', ascending = False)

<a id = '7'></a><br>
# Outlier Detection

In [None]:
def detect_outliers(data,features):
    outlier_indices = []
    for i in features:
        #1st quartile
        Q1 = np.percentile(data[i],25)
        #3rd quartile
        Q3 = np.percentile(data[i],75)
        #IQR
        IQR = Q3 - Q1
        #Outlier step
        outlier_step = IQR * 1.5
        #detect outlier and their indices
        outlier_list_cols = data[(data[i] <  Q1 - outlier_step) | (data[i] >  Q3 + outlier_step)].index
        
        #store indices
        
        outlier_indices.extend(outlier_list_cols)
        
    
    outlier_indices = Counter(outlier_indices)
    
    multiple_outliers = list(c for c,k in outlier_indices.items() if k>2)
    
    return multiple_outliers
        
        

In [None]:
df.loc[detect_outliers(df,["age","trestbps","chol","thalach","oldpeak"])]

There are no outlier entries.

<a id = '8'></a><br>
# Missing Value

In [None]:
df.columns[df.isnull().any()]

No missing values for this data.

<a id = '9'></a><br>
# Visualization

<a id = '10'></a><br>
## Correlation Map

In [None]:
plt.subplots(figsize = (15,15))
sns.heatmap(df.corr(), annot=True, fmt='.2f')
plt.show()

* Negative Correlated Features: 
    *  target--age, target--sex, target--exang, age--oldpeak, target--ca, target--thal 


* Positive Correlated Features: 
     * target--cp, target--thalach, target--slope

<a id = '11'></a><br>
## Target -- cp

In [None]:
f,ax = plt.subplots(figsize = (10,10))
ax = sns.barplot(x='cp',y='target',data=df)
plt.show()

* Having 0 type cp means less likely to have heart disease
* Having 1,2,3 cp means more likely to have heart disease
* We can use cp feature to model training

<a id = '12'></a><br>
## Target -- slope

In [None]:
f,ax = plt.subplots(figsize = (10,10))
ax = sns.barplot(x='slope',y='target',data=df)
plt.show()

* Having 2 slope value means more likely to have heart disease

<a id = '13'></a><br>
## Target -- thalach

In [None]:
f,ax = plt.subplots(figsize = (10,10))
ax = sns.boxplot(data=df,x='target',y='thalach')
plt.show()

* People who have heart disease also have more thalach value

<a id = '14'></a><br>
## Target -- sex

In [None]:
f,ax = plt.subplots(figsize = (10,10))
ax = sns.barplot(x='sex',y='target',data=df)
plt.show()

* Females are more likely to have heart disease

<a id = '15'></a><br>
## Target -- exang

In [None]:
f,ax = plt.subplots(figsize = (10,10))
ax = sns.barplot(x='exang',y='target',data=df)
plt.show()

* People who do NOT have exercise induced angina (exang=0) are more likely to have heart disease

<a id = '16'></a><br>
## Target -- oldpeak

In [None]:
ax = sns.FacetGrid(df, col="target", size=5)
ax.map(sns.distplot, 'oldpeak',bins=10)
plt.show()

* oldpeak <=1.2 has high rate of having heart disease
* highest (2) oldpeak people dont have heart disease

<a id = '17'></a><br>
## Target -- ca

In [None]:
f,ax = plt.subplots(figsize = (10,10))
ax = sns.barplot(x='ca',y='target',data=df)
plt.show()

* 0 and 4 ca values means more likely to have heart disease
* 1,2,3 ca values means less likely to have heart disease

<a id = '18'></a><br>
## Target -- thal

In [None]:
f,ax = plt.subplots(figsize = (10,10))
ax = sns.barplot(x='thal',y='target',data=df)
plt.show()

* thal value 2 ==> more likely to have heart disease

# **TO BE CONTINUED**