# Heart Disease

### Context
This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to
this date. The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).

The names and social security numbers of the patients were recently removed from the database, replaced with dummy values.

One file has been "processed", that one containing the Cleveland database. All four unprocessed files also exist in this directory.

To see Test Costs (donated by Peter Turney), please see the folder "Costs"

### Content
* age: The person's age in years
* sex: The person's sex (1 = male, 0 = female)
* cp: The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)
* trestbps: The person's resting blood pressure (mm Hg on admission to the hospital)
* chol: The person's cholesterol measurement in mg/dl
* fbs: The person's fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)
* restecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)
* thalach: The person's maximum heart rate achieved
* exang: Exercise induced angina (1 = yes; 0 = no)
* oldpeak: ST depression induced by exercise relative to rest ('ST' relates to positions on the ECG plot. See more here)
* slope: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)
* ca: The number of major vessels (0-3)
* thal: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)
* target: Heart disease (0 = no, 1 = yes)

### Acknowledgements
The original dataset is provided by UCI (https://archive.ics.uci.edu/ml/datasets/Heart+Disease).

### Inspiration
The objective is to explore the dataset to achieve a better understanding of the heart disease in the exams results.

### Imports

In [None]:
!pip install joblib

In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import OneHotEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.compose import ColumnTransformer
from joblib import dump, load
from sklearn.metrics import plot_confusion_matrix
import warnings
warnings.filterwarnings(action='ignore')
%matplotlib inline

## Read dataset

In [None]:
dataset = pd.read_csv('../input/heart-disease-uci/heart.csv')

In [None]:
dataset.head()

In [None]:
dataset.info()

In [None]:
dataset.describe()

## Data Exploration

In [None]:
dataset.target.value_counts()

In [None]:
sns.countplot(x="target", data=dataset, palette="bwr")
plt.show()

In [None]:
countNoDisease = len(dataset[dataset.target == 0])
countHaveDisease = len(dataset[dataset.target == 1])
print("Patients Haven't Heart Disease: {:.2f}%".format((countNoDisease / (len(dataset.target))*100)))
print("Patients Have Heart Disease: {:.2f}%".format((countHaveDisease / (len(dataset.target))*100)))

In [None]:
sns.countplot(x='sex', data=dataset, palette="mako_r")
plt.xlabel("Sex (0 = female, 1= male)")
plt.show()

In [None]:
countFemale = len(dataset[dataset.sex == 0])
countMale = len(dataset[dataset.sex == 1])
print("Female Patients: {:.2f}%".format((countFemale / (len(dataset.sex))*100)))
print("Male Patients: {:.2f}%".format((countMale / (len(dataset.sex))*100)))

In [None]:
pd.crosstab(dataset.age,dataset.target).plot(kind="bar",figsize=(20,6))
plt.title('Heart Disease Frequency for Ages')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

#### Qual a faixa de idade que as pessoas costumam a ter mais problemas no coração?

In [None]:
dataset['age'].describe()
dataset['age'] = pd.cut(dataset['age'], bins=[0,40,45,50,55,60,65,70,300], labels=[1,2,3,4,5,6,7,8])

In [None]:
dataset.groupby(['age']).target.value_counts()

In [None]:
dataset.groupby(['age']).target.apply(lambda g: g.value_counts()/len(g))

In [None]:
dataset.groupby(['age']).target.apply(lambda g: g.value_counts()/dataset.target.value_counts())

R.:

Segundo a pesquisa se formos levar em conta apenas os dados da faixa etária,
a faixa entre 40 aos 45 anos possuem 77% de pessoas que tiveram problemas do
coração.

Entretanto quando observamos os dados de um modo geral a mesma faixa que possui algum problema no coração
possui 21% das pessoas da pesquisa enquanto a faixa entre 50 e 55 anos possuem
22% das pessoas que se possuiram problemas do coração relatos na pesquisa.


#### Quantas pessoas do sexo feminino entre 40 e 45 anos tem problema no coração?

In [None]:
# sex (1 = male; 0 = female)

dataset[dataset['age'] == 2][dataset['target'] == 1][dataset['sex'] == 0].target.count()


R.:
12 Pessoas

## Data Analysis

In [None]:
col = dataset.columns       # .columns gives columns names in data
print(col)

In [None]:
target = 'target'
features = col[:-1]

#### Missing data

In [None]:
total = dataset[features].isnull().sum().sort_values(ascending = False)
percent = (dataset[features].isnull().sum()/dataset[features].isnull().count()*100).sort_values(ascending = False)
missing  = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing

#### Heatmap features

In [None]:
data_map = dataset[[target]]
data_map[features] = dataset[features]
plt.figure(figsize=(15,15))
sns.heatmap(data_map.corr(), annot=True, square=True, cmap='coolwarm')
plt.show()

In [None]:
for column in features:
    plt.figure(figsize = (20, 3))
    dataset.plot(kind='scatter', x=column, y=target)

## Clean Dataset

In [None]:
duplicated_data = dataset.duplicated()
dataset[duplicated_data]

In [None]:
dataset.drop_duplicates(keep = False, inplace = True)

In [None]:
duplicated_data = dataset.duplicated()
dataset[duplicated_data]

## Data Preprocessing

In [None]:
y = dataset[target]
X = dataset.drop([target], axis=1)

In [None]:
numerical_columns = list(X._get_numeric_data().columns)
categorical_columns = list(set(X.columns) - set(numerical_columns))

In [None]:
numerical_pipeline = Pipeline([
        ('std_scaler', StandardScaler()),
    ])

In [None]:
categorical_pipeline = Pipeline([
        ('encoder', OneHotEncoder(handle_unknown='ignore'))
    ])

In [None]:
transformer = ColumnTransformer([
    ("numerical", numerical_pipeline, numerical_columns),
    ("categorical", categorical_pipeline, categorical_columns)
])

In [None]:
X, X_validation, y, y_validation = train_test_split(X, y, test_size = 0.3, random_state = 0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

In [None]:
y_train.value_counts()

In [None]:
y_test.value_counts()

In [None]:
y_validation.value_counts()

## Models

### Chooise the best K in KNN

In [None]:
def train_model(k_value,metric='euclidean'):
    knn = KNeighborsClassifier(n_neighbors=k_value,metric=metric)
    X_train_transformer = transformer.fit_transform(X_train)
    X_test_transformer = transformer.transform(X_test)
    knn.fit(X_train_transformer,y_train.values.ravel())
    y_pred = knn.predict(X_test_transformer)
    return np.mean(y_pred != y_test.values.ravel())

In [None]:
def train_model_without_scaler(k_value,metric='euclidean'):
    knn = KNeighborsClassifier(n_neighbors=k_value,metric=metric)
    knn.fit(X_train,y_train.values.ravel())
    y_pred = knn.predict(X_test)
    return np.mean(y_pred != y_test.values.ravel())

In [None]:
def plot_error_rate(error_rate):
    plt.figure(figsize=(10,6))
    plt.plot(range(1,50),error_rate,color='blue', linestyle='dashed', marker='o',
             markerfacecolor='red', markersize=10)
    plt.title('Error Rate vs. K Value')
    plt.xlabel('K')
    plt.ylabel('Error Rate')

####  Euclidean

In [None]:
error_rate = [ train_model(k_value) for k_value  in range(1,50) ]
plot_error_rate(error_rate)

#### Cosine

In [None]:
error_rate = [ train_model(k_value,metric='cosine') for k_value  in range(1,50) ]
plot_error_rate(error_rate)

#### Correlation

In [None]:
error_rate = [ train_model(k_value,metric='correlation') for k_value  in range(1,50) ]
plot_error_rate(error_rate)


In [None]:
def train_ensemble_models(X, y):
    clf1 = KNeighborsClassifier(n_neighbors=12, metric='euclidean')
    clf2 = GaussianNB()
    clf3 = DecisionTreeClassifier()
    clf4 = RandomForestClassifier()

    for clf, label in zip([clf1, clf2, clf3, clf4], ['KNeighborsClassifier', 'GaussianNB', 'DecisionTreeClassifier','RandomForestClassifier']):
        execute_pipeline(clf, X, y, label)

In [None]:
def execute_pipeline(clf, X, y, title):
    pipe = Pipeline([
        ('transformer', transformer),
        ('reduce_dim', 'passthrough'),
        ('classify', clf)
    ])

    N_FEATURES_OPTIONS = [2, 4, 8, 12]

    param_grid = [
        {
            'reduce_dim': [PCA()],
            'reduce_dim__n_components': N_FEATURES_OPTIONS
        },
        {
            'reduce_dim': [SelectKBest()],
            'reduce_dim__k': N_FEATURES_OPTIONS
        },
    ]
    reducer_labels = ['PCA', 'KBest']

    grid = GridSearchCV(pipe,  param_grid=param_grid, scoring='accuracy', cv=10, verbose=1, n_jobs=-1, return_train_score=True)
    grid.fit(X, y)

    mean_train_scores = np.array(grid.cv_results_['mean_train_score'])
    mean_scores = np.array(grid.cv_results_['mean_test_score'])
    mean_scores = mean_scores.reshape(2, len(N_FEATURES_OPTIONS))
    bar_offsets = (np.arange(len(N_FEATURES_OPTIONS)) * (len(reducer_labels) + 1) + .5)

    plt.figure()
    COLORS = 'bgrcmyk'
    for i, (label, reducer_scores) in enumerate(zip(reducer_labels, mean_scores)):
        plt.bar(bar_offsets + i, mean_train_scores[i], label='{} train'.format(label),alpha=.7)
        plt.bar(bar_offsets + i, reducer_scores, label='{} test'.format(label), color=COLORS[i])

    plt.title(title)
    plt.xlabel('Number of features')
    plt.xticks(bar_offsets + len(reducer_labels) / 2, N_FEATURES_OPTIONS)
    plt.ylabel('Classification accuracy')
    plt.ylim((0, 1))
    plt.legend(bbox_to_anchor=(0,1), loc="upper right", bbox_transform=plt.gcf().transFigure)
    plt.show()



## Conclusion

In [None]:
grid_result = train_ensemble_models(X_train, y_train)

### Validated Model

#### Explicar a escolha do modelo.
R.: Foi escolhido o Naive Bayes pelo melhores resultados avaliados na etapa anterior.
É possível observar que as arvores de decisão tiveram um comportamento de overfit.
Entre o KNN e naive Bayes tiveram poucas diferenças comparando os resultados, entretanto o naive bayes possuiu um resultado um pouco melhor do que o KNN.

In [None]:
def train_best_model(X_train, y_train, X_validation, y_validation):
    reduction = SelectKBest(k=8)
    model = GaussianNB()

    X_train_transformer = transformer.fit_transform(X_train)
    X_validation_transformer = transformer.transform(X_validation)

    X_train_reduction_transformer = reduction.fit_transform(X_train_transformer, y_train)
    X_validation_reduction_transformer = reduction.transform(X_validation_transformer)

    model.fit(X_train_reduction_transformer, y_train)
    y_predict = model.predict(X_validation_reduction_transformer)

    # Acurácia, precisão e recall

    print(classification_report(y_predict, y_validation))
    plot_confusion_matrix(model, X_validation_reduction_transformer, y_validation)

    return reduction, X_train_reduction_transformer, model

In [None]:
reduction, X_train_reduction_transformer, model = train_best_model(X_train, y_train, X_validation, y_validation)

#### Quais as variáveis que mais influenciam no resultado da predição?

In [None]:
cols = reduction.get_support(indices=True)
new_features = []
for bool, feature in zip(cols, X_train.columns):
    new_features.append(feature)
dataframe = pd.DataFrame(X_train_reduction_transformer, columns=new_features)
dataframe

In [None]:
dataframe['target'] = y_train

In [None]:
dataframe.describe()

In [None]:
dataframe.tail()


In [None]:
plt.figure(figsize=(15,15))
sns.heatmap(dataframe.corr(), annot=True, square=True, cmap='coolwarm')
plt.show()

#### Caso eu faça um modelo só pra sexo masculino e outro para feminino isso melhora o resultado para cada gênero na predição?

In [None]:
new_data = X
new_data[target] = y

data_male = new_data[new_data['sex'] == 1]
data_female = new_data[new_data['sex'] == 0]


X_train_male, X_test_male, y_train_male, y_test_male = train_test_split(data_male.drop([target], axis=1), data_male[target], test_size = 0.2, random_state = 42)
X_train_female, X_test_female, y_train_female, y_test_female = train_test_split(data_female.drop([target], axis=1), data_female[target], test_size = 0.2, random_state = 42)

#### Model Male

In [None]:
_ ,_ ,_ = train_best_model(X_train_male, y_train_male, X_test_male, y_test_male)


#### Model Female

In [None]:
_ ,_ ,_ = train_best_model(X_train_female, y_train_female, X_test_female, y_test_female)

R.:
Não foi observado melhoras significativas no modelo de homens, entretanto no de mulher teve uma melhora de 10%.


## Save models and results

In [None]:
persistence = {}
persistence['transformer'] = transformer
persistence['reduction'] = reduction
persistence['model']  = model
dump(persistence, 'persist.joblib')

In [None]:
persistence = load('persist.joblib')

transformer = persistence['transformer']
reduction = persistence['reduction']
model = persistence['model']

dataset_test_transformer = transformer.transform(X_validation)
dataset_test_reduction_transformer = reduction.transform(dataset_test_transformer)

predictions = model.predict(dataset_test_reduction_transformer)

In [None]:
output = X_validation.copy()
output['target'] = predictions

In [None]:
output.to_csv('./answer.csv', header=col, index=False)
print("Your submission was successfully saved!")