# Introduction
 
 Hello! Recently I started to study Machine Learning. I always found it quite interesting, but only now I actually started to study the methods, models, and the general ideia. It was quite interesting to see that it is basically linear regressions and matrix multiplication hahaha
 Anyway, I've bee following both [Andrew Ng's Machine Learning course](https://www.coursera.org/learn/machine-learning) and this [Udemy Machine Learning course](https://www.udemy.com/course/machinelearning/), made by different people. But are quite great, and even if I didn't finished them yet, I decided to start to practice what they teach.
 
 This notebook isn't the best to read and learn new things, but feedback is quite welcome!

# Importing basic libraries

In [None]:
import numpy as np # Linear algebra
import pandas as pd # Data processing, CSV file I/O (e.g. pd.read_csv)

import os # Will help us to open the dataset
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import matplotlib.pyplot as plt # Plotting
import seaborn as sns

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, plot_confusion_matrix
from sklearn.model_selection import train_test_split

In [None]:
#Import dataset
data = pd.read_csv("../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv")
data.head()

In [None]:
#Checking the dataset size
data.shape

## Test #1: Let's use all the features

This is evidently a classification problem, where we want to find the risk of death by heart failure. Also, all the columns are great candidates for our features... at first, let's use them all, with Support Vector Classification. This will quite likely lead to overfitting, but let's see what we end up with anyway.

In [None]:
#Splitting the dataset between the features and the predicted variable, and also a train and a test set
X=data[list(data.columns.drop(["DEATH_EVENT"]))]
y=data["DEATH_EVENT"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
#Training our first model
from sklearn.svm import LinearSVC
svcModel = LinearSVC()
svcModel.fit(X_train, y_train)
#Creating some predictions, so that we can check our accuracy
y_predict=svcModel.predict(X_test)

In [None]:
print(classification_report(y_test, y_predict))

In [None]:
confusion_matrix(y_test, y_predict)

In [None]:
accuracy_score(y_train, svcModel.predict(X_train))

In [None]:
accuracy_score(y_test, y_predict)

For now, we can see that this model did quite bad in both the training sample and the testing sample, so we must have something with high bias right now (that is, underfitted).

This makes sense, because of the *ConvergenceWarning* we got while fitting.

## Reducing the number of features

We might be able to do something better by choosing fewer features. Good ways to find the best features are by viewing the correlations between the variables, and also by using an ExtraTreeClassifier.

In [None]:
#Heatmap
plt.figure(figsize=(10,10))
sns.heatmap(data.corr(), vmin=-1, cmap='coolwarm', annot=True);

In [None]:
#Extratrees
from sklearn.ensemble import ExtraTreesClassifier
plt.rcParams['figure.figsize']=16,9
sns.set_style("darkgrid")

XExtraTrees = data[list(data.columns.drop(["DEATH_EVENT"]))]
yExtraTrees = data["DEATH_EVENT"]

treeModel = ExtraTreesClassifier()
treeModel.fit(XExtraTrees,yExtraTrees)
feat_importances = pd.Series(treeModel.feature_importances_, index=XExtraTrees.columns)
feat_importances.nlargest(12).plot(kind='barh')
plt.show()

 As we can see, time, ejection_fraction, and serum_creatinine are our best bet towards a better model. So, let's do everything the same, but with less features.
 
 The info that is [here](https://towardsdatascience.com/feature-selection-in-python-recursive-feature-elimination-19f1c39b8d15) might be interesting as well. I should use it in the future.

## Test #2: Let's use some features only

In [None]:
X=data[['time','ejection_fraction','serum_creatinine']]
y=data["DEATH_EVENT"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
svcModelRed = LinearSVC()
svcModelRed.fit(X_train, y_train)
y_predict=svcModelRed.predict(X_test)

In [None]:
print(classification_report(y_test, y_predict))

In [None]:
confusion_matrix(y_test, y_predict)

In [None]:
accuracy_score(y_train, svcModelRed.predict(X_train))

In [None]:
accuracy_score(y_test, y_predict)

In [None]:
#Strangely, this model did only slightly better than the other one.
#It also looks that we still are underfitting.
a=accuracy_score(y_test,svcModelRed.predict(X_test))
b=accuracy_score(y_train,svcModelRed.predict(X_train))
abs(a-b)

Since the difference between both is quite small (3%), we actually have a high bias problem, that is, we are underfitting our data. Now, we could tinker more with which features to use, or we can change our model. I'm going with the later, since sklearn accused LinearSVC of not converging, and also, Sklearn gives us other model to use in case of failure.

Let's try nearest neighbors then!

## Test #3: Testing another model

In [None]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier()
neigh.fit(X_train, y_train)
y_predict=neigh.predict(X_test)

In [None]:
print(classification_report(y_test, y_predict))

In [None]:
confusion_matrix(y_test, y_predict)

In [None]:
accuracy_score(y_train,neigh.predict(X_train))

In [None]:
accuracy_score(y_test, y_predict)

In [None]:
a=accuracy_score(y_test,neigh.predict(X_test))
b=accuracy_score(y_train,neigh.predict(X_train))
abs(a-b)

Our model did quite better now! The accuracy on both the training and testing set seem great, so we might be nicely fitting our model now. Maybe we could again try to use every feature possible and see what happens, or even try another model?

## Test #4: Testing yet another model

In [None]:
from sklearn.naive_bayes import GaussianNB
bayesModel = GaussianNB()
bayesModel.fit(X_train, y_train)
y_predict=bayesModel.predict(X_test)

In [None]:
print(classification_report(y_test, y_predict))

In [None]:
confusion_matrix(y_test, y_predict)

In [None]:
accuracy_score(y_train,bayesModel.predict(X_train))

In [None]:
accuracy_score(y_test, y_predict)

In [None]:
a=accuracy_score(y_test,neigh.predict(X_test))
b=accuracy_score(y_train,neigh.predict(X_train))
abs(a-b)

## Test #5: Let's use all the features again

In [None]:
X=data[list(data.columns.drop(["DEATH_EVENT"]))]
y=data["DEATH_EVENT"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
neigh = KNeighborsClassifier()
neigh.fit(X_train, y_train)
y_predict=neigh.predict(X_test)

In [None]:
print(classification_report(y_test, y_predict))

In [None]:
confusion_matrix(y_test, y_predict)

In [None]:
accuracy_score(y_test, y_predict)

In [None]:
accuracy_score(y_train,neigh.predict(X_train))

In [None]:
accuracy_score(y_test,neigh.predict(X_test))

In [None]:
a=accuracy_score(y_test,neigh.predict(X_test))
b=accuracy_score(y_train,neigh.predict(X_train))
abs(a-b)

## Visualizations

Some visualizations might be nice. After all, some features might be better described by polynomials... I'm not sure if anything interesting will come up, but doing some plots should be nice practice.

Also, in the future I should put them before fitting the models... oh well.

In [None]:
sns.pairplot(data,hue='DEATH_EVENT')

# Last comments

  I think it is about time I wrapped up this notebook, and go work in something else. In the end:
 
 - KNN did the best, followed by Naive Bayes;
 - Using less features was better than using all of them;
 - The features chosen as the most important kind of indicate different behavior on the diagonal plots.
 
  So, nothing new I think haha. I should implement a cross validation test next time I think, and also make things more organized.