# Introduction

#### If you are new to data science, this notebook will surely help you. After going to this notebook, I am sure you will be more comfortable with handling data, visualization, creating models, and analyzing them. 

#### I have divided the notebook into 3 major parts -

#### 1. Data handling and visvalisation
#### 2. Creating models
#### 3. Analysing the models for better understanding

#### So, let's get started!

**PLEASE UPVOTE GUYS AND RECOMMEND THAT SHOULD IMPLEMENT**

In [None]:
# Importing the required libraries

# Libraries for reading and handling the data
import numpy as np, pandas as pd

# Libraries for data visvalization
import matplotlib.pyplot as plt
import seaborn as sns

# Libraries for data preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Libraries for creating ML model
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# Library for Analysing the ML model
from sklearn import metrics

%matplotlib inline

## 1. Data handling and visualizing

#### Believe it or not, but these steps generally take 70% to 80% of the whole time. So, if you become good at this, you can save a lot of time and effort. I have discussed many methods which you can directly apply to your dataset and you can get an edge over others.

#### This step includes handling the missing values, correcting the data types, removing the outliers, visualizing the data, and preprocessing the data to feed it into the model. I know that many of you want to make their Machine Learning model directly without doing all of this work, but please be a little patient as these steps are also important. If you feed trash into your ML model it will produce nothing but thrash! And so, almost all of the time, you have to perform these steps to clean your data before making your machine learning model. But don't worry, these steps are very simple and easy to understand and once you practice them, you will become very good at this.

In [None]:
# Reading the data
df=pd.read_csv('../input/heart-attack-analysis-prediction-dataset/heart.csv')
df.head()

In [None]:
# df.info() is shows the basic information about the data like column names, data types, number of rows, memory usages, etc.
df.info()

In [None]:
# Checking for missing values
df.isnull().sum()

#### As there are no missing values in our data we shall proceed with further steps, but if your data have some missing values you can either remove them by using [df.dropna](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) or substitute the missing values with mean or mode with [df.fillna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) . Some models can give you an error if missing values are present and some will show incorrect results, so it is important to remove them.

In [None]:
# Chaning the data type for the categorical variables
categorical_var=['sex','cp','fbs','restecg','exng','slp','caa','thall']
df[categorical_var]=df[categorical_var].astype('category')

numeric_var=[i for i in df.columns if i not in categorical_var][:-1] # Storing all the numeric columns in one list

#### The data type decides what actions are to be performed on the data. To change the datatypes [df.astype](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html)(data_type_to_convert_to) like 'int', 'float', 'category', etc. In our case, there some columns like 'sex' 'cp' 'fbs', etc are categorical variables as they have discrete and few values like 0 and 1 for 'sex', but they are mentioned as 'int' so we will change their data type.

In [None]:
# Visvalization the data
sns.pairplot(df)

In [None]:
# Visvalization of categorical columns
fig, ax=plt.subplots(2,4, figsize=(10,5)) # Creates a grid of 2 rows and 4 colums as we have 8 columns.
for axis, cat_var in zip(ax.ravel(), categorical_var): # ax.ravel() kind of flattens the 2d grid we created, for iteration
    sns.countplot(x=cat_var,data=df, hue='output', ax=axis) # plots the count of each column

plt.tight_layout() # makes the layout of the plot tight, i.e. to avoid overlapping of plots

#### For identifying the outliers from numeric columns and understand the spread of the data, we will use a box plot which is commonly used.
![Box plot](http://www.simplypsychology.org/boxplot.jpg)

In [None]:
# Visvalization of numeric columns

fig, ax=plt.subplots(1,5, figsize=(15,5))
for axis, num_var in zip(ax, numeric_var):
    sns.boxplot(y=num_var,data=df, x='output', ax=axis)

plt.tight_layout()

#### I have just scratched the surface for visualizing the data, I highly recommend you to go through the [seaborn](https://seaborn.pydata.org/) and [matplotlib](https://matplotlib.org/) libraries for more details.

#### Tip: 
#### Use 'pairplot' for getting a detailed view of the data. 
#### If both columns are numeric => scatterplot, relplot, regplot, lmplot. 
#### If both columns are categorical or one categorical and one numeric => catplot, barplot, countplot.

In [None]:
# Removing outliers

# I have considered more than 95 percetile and less than 5 percentile as outliers, but it totally depends on you data and decision.
df=df[df['trtbps']<df['trtbps'].quantile(0.95)]
df=df[df['chol']<df['chol'].quantile(0.95)]
df=df[df['thalachh']>df['thalachh'].quantile(0.05)]
df=df[df['oldpeak']<df['oldpeak'].quantile(0.95)]

In [None]:
# y is the target column and X contains the features using which we have to predict y.
y=df['output']
X=df.iloc[:,:-1]

In [None]:
X.head()

#### The categorical features have to be converted into something which can be understood by our ML model, so we are using one-hot encoding. Eg. If we have a categorical feature, say 'color' having 3 values, say 'red, yellow and green them it will create 3 columns for 'red, yellow and green, one for each. These columns will have 1 if the value is present at the original column and 0 otherwise. You can check this out for more details: [One hot encoding](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html)  

![One hot encoding](https://i.imgur.com/mtimFxh.png)

In [None]:
# One hot encoding
temp = pd.get_dummies(X[categorical_var], drop_first=True)

# Concanating Data Frames
X_modified = pd.concat([X, temp], axis=1)

#Removing the old columns
X_modified.drop(categorical_var, axis=1, inplace=True)

In [None]:
X_modified.head()

#### Now we will split the data into train and test sets, for training and evaluation purposes respectively. We will be dividing the data into 80% training and 20% testing. We are not using a validation set for this dataset, but you can very well use it.

![train_test_split](https://upload.wikimedia.org/wikipedia/commons/b/bb/ML_dataset_training_validation_test_sets.png)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_modified, y, train_size=0.8) # 80% training and 20% testing

#### Scaling your data is a good practice. This reduces the training time, as Gradient Descent converges quickly for scaled data compared to non-scaled data. You can do [MinMaxScaling](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) (squeezes data into 0 and 1) and [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) (squeezes data such that the standared deviation is 1)

![scaling](https://miro.medium.com/max/1200/1*yi0VULDJmBfb1NaEikEciA.png)

In [None]:
scaler=StandardScaler()
X_train[numeric_var] = scaler.fit_transform(X_train[numeric_var]) # Use fit_transform on training set
X_test[numeric_var] = scaler.transform(X_test[numeric_var]) # Use transform on test set

In [None]:
X_train.head()

## Creating Machine Learning Model

### Now finally we will create ML model, and this is the quickest step! Just a one line code!

#### We will use logistic regression and decision trees for this classification problem as they take very little time for training and are light to deploy online. Once you understand these, you can also try [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), [XGBoost](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html), [CastBoost](https://catboost.ai/docs/concepts/python-reference_catboostclassifier.html), etc.

## Logistic Regression

![log_reg](https://www.equiskill.com/wp-content/uploads/2018/07/WhatsApp-Image-2020-02-11-at-8.30.11-PM.jpeg)

In [None]:
log_reg=LogisticRegression().fit(X_train, y_train) # Just one line code for creating model!

In [None]:
print('Train accuracy score is', log_reg.score(X_train, y_train))
print('Test accuracy score is', log_reg.score(X_test, y_test))

## Decision Tree

![decision tree](https://www.explorium.ai/wp-content/uploads/2019/12/Decision-Trees-2.png)

In [None]:
# max_depth - maximum depth to which tree can grow, if we increase it very much then, model can overfit
# min_samples_leaf - minimum number of samples a leaf can have
# min_samples_split - minimum number of samples a node should have to further split

tree=DecisionTreeClassifier(max_depth=5, min_samples_leaf=20, min_samples_split=40).fit(X_train, y_train)

In [None]:
print('Train accuracy score is', tree.score(X_train, y_train))
print('Test accuracy score is', tree.score(X_test, y_test))

## 3. Analysing the Model

#### We can use confusion matrix, AUC ROC curve, MSE, F1 score, etc. for analyzing the model. Each one of them is suited for a specific purpose. We have to decide which one of them to be used, but generally, we can use AUC ROC curve for most of the cases unless explicitly mentioned. 

#### For now, you can understand about AUC ROC curve that, more the area under the curve better is the model. You can read more about that [here](https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/).

![AUC ROC curve](https://glassboxmedicine.files.wordpress.com/2019/02/roc-curve-v2.png?w=576)


In [None]:
def plot_auc_roc(model):
    probs = model.predict_proba(X_test)
    preds = probs[:,1]
    fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
    roc_auc = metrics.auc(fpr, tpr)

    # method I: plt
    import matplotlib.pyplot as plt
    plt.title('Receiver Operating Characteristic')
    plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()

In [None]:
# for Logistic Regression
plot_auc_roc(log_reg)

In [None]:
# for Decision Tree
plot_auc_roc(tree)

**PLEASE UPVOTE GUYS AND RECOMMEND THAT SHOULD IMPLEMENT**