# Logistic Regression

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

## Heart Dataset 

In this project we will work with a dataset of patients. 
We have access to 303 patients' data. The features are listed below. 

**Age:** The person’s age in years

**Sex:** The person’s sex (1 = male, 0 = female)

**ChestPain:** chest pain type

* Value 0: asymptomatic
* Value 1: atypical angina
* Value 2: non-anginal pain
* Value 3: typical angina

**RestBP:** The person’s resting blood pressure (mm Hg on admission to the hospital)

**Chol:** The person’s cholesterol measurement in mg/dl

**Fbs:** The person’s fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)
restecg: resting electrocardiographic results

* Value 0: showing probable or definite left ventricular hypertrophy by Estes’ criteria
* Value 1: normal
* Value 2: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)

**RestECG:** The person’s maximum heart rate achieved

**MaxHR:** Exercise induced angina (1 = yes; 0 = no)

**Oldpeak:** ST depression induced by exercise relative to rest (‘ST’ relates to positions on the ECG plot. See more here)

**Slope:** the slope of the peak exercise ST segment — 0: downsloping; 1: flat; 2: upsloping

* 0: downsloping; 
* 1: flat; 
* 2: upsloping

**Ca:** The number of major vessels (0–3)

**Thal:** A blood disorder called thalassemia Value 0: NULL (dropped from the dataset previously

* Value 1: fixed defect (no blood flow in some part of the heart)
* Value 2: normal blood flow
* Value 3: reversible defect (a blood flow is observed but it is not normal)

**Target:** Heart disease

Let's take a look at the data through a Pandas dataframe:

In [2]:
heart_df = pd.read_csv("Heart.csv")
heart_df

Unnamed: 0,Age,Sex,ChestPain,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,Thal,Target
0,63,1,typical,145,233,1,2,150,0,2.3,3,0.0,fixed,No
1,67,1,asymptomatic,160,286,0,2,108,1,1.5,2,3.0,normal,Yes
2,67,1,asymptomatic,120,229,0,2,129,1,2.6,2,2.0,reversable,Yes
3,37,1,nonanginal,130,250,0,0,187,0,3.5,3,0.0,normal,No
4,41,0,nontypical,130,204,0,2,172,0,1.4,1,0.0,normal,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45,1,typical,110,264,0,0,132,0,1.2,2,0.0,reversable,Yes
299,68,1,asymptomatic,144,193,1,0,141,0,3.4,2,2.0,reversable,Yes
300,57,1,asymptomatic,130,131,0,0,115,1,1.2,2,1.0,reversable,Yes
301,57,0,nontypical,130,236,0,2,174,0,0.0,2,1.0,normal,Yes



# Goal
We want to use logistic regerssion to predict if a patient will have heart problems or not. The column "Target" in our datasets includes data about heart disease. If the patient has heart disease, the patient's "Target" value equals "Yes". Otherwise, "Target" equals "No".

Let's choose a few features to use to try to predict whether a patient has heart disease. We will start with the following 3 features:

* Age of the patient (Column **"Age"**)
* Gender of the patient (male or female - Column **"Sex"**)
* Cholestrol level of the patient (Column **"Chol"**) 


# Data Preparation

First, we need to split our data into train and test sets. Before we do that, we need to get our data set up into arrays. We will put the independent features into an "X" array, and the target variable into a "y" array.

In [9]:
X = np.array(heart_df[['Age', 'Sex', 'Chol']])
y = np.array(heart_df['Target'])

Now we can split the data into training and testing sets, with 80% of the data to be used for training and 20% for testing. Setting the random_state variable equal to 1 ensures repeatability with the same train/test splits.

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)


# Implementing Logistic Regression

Next, we will enerate a logistic regression model using our training data to see if these features are accurate in predicting if a patient has heart disease or not.

In [13]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()

With our logistic regression model fitted on the training data, we can test the accuracy of our model with the testing data and calculate our model's accuracy score.

In [15]:
y_pred = model.predict(X_test)

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
accuracy

0.6721311475409836

Ouch! Our model only achieved about 67.2% accuracy when predicting if a patient has heart disease based on their age, gender, and cholesterol levels. The gap of 32.8% could represent patients who have heart disease but are misclassified because our model is poor.


# Classification Report
Let's analyze our classification report.


In [16]:
from sklearn.metrics import classification_report

report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

          No       0.73      0.65      0.69        34
         Yes       0.61      0.70      0.66        27

    accuracy                           0.67        61
   macro avg       0.67      0.68      0.67        61
weighted avg       0.68      0.67      0.67        61



Our classification report gives a bit of a clearer picture about how our model is performing. Looking at our precision score, it seems like out of all the patients that the model predicted would have heart disease, only 61% actually had heart disease. Similarly, looking at the recall score tells us that out of all the patients who actually did have heart disease, our model only predicted that correctly for 70% of those patients. 

So, not too great of a model, right? How can we our model more accurate?

We can experiment with other features. Maybe it will turn out that age, gender, and cholesterol levels aren't the factors that most accurately predict heart disease, but rather something like fasting blood sugar levels and age are better indicators.

<!-- END QUESTION -->

