<a href="https://colab.research.google.com/github/twisha-k/Python_notes/blob/main/72_coding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 72: Logistic Regression - Univariate Classification I

### Teacher-Student Activities

In the previous classes, you have already worked on the **heart disease dataset** where you calculated the probability of a person having heart disease by examining the `chol` (cholesterol) values. The results roughly suggested that people with less cholesterol level have greater chances of heart disease and people with high cholesterol level have lesser chances of heart disease.


Here's the box plot that we created in the previous classes to visualise the distribution of the `chol` values.

<img src = 'https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/heart-disease-boxplot.png' width = 1000>

From the box plot, you can observe that the first, second and third quartiles of the cholesterol values for the patients **having** heart disease (shown with the orange colour) are lower as compared to the ones for the patients **not having** it (shown with the blue colour).

Hence, we concluded that people having lower cholesterol levels are more likely to have heart disease.

In today's class, you will learn to classify or predict whether a person is suffering from heart disease or not based on just cholesterol values by deploying two most commonly used classification-based machine learning algorithms:

1. Random Forest Classifier
2. Logistic Regression

Before we start, let's first recall the attributes or columns of the dataset.

**Data Description**

The Heart Disease UCI dataset contains data collected on 14 different attributes by examining 303 patients. The dataset focuses only on differentiating patients having heart disease; labelled as value 1 and those not having heart disease; labelled as value 0. The 14 attributes (or columns) are as follows:

|Columns|Description|
|-|-|
|age|age in years|
|sex|sex (1 = male; 0 = female)|
|cp|chest pain type (4 values)|
|trestbps|resting blood pressure (in mm Hg on admission to the hospital)|
|chol|serum cholesterol in $\frac{mg}{dl}$|
|fbs|fasting blood sugar > 120 $\frac{mg}{dl}$|
|restecg|resting electrocardiographic results (values 0, 1, 2)|
|thalach|maximum heart rate achieved|
|exang|exercise induced angina (1 = yes; 0 = no)|
|oldpeak|ST depression induced by exercise relative to rest|
|slope|the slope of the peak exercise ST segment|
|ca|number of major vessels (0-3) colored by fluoroscopy|
|thal|A blood disorder called thalassemia|
|target|1 = presence of heart disease; 0 = absence of heart disease|

**Source:** https://archive.ics.uci.edu/ml/datasets/Heart+Disease




---

#### Activity 1: Loading Data

Load the heart disease dataset. Here's the dataset link:

https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/uci-heart-disease/heart.csv


In [None]:
# S1.1: Import the required modules and load the heart disease dataset. Also, display the first five rows.
import pandas as pd
heart_disease=pd.read_csv('https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/uci-heart-disease/heart.csv')
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


Let's first look at the complete information on the `df` DataFrame.

In [None]:
# S1.2: Apply the 'info()' function on the 'df' DataFrame.
heart_disease.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


You can see there are 303 entries for each column and no missing values.


---

#### Activity 2: Imbalanced Data^

The target variable `target` has two values: `0` and `1`. This means that our dataset is composed of two classes or labels:

 - Class `0` - Patients NOT having heart disease
 - Class `1` - Patients having heart disease

Such problems are known as **binary classification** problem where the target attribute can have only two possible values (for e.g. `0` and `1`).

Before we start building the model, let us find out whether our dataset is balanced or not i.e. whether class distribution is uniform among all the classes. An imbalanced dataset means that the number of observations belonging to one class is significantly lower than that of the other class. Such datasets will result in a biased classifier which will hamper the results.

As our dataset has two classes, then balanced data would mean 50% observations for each class. Let us calculate the number of observations for each class.

In [None]:
# S2.1 Print the number of records in each label and their percentage in the 'target' column
# Print the number of records with and without heart disease
(heart_disease['target'].value_counts()*100)/heart_disease.shape[0]

1    54.455446
0    45.544554
Name: target, dtype: float64

---

#### Activity 3: Train-Test Split

We will first predict whether a person is a heart patient or not by analysing only his/her cholesterol value. Thus, the model will use only one feature or independent variable `chol` to predict the target variable `target`.

Before deploying our model, let's split the `df` DataFrame into train set and test set.

In [None]:
# S3.1: Split the DataFrame into the train and test sets.
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(heart_disease, test_size = 0.3, random_state = 42)
X_train=train_df['chol']#feature train
X_test=test_df['chol']#feature test
y_train=train_df['target']#target train
y_test=test_df['target']#target test
X_train_reshaped=X_train.values.reshape(-1,1)
X_test_reshaped=X_test.values.reshape(-1,1)
y_train_reshaped=y_train.values.reshape(-1,1)
y_test_reshaped=y_test.values.reshape(-1,1)

---

#### Activity 4: Applying Random Forest Classifier^^

We had already explored the **Random Forest Classifier** algorithm in one of our previous classes. Let's use it to find out if it can detect the patients having heart disease accurately or not.

In [None]:
from sklearn.ensemble import RandomForestClassifier
# S4.1: Build the Random Forest Classifier prediction model.
sklearn_rfc=RandomForestClassifier()
sklearn_rfc.fit(X_train_reshaped,y_train_reshaped)
sklearn_rfc.score(X_train_reshaped,y_train_reshaped)

  after removing the cwd from sys.path.


0.8537735849056604

You may observe that the model score is pretty close to 1 or 100%. Let's perform predictions using the above model.

In [None]:
# S4.2: Make predictions on the test dataset using the 'predict()' function.
rfc_y_test_pred=sklearn_rfc.predict(X_test_reshaped)
rfc_y_test_pred

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0,
       0, 1, 1])

In [None]:
y_test_reshaped.shape,rfc_y_test_pred.shape

((91, 1), (91,))

Let's compute the confusion matrix to evaluate the accuracy of our classifier `rf_clf`.

In [None]:
# S4.3: Display the results of 'confusion_matrix'
from sklearn.metrics import confusion_matrix,classification_report
confusion_matrix(y_test_reshaped,rfc_y_test_pred)


array([[25, 16],
       [23, 27]])

So we got the confusion matrix for our Random Forest model. Let's recall what does a confusion matrix returns as output.

In this case,
 - positive outcome $\Rightarrow$ class `1` (patients having heart disease)
 - negative outcome $\Rightarrow$ class `0` (patients NOT having heart disease)

The confusion matrix reflects the following values:

1. **True Negatives (TN)** - class `0` values **correctly** predicted as class `0`.

2. **True Positives (TP)** - class `1` values **correctly** predicted as class `1`.

3. **False Positives (FP)** - class `0` values **incorrectly**  predicted as class `1`.

4. **False Negatives (FN)** - class `1` values **incorrectly**  predicted as class `0`.


||Predicted Class `0`|Predicted Class `1`|    
|-|-|-|
|Actual Class `0`|**TN = 24**|**FP = 17**|
|Actual Class `1`|**FN = 20**|**TP = 30**|


**Note:** Every time you build a prediction model, the predictions might be slightly different from the previous ones. Hence, the confusion matrix might have slightly different values every time.

These values of confusion matrix are used for calculating precision, recall and f1-score with the below formulae:

1. **Precision** - It is the ratio of the correctly predicted positive values (TP) to the total predicted positive values (TP + FP) i.e.

$$\text{precision} = \frac{\text{TP}}{\text{TP + FP}}$$


2. **Recall** -  It is the ratio of the correctly predicted positive values (TP)values to the total values (TP + FN) i.e.

$$\text{recall} = \frac{\text{TP}}{\text{TP + FN}}$$


3. **f1-score** - It is a harmonic mean of the precision and recall values, i.e.

$$\text{f1-score} = 2 \left( \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}} \right)$$

Let's take a look at precision, recall and f1-score of our model using `classification_report()` function.

In [None]:
# S4.4: Display the precision, recall and f1-score values.
print(classification_report(y_test_reshaped,rfc_y_test_pred))

              precision    recall  f1-score   support

           0       0.52      0.61      0.56        41
           1       0.63      0.54      0.58        50

    accuracy                           0.57        91
   macro avg       0.57      0.57      0.57        91
weighted avg       0.58      0.57      0.57        91



We can see that the **f1-scores** for both the labels `0` and `1` are not closed to 1. Thus, the prediction percentage is not satisfactory.  

Let's verify accuracy with another classification based machine learning model **Logistic Regression**.

-----

#### Activity 5: Logistic Regression^^^

Logistic Regression is a type of **classification** algorithm which classifies or categorises a given set of data into different class labels. In the context of heart disease dataset, logistic regression will  classify the patients either as `1` (having heart disease) or as `0` (not having heart disease).

Logistic Regression is used to predict the probability of an outcome for an event. It calculates a threshold probability value. If the probability of an outcome is less than the threshold probability, then logistic regression classifies that outcome as `0`, otherwise as `1`. You will learn the technical details in the subsequent classes, but for the time being, let's build a Logistic Regression model on the train set by following the steps listed below:

1. Import `LogisticRegression` class from the `sklearn.linear_model` module.
2. Create an object of the `LogisticRegression` class, say `log_reg` and pass `n_jobs = -1` as input to its constructor.
3. Call the `fit()` function of the `LogisticRegression` class on the object created and pass `X_train` and `y_train` as inputs to the function.

In [None]:
# T5.1: Deploy the 'LogisticRegression' model using the 'fit()' function.
from sklearn.linear_model import LogisticRegression
log_reg=LogisticRegression(n_jobs = -1)
log_reg.fit(X_train_reshaped,y_train_reshaped)
log_reg.score(X_train_reshaped,y_train_reshaped)

  y = column_or_1d(y, warn=True)


0.5283018867924528

The accuracy score is less than the one obtained through `RandomForestClassifier` . However, let's make the predictions on the test set and compare them with the actual labels.

In [None]:
# S5.1: Make predictions on the test dataset by using the 'predict()' function.
log_y_test_pred=log_reg.predict(X_test_reshaped)
log_y_test_pred

array([1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1])

Let's compute the confusion matrix to calculate recall, precision and f1-scores
to evaluate the logistic regression model.

In [None]:
# S5.2: Display the confusion_matrix.
confusion_matrix(y_test_reshaped,log_y_test_pred)

array([[ 2, 39],
       [ 0, 50]])

In [None]:
# S5.3: Display recall, precision and f1-score values.
print(classification_report(y_test_reshaped,log_y_test_pred))

              precision    recall  f1-score   support

           0       1.00      0.05      0.09        41
           1       0.56      1.00      0.72        50

    accuracy                           0.57        91
   macro avg       0.78      0.52      0.41        91
weighted avg       0.76      0.57      0.44        91



The f1-score is high for class `1` which are true positives are correctly obtained through Logistic Regression. But true negatives are very low. The Random Forest Classifier model is able to get more true negatives. So both of them are unable to provide high number of true positive and high number of true negatives.

You will soon get to learn how both these models work behind the scenes and then you will develop a sense of which classification model to use for different kinds of problem statements.

Let us predict the labels with both classifiers on some arbitrary cholesterol values, say 180 and 260.

In [None]:
# S5.4: Predict labels with cholesterol levels 180 and 260 for both models
import numpy as np
print(sklearn_rfc.predict(np.array(180).reshape(-1,1)))
print(sklearn_rfc.predict(np.array(260).reshape(-1,1)))
print(log_reg.predict(np.array(180).reshape(-1,1)))
print(log_reg.predict(np.array(260).reshape(-1,1)))

[1]
[0]
[1]
[1]


Let's stop here, in the next class, we will understand the working of Logistic Regression algorithm using sigmoid function and will also try to improve the model.

-----