<a href="https://colab.research.google.com/github/sabinedaher20-spec/DataScience-GenAI-Submissions-/blob/main/4_02_Logistics_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 4.02 Logistic Regression

Following up on our regression example, we will also run an equivalent for classification using _logistic regression_.

We'll begin with getting a dataset together, this time using one of the inbuilt datasets from scikit-learn. The dataset is for predicting the presence of breast cancer or not. You can get more details here: [https://scikit-learn.org/1.5/datasets/toy_dataset.html](https://scikit-learn.org/1.5/datasets/toy_dataset.html).

In [1]:
import pandas as pd
import numpy as np

from sklearn.datasets import load_breast_cancer

# import the data
data = load_breast_cancer()

# show the dataset
print(data)

# print a return space
print('\n')

# Our dataset is in a dictionary. We can print the keys.
print("Dataset keys:")
print(data.keys())


{'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]]), 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
 

Our 'data' key gives us a bunch of continuous values. These will be our $x$ values. Our $Y$ value is given by the 'target' key, and is either a 0 or a 1 (a binary/two-class classification problem).

In other words, we are already set up to fit a logistic regression. However, we will set up some x values as a dataframe as before:

In [2]:
# create a DataFrame of features
x_values = pd.DataFrame(data.data, columns=data.feature_names)
x_values.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


We will skip most of the feature engineering, but this time we will do normalisation. Normalisation ensures all features contribute equally to the model's loss function, preventing features with large magnitudes (like income) from unfairly dominating the learning process over features with small magnitudes (like age). For regularised linear models (like we are building here) it is completely necessary because the penalty term treats coefficients of all features equally, requiring the features themselves to be on a comparable scale. Note, we excluded this from the last notebook in the interests of simplicity, but it is something we definitely should have performed.

There are multiple ways of doing normalisation, but our approach will be using min-max normalisation. The calculation is as follows:

$x_i' = \frac{x_i - \min(x)}{\max(x) - \min(x)}$

Breaking this down. Firstly, we are replacing each value of $x$, so $x$ becomes $x_i'$. For the top of the fraction (the numerator), we subtract from the current value of $x$ the minimum value in this feature. The bottom of the fraction (the denominator) takes the feature's maximum value subtracting the minimum value. In other words we end up with a value between 0 and 1 where the largest value (the max) is 1 and lowest value is 0.

To illustrate this lets assume that $x$ is on a range between 1 (the min) and 11 (the max).
* For $x_i$ = 11 ... $\frac{x_i - \min(x)}{\max(x) - \min(x)} = \frac{11 - 1}{11 - 1} = \frac{10}{10} = 1.0$
* For $x_i$ = 1 ... $\frac{x_i - \min(x)}{\max(x) - \min(x)} = \frac{1 - 1}{11 - 1} = \frac{0}{10} = 0.0$
* For $x_i$ = 6 ... $\frac{x_i - \min(x)}{\max(x) - \min(x)} = \frac{6 - 1}{11 - 1} = \frac{5}{10} = 0.5$

However, I'll use a little hack on writing the full math by using a pre-defined scaler from scikit-learn:

In [3]:
from sklearn.preprocessing import MinMaxScaler

# create a MinMaxScaler object
scaler = MinMaxScaler()

# fit and transform the data
normal_data = scaler.fit_transform(x_values)

# recreate x_values using the scaled data and original feature names
x_values = pd.DataFrame(normal_data, columns=data.feature_names)
x_values.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,0.521037,0.022658,0.545989,0.363733,0.593753,0.792037,0.70314,0.731113,0.686364,0.605518,...,0.620776,0.141525,0.66831,0.450698,0.601136,0.619292,0.56861,0.912027,0.598462,0.418864
1,0.643144,0.272574,0.615783,0.501591,0.28988,0.181768,0.203608,0.348757,0.379798,0.141323,...,0.606901,0.303571,0.539818,0.435214,0.347553,0.154563,0.192971,0.639175,0.23359,0.222878
2,0.601496,0.39026,0.595743,0.449417,0.514309,0.431017,0.462512,0.635686,0.509596,0.211247,...,0.556386,0.360075,0.508442,0.374508,0.48359,0.385375,0.359744,0.835052,0.403706,0.213433
3,0.21009,0.360839,0.233501,0.102906,0.811321,0.811361,0.565604,0.522863,0.776263,1.0,...,0.24831,0.385928,0.241347,0.094008,0.915472,0.814012,0.548642,0.88488,1.0,0.773711
4,0.629893,0.156578,0.630986,0.48929,0.430351,0.347893,0.463918,0.51839,0.378283,0.186816,...,0.519744,0.123934,0.506948,0.341575,0.437364,0.172415,0.319489,0.558419,0.1575,0.142595


Again, let's set up the $Y$ value:

In [4]:
y_value = pd.DataFrame(data.target, columns=['class'])
y_value.head()

Unnamed: 0,class
0,0
1,0
2,0
3,0
4,0


However, our algorithm won't want a dataframe for y as it is a single vector/list. We can fix like this:

In [5]:
y_value = np.ravel(y_value)
y_value

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

## Train-Test split
As before, our next step will be to split the data:

In [6]:
# split data into training and test
from sklearn.model_selection  import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x_values, y_value, test_size = 0.2, random_state=4567, stratify=y_value)

# print the shapes to check everything is OK
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(455, 30)
(114, 30)
(455,)
(114,)


Our code is basically the same as before except with an extra parameter ("stratify=y_value"). This part means we make sure there is a propotionate number of each class in our training data. We want to avoid having minimal examples of one of the classes to learn from (as the split is otherwise fully random).

We can confirm this has worked by looking at the size of our different datasets:


*   `X_train` (the $x$ values we use for training) is 455 rows and 30 columns;
*   `X_test` (the $x$ values we use for testing) is 114 rows and 30 columns;
*   `Y_train` (the $Y$ values we use for training) is 455 rows and a single column;
*   `Y_test` (the $Y$ values we use for testing) is 114 rows and a single columns. All seems to be correct!

## Logistic Regression
We'll begin with a standard logisitic regression model but this time using L2/Ridge regularisation. L2 is very similar to the L1 penaly we saw in the last notebook, but instead of using the absolute values ($|x|$) we use the squared values ($x^2$). However, we can recall that this achieves a similar thing in that it ensures all values are treated as positive numbers and negatives are ignored/replaced. This give us an objective of:
<br><br>
$minimise \; OLS + \alpha \cdot \Sigma{\beta^2}$  
<br><br>
Whilst this looks very similar, it can have very different results to the L1 penalty, but effectively this is just a hyperparameter.

With this in mind we first need to specify the model:

In [7]:
from sklearn.linear_model import LogisticRegression as LogR

# create the model
logR_algo = LogR(penalty='l2')
logR_algo

This is the unfitted/untrained algorithm. Let's make a model by training it on the data:

In [8]:
logR_model = logR_algo.fit(X_train, Y_train)
logR_model

But how does it perform? We cannot use $R^2$ as we did before because this is a classification problem. There are no gaps between $y_i$ and $\hat{y_i}$ in the way there were before as $y$-values will either be 0 (no cancer) or 1 (cancer). Instead we can use _accuracy_ as a measure of performance, which simply measure what percentage of predictions were correct:
<br><br>
$accuracy = \frac{correct\_predictions}{total\_predictions}$
<br><br>
I.e. for every prediction, what percentage were correct. We can measure this on our _unseen_ data from `X_test`:

In [9]:
from sklearn.metrics import accuracy_score

# predict the test data
predict = logR_model.predict(X_test)

# seperate the first five predictions and the first five real values in Y_test
for i in range(5):
  print(f'Predicted: {round(predict[i],2)}')
  print(f'Real: {Y_test[i]}')
  print("\n")

print("\n")

print(f'Accuracy: {round(accuracy_score(Y_test, predict),2)}')

Predicted: 1
Real: 1


Predicted: 1
Real: 0


Predicted: 0
Real: 0


Predicted: 1
Real: 1


Predicted: 1
Real: 1




Accuracy: 0.96


We have very strong performance ... 96% accuracy.