# <u>Tut_7.1</u>

### Learning outcomes
* ChatGPT usage
* Logistic regression (continued)
* CNN glossary (for CW preparation)

---

### ChatGPT prompting
1. Context
2. Question
3. Restrictions (3-4, is just right)

---

## Logistic regression (continued)

### Import libraries and modules

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder

#### Config some settings

In [None]:
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

### Load data

In [None]:
data_path = r'https://raw.githubusercontent.com/DrSYakovlev/m32895-public/refs/heads/main/raw_datasets/logistic_regr/weatherAUS.csv'
raw_df = pd.read_csv(data_path)
raw_df.info()

In [None]:
raw_df.dropna(subset=['RainToday', 'RainTomorrow'], inplace=True)

### Splitting our dataset into train, validation and test by the year

In [None]:
year = pd.to_datetime(raw_df['Date']).dt.year

train_df = raw_df[year < 2015]
val_df = raw_df[year == 2015]
test_df = raw_df[year > 2015]

In [None]:
print('train_df.shape :', train_df.shape)
print('val_df.shape :', val_df.shape)
print('test_df.shape :', test_df.shape)

In [None]:
input_cols = list(train_df.columns)[1:-1]
target_col = 'RainTomorrow'

* Create inputs and targets for the training, validation and test sets for further processing and model training

In [None]:
train_inputs = train_df[input_cols].copy()
train_targets = train_df[target_col].copy()

val_inputs = val_df[input_cols].copy()
val_targets = val_df[target_col].copy()

test_inputs = test_df[input_cols].copy()
test_targets = test_df[target_col].copy()

* Identify which of the columns are numerical and which ones are categorical. This will be useful later, as we'll need to convert the categorical data to numbers for training a logistic regression model

In [None]:
numeric_cols = train_inputs.select_dtypes(include=np.number).columns.tolist()
categorical_cols = train_inputs.select_dtypes('object').columns.tolist()

<u>Above steps are required for the notebook to functuon properly. They were carried over from the previous tutorial.</u>

---

## Imputing Missing Numeric Data
* Machine learning models can't work with missing numerical data. The process of filling missing values is called imputation.

<img src="https://i.imgur.com/W7cfyOp.png" width="480">

* There are several techniques for imputation, but we'll use the most basic one: replacing missing values with the average value in the column using the `SimpleImputer` class from `sklearn.impute`.

In [None]:
imputer = SimpleImputer(strategy = 'mean')

* Before we perform imputation, let's check the no. of missing values in each numeric column

In [None]:
raw_df[numeric_cols].isna().sum()

* These values are spread across the training, test and validation sets. You can also check the no. of missing values individually for `train_inputs`, `val_inputs` and `test_inputs`

In [None]:
train_inputs[numeric_cols].isna().sum()

* The first step in imputation is to `fit` the imputer to the data i.e. compute the chosen statistic (e.g. mean) for each column in the dataset

In [None]:
imputer.fit(raw_df[numeric_cols])

* After calling `fit`, the computed statistic for each column is stored in the `statistics_` property of `imputer`.

In [None]:
list(imputer.statistics_)

* The missing values in the training, test and validation sets can now be filled in using the `transform` method of `imputer`

In [None]:
train_inputs[numeric_cols] = imputer.transform(train_inputs[numeric_cols])
val_inputs[numeric_cols] = imputer.transform(val_inputs[numeric_cols])
test_inputs[numeric_cols] = imputer.transform(test_inputs[numeric_cols])

* The missing values are now filled in with the mean of each column

In [None]:
train_inputs[numeric_cols].isna().sum()

In [None]:
print(train_targets.isna().sum())

### Scaling Numeric Features
* Another good practice is to scale numeric features to a small range of values e.g. $(0,1)$ or $(-1,1)$
* Scaling numeric features ensures that no particular feature has a disproportionate impact on the model's loss. Optimization algorithms also work better in practice with smaller numbers
* The numeric columns in our dataset have varying ranges.

In [None]:
raw_df[numeric_cols].describe()

* Let's use `MinMaxScaler` from `sklearn.preprocessing` to scale values to the $(0,1)$ range

In [None]:
scaler = MinMaxScaler()

* First, we `fit` the scaler to the data i.e. compute the range of values for each numeric column

In [None]:
scaler.fit(raw_df[numeric_cols])

* We can now inspect the minimum and maximum values in each column

In [None]:
print('Minimum:')
list(scaler.data_min_)

In [None]:
print('Maximum:')
list(scaler.data_max_)

* We can now separately scale the training, validation and test sets using the `transform` method of `scaler`

In [None]:
train_inputs[numeric_cols] = scaler.transform(train_inputs[numeric_cols])
val_inputs[numeric_cols] = scaler.transform(val_inputs[numeric_cols])
test_inputs[numeric_cols] = scaler.transform(test_inputs[numeric_cols])

* We can now verify that values in each column lie in the range $(0,1)$

In [None]:
train_inputs[numeric_cols].isna().sum()

## Encoding categorical data

Since machine learning models can only be trained with numeric data, we need to convert categorical data to numbers. A common technique is to use one-hot encoding for categorical columns.

<img src="https://i.imgur.com/n8GuiOO.png" width="640">

One hot encoding involves adding a new binary (0/1) column for each unique category of a categorical column.

In [None]:
raw_df[categorical_cols].nunique()

We can perform one hot encoding using the `OneHotEncoder` class from `sklearn.preprocessing`.

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

* First, we `fit` the encoder to the data i.e. identify the full list of categories across all categorical columns.

In [None]:
encoder.fit(raw_df[categorical_cols])

In [None]:
encoder.categories_

The encoder has created a list of categories for each of the categorical columns in the dataset. 

We can generate column names for each individual category using `get_feature_names`.

In [None]:
encoded_cols = list(encoder.get_feature_names_out(categorical_cols))
print(encoded_cols)

All of the above columns will be added to `train_inputs`, `val_inputs` and `test_inputs`.

To perform the encoding, we use the `transform` method of `encoder`.

In [None]:
train_inputs[encoded_cols] = encoder.transform(train_inputs[categorical_cols])
val_inputs[encoded_cols] = encoder.transform(val_inputs[categorical_cols])
test_inputs[encoded_cols] = encoder.transform(test_inputs[categorical_cols])

* We can verify that these new columns have been added to our training, test and validation sets.

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
test_inputs.drop(columns=['Location'], inplace=True)
val_inputs.drop(columns=['Location'], inplace=True)
train_inputs.drop(columns=['Location'], inplace=True)

---

## Training a Logistic Regression Model

Logistic regression is a commonly used technique for solving binary classification problems. In a logistic regression model: 

- we take linear combination (or weighted sum of the input features) 
- we apply the sigmoid function to the result to obtain a number between 0 and 1
- this number represents the probability of the input being classified as "Yes"
- instead of RMSE, the cross entropy loss function is used to evaluate the results


Here's a visual summary of how a logistic regression model is structured ([source](http://datahacker.rs/005-pytorch-logistic-regression-in-pytorch/)):


<img src="https://i.imgur.com/YMaMo5D.png" width="480">

The sigmoid function applied to the linear combination of inputs has the following formula:

<img src="https://i.imgur.com/sAVwvZP.png" width="400">

To train a logistic regression model, we can use the `LogisticRegression` class from Scikit-learn.

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
model = LogisticRegression(solver='liblinear')

We can train the model using `model.fit`.

In [None]:
model.fit(train_inputs[numeric_cols], train_targets)

## Making prediction and evaluating the model

* We can now use the trained model to make predictions on the training, test 

In [None]:
train_preds = model.predict(train_inputs[numeric_cols])

In [None]:
train_preds

In [None]:
train_targets

* We can output a probabilistic prediction using `predict_proba`.

In [None]:
train_probs = model.predict_proba(train_inputs[numeric_cols])
train_probs

In [None]:
model.classes_

We can test the accuracy of the model's predictions by computing the percentage of matching values in `train_preds` and `train_targets`.

This can be done using the `accuracy_score` function from `sklearn.metrics`.

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(train_targets, train_preds)

<u>Exercise</u>: apply the model to validation and train datasets and see what happens

* For single input prediction, one has to follow the same steps:
	* One-hot encoding
	* Scaling	

We have applied to the entire dataset, and supply it to the model.
* **Single output must be in the same format as train/validation/test datasets** (see linear regression tutorial)

---
---

## Homework
### Glossary and topics for self-study
#### Read at home, gain understanding of the following terms
* Activation function
* Augmentation
* Backpropagation
* Bias
* Comvolution layer
* Dense layer
* Difference between neural netwwork and convolutional neural network (CNN)
* Epoch
* Flatten layer
* Hyperparameter optimisation
* Null hypothesis
* Pooling layer
* Training curve
* Underfitting and overfitting