<a href="https://colab.research.google.com/github/wongwara/AT2_Classification/blob/main/MLAA_Lab_4_Exercise_1_Solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lab 4: Binary Classification**



## Exercise 1: Logistic Regression

In this exercise, you will train a binary classifier using a logistic regression and you will look at solving some of the most common data quality issues.

We will be using a dataset from the Australian Bureau of Metereology which contains the daily weather data for Sydney between May 2019 and June 2020 (http://reg.bom.gov.au/climate/dwo/IDCJDW2124.latest.shtml).

The goal of this exercise is to predict if it will rain tomorrow by training a logistic regression model on the target "**rain_tomorrow**".

You will have to import the dataset from the following link:
https://raw.githubusercontent.com/aso-uts/mlaa/main/datasets/lab4/ex1/Sydney_Weather_BOM.csv

The steps are:
1.   Load and explore dataset
2.   Data Cleaning
3.   Data Splitting
4.   Assess Baseline model
5.   Train Logistic Regression Classifier

## 1. Load and explore dataset

**[1.1]** Import the pandas and numpy packages

In [None]:
# Placeholder for student's code

In [None]:
# Solution
import pandas as pd
import numpy as np

**[1.2]** Create a variable called file_url containing the link to the CSV file and load the dataset into dataframe called df

In [None]:
# Placeholder for student's code

In [None]:
# Solution
file_url = 'https://raw.githubusercontent.com/aso-uts/mlaa/main/datasets/lab4/ex1/Sydney_Weather_BOM.csv'
df = pd.read_csv(file_url)

In [None]:
# Unit Tests
assert isinstance(file_url, str)
assert isinstance(df, pd.DataFrame)

**[1.3]** Display the first 5 rows of df


In [None]:
# Placeholder for student's code

In [None]:
# Solution
df.head()

Unnamed: 0,date,min_temperature_c,max_temperature_c,rainfall_mm,evaporation_mm,sunshine_hours,max_wind_speed_kmh,max_pressure_hpa,rain_today,rain_tomorrow
0,2019-05-01,17.9,22.5,0.0,5.4,1.9,35.0,1022.6,NO,NO
1,2019-05-02,19.5,24.1,0.0,3.4,1.7,33.0,1025.8,NO,NO
2,2019-05-03,19.2,24.1,0.0,3.4,0.7,31.0,1019.1,NO,YES
3,2019-05-04,17.3,23.1,10.8,2.4,5.8,39.0,1015.9,YES,NO
4,2019-05-05,12.0,19.1,0.0,4.8,5.5,76.0,1017.6,NO,YES


**[1.4]** Display the dimensions of df

In [None]:
# Placeholder for student's code

In [None]:
# Solution
df.shape

(432, 10)

**[1.5]** Display the summary of df

In [None]:
# Placeholder for student's code

In [None]:
# Solution
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 432 entries, 0 to 431
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   date                432 non-null    object 
 1   min_temperature_c   432 non-null    float64
 2   max_temperature_c   432 non-null    float64
 3   rainfall_mm         432 non-null    float64
 4   evaporation_mm      419 non-null    float64
 5   sunshine_hours      430 non-null    float64
 6   max_wind_speed_kmh  422 non-null    float64
 7   max_pressure_hpa    432 non-null    float64
 8   rain_today          432 non-null    object 
 9   rain_tomorrow       432 non-null    object 
dtypes: float64(7), object(3)
memory usage: 33.9+ KB


It seeems that we have a few columns with missing data (**evaporation_mm, sunshine_hours, max_wind_speed_kmh**).

**[1.6]** Display the descriptive statictics of df


In [None]:
# Placeholder for student's code

In [None]:
# Solution
df.describe()

Unnamed: 0,min_temperature_c,max_temperature_c,rainfall_mm,evaporation_mm,sunshine_hours,max_wind_speed_kmh,max_pressure_hpa
count,432.0,432.0,432.0,419.0,430.0,422.0,432.0
mean,14.324074,23.090278,3.314352,5.371838,7.176512,42.308057,1019.272222
std,4.47608,4.606265,12.58173,3.100628,5.823038,14.937236,7.416347
min,6.1,14.1,0.0,0.0,0.0,2.0,994.5
25%,10.6,19.7,0.0,3.0,4.2,31.0,1014.475
50%,13.9,22.8,0.0,4.6,7.9,39.0,1019.4
75%,18.4,25.9,0.4,7.2,10.1,50.0,1024.4
max,24.8,41.3,176.0,17.0,100.0,104.0,1038.4


The column "**sunshine_hours**" have an issue as its maximum value is 100 hours. Obviously one day contains only 24 hours.



## 2. Data Cleaning

**[2.1]** Create a copy of the dataframe

In [None]:
# Placeholder for student's code

In [None]:
# Solution
df_cleaned = df.copy()

In [None]:
# Unit Tests
assert isinstance(df_cleaned, pd.DataFrame)
assert df_cleaned.shape == df.shape

**[2.2]** Create a filtering mask that will find the observations with less or equal to 24 hours of sunshine

In [None]:
correct_sunshine = df_cleaned['sunshine_hours'] <= 24
correct_sunshine

0      True
1      True
2      True
3      True
4      True
       ... 
427    True
428    True
429    True
430    True
431    True
Name: sunshine_hours, Length: 432, dtype: bool

In [None]:
# Unit Tests
assert correct_sunshine.sum() == 429

**[2.3]** Filter out the observations with over 24 hours of sunshine

In [None]:
df_cleaned = df_cleaned[correct_sunshine]

**[2.4]** Print the list of distinct values of rain_today and rain_tomorrow

In [None]:
# Placeholder for student's code

In [None]:
# Solution
print(df_cleaned.rain_today.unique())
print(df_cleaned.rain_tomorrow.unique())

['NO' 'YES' 'no' 'yes' 'Yes']
['NO' 'YES' 'yes' 'no' 'Yes']


Both "**rain_today**" and "**rain_tomorrow**" are binary and should only have 2 values. We have multiple variants of the same values "**yes**" or "**no**".

**[2.5]** Re-map all values of rain_today and rain_tomorrow to binary outcome (either 0 or 1):

In [None]:
# Placeholder for student's code

In [None]:
# Solution
df_cleaned.rain_today = df_cleaned.rain_today.map( {'YES':1 ,'yes':1 ,'Yes':1 ,'NO':0,'no':0} )
df_cleaned.rain_tomorrow = df_cleaned.rain_tomorrow.map( {'YES':1 ,'yes':1 ,'Yes':1 ,'NO':0,'no':0} )

**[2.6]** Print the distinct values of rain_today and rain_tomorrow

In [None]:
# Placeholder for student's code

In [None]:
# Solution
print(df_cleaned.rain_today.unique())
print(df_cleaned.rain_tomorrow.unique())

[0 1]
[0 1]


In [None]:
# Unit Tests
assert list(df_cleaned.rain_today.unique()) == [0, 1]
assert list(df_cleaned.rain_tomorrow.unique()) == [0, 1]

**[2.7]** Find all the duplicated rows in the dataframe

In [None]:
# Placeholder for student's code

In [None]:
# Solution
dup = df_cleaned.duplicated()
df_cleaned[dup]

Unnamed: 0,date,min_temperature_c,max_temperature_c,rainfall_mm,evaporation_mm,sunshine_hours,max_wind_speed_kmh,max_pressure_hpa,rain_today,rain_tomorrow
103,2019-08-11,8.8,16.1,0.0,5.6,6.6,56.0,1003.7,0,1
142,2019-09-18,10.6,17.9,65.6,,0.1,57.0,1025.4,1,1
184,2019-10-29,13.9,24.5,0.0,5.4,12.2,46.0,1024.2,0,0
252,2020-01-04,21.3,35.9,0.0,15.4,10.5,81.0,1010.4,0,0
261,2020-01-12,19.2,22.2,0.0,6.4,0.0,37.0,1018.5,0,1


**[2.8]** Remove all duplicated rows from the dataframe

In [None]:
# Placeholder for student's code

In [None]:
# Solution
df_cleaned.drop_duplicates(inplace=True)

In [None]:
# Unit Tests
assert df_cleaned.duplicated().sum() == 0

**[2.9]** Print the range of values for the date column using the "**.min()**" and "**.max()**" functions

In [None]:
# Placeholder for student's code

In [None]:
# Solution
print(df_cleaned.date.min())
print(df_cleaned.date.max())

2019-05-01
2120-03-16


In [None]:
# Unit Tests
assert str(df_cleaned.date.min()) == "2019-05-01"
assert str(df_cleaned.date.max()) == "2120-03-16"

Our date should be between May 2019 and June 2020, clearly we have some wrong dates.

**[2.10]** Print all the rows with dates after June 2020

In [None]:
# Placeholder for student's code

In [None]:
# Solution
df_cleaned[df_cleaned['date'] > '2020-06-30']

Unnamed: 0,date,min_temperature_c,max_temperature_c,rainfall_mm,evaporation_mm,sunshine_hours,max_wind_speed_kmh,max_pressure_hpa,rain_today,rain_tomorrow
325,2120-03-16,15.9,22.3,12.6,3.4,3.3,48.0,1025.4,1,1


The data seem correct except for the date. Maybe the year was entered incorrectly and it should refer to "2020-03-16". Let's see if this date does exist

**[2.11]** Print all the rows with dates equal to '2020-03-16'

In [None]:
# Placeholder for student's code

In [None]:
# Solution
df_cleaned[df_cleaned['date'] == '2020-03-16']

Unnamed: 0,date,min_temperature_c,max_temperature_c,rainfall_mm,evaporation_mm,sunshine_hours,max_wind_speed_kmh,max_pressure_hpa,rain_today,rain_tomorrow


There is no data for "2020-03-16", we can assume "2120-03-16" is actually referring to this date.

**[2.12]** Replace '2021-03-16' by '2020-03-16' in the dataframe

In [None]:
# Placeholder for student's code

In [None]:
# Solution
df_cleaned['date'].replace({'2120-03-16': '2020-03-16'}, inplace=True)

In [None]:
# Unit Tests
assert str(df_cleaned.date.max()) == "2020-06-30"

**[2.13]** Remove all observation with missing values:

In [None]:
# Placeholder for student's code

In [None]:
# Solution
df_cleaned.dropna(how='any', inplace=True)

In [None]:
# Unit Tests
assert df_cleaned.isna().sum().all() == 0

## 3. Data Splitting

**[3.1]** Extract the target variable into a variable called y

In [None]:
# Placeholder for student's code

In [None]:
# Solution
y = df_cleaned.pop('rain_tomorrow')

In [None]:
# Unit Tests
assert isinstance(y, pd.Series)
assert y.shape == (402, )

**[3.2]** Create a variable called X that contains all the variables

In [None]:
# Placeholder for student's code

In [None]:
# Solution
X = df_cleaned

In [None]:
# Unit Tests
assert isinstance(X, pd.DataFrame)
assert X.shape == (402, 9)

**[3.3]** Import train_test_split from sklearn.model_selection

In [None]:
# Placeholder for student's code

In [None]:
# Solution
from sklearn.model_selection import train_test_split

**[3.4]** Split the features and target variable into 2 different sets (data and test) with 90-10 ratio

In [None]:
# Placeholder for student's code

In [None]:
# Solution
X_data, X_test, y_data, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

In [None]:
# Unit Tests
assert isinstance(X_data, pd.DataFrame)
assert X_data.shape == (361, 9)
assert isinstance(X_test, pd.DataFrame)
assert X_test.shape == (41, 9)
assert isinstance(y_data, pd.Series)
assert y_data.shape == (361, )
assert isinstance(y_test, pd.Series)
assert y_test.shape == (41, )

**[3.5]** Split the features and target variable into 2 different sets (training and validation) with 90-10 ratio

In [None]:
# Placeholder for student's code

In [None]:
# Solution
X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=0.1, random_state=42)

In [None]:
# Unit Tests
assert isinstance(X_train, pd.DataFrame)
assert X_train.shape == (324, 9)
assert isinstance(X_val, pd.DataFrame)
assert X_val.shape == (37, 9)
assert isinstance(y_train, pd.Series)
assert y_train.shape == (324, )
assert isinstance(y_val, pd.Series)
assert y_val.shape == (37, )

**[3.6]** Drop the `date` column from the different sets

In [None]:
# Placeholder for student's code

In [None]:
# Solution
X_train.drop(['date'], axis=1, inplace=True)
X_val.drop(['date'], axis=1, inplace=True)
X_test.drop(['date'], axis=1, inplace=True)

In [None]:
# Unit Tests
assert X_train.shape == (324, 8)
assert X_val.shape == (37, 8)
assert X_test.shape == (41, 8)

## 4. Assess Baseline Model

**[4.1]** Find the mode of the target variable and save it into a variable called y_mode

In [None]:
# Placeholder for student's code

In [None]:
# Solution
y_mode = y.mode()

**[4.2]** Create a numpy array called y_base filled with this value of same length of y_train

In [None]:
# Placeholder for student's code

In [None]:
# Solution
y_base = np.full(y_train.shape, y_mode)

In [None]:
# Unit Tests
assert isinstance(y_base, np.ndarray)
assert y_base.shape == y_train.shape

**[4.3]** Import the accuracy score from sklearn

In [None]:
# Placeholder for student's code

In [None]:
# Solution
from sklearn.metrics import accuracy_score

**[4.4]** Display the accuracy score of this baseline model using the training set

In [None]:
# Placeholder for student's code

In [None]:
# Solution
accuracy_score(y_train, y_base)

0.6604938271604939

## 5. Train Logistic Regression Classifier

**[5.1]** Import the LogisticRegression class from sklearn

In [None]:
# Placeholder for student's code

In [None]:
# Solution
from sklearn.linear_model import LogisticRegression

**[5.2]** Instantiate our model



In [None]:
# Placeholder for student's code

In [None]:
# Solution
log_reg = LogisticRegression()

**[5.3]** Fit our model with the training data

In [None]:
# Placeholder for student's code

In [None]:
# Solution
log_reg.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


**[5.4]** Use the trained model to predict the outcome on X_train and save them into y_preds

In [None]:
# Placeholder for student's code

In [None]:
# Solution
y_train_preds = log_reg.predict(X_train)

In [None]:
# Unit Tests
assert isinstance(y_train_preds, np.ndarray)
assert y_train_preds.shape == y_train.shape

**[5.5]** Display the accuracy score for the training set

In [None]:
# Placeholder for student's code

In [None]:
# Solution
accuracy_score(y_train, y_train_preds)

0.8364197530864198

**[5.6]** Display the accuracy score for the validation set

In [None]:
# Placeholder for student's code

In [None]:
# Solution
y_val_preds = log_reg.predict(X_val)
accuracy_score(y_val, y_val_preds)

0.7027027027027027

In [None]:
# Unit Tests
assert isinstance(y_val_preds, np.ndarray)
assert y_val_preds.shape == y_val.shape

**[5.7]** Retrieve the probalitities for the first 10 predictions on the training set

In [None]:
log_reg.predict_proba(X_train[0:10])

array([[0.92014881, 0.07985119],
       [0.3973158 , 0.6026842 ],
       [0.15826048, 0.84173952],
       [0.56923713, 0.43076287],
       [0.89911403, 0.10088597],
       [0.95570576, 0.04429424],
       [0.97683008, 0.02316992],
       [0.92857175, 0.07142825],
       [0.76529832, 0.23470168],
       [0.59353172, 0.40646828]])

In [None]:
log_reg.predict(X_train[0:10])

array([0, 1, 1, 0, 0, 0, 0, 0, 0, 0])