# **Lab 4: Binary Classification**



## Exercise 1: SVM

In this exercise, you will train a binary classifier using Support Vector Machine and you will look at solving some of the most common data quality issues.

We will be using a dataset from the Australian Bureau of Metereology which contains the daily weather data for Sydney between May 2019 and June 2020 (http://reg.bom.gov.au/climate/dwo/IDCJDW2124.latest.shtml).

The goal of this exercise is to predict if it will rain tomorrow by training a Support Vector Machine model on the target "**rain_tomorrow**".

You will have to import the dataset from the following link:
https://raw.githubusercontent.com/aso-uts/labs_datasets/main/36106-mlaa/lab04/ex1/Sydney_Weather_BOM.csv

The steps are:
1.   Load and explore dataset
2.   Data Cleaning
3.   Data Splitting
4.   Assess Baseline model
5.   Train Support Vector Machine Classifier

---
### 0. Setup Environment

In [1]:
# Do not modify this code
!pip install -q utstd

from utstd.folders import *
from utstd.ipyrenders import *

lab = LabExFolder(
  course_code="36106",
  lab="lab04",
  exercise="ex01"
)
lab.run()

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.6/1.6 MB[0m [31m16.2 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.6/1.6 MB[0m [31m29.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[?25hMounted at /content/gdrive

You can now save your data files in: /content/36106/labs/lab04/ex01/data


In [2]:
import warnings
warnings.simplefilter(action='ignore')

## 1. Load and explore dataset

**[1.1]** Import the pandas and numpy packages

In [3]:
# Placeholder for student's code

In [4]:
# Solution
import pandas as pd
import numpy as np

**[1.2]** Create a variable called file_url containing the link to the CSV file and load the dataset into dataframe called `df`

In [5]:
# Placeholder for student's code

In [6]:
# Solution
file_url = 'https://raw.githubusercontent.com/aso-uts/labs_datasets/main/36106-mlaa/lab04/ex1/Sydney_Weather_BOM.csv'
df = pd.read_csv(file_url)

**[1.3]** Display the first 5 rows of `df`


In [7]:
# Placeholder for student's code

In [8]:
# Solution
df.head()

Unnamed: 0,date,min_temperature_c,max_temperature_c,rainfall_mm,evaporation_mm,sunshine_hours,max_wind_speed_kmh,max_pressure_hpa,rain_today,rain_tomorrow
0,2019-05-01,17.9,22.5,0.0,5.4,1.9,35.0,1022.6,NO,NO
1,2019-05-02,19.5,24.1,0.0,3.4,1.7,33.0,1025.8,NO,NO
2,2019-05-03,19.2,24.1,0.0,3.4,0.7,31.0,1019.1,NO,YES
3,2019-05-04,17.3,23.1,10.8,2.4,5.8,39.0,1015.9,YES,NO
4,2019-05-05,12.0,19.1,0.0,4.8,5.5,76.0,1017.6,NO,YES


**[1.4]** Display the dimensions of `df`

In [9]:
# Placeholder for student's code

In [10]:
# Solution
df.shape

(432, 10)

**[1.5]** Display the summary of `df`

In [11]:
# Placeholder for student's code

In [12]:
# Solution
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 432 entries, 0 to 431
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   date                432 non-null    object 
 1   min_temperature_c   432 non-null    float64
 2   max_temperature_c   432 non-null    float64
 3   rainfall_mm         432 non-null    float64
 4   evaporation_mm      419 non-null    float64
 5   sunshine_hours      430 non-null    float64
 6   max_wind_speed_kmh  422 non-null    float64
 7   max_pressure_hpa    432 non-null    float64
 8   rain_today          432 non-null    object 
 9   rain_tomorrow       432 non-null    object 
dtypes: float64(7), object(3)
memory usage: 33.9+ KB


It seeems that we have a few columns with missing data (**evaporation_mm, sunshine_hours, max_wind_speed_kmh**).

**[1.6]** Display the descriptive statictics of `df`


In [13]:
# Placeholder for student's code

In [14]:
# Solution
df.describe()

Unnamed: 0,min_temperature_c,max_temperature_c,rainfall_mm,evaporation_mm,sunshine_hours,max_wind_speed_kmh,max_pressure_hpa
count,432.0,432.0,432.0,419.0,430.0,422.0,432.0
mean,14.324074,23.090278,3.314352,5.371838,7.176512,42.308057,1019.272222
std,4.47608,4.606265,12.58173,3.100628,5.823038,14.937236,7.416347
min,6.1,14.1,0.0,0.0,0.0,2.0,994.5
25%,10.6,19.7,0.0,3.0,4.2,31.0,1014.475
50%,13.9,22.8,0.0,4.6,7.9,39.0,1019.4
75%,18.4,25.9,0.4,7.2,10.1,50.0,1024.4
max,24.8,41.3,176.0,17.0,100.0,104.0,1038.4


The column "**sunshine_hours**" have an issue as its maximum value is 100 hours. Obviously one day contains only 24 hours.



## 2. Data Cleaning

**[2.1]** Create a copy of the dataframe called `df_cleaned`

In [15]:
# Placeholder for student's code

In [16]:
# Solution
df_cleaned = df.copy()

**[2.2]** Create a filtering mask that will find the observations with less or equal to 24 hours of sunshine

In [17]:
correct_sunshine = df_cleaned['sunshine_hours'] <= 24
correct_sunshine

Unnamed: 0,sunshine_hours
0,True
1,True
2,True
3,True
4,True
...,...
427,True
428,True
429,True
430,True


**[2.3]** Filter out the observations with over 24 hours of sunshine

In [18]:
df_cleaned = df_cleaned[correct_sunshine]
df_cleaned.shape

(429, 10)

**[2.4]** Print the list of distinct values of `rain_today` and `rain_tomorrow`

In [19]:
# Placeholder for student's code

In [20]:
# Solution
print(df_cleaned.rain_today.unique())
print(df_cleaned.rain_tomorrow.unique())

['NO' 'YES' 'no' 'yes' 'Yes']
['NO' 'YES' 'yes' 'no' 'Yes']


Both "**rain_today**" and "**rain_tomorrow**" are binary and should only have 2 values. We have multiple variants of the same values "**yes**" or "**no**".

**[2.5]** Re-map all values of `rain_today` and `rain_tomorrow` to binary outcome (either 0 or 1):

In [21]:
# Placeholder for student's code

In [22]:
# Solution
df_cleaned.rain_today = df_cleaned.rain_today.map( {'YES':1 ,'yes':1 ,'Yes':1 ,'NO':0,'no':0} )
df_cleaned.rain_tomorrow = df_cleaned.rain_tomorrow.map( {'YES':1 ,'yes':1 ,'Yes':1 ,'NO':0,'no':0} )

**[2.6]** Print the distinct values of `rain_today` and `rain_tomorrow`

In [23]:
# Placeholder for student's code

In [24]:
# Solution
print(df_cleaned.rain_today.unique())
print(df_cleaned.rain_tomorrow.unique())

[0 1]
[0 1]


**[2.7]** Find all the duplicated rows in the dataframe

In [25]:
# Placeholder for student's code

In [26]:
# Solution
dup = df_cleaned.duplicated()
df_cleaned[dup]

Unnamed: 0,date,min_temperature_c,max_temperature_c,rainfall_mm,evaporation_mm,sunshine_hours,max_wind_speed_kmh,max_pressure_hpa,rain_today,rain_tomorrow
103,2019-08-11,8.8,16.1,0.0,5.6,6.6,56.0,1003.7,0,1
142,2019-09-18,10.6,17.9,65.6,,0.1,57.0,1025.4,1,1
184,2019-10-29,13.9,24.5,0.0,5.4,12.2,46.0,1024.2,0,0
252,2020-01-04,21.3,35.9,0.0,15.4,10.5,81.0,1010.4,0,0
261,2020-01-12,19.2,22.2,0.0,6.4,0.0,37.0,1018.5,0,1


**[2.8]** Remove all duplicated rows from the dataframe

In [27]:
# Placeholder for student's code

In [28]:
# Solution
df_cleaned.drop_duplicates(inplace=True)

**[2.9]** Print the range of values for the `date` column using the "**.min()**" and "**.max()**" functions

In [29]:
# Placeholder for student's code

In [30]:
# Solution
print(df_cleaned.date.min())
print(df_cleaned.date.max())

2019-05-01
2120-03-16


Our date should be between May 2019 and June 2020, clearly we have some wrong dates.

**[2.10]** Print all the rows with dates after June 2020

In [31]:
# Placeholder for student's code

In [32]:
# Solution
df_cleaned[df_cleaned['date'] > '2020-06-30']

Unnamed: 0,date,min_temperature_c,max_temperature_c,rainfall_mm,evaporation_mm,sunshine_hours,max_wind_speed_kmh,max_pressure_hpa,rain_today,rain_tomorrow
325,2120-03-16,15.9,22.3,12.6,3.4,3.3,48.0,1025.4,1,1


The data seem correct except for the date. Maybe the year was entered incorrectly and it should refer to "2020-03-16". Let's see if this date does exist

**[2.11]** Print all the rows with dates equal to '2020-03-16'

In [33]:
# Placeholder for student's code

In [34]:
# Solution
df_cleaned[df_cleaned['date'] == '2020-03-16']

Unnamed: 0,date,min_temperature_c,max_temperature_c,rainfall_mm,evaporation_mm,sunshine_hours,max_wind_speed_kmh,max_pressure_hpa,rain_today,rain_tomorrow


There is no data for "2020-03-16", we can assume "2120-03-16" is actually referring to this date.

**[2.12]** Replace '2021-03-16' by '2020-03-16' in the dataframe

In [35]:
# Placeholder for student's code

In [36]:
# Solution
df_cleaned['date'].replace({'2120-03-16': '2020-03-16'}, inplace=True)

**[2.13]** Remove all observation with missing values:

In [37]:
# Placeholder for student's code

In [38]:
# Solution
df_cleaned.dropna(how='any', inplace=True)

## 3. Data Splitting

Note: If you are stuck in previous steps, you can dowload the content of df_cleaned here: https://raw.githubusercontent.com/aso-uts/labs_datasets/main/36106-mlaa/lab04/ex1/df_cleaned.csv


**[3.1]** Sort the dataframe in ascending order using the `date` column

In [39]:
# Placeholder for student's code

In [40]:
# Solution
df_cleaned.sort_values(['date'], inplace=True)

**[3.2]** Extract the target variable into a variable called `y`

In [41]:
# Placeholder for student's code

In [42]:
# Solution
y = df_cleaned.pop('rain_tomorrow')

**[3.3]** Create a variable called `X` that contains all the variables

In [43]:
# Placeholder for student's code

In [44]:
# Solution
X = df_cleaned

**[3.4]** Create a variable called `one_fifth` that will contain the number of rows that corresponds to 20% of the dataframe. Round it to the closest integer.

In [45]:
# Placeholder for student's code

In [46]:
# Solution
one_fifth = round(len(X) / 5)
one_fifth

80

**[3.5]** Create `X_train` and `y_train` that will contain the first 60% of the original dataframe

In [47]:
# Placeholder for student's code

In [48]:
# Solution
X_train = X[: one_fifth * 3]
y_train = y[: one_fifth * 3]

**[3.6]** Create `X_val`, `y_val`, `X_test` and `y_test` that will respectively contain the next 20% and the remaining 20% of the original dataframe

In [49]:
# Placeholder for student's code

In [50]:
# Solution
X_val = X[one_fifth * 3: one_fifth * 4]
y_val = y[one_fifth * 3: one_fifth * 4]

X_test = X[one_fifth * 4:]
y_test = y[one_fifth * 4:]

**[3.7]** Drop the `date` column from the different sets

In [51]:
# Placeholder for student's code

In [52]:
# Solution
X_train.drop(['date'], axis=1, inplace=True)
X_val.drop(['date'], axis=1, inplace=True)
X_test.drop(['date'], axis=1, inplace=True)

## 4. Assess Baseline Model

**[4.1]** Import the DummyClassifier module from sklearn

In [53]:
# Placeholder for student's code

In [54]:
# Solution
from sklearn.dummy import DummyClassifier

**[4.2]** Instantiate the Dummy class into a variable called `base_clf` and fit it on the training set it

In [55]:
# Placeholder for student's code

In [56]:
# Solution
base_clf = DummyClassifier(strategy='most_frequent')
base_clf.fit(X_train, y_train)

**[4.3]** Import the accuracy score from sklearn

In [57]:
# Placeholder for student's code

In [58]:
# Solution
from sklearn.metrics import accuracy_score

**[4.4]** Display the accuracy score of this baseline model using the training set

In [59]:
# Placeholder for student's code

In [60]:
# Solution
y_preds = base_clf.predict(X_train)
accuracy_score(y_train, y_preds)

0.7583333333333333

## 5. Train SVM Classifier

**[5.1]** Import SVC from sklearn.svm

In [61]:
# Placeholder for student's code

In [62]:
# Solution
from sklearn.svm import SVC

**[5.2]** Instantiate our model



In [63]:
# Placeholder for student's code

In [64]:
# Solution
svc = SVC()

**[5.3]** Fit our model with the training data

In [65]:
# Placeholder for student's code

In [66]:
# Solution
svc.fit(X_train, y_train)

**[5.4]** Use the trained model to predict the outcome on `X_train` and save them into `y_preds`

In [67]:
# Placeholder for student's code

In [68]:
# Solution
y_train_preds = svc.predict(X_train)

**[5.5]** Display the accuracy score for the training set

In [69]:
# Placeholder for student's code

In [70]:
# Solution
accuracy_score(y_train, y_train_preds)

0.7583333333333333

**[5.6]** Display the accuracy score for the validation set

In [71]:
# Placeholder for student's code

In [72]:
# Solution
y_val_preds = svc.predict(X_val)
accuracy_score(y_val, y_val_preds)

0.4125