# Machine Learning Problem

The dataset is weather data from the edmonton city-center from 1961 to 1994, with the goal to predict weather "descriptions" (i.e. cloudy, snow, rain etc) from features (temperature, relhumidity, pressure)


## Classifiers
I choose 3 classifiers + the baseline to explore the performance of different algorithms, all using the
scikit-learn implementations
\begin{itemize}
    \item There's the obvious baseline of a $\frac{1}{k}$ zero-classifier, which we will hopefully outperform
    \item Linear regression as a basic model
    \item SVM
    \item and a relatively simple multi-layer perceptron neural net from SKLearn
\end{itemize}

In [115]:
from sklearn.dummy import DummyClassifier
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegressionCV
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, balanced_accuracy_score, get_scorer

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
# tqdm loading bars
from tqdm.notebook import tqdm

In [124]:
# GLOBALS
# the size of the test and validation splits
TEST_SIZE = 0.2
# our random seed to use
RANDOM_STATE = 69

# Data

We're trying to predict the weather column of this data, which has 178 different labels, which is too much to
reasonably classify in this project, so I filter the dataset to only use the top 10 classes, which make up ~95% of the data anyways

I also remove any rows containing NaN

In [100]:
# load in data
data = pd.read_csv("weather_data.csv")

# extract classes and counts
classes, counts = np.unique(data[['Weather']].values, return_counts=True)
classes = sorted(list(zip(classes, counts)), key=lambda x: -x[1])[:10]
print("Classes with counts are : {}".format(classes))
classes = list(map(lambda x: x[0], classes))

data = data[['TempC','RelHumPct', 'PkPa', 'Weather']].loc[data['Weather'].isin(classes)].dropna()
data

Classes with counts are : [('Mostly Cloudy', 121373), ('Mainly Clear', 99069), ('Cloudy', 45023), ('Clear', 31173), ('Snow', 25218), ('Rain Showers', 6916), ('Rain', 6347), ('Fog', 5068), ('Ice Crystals', 3636), ('Smoke', 1749)]


Unnamed: 0,TempC,RelHumPct,PkPa,Weather
0,-8.3,96.0,92.59,Clear
1,-8.9,90.0,92.64,Clear
2,-8.9,90.0,92.71,Clear
3,-10.0,93.0,92.79,Clear
4,-11.1,92.0,92.84,Clear
...,...,...,...,...
361473,6.0,42.0,94.63,Clear
361474,7.1,39.0,94.60,Clear
361475,7.4,37.0,94.60,Clear
361476,7.2,38.0,94.63,Clear


In [101]:
X = data[['TempC', 'RelHumPct', "PkPa"]]
y = data[['Weather']]

### Train, validation, test
I use built-in scikit-learn train_test_split to split the data into train, validation, and test datasets 
In this step I only split into train and test datasets, since I will be using k-fold validation on the 
All 3 models will be trained on the same train, test, val datasets to compare effectively

I extract both a unscaled X_train... and a 
scaled X_train_scaled... (using a minmax scaler)

In [102]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state = RANDOM_STATE)

# scale a copy of the data (y is just classes so doesn't need to be scaled)
min_max_scaler = preprocessing.MinMaxScaler()
X_scaled = min_max_scaler.fit_transform(X)
X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled = train_test_split(X_scaled, y, test_size=TEST_SIZE, random_state = RANDOM_STATE)
#X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=TEST_SIZE, random_state = RANDOM_STATE)

# Classifiers

Here I train, validate and test all 4 classifiers, reporting the accuracy and a balanced accuracy score.
I use accuracy here because it's a simple metric to intuitively understand, and balanced accuracy since it averages recall per class, which handles the imbalanced classes a little better
differently for different labels

## Baseline Classifier

As a baseline, I use the scikit-learn DummyClassifier using the "most_frequent" strategy
which is simply a max class classifier as well as the "uniform" strategy which just generates predictions
uniformly at random across the classes

I would expect this to perform the worst out of the 4 classifiers

In [89]:
dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X_train, y_train)
y_hat = dummy.predict(X_test)
print("Dummmy classifier (most frequent) has accuracy {:f}".format(accuracy_score(y_test, y_hat)))
print("Dummmy classifier (most frequent) has balanced accuracy score {:f}\n".format(balanced_accuracy_score(y_test, y_hat)))

dummy = DummyClassifier(strategy="uniform", random_state=RANDOM_STATE)
dummy.fit(X_train, y_train)
y_hat = dummy.predict(X_test)
print("Dummmy classifier (uniform) has accuracy {:f}".format(accuracy_score(y_test, y_hat)))
print("Dummmy classifier (uniform) has balanced accuracy score {:f}".format(balanced_accuracy_score(y_test, y_hat)))

Dummmy classifier (most frequent) has accuracy 0.352062
Dummmy classifier (most frequent) has balanced accuracy score 0.100000

Dummmy classifier (uniform) has accuracy 0.100955
Dummmy classifier (uniform) has balanced accuracy score 0.101505


### Results

Both the dummmy classifiers perform exactly as well as expected, since ~35% of the data is "mostly cloudy" and there are 10 classes so the balanced accuracy score is going to be 10%, and for the uniform classifier, both values are around 10%

## Logistic Regression

My first model is a Logistic Regression classifier,
I use the scikit-learn LogisticRegressionCV which does cross-fold validation within it, I just use the default
stratified k-folds built in which should be good enough. NUM_FOLDS can be tuned as needed, as can MAX_ITER

I use the "saga" solver since it handles large datasets quite well, and I use balanced_accuracy_score 
for my scoring, since that's my favoured measure of quality in this experiment

In [127]:
# number of validation folds for k-fold cross validation
NUM_FOLDS = 5
# maximum iterations when training our classifier
MAX_ITER = 1000

logit = LogisticRegressionCV(cv=NUM_FOLDS, max_iter=MAX_ITER, solver="saga", scoring=get_scorer("balanced_accuracy"))
logit.fit(X_train_scaled, np.ravel(y_train))

LogisticRegressionCV(Cs=10, class_weight=None, cv=5, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=1000, multi_class='auto', n_jobs=None,
                     penalty='l2', random_state=None, refit=True, scoring=None,
                     solver='saga', tol=0.0001, verbose=0)

In [128]:
y_hat = logit.predict(X_test_scaled)
print("Logistic Regression classifier has accuracy {:f}".format(accuracy_score(y_test, y_hat)))
print("Logistic Regression has balanced accuracy score {:f}".format(balanced_accuracy_score(y_test, y_hat)))

Logistic Regression classifier has accuracy 0.392881
Logistic Regression has balanced accuracy score 0.200750


### Results

As can be seen, there's a marginal increase in accuracy, with a marked increase in balanced accuracy score
but we're still looking at only a ~20% balanced accuracy, which is not yet particularly impressive

## SVM

I use an svm classifier next