## Machine Learning Problem

The dataset is weather data from the edmonton city-center from 1961 to 1994, with the goal to predict weather "descriptions" (i.e. cloudy, snow, rain etc) from features (temperature, relhumidity, pressure)


### Classifiers
I choose 3 classifiers + the baseline to explore the performance of different algorithms, all using the
scikit-learn implementations
\begin{itemize}
    \item There's the obvious baseline of a $\frac{1}{k}$ zero-classifier, which we will hopefully outperform
    \item Linear regression as a basic model
    \item SVM
    \item and a relatively simple multi-layer perceptron neural net from SKLearn
\end{itemize}

In [98]:
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
# tqdm loading bars
from tqdm.notebook import tqdm

In [2]:
# GLOBALS
# the size of the test and validation splits
TEST_SIZE = 0.2
NUM_FOLDS = 2

# Data

We're trying to predict the weather column of this data, which has 178 different labels, which is too much to
reasonably classify in this project, so I filter the dataset to only use the top 10 classes, which make up ~95% of the data anyways

In [75]:
# load in data
data = pd.read_csv("weather_data.csv")

# extract classes and counts
classes, counts = np.unique(data[['Weather']].values, return_counts=True)
classes = sorted(list(zip(classes, counts)), key=lambda x: -x[1])[:10]
classes = list(map(lambda x: x[0], classes))
print("Classes are : {}".format(classes))

data[['TempC','RelHumPct', 'PkPa', 'Weather']].loc[data['Weather'].isin(classes)]

Classes are : ['Mostly Cloudy', 'Mainly Clear', 'Cloudy', 'Clear', 'Snow', 'Rain Showers', 'Rain', 'Fog', 'Ice Crystals', 'Smoke']


Unnamed: 0,TempC,RelHumPct,PkPa,Weather
0,-8.3,96.0,92.59,Clear
1,-8.9,90.0,92.64,Clear
2,-8.9,90.0,92.71,Clear
3,-10.0,93.0,92.79,Clear
4,-11.1,92.0,92.84,Clear
...,...,...,...,...
361473,6.0,42.0,94.63,Clear
361474,7.1,39.0,94.60,Clear
361475,7.4,37.0,94.60,Clear
361476,7.2,38.0,94.63,Clear


In [76]:
X = data[['TempC', 'RelHumPct', "PkPa"]]
y = data[['Weather']]

### Train, validation, test
I use built-in scikit-learn train_test_split to split the data into train, validation, and test datasets 

All 3 models will be trained on the same train, test, val datasets to compare effectively

In [80]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state = 69)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=TEST_SIZE, random_state = 69)

# Classifiers

Here I train, validate and test all 4 classifiers, reporting the accuracy and weighted f1-score.
I use accuracy here because it's a simple metric to intuitively understand, while it is somewhat flawed
and weighted f1 since there is an imbalance of labels and so there's some value to scaling weighting the score
differently for different labels

## Baseline Classifier

As a baseline, I use the scikit-learn DummyClassifier using the "most_frequent" strategy
which is simply a max class classifier. In theory this should perform the worst out of the 4 classifiers

In [103]:
dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X_train, y_train)
y_hat = dummy.predict(X_test)
print("Dummmy classifier has accuracy {:f}".format(accuracy_score(y_test, y_hat)))
print("Dummmy classifier has weighted f1-score {:f}".format(f1_score(y_test, y_hat, average="weighted")))

Dummmy classifier has accuracy 0.334500
Dummmy classifier has weighted f1-score 0.167689


## Logistic Regression

My first model is a Logistic Regression classifier,

In [79]:
logit = LogisticRegressionCV()