## Introduction

I will use historic daily weather observations from numerous Australian weather stations to predict whether or not it will rain tomorrow. This data was sourced from the [Bureau of Meteorology](http://www.bom.gov.au/climate/data/).

A binary classification model will be trained on the target attribute `RainTomorrow` (Did it rain the next day? Yes or no).

This is a typical supervised learning task, as we are given *labelled* training examples (`RainTomorrow`). It is also a classification task as the target attribute `RainTomorrow` is binary with two possible values `Yes` or `No`. Additionally, this is a univariate problem because I have only one outcome of interest.

## Get the Data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

### Take a Quick Look at the Data Structure

In [None]:
weather = pd.read_csv('../input/weather-dataset-rattle-package/weatherAUS.csv')
weather.head()

Each instance represents weather information from a weather station on a particular day.

In [None]:
weather.info()

In [None]:
weather.select_dtypes('object').columns

In [None]:
weather.select_dtypes('float64').columns

In [None]:
print(len(weather.select_dtypes('object').columns))
print(len(weather.select_dtypes('float64').columns))

There are 142,193 instances in the dataset. Some attributes, such as `Sunshine` and `Evaporation`, have a high proportion of missing values. In fact, most attributes appear to have at least some missing values.

In total, there are 24 attributes. 7 of these have the `object` type, while the remaining 17 are `float64`. The `object`s appear to be categorical attributes (`Date` can be treated as numerical or categorical depending on the analysis at hand).

Before proceeding further, the dataset description advises to drop `RISK_MM` as it contains the amount of rainfall for the next day. This would leak future information to the model if used for training, resulting in an inflated accuracy rate when evaluating the model.

In [None]:
weather = weather.drop(columns='RISK_MM')

## Explore the Data

### Numerical Attributes

In [None]:
weather.describe()

In [None]:
weather.hist(bins=50, figsize=(20,15));

* Many of these attributes, such as `MinTemp`, `MaxTemp`, `Pressure9am`, and `Pressure3pm`, have bell-shaped distributions.
* `Rainfall` and `Evaporation` are heavily skewed to the right. How likely is 371 mm of rainfall in a day?
* `WindGustSpeed`, `WindSpeed9am`, and `WindSpeed3pm` are also skewed to the right, but less so than `Rainfall` and `Evaporation`.
* `Humidity9am` and `Humidity3pm` are slightly skewed to the left.
* These attributes have very different scales. For example, compare `MinTemp` and `Pressure9am`.
* `Cloud9am` and `Cloud3pm` are discrete attributes (cloud cover is measured in [oktas](https://en.wikipedia.org/wiki/Okta)).
* The mode for `Sunshine` is 0. `Sunshine` is described as the "number of hours of bright sunshine in the day". It seems unlikely there would be so many days without any sunshine, but it depends on what "bright" means. Maybe during the winter months there are very few days with "bright" sunshine.
* The mode for `Humidity9am` is 100%. Also, `Humidity3pm` has an usually high number of 100% days given the bell-shaped distribution.

The above highlights the need for feature scaling and the transformation of attributes so they approximate a normal distribution. Additionally, extreme values for `Rainfall`, `Evaporation`, `Sunshine`, `Humidity9am`, and `Humidity3pm` can be investigated.

### Categorical Attributes

In [None]:
cat_attribs = weather.select_dtypes('object')
cat_attribs = cat_attribs.drop(columns='Date')
for i in cat_attribs:
    print(cat_attribs[i].value_counts())
    if i != 'RainTomorrow':
        print('\n')

In [None]:
len(cat_attribs['Location'].value_counts())

In [None]:
fig, axarr = plt.subplots(3, 2, figsize=(12,10))
cat_attribs['WindGustDir'].value_counts(ascending=True).plot.barh(ax=axarr[0,0], title='WindGustDir')
cat_attribs['WindDir9am'].value_counts(ascending=True).plot.barh(ax=axarr[1,0], title='WindDir9am')
cat_attribs['WindDir3pm'].value_counts(ascending=True).plot.barh(ax=axarr[2,0], title='WindDir3pm')
cat_attribs['RainToday'].value_counts(ascending=True).plot.barh(ax=axarr[0,1], title='RainToday')
cat_attribs['RainTomorrow'].value_counts(ascending=True).plot.barh(ax=axarr[1,1], title='RainTomorrow')
cat_attribs['RainTomorrow'].value_counts(ascending=True).plot.barh(ax=axarr[1,1], title='RainTomorrow')
fig.delaxes(axarr[2,1]) # deletes empty plot
plt.tight_layout();

* There are 49 different locations.
* `WindGustDir`, `WindDir9am`, and `WindDir3pm` are nominal attributes. Therefore, they should be converted to numbers using one-hot encoding.
* `RainToday` should be converted to binary (`0`, `1`).
* `RainTomorrow` is the target attribute.

### Missing values

In [None]:
print('Total number of missing values: ')
print(weather.isnull().sum().sort_values(ascending=False))

In [None]:
print('Percentage of missing values: ')
print((weather.isnull().sum().sort_values(ascending=False) / len(weather)) * 100)

In [None]:
missing_counts = weather.isnull().sum().sort_values(ascending=True)
missing_counts.plot.barh(figsize=(10,8), title = 'Total number of missing values by attribute');

* There are a large number of missing values, particularly for `Sunshine`, `Evaporation`, `Cloud3pm`, and `Cloud9am` (missingness ranges from 37.7%–47.7%).
* I will need to set these values to something else (e.g., zero, the mean, the median).
* The target attribute `RainTomorrow` has no missing values.

### Create a Test Set

Before exploring the data further, I will create a test set.

In [None]:
# Random sampling method
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(weather, test_size=0.2, random_state=42)

I think `RainToday` will be an important attribute to predict `RainTomorrow`. I want to make sure the test set is representative of `RainToday` in the whole dataset.

In [None]:
(test_set['RainToday'].value_counts() / len(test_set)) * 100

In [None]:
(weather['RainToday'].value_counts() / len(weather)) * 100

### Correlations

In [None]:
weather = train_set.copy()

In [None]:
corr_matrix = weather.corr()
corr_matrix

In [None]:
temp = weather[['MinTemp', 'MaxTemp', 'Temp9am', 'Temp3pm']]
temp.corr()

In [None]:
from pandas.plotting import scatter_matrix
scatter_matrix(temp, figsize=(15,12), alpha=0.05, s=5);

In [None]:
wind = weather[['WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm']]
wind.corr()

In [None]:
scatter_matrix(wind, figsize=(15,10), alpha=0.1);

In [None]:
humidity = weather[['Humidity9am', 'Humidity3pm']]
humidity.corr()

In [None]:
scatter_matrix(humidity, figsize=(12,8), alpha=0.02);

* Each pairwise correlation of `MinTemp`, `MaxTemp`, `Temp9am`, and `Temp3pm` has a moderate (0.5 < r < 0.75) to strong (r > 0.75) relationship.
* The strongest association is between `MaxTemp` and `Temp3pm` (r = 0.98) followed by `MinTemp` and `Temp9am` (r = 0.90).
* `WindGustSpeed`, `WindSpeed9am`, `WindSpeed3pm` are moderately associated with each other (0.5 < r < 0.75).
* There is a moderate relationship between `Humidity9am` and `Humidity3pm` (0.5 < r < 0.75).
* The relationship between `Humidity9am` and `Humidity3pm` is not strictly linear.

In [None]:
weather['RainTomorrow'].value_counts()

In [None]:
# RainTomorrow and RainToday values must be transformed from text (Yes, No) to numbers (0, 1) before correlations can be computed
make_binary = {'RainTomorrow': {'No': 0, 'Yes': 1},
               'RainToday': {'No': 0, 'Yes': 1}
              }
weather.replace(make_binary, inplace=True)

In [None]:
corr_matrix = weather.corr()
corr_matrix['RainTomorrow'].sort_values(ascending=False)

The most promising attribute to predict whether or not it will rain tomorrow is `Sunshine`, followed by `Humidity3pm` and `Cloud3pm`. Interestingly, `MinTemp` and `Temp9am` have almost no linear relationship with `RainTomorrow`, and `RainToday` had a weaker association than I expected.

## Prepare the Data

Separate the predictors and the label:

In [None]:
weather = train_set.drop('RainTomorrow', axis=1)
weather_labels = train_set['RainTomorrow'].copy()

Create the transformation pipeline and and apply it to each attribute:

In [None]:
# Transformation pipeline for numerical attributes
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')), # impute missing values with median
    ('minmax_scaler', MinMaxScaler()),             # scale features
])

In [None]:
# Transformation pipeline for categorical attributes
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')), # impute missing values with mode
    ('cat_encoder', OneHotEncoder())                      # convert text to numbers
])

In [None]:
# Apply transformations
from sklearn.compose import ColumnTransformer

num_attribs = weather.select_dtypes('float64').columns
cat_attribs = weather.select_dtypes('object').columns

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", cat_pipeline, cat_attribs),
    ])

train_set_prepared = full_pipeline.fit_transform(weather)

In [None]:
train_set_prepared

## SGD Classifier

I will use a stochastic gradient descent (SGD) classifier first, followed by a random forest classifier. After evaluating the performance of the SGD classifier using a confusion matrix and precision and recall, I will compare the performance of the two classifiers using a ROC curve.

In [None]:
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(train_set_prepared, weather_labels) # fit model to transformed training set

In [None]:
some_data = weather.iloc[:5] # training set without the labels
some_labels = weather_labels.iloc[:5] # training set with labels only
some_data_prepared = full_pipeline.transform(some_data) # transform first five instances

In [None]:
print("Predictions:", sgd_clf.predict(some_data_prepared)) # predictions made by model
print("Labels:", list(some_labels)) # labels from training set

In [None]:
sgd_clf.predict(train_set_prepared)

In [None]:
len(sgd_clf.predict(train_set_prepared))

## Evaluate

### Performance Measures

Evaluating a classifier is often trickier than evaluating a regressor. Here are some options for evaluating the performance of my binary classifier:
1. Cross-validation
1. Confusion matrix
1. Precision and recall
1. The ROC curve

### Cross-Validation

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, train_set_prepared, weather_labels, cv=3, scoring='accuracy')

Accuracy is 85% (ratio of correct predictions) on all cross-validation folds.

How does this compare to a dumb classifier that predicts no rain every day? It rains about 22% of the time, so if I always guessed no rain, I would be correct approximately 78% of the time.

### Confusion Matrix

A confusion matrix is a better way to evaluate the performance of a classifier.

In [None]:
from sklearn.model_selection import cross_val_predict
weather_labels_pred = cross_val_predict(sgd_clf, train_set_prepared, weather_labels, cv=3)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(weather_labels, weather_labels_pred)

The first row in the output considers instances in the training set where `RainTomorrow` is equal to `No`:
* 84,793 instances were correctly classfied as `No`.
* 3,425 instances were wrongly classified as `Yes`.

Of the instances where `RainTomorrow` is equal to `Yes` (the second row):
* 13,569 were wrongly classified as `No`.
* 11,967 were correctly classified as `Yes`.

### Precision and Recall

A more concise metric than a confusion matrix is precision and recall.

In [None]:
# precision = TP / (TP + FP) (the accuracy of positive predictions)
# TP is the number of true positives, and FP is the number of false positives
from sklearn.metrics import precision_score, recall_score
precision_score(weather_labels, weather_labels_pred, average='binary', pos_label='Yes') # == 11967 / (11967 + 3425)

In [None]:
# recall = TP / (TP + FN)
recall_score(weather_labels, weather_labels_pred, average='binary', pos_label='Yes') # == 11967 / (11967 + 13569)

When the model predicts it will rain tomorrow, it is correct about 78% of the time. Moreover, it correctly classifies 47% of instances in the training set where `RainTomorrow` is equal to `Yes` (i.e., by predicting `Yes`). Or, stated alternatively, 53% of the time the model predicts `No` for `RainTomorrow` when it should be `Yes`.

These numbers may seem disappointing, but at least the model is correctly predicting nearly 50% of `Yes` instances. Recall the dumb classifier predicts `No` for `RainTomorrow` for *all* instances. This means it correctly identifies 0% of days where it rains the following day in the training set. In comparison to this, the numbers above don't seem so bad!

The F₁ score is the harmonic mean of precision and recall and can be used to compare two classifiers. An F₁ score is only high when both precision *and* recall are high, which we don't always want.

In [None]:
from sklearn.metrics import f1_score
f1_score(weather_labels, weather_labels_pred, average='binary', pos_label='Yes')

In this context, it is more important to predict days it is going to rain rather than days it *isn't* going to rain. So, I don't mind sacrificing precision in order to increase recall (I want to correctly detect more instances in the training set where `RainTomorrow` is equal to `Yes`).

In [None]:
weather_labels_scores = cross_val_predict(sgd_clf, train_set_prepared, weather_labels, cv=3,
                                          method='decision_function') # returns decision scores instead of predictions

In [None]:
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(weather_labels, weather_labels_scores,
                                                         pos_label='Yes')

In [None]:
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
    plt.title('Precision and recall by the decision threshold')
    plt.legend()
    plt.xlabel("Threshold")
    plt.ylabel("Proportion")
    plt.axis([-4, 3, 0, 1])
    
plt.figure(figsize=(8,6))
plot_precision_recall_vs_threshold(precisions, recalls, thresholds);

In [None]:
def plot_precision_vs_recall(precisions, recalls):
    plt.plot(recalls, precisions, "b-")
    plt.title('Precision by recall')
    plt.xlabel("Recall")
    plt.ylabel("Precision")
    plt.axis([0, 1, 0, 1])
    plt.grid(True)

plt.figure(figsize=(8,6))
plot_precision_vs_recall(precisions, recalls)
plt.plot([0.8, 0.8], [0., 0.525], "r:")
plt.plot([0.0, 0.8], [0.525, 0.525], "r:")
plt.plot([0.8], [0.525], "ro");

Precision starts to fall sharply around 80% recall. This is where I will set the threshold.

In [None]:
threshold_80_recall = thresholds[np.argmin(recalls >= 0.80)]
threshold_80_recall

In [None]:
weather_labels_pred_80 = (weather_labels_scores >= threshold_80_recall)
weather_labels_pred_80

In [None]:
# New predictions using threshold of -1.06 are boolean (False = No; True = Yes)
# Convert labels on training set to boolean to allow for calculation of precision and recall
weather_labels_arr = weather_labels.to_numpy()
weather_labels_bool = weather_labels_arr == 'Yes'

In [None]:
precision_score(weather_labels_bool, weather_labels_pred_80)

In [None]:
recall_score(weather_labels_bool, weather_labels_pred_80)

As predicted, using a new threshold, the model correctly detects instances in the training set where `RainTomorrow` is equal to `Yes` 80% of the time. Therefore, 20% of the time the model predicts `No` when `RainTomorrow` is `Yes`. Additionally, because the threshold has been adjusted, when the model predicts rain the following day, it is now correct only 52% of the time. So, it is more cautious than the first model.

### The ROC Curve

In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(weather_labels, weather_labels_scores, pos_label='Yes')

In [None]:
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--') # Dashed diagonal”
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate (Recall)')
    plt.grid(True)

plt.figure(figsize=(8,6))
plot_roc_curve(fpr, tpr);

In [None]:
from sklearn.metrics import roc_auc_score
roc_auc_score(weather_labels, weather_labels_scores)

Based on this, if recall is 80%, the false positive rate is just over 20% (the ratio of `No`s that are incorrectly classified as `Yes`). I'm satisfied with this.

## Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, train_set_prepared, weather_labels, cv=3, 
                                    method="predict_proba")

In [None]:
y_scores_forest = y_probas_forest[:, 1]   # score = proba of positive class
fpr_forest, tpr_forest, thresholds_forest = roc_curve(weather_labels, y_scores_forest, pos_label='Yes')

In [None]:
plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, "b:", label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right")
plt.show();

In [None]:
roc_auc_score(weather_labels, y_scores_forest)

In [None]:
# Need predictions for calculating precision and recall
y_train_pred_forest = cross_val_predict(forest_clf, train_set_prepared, weather_labels, cv=3)

In [None]:
precision_score(weather_labels, y_train_pred_forest, pos_label='Yes')

In [None]:
recall_score(weather_labels, y_train_pred_forest, pos_label='Yes')

The random forest classifier performs slightly better than the SGD classifier. If I wanted a similar precision and recall as the SGD classifier, I would have to adjust the threshold.

In [None]:
precisions_forest, recalls_forest, thresholds_forest = precision_recall_curve(weather_labels, y_scores_forest,
                                                                              pos_label='Yes')

In [None]:
plt.figure(figsize=(8,6))
plot_precision_vs_recall(precisions_forest, recalls_forest)
plt.plot([0.8, 0.8], [0., 0.54], "r:")
plt.plot([0.0, 0.8], [0.54, 0.54], "r:")
plt.plot([0.8], [0.54], "ro");

In [None]:
threshold_80_recall = thresholds_forest[np.argmin(recalls_forest >= 0.80)]
threshold_80_recall

In [None]:
y_scores_forest_pred_80 = (y_scores_forest >= threshold_80_recall)
precision_score(weather_labels_bool, y_scores_forest_pred_80)

In [None]:
recall_score(weather_labels_bool, y_scores_forest_pred_80)

Precision and recall are similar when compared to the SGD classifier (with the adjusted threshold). However, at about the same percentage of recall, precision is 2% greater with the random forest classifier.

In [None]:
cross_val_score(forest_clf, train_set_prepared, weather_labels, cv=3, scoring='accuracy')

Accuracy is 85% (ratio of correct predictions) on all cross-validation folds. In comparison to the SGD classifier, it is about .001% better.

## Fine-Tune the Model

I will use `GridSearchCV` to experiment with different combinations of hyperparameters.

### Grid Search

#### SGD Classifier

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'alpha': [0.0001, 0.001, 0.01], 'verbose': [0, 1, 10, 100], 'shuffle': [True, False]} 
]

sgd_clf = SGDClassifier(random_state=42)

grid_search = GridSearchCV(sgd_clf, param_grid, cv=3, scoring='accuracy', return_train_score=True)

grid_search.fit(train_set_prepared, weather_labels)

In [None]:
grid_search.best_params_

These are the default values for these parameters.

In [None]:
cvres = grid_search.cv_results_
cvres

In [None]:
cvres['mean_test_score']

#### Random Forest Classifier

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_estimators': [3, 10, 30, 100], 'max_features': [2, 4, 6, 8]},
    {'bootstrap': [True, False], 'n_estimators': [3, 10, 100], 'max_features': [2, 3, 4]} 
]

forest_clf = RandomForestClassifier(random_state=42)

grid_search = GridSearchCV(forest_clf, param_grid, cv=3, scoring='accuracy', return_train_score=True)

grid_search.fit(train_set_prepared, weather_labels)

In [None]:
grid_search.best_params_

The defaults values for the hyperparameters are satisfactory.

## Summary and Key Findings

* I wanted to predict whether or not it will rain tomorrow.
* I performed an EDA on weather data from 49 different weather stations across Australia for the past 10 years.
* I selected and trained two models: an SGD classifier and a random forest classifier.
* Both models had an accuracy of about 85% on all cross-validation folds of the training set.
* I adjusted recall to 80%, which resulted in a precision of 54% for the random forest classifier.
* I tried different combinations of hyperparameters for both models but the best combinations were the default values.
* After further experimentation with feature engineering, I will be ready to measure performance on the test set to estimate the generalisation error.