# Using ML to reliably predict a next days rainfall 
**A time series analysis to uncover possible determinants and successfully predict a next days rainfall**

![image.png](https://images.unsplash.com/photo-1485797460056-2310c82d1213?ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&ixlib=rb-1.2.1&auto=format&fit=crop&w=1350&q=80)

## Table of contents
1. [Introduction](#Introduction)
2. [Objectives](#Objectives)
3. [Setup](#Setup)
4. [Data Understanding](#Data-Understanding)
5. [Data Preparation](#Data-Preparation)
6. [Data Exploration](#Data-Exploration)
7. [Data Analysis](#Data-Analysis)
8. [Conclusion](#Conclusion)

## Introduction

As of right now weather forecasting methods rely on observing the weather situation and collect vast amounts of data to derive a forecast. In short ranged forecasting the evolution of current weather systems is tracked to predict future movement based on the dynamics of the atmosphere. [[1]]
Fields that could benefit getting reliable next day rain forecasting are e.g. aerospace and agriculture to schedule tasks propperly. 



[1]:https://www.weather.gov/car/weatherforecasting

## Objectives

This notebook's goals are defined as the following:
1. We want to visualize multivariate coherences that determine a next days rainfall
2. Build a statistical model for classification of all locations and compare results
  1. Note that target variable is imbalanced
  2. Try different strategies for missing values
  2. Implement feature engineering like days without rain and season
  3. Test oversampling
3. Draw a conclusion 

## Setup
We do a basic setup of the notebook to get the source data from Kaggle using Kaggles API. A google drive account with kaggle.json credentials file is required to be in /kaggle directory of drive. We then load required libraries, create pandas dataframe from the source and generate a profilereport [[2]] 

[2]: https://github.com/pandas-profiling/pandas-profiling

In [None]:
%matplotlib inline
#%load_ext rpy2.ipython #loads ipython R Kernel
#!pip install pandas-profiling[notebook] --upgrade #upgrade pandas profiling to stable version
#from google.colab import drive
#DRIVE_DIR = ''
#drive.mount('/content/drive')

We create a new subdirectory for the project, use it as working directory, download the dataset and finally unzip it.  

In [None]:
#create subdirectories and load the dataset from kaggle 
#import os

project = 'RainAustralia/'
kaggle = '/content/drive/MyDrive/kaggle/'
#os.environ['KAGGLE_CONFIG_DIR'] = kaggle
#!kaggle datasets download -d jsphyg/weather-dataset-rattle-package 


For the last setup step data science libraries are imported and a pandas dataframe is created using the csv source file

In [None]:
#import neccessary libraries and create the dataframe from source
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
#from pandas_profiling import ProfileReport
from sklearn import tree
#use liblinear svm for linear scaling complexity
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.ensemble import StackingClassifier
from sklearn.impute import KNNImputer
from sklearn.model_selection import RandomizedSearchCV
from collections import Counter
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_curve
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
#ignore SVC Convergence Warnings
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('../input/weather-dataset-rattle-package/weatherAUS.csv')
#profile = ProfileReport(df, title="Rain in Australia", explorative=True)
#profile.to_file('report.html')

## Data Understanding

We get a very first glance using the generated profile report 'report.html' in the working directory.

Our dataset that describes 145.460 observations represents a time series of a vast of metreologically data for every of the 49 locations with most of them ranging from 2008-12-01 to 2017-07-01.

The target variable RainTomorrow is heavily unbalanced with only ~22.5% being true and ~77.5% being false. Therefore we have to expect to score superior to null accurracy of 0.775

If we exclude the Date, Location and the dependend target variable RainTomorrow there are 20 features to consider in our model. From the 20 features 16 are numerical, 3 categorical and 1 boolean. 

Given that there are 10% missing cells in the dateset a strategy to handle those NaN values needs to be implemented. 
A special attention has to be put on the Evaporation, Sunshine, Cloud9am and Cloud3pm as they range from 38% to 48% of missing values. 

The features Rainfall, Sunshine, WindSpeed9am, Cloud9am and Cloud3pm got zero values assigned. Features that highly correlate are MinTemp with the Temp9am, MaxTemp with the Temp3pm and Pressure9am with pressure3pm.



In [None]:
df.head()

## Data Preparation
To use proper datetime slicing the Date feature has to be converted to datetime first. Furthermore we replace the yes and no of RainToday and RainTomorrow with 1 and 0 respectively to finally convert them to boolean format. 

In [None]:
#Convert data to more data type
df['Date'] = pd.to_datetime(df['Date'])
df = df.replace('Yes', 1)
df = df.replace('No', 0)
df = df[df['RainTomorrow'].notna()]
df['RainTomorrow'] = df['RainTomorrow'].astype('int')

In [None]:
df.isna().sum()

For exploration we focus on a smaller subset of the dataset. Therefore we create a dictionary of dataframes grouped by every locations data and its respective name as key. 

In [None]:
#Get a dataset for every location and use the date as unique index
dfs = {}
for line in set(df['Location'].values):
   dfs[line] = df[df['Location'] == line]
   dfs[line] = dfs[line].set_index('Date')
   dfs[line] = dfs[line].drop('Location', axis=1)
keys = dfs.keys()

## Data Exploration
Within the first plot we take a look at the balance of our target variable RainTomorrow. The graph dipicts location based differences in the distribution with Portland as the location with the most rain at a 37% ratio and Woomera as the location with the least rain at a 7% ratio. In the median there is a 22% rain ratio. We are able to tell our target variable is unbalanced and heavily location dependent.

In [None]:
%matplotlib inline
sns.set_theme(style="whitegrid")
sns.set(rc={'figure.figsize':(16,9)})

#plot rain to no rain ratio by location
rain_ratio = [dfs[line]['RainToday'].value_counts()[1] / (dfs[line]['RainToday'].value_counts()[0] + dfs[line]['RainToday'].value_counts()[1]) for line in keys ]
data = pd.DataFrame({'Percentage': rain_ratio, 'Location': keys})
data = data.sort_values(by=['Percentage'], ascending=False)

sns.barplot(x='Percentage',y='Location', data=data).set_title('Rain to no rain ratio by location')

Next we take a look at the missing cells by dividing a locations number of missing cells by a locations number of overall cells. We can see that airports are locations with the least amount of missing cells. Given that Melbourne Airport has the least amount of missing cells and is also almost median Rain/NoRain value we will use it as a sample for explorative analysis.

In [None]:
#Plot missing values by location
na_values = [ dfs[line].isna().sum().sum() for line in keys]
values = [dfs[line].shape[0] * dfs[line].shape[1] * 100 for line in keys]
data = pd.DataFrame({'Location': keys, 'Count_NA' : na_values, 'Count' : values })
data['Percentage'] = data['Count_NA'] / data['Count']
data = data.sort_values(by=['Percentage'])


plt.hlines(y=data['Location'], xmin = 0, xmax=data['Percentage'], color='skyblue')
plt.plot(data['Percentage'], data['Location'], "D")
plt.yticks(data['Location'])
plt.title('Percentage of missing values by location')
plt.show()


To reduce complexitiy we limit the sample timespan to 2 years.

In [None]:
sample = 'MelbourneAirport'
start = pd.Timestamp(2015,7,1)
end = pd.Timestamp(2017,7,1)

We start with our continous features by examining several graphs to recognize any seasonal effects influencing the target variable. We will focus on the features with noticeable effects. Note that we don't show every possible graph as some features don't show seasonality or the seasonality is somewhat redundant (like Temp9am and Temp3pm).
It is to be expected that the -3pm features are more suitable at discriminating next days rainfall as they are closer in time to the next day than their -9am counterpart. It turns out that this effect is nevertheless smaller than expected.

- Starting with the Temp3pm (top left)we can clearly figure out the seasonality effects on the temperature with their respective peak in a years january to february. Some outliers are noticeable: if in high temperature periods there is a significantly lower Temp3pm it tends to rain the next day.
- Moving to Pressure3pm (top right) the graph is a little more scattered than Temp3pm but we can see some inverse behaviour to Temp3pm. It is noticeable that in lower than average Pressure3pm it tends to rain the next day.
- Inspecting Sunshine (bottom left) we can notice same seasonal behaviour to Temp3pm in periods of high sunshine, allthough this is not transferable to lower sunshine days, as they seem to be much more indepent of seasonality. Nevertheless a higher population of RainTomorrow is noticeable at really low (0-2) sunshine level days.
- Finally Humidity3pm has some sort of seasonality much like the Pressure3pm. We can notice that observations with a high humidity are more likely to cause next days rain.

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(20, 10))
sns.scatterplot(data = dfs[sample].loc[dfs[sample].index > start] , x='Date', y='Temp3pm', hue='RainTomorrow', ax=axes[0,0])
sns.scatterplot(data = dfs[sample].loc[dfs[sample].index > start] , x='Date', y='Pressure3pm', hue='RainTomorrow', ax=axes[0,1])
sns.scatterplot(data = dfs[sample].loc[dfs[sample].index > start] , x='Date', y='Sunshine', hue='RainTomorrow', ax=axes[1,0])
sns.scatterplot(data = dfs[sample].loc[dfs[sample].index > start] , x='Date', y='Humidity3pm', hue='RainTomorrow', ax=axes[1,1])


The distribution and characteristics of the Cloud3pm feature become much more visible if we remove the time axis and display using a boxplot. We can therefore identify that RainTomorrow is most likely determined by a cloud level greater or equal to 6. We use the same technique to dipict that RainTomorrow is more likely on days with lower sunshine levels.

In [None]:
fig, axes = plt.subplots(1,2, figsize=(20, 10))
sns.boxplot(data = dfs[sample].loc[dfs[sample].index > start] , y='Sunshine', x='RainTomorrow', ax=axes[0])
sns.boxplot(data = dfs[sample].loc[dfs[sample].index > start] , y='Cloud3pm', x='RainTomorrow', ax=axes[1])

Next we can also conclude that high a Humidity and a low Pressure tend to determine a next days rain. 

In [None]:
fig, axes = plt.subplots(1,2, figsize=(20, 10))
sns.boxplot(data = dfs[sample].loc[dfs[sample].index > start] , y='Humidity3pm', x='RainTomorrow', ax=axes[0])
sns.boxplot(data = dfs[sample].loc[dfs[sample].index > start] , y='Pressure3pm', x='RainTomorrow', ax=axes[1])

Moving to categorical features we can point out that:
1. Southwind S is the most common direction with a low RainTomorrow rate.
2. Also common are SE and SSE with an even lower RainTomorrow rate.
3. N is the 2nd most common direction with a much increased rain ratio.
4. if it has rained today, RainTomorrow is much more common

In [None]:
order = dfs[sample]['WindGustDir'].value_counts()
order = order.keys()



fig, axes = plt.subplots(1, 2, figsize=(16, 9))
sns.countplot(data = dfs[sample].loc[dfs[sample].index > start] , x='WindDir3pm', ax=axes[0], hue='RainTomorrow', order=order)
sns.countplot(data = dfs[sample].loc[dfs[sample].index > start] , x='RainToday', hue='RainTomorrow', ax=axes[1])

After taking a loot at univariate behaviour we conclude the explorative analysis by conducting bivariate relationships with the help of a pairplot of some choosen features. A first interesting behaviour to point out is the correlation of MaxTemp to Temp9am (column 2, row 1) with the RainTomorrow being mostly true below of what would be a linear regression line.

In [None]:
sns.pairplot(data = dfs[sample].loc[dfs[sample].index > start][['Temp9am', 'MaxTemp', 'Pressure3pm', 'Humidity3pm', 'Rainfall','RainTomorrow']], hue='RainTomorrow')

## Data Analysis

### Base Models

First we test the performance of label encoding against onehot encoding on the baseline classificators decision tree, support vector machine and knn classification. We use a 5 fold cross validation accuracy score and compare  different hyperparameters at the same time

In [None]:
#function to onehot encode every category
def get_Xy_oh(_df):
  data = _df.dropna().copy()
  data['RainToday'] = data['RainToday'].astype('int')
  X = data.drop(['RainTomorrow','Date'], axis=1)
  y = data['RainTomorrow']

  for item in ['Location','WindGustDir','WindDir9am','WindDir3pm']:
    temp = pd.get_dummies(X[item],prefix=item[4:])
    X = X.join(temp)
    X = X.drop(item, axis=1)

  return train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
#function to label encode every category
def get_Xy_le(_df):
  le = LabelEncoder()
  data = _df.dropna().copy()
  data['RainToday'] = data['RainToday'].astype('int')
  X = data.drop(['RainTomorrow','Date'], axis=1)
  y = data['RainTomorrow']

  for item in ['Location','WindGustDir','WindDir9am','WindDir3pm']:
    X[item] = le.fit_transform(X[item])

  return train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
%%time
#Onehot Encoding
X_train, X_test, y_train, y_test = get_Xy_oh(df)
scaler = StandardScaler()
scaler = scaler.fit(X_train)
X_train = scaler.transform(X_train)
score_oh_dct = [cross_val_score(tree.DecisionTreeClassifier(max_depth=i), X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1).mean() for i in range(1,13)]
score_oh_svc = [cross_val_score(LinearSVC(C = C,), X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1).mean() for C in [1,10,100] ]
score_oh_knn = [cross_val_score(KNeighborsClassifier(n_neighbors=n), X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1).mean() for n in range(1,9,2) ]

#Label Encoding
X_train, X_test, y_train, y_test = get_Xy_le(df)
scaler = StandardScaler()
scaler = scaler.fit(X_train)
X_train = scaler.transform(X_train)
score_le_dct = [cross_val_score(tree.DecisionTreeClassifier(max_depth=i), X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1).mean() for i in range(1,13)]
score_le_svc = [cross_val_score(LinearSVC(C = C,), X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1).mean() for C in [1,10,100] ]
score_le_knn = [cross_val_score(KNeighborsClassifier(n_neighbors=n), X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1).mean() for n in range(1,9,2) ]

In [None]:
#plot everything
fig, axes = plt.subplots(1,3)

axes[0].plot(range(1,13), score_oh_dct, label="DCT OneHot")
axes[0].plot(range(1,13), score_le_dct, label="DCT LabelEnc")
axes[0].set_title('DCT Depth / Train Accuracy')
axes[0].legend()
axes[1].plot([1,10,100], score_oh_svc, label="SVC OneHot")
axes[1].plot([1,10,100], score_le_svc, label="SVC LabelEnc")
axes[1].set_title('SVC C Value / Train Accuracy')
axes[1].legend()
axes[2].plot(range(1,9,2),score_oh_knn, label="KNN OneHot")
axes[2].plot(range(1,9,2),score_le_knn, label="KNN LabelEnc")
axes[2].set_title('KNNC Neighbors / Train Accuracy')
axes[2].legend()

Using Graphviz we can plot the decision Tree.

In [None]:
#import graphviz 
#clf = tree.DecisionTreeClassifier(max_depth=5).fit(X_train, y_train)
#dot_data = tree.export_graphviz(clf, out_file=None,filled=True, feature_names=X_train.columns, class_names=['NoRain','Rain']) 
#graph = graphviz.Source(dot_data, format='png') 
#graph.render("RainTomorrow") 

### Ensembles

In [None]:
%%time
#Onehot encoding
X_train, X_test, y_train, y_test = get_Xy_oh(df)

scaler = StandardScaler()
scaler = scaler.fit(X_train)
X_train = scaler.transform(X_train)
n_estimators = [10,100,500]

score_oh_rf = [cross_val_score(RandomForestClassifier(class_weight='balanced', n_estimators=i), X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1).mean() for i in n_estimators]
score_oh_ada = [cross_val_score(AdaBoostClassifier(n_estimators = i), X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1).mean() for i in n_estimators ]
#if one has a gpu, the parameter tree_method='gpu_hist' can reduce computation time significantly
score_oh_xgb = [cross_val_score(XGBClassifier(use_label_encoder=False, tree_method='gpu_hist', n_estimators=i, n_jobs=-1, verbosity = 0), X_train, y_train, cv=5, scoring='accuracy').mean() for i in n_estimators]

#Label encoding
X_train, X_test, y_train, y_test = get_Xy_le(df)

scaler = StandardScaler()
scaler = scaler.fit(X_train)
X_train = scaler.transform(X_train)

score_le_rf = [cross_val_score(RandomForestClassifier(class_weight='balanced', n_estimators=i), X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1).mean() for i in n_estimators]
score_le_ada = [cross_val_score(AdaBoostClassifier(n_estimators = i), X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1).mean() for i in n_estimators ]
score_le_xgb = [cross_val_score(XGBClassifier(use_label_encoder=False, tree_method='gpu_hist', n_estimators=i, n_jobs=-1, verbosity = 0), X_train, y_train, cv=5, scoring='accuracy').mean() for i in n_estimators]

In [None]:
fig, axes = plt.subplots(1,3)

axes[0].plot(n_estimators, score_oh_rf, label="RF OneHot")
axes[0].plot(n_estimators, score_le_rf, label="RF LabelEnc")
axes[0].set_title('RF Estimators / Accuracy')
axes[0].legend()
axes[1].plot(n_estimators, score_oh_ada, label="ADA OneHot")
axes[1].plot(n_estimators, score_le_ada, label="ADA LabelEnc")
axes[1].set_title('ADA Estimators / Accuracy')
axes[1].legend()
axes[2].plot(n_estimators,score_oh_xgb, label="XGB OneHot")
axes[2].plot(n_estimators,score_le_xgb, label="XGB LabelEnc")
axes[2].set_title('XGB Estimators / Accuracy')
axes[2].legend()

For allmost every model the label encoding works better allthough we achieved the best result using onehot encoding with a larger boosting tree. We will use XGboost as baseline model and try different strategies against this base.

In [None]:
#Onehot encoding
X_train, X_test, y_train, y_test = get_Xy_oh(df)

scaler = StandardScaler()
scaler = scaler.fit(X_train)
X_train = scaler.transform(X_train)

base = cross_val_score(XGBClassifier(use_label_encoder=False, tree_method='gpu_hist', n_estimators=500), X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1).mean()

In [None]:
base

86.5% accuracy score is the base performance which we try further improve. 

### Missing Values

A first attempt to improve accuracy score is by tackling the vast amount of missing values in the dataframe. The baseline model was trained with dropped na values which caused the sample size to go down from ~150k to ~53k. We explore the outcome of handling those missing values with two different strategies:
* Missing values are filled using the mean for every numerical feature and the modal value for every categorical feature
* Missing values will be imputed by their next neighbours using knn imputation.

#### Mean/Modal Filling

In [None]:
df_missing = df.copy()
cols = df.columns
numerical = df_missing._get_numeric_data().columns
cat = list(set(cols) - set(numerical))

for var in numerical:
   df_missing[var]= df_missing[var].fillna(df_missing[var].mean()) 

for var in cat:
   df_missing[var]=df_missing[var].fillna(df_missing[var].mode()[0]) 

In [None]:
X_train, X_test, y_train, y_test = get_Xy_oh(df_missing)
print(base, cross_val_score(XGBClassifier(tree_method='gpu_hist', n_estimators=500, use_label_encoder=False, verbosity = 0), X_train, y_train, cv=5, scoring='accuracy').mean())

#### KNN Imputation

In [None]:
le = LabelEncoder()
data = df.copy()
data = data.drop('Date', axis=1)

for item in ['Location','WindGustDir','WindDir9am','WindDir3pm']:
  data[item] = data[item].fillna('NaN')
  data[item] = le.fit_transform(data[item])

sample = data.sample(frac=0.1,random_state=42)

imputer = KNNImputer(n_neighbors=5, weights='uniform', metric='nan_euclidean')
imputer.fit(sample)
data = imputer.transform(data)

In [None]:
data = pd.DataFrame(data)
data.columns = df.drop('Date',axis=1).columns
data['Date'] = df['Date']
data.head()

In [None]:
X_train, X_test, y_train, y_test = get_Xy_oh(data)
print(base, cross_val_score(XGBClassifier(tree_method='gpu_hist', n_estimators=500, use_label_encoder=False, verbosity = 0), X_train, y_train, cv=5, scoring='accuracy').mean())

Handling missing values we can conclude that both attempts slightly dragged down model performance. Which tells us that our model can't be further improved by adding more training data to our dataframe.

### Feature Engineering

As for feature engineering we try out 2 attempts to improve model performance:
* We use the date column to get a samples meteorological season. We try this because we detected seasonality in our exploratory analysis  
* We create a feature which counts the days since the last rainfall occured. We try this to help the model detect heavy rain or dry intervals   

#### Season

We adapted a baseline function of w3resource to get the season of a date and reversed it for our australian usecase. [[3]]

[3]:https://www.w3resource.com/python-exercises/python-conditional-exercise-37.php

In [None]:
def get_season(date):
  month=date.month
  day=date.day
  if month in (1, 2, 3):
    season = 2
  elif month in (4, 5, 6):
    season = 3
  elif month in (7, 8, 9):
    season = 4
  else:
    season = 1
  if (month == 3) and (day > 19):
    season = 3
  elif (month == 6) and (day > 20):
    season = 4
  elif (month == 9) and (day > 21):
    season = 1
  elif (month == 12) and (day > 20):
    season = 2
  return(season)

In [None]:
df_season = df.dropna().copy()
df_season['Season'] = df_season['Date'].apply(get_season)
X_train, X_test, y_train, y_test = get_Xy_oh(df_season)
print(base, cross_val_score(XGBClassifier(n_estimators=500, tree_method='gpu_hist', use_label_encoder=False, verbosity = 0), X_train, y_train, cv=5, scoring='accuracy').mean())

#### N-Days since last rain 

We derive from [[4]] to get the rain since n days feature. Regarding a cold start problem which caused every time series to start with a series of NaN for this feature till the very first RainToday was true, we just filled those nans with zeroes. 


[4]:https://stackoverflow.com/questions/49910451/create-a-new-column-containing-time-since-last-event-in-pandas

In [None]:
df_norain = df.dropna().copy()
df_norain["NoRainFor"] = df_norain["Date"] - df_norain["Date"].where(df_norain["RainToday"].astype('boolean')).groupby(df_norain["Location"]).ffill()
df_norain["NoRainFor"] = df_norain["NoRainFor"].fillna(pd.Timedelta(seconds=0))
df_norain["NoRainFor"] = df_norain["NoRainFor"].dt.days.astype("int")

In [None]:
X_train, X_test, y_train, y_test = get_Xy_oh(df_norain)
print(base , cross_val_score(XGBClassifier(n_estimators=500, tree_method='gpu_hist', use_label_encoder=False,verbosity = 0), X_train, y_train, cv=5, scoring='accuracy').mean())

We used feature engineering to create and evaluate the use of two new features. Neither of the two features had a positive impact on model performance. The model seems to already recognize seasonality effects considering the state of other features. 

### Feature Selection

Looking at the model's feature importances we can identify Location feature as superior to any other feature. We can conclude that fitting a model for every Location in our dataframe would score better accuracy. This is also shown in [[5]] .

[5]:https://www.kaggle.com/amiromidvar/prediction-all-location-avg-score-0-96-smote

In [None]:
data = df.dropna().copy()
cnames = data.columns  
X_train, X_test, y_train, y_test = get_Xy_le(data)
clf = XGBClassifier(n_estimators=500, tree_method='gpu_hist', use_label_encoder=False, verbosity = 0).fit(X_train, y_train)
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]
feature_names = X_train.columns
for i in range(X_train.shape[1]):
    print(f"{feature_names[i]} - {importances[indices[i]]}")

Furthemore we attempted removing highly correlated features Temp9am, Temp3am and Pressure 9am as suggested by [[6]]. Nevertheless our model performance didn't change significantly.

[6]:https://www.kaggle.com/purvitsharma/rain-in-australia-90-9-accuracy

In [None]:
data = data.drop(['Temp9am','Temp3pm','Pressure9am'],axis=1)

In [None]:
X_train, X_test, y_train, y_test = get_Xy_oh(data)
print(base, cross_val_score(XGBClassifier(n_estimators=500, tree_method='gpu_hist', use_label_encoder=False, verbosity = 0), X_train, y_train, cv=5, scoring='accuracy').mean())

### Oversampling

Next we attempt oversampling using imblearns oversampling tool SMOTE, also suggested in [[5]] and [[6]] and can see a significant improvment of our model's training performance. Nevertheless this didn't turn out to improve test accuracy confirming again that more training data will not improve model performance any further.  

[5]:https://www.kaggle.com/amiromidvar/prediction-all-location-avg-score-0-96-smote
[6]:https://www.kaggle.com/purvitsharma/rain-in-australia-90-9-accuracy

In [None]:
le = LabelEncoder()
df_os = df.dropna().copy()
X_train, X_test, y_train, y_test = get_Xy_oh(df_os)
os = SMOTE()
X_train, y_train = os.fit_resample(X_train, y_train)
count = Counter(y_train)
print(count)


In [None]:
a = cross_val_score(XGBClassifier(use_label_encoder=False, tree_method='gpu_hist', n_estimators=500, verbosity = 0), X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1).mean()

In [None]:
print(a)

### Parameter Tuning

Next attempt is to use a random grid search to improve accuracy by hyperparameter tuning. 

In [None]:
%%time

params = {
        'min_child_weight': [1, 5, 10],
        'gamma': [0.5, 1, 1.5, 2, 5],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [4, 5, 6, 7, 8],
        'n_estimators': [100, 300,500,700,900]
        }

df_os = df.dropna().copy()
X_train, X_test, y_train, y_test = get_Xy_oh(df_os)
xgb = XGBClassifier(use_label_encoder=False, verbosity = 0, tree_method='gpu_hist')
xgb_random = RandomizedSearchCV(estimator = xgb, param_distributions = params, n_iter = 125, cv = 2, verbose=2, random_state=42, n_jobs = -1, scoring='accuracy')
# Fit the random search model
xgb_random.fit(X_train, y_train)

In [None]:
print(xgb_random.best_params_)

In [None]:
f = cross_val_score(XGBClassifier(n_estimators=700, tree_method='gpu_hist', use_label_encoder=False, verbosity = 0 , colsample_bytree=1, gamma=5, max_depth=5, min_child_weight=1, subsample=1), X_train, y_train, cv=5, scoring='accuracy').mean()
print(base, f)
base = f

Using grid search to tune hyperparameters we have improved accuracy only at the 4th decimal.

### Meta Learner

Finally we attempt to stack 3 of the best performing baseline models. As meta learner we choose a logistic regression. For the baseline models it is suggested that they are all taking different approaches [[7]]. Therefore we consider support vector classifier, random forest and our gradient boosting XGboost.

[7]:https://www.statsoft.de/glossary/M/MetaLearning.htm

In [None]:
base_models = [
    ('SVC'       , LinearSVC(C = 1)),
    ('XGB'       , XGBClassifier(n_estimators=700, tree_method='gpu_hist', use_label_encoder=False, verbosity = 0 , colsample_bytree=1, gamma=5, max_depth=5, min_child_weight=1, subsample=1)),
    ('RF'        , RandomForestClassifier(n_estimators=500))
]

In [None]:
%%time
X_train, X_test, y_train, y_test = get_Xy_oh(df)

scaler = StandardScaler()
scaler = scaler.fit(X_train)
X_train = scaler.transform(X_train)

meta_model = LogisticRegressionCV()
stacking_model = StackingClassifier(estimators=base_models, 
                                    final_estimator=meta_model, 
                                    passthrough=True, 
                                    cv=3)

print(base, cross_val_score(stacking_model, X_train, y_train, cv=3, scoring='accuracy', n_jobs=-1).mean())

Fitting the meta learner was computational expensive allthough we achieved a 0.03  higher accuracy score.

## Conclusion

Finally we check wether our model actually performs on the test data. 

In [None]:
%%time
X_train, X_test, y_train, y_test = get_Xy_oh(df)
scaler = StandardScaler()
scaler = scaler.fit(X_train)
X_train = scaler.transform(X_train)
clf = stacking_model.fit(X_train, y_train)
X_test = scaler.transform(X_test)

yhat = clf.predict(X_test)
print("Accuracy :", accuracy_score(y_test, yhat))
cm = confusion_matrix(y_test, yhat)
print(classification_report(y_test, yhat))
cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], 
                                 index=['Predict Positive:1', 'Predict Negative:0'])
sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')

Our model is not facing any over or underfitting issues, because we score even a little higher accuracy on the test data than we did in crossvalidation training

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, yhat)
plt.figure(figsize=(6,4))
plt.plot(fpr, tpr, linewidth=2)
plt.plot([0,1], [0,1], 'k--' )
plt.rcParams['font.size'] = 12
plt.title('ROC curve for RainTomorrow Meta learner')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

In [None]:
#from sklearn.preprocessing import binarize
#yhat = clf.predict_proba(X_test)[:,1]
#yhat = binarize(yhat.reshape(-1,1), 0.5)
#cm = confusion_matrix(y_test, yhat)
#print("Accuracy :", accuracy_score(y_test, yhat))
#print('Confusion matrix\n\n', cm)
#print(classification_report(y_test, yhat))
#sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')

Taking everything to a conclusion we accomplished a test accuracy of 86.7% which is higher than null accuracy of 77.5%. Using XGBoost to tweak model performance we can point out that optimizations had no big impact on model performance. We scored ~86.5% using a defaulted XGBoost. Against our asssumption handling the vast amount of missing values made performance worse. Our oversampling approach had a great impact on train accuracy, but falling short and making test accuracy worse. Only hyperparameter tuning and stacking multiple models enabled a small increase in accuracy. Future work could focus on further tweaking the forest and SVC and grid search hyperparameters for the meta learner, which is coomputationally expensive, or try deep learning methods e.g. lstm.   