# Recurrent Neural Network Model for Rain Forecasting

This notebook builds a model to predict whether or not it will rain tomorrow in Australia using real-world weather data using Recurrent Neural Network with PyTorch. It starts by preprocessing then converting the data to tensors, then building the neural network model with pytorch, then using a loss function and an optimiser to train the model and finally evaluating the model. The dataset contains daily weather observations from numerous Australian weather stations.

First step is to import the necessary libraries.

In [None]:
import os
import torch
import numpy as np
import pandas as pd
import seaborn as sns
from torch import nn, optim
from datetime import datetime
import matplotlib.pyplot as plt
from sklearn import preprocessing
import torch.nn.functional as func
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
%matplotlib inline

In [None]:
sns.set(style='darkgrid')
sns.set_palette('deep')

In [None]:
# load the dataset
df = pd.read_csv('/kaggle/input/weather-dataset-rattle-package/weatherAUS.csv')

In [None]:
# show first few records
df.head()

In [None]:
# show dataset dimensions
df.shape

In [None]:
# show dataset summary
df.info()

In [None]:
# show the frequency distribution of RainTomorrow
df['RainTomorrow'].value_counts()

In [None]:
# show percentage
df['RainTomorrow'].value_counts()/len(df)

We can see that out of the total number of RainTomorrow values, No RainTomorrow appears 77.58% times and RainTomorrow appears only 22.42% times. Now let's check for missing data.

In [None]:
df.isnull().sum()

We have lots of missing data

# Data Preprocessing

There are two ways to deal with missing values, either by deleting incomplete variables if there are too many data missing or by replacing these missing values with estimated value based on the other information available. So as a rule, any column with more than 2,000 missing value will be excluded as they having more missing values that rest of the variables in the dataset. Then before replaceing missing values of other columns with mean, it's wise to first check for outliers as the mean is greatly affected by outliers and works better if the data is normally-distributed while median imputation is preferable for skewed distribution.

In [None]:
numerical = ['Temp9am', 'MinTemp', 'MaxTemp', 'Rainfall', 'Humidity9am', 'WindSpeed9am']
df[numerical].hist()

Show the statistical properties of numerical variables to check for skewed variables.

In [None]:
df[numerical].describe()

On closer inspection, we can see that the Temp9am, MinTemp, MaxTemp and Rainfall columns seem to have a relatively normal distribution, whilst Humidity9am and WindSpeed9am columns have outliers.

In [None]:
# fill missing values of normally-distributed columns with mean and skewed distribution with median
df['Temp9am'] = df['Temp9am'].fillna(value = df['Temp9am'].mean())
df['MinTemp'] = df['MinTemp'].fillna(value = df['MinTemp'].mean())
df['MaxTemp'] = df['MaxTemp'].fillna(value = df['MaxTemp'].mean())
df['Rainfall'] = df['Rainfall'].fillna(value = df['Rainfall'].mean())
df['Humidity9am'] = df['Humidity9am'].fillna(value = df['Humidity9am'].median())
df['WindSpeed9am'] = df['WindSpeed9am'].fillna(value = df['WindSpeed9am'].median())

Next step is to impute missing categorical variables with most frequent value or mode.

In [None]:
df['RainToday'] = df['RainToday'].fillna(value = df['RainToday'].mode()[0])

It is well known that categorical data doesn't work with machine learning and deep learning algorithms, so we gonna encode 'Date', 'Location', 'RainToday' and 'RainTomorrow' columns so we can predict whether or not is going to rain tomorrow?

In [None]:
# convert data variable into dattime type
df['Date'] = df['Date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d'))

In [None]:
# extract year from the date
df['Year'] = df['Date'].dt.year

In [None]:
# extract month from the date
df['Month'] = df['Date'].dt.month

In [None]:
# extract day from the date
df['Day'] = df['Date'].dt.day

In [None]:
# encode location
le = preprocessing.LabelEncoder()
df['Location'] = le.fit_transform(df['Location'])

In [None]:
# encode RainToday & RainTomorrow
df['RainToday'].replace({'No': 0, 'Yes': 1}, inplace = True)
df['RainTomorrow'].replace({'No': 0, 'Yes': 1}, inplace = True)

Now we have only 9 columns out of 24 after removing variable with many missing data to predict whether or not is gonna rain tomorrow?

In [None]:
X = df[['Temp9am', 'MinTemp', 'MaxTemp', 'Rainfall', 'Humidity9am', 'WindSpeed9am', 'RainToday', 'Location', 'Year', 'Month', 'Day']]
y = df[['RainTomorrow']]

The final step is to split the data into train and test sets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

Then convert all of it to Tensors (so we can use it with PyTorch).

In [None]:
X_train = torch.from_numpy(X_train.to_numpy()).float()
y_train = torch.squeeze(torch.from_numpy(y_train.to_numpy()).float())

In [None]:
X_test = torch.from_numpy(X_test.to_numpy()).float()
y_test = torch.squeeze(torch.from_numpy(y_test.to_numpy()).float())

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

# Building the Neural Network

We gonna create an input layer from our 11 columns: 'Temp9am', 'MinTemp', 'MaxTemp', 'Rainfall', 'Humidity9am', 'WindSpeed9am', 'RainToday', 'Location', 'Year', 'Month' and 'Day'. Then the output will be a number between 0 and 1, representing how likely the model thinks it is gonna rain tomorrow. The prediction will be given out by the final layer of the network. We will add 4 hidden layers between the input and output layers. The parameters of those layer will decide the final output. All layers will be fully-connected. One easy way to build the neural network is to create a class that inherits from torch.nn.Module.

In [None]:
# create the model
class Model(nn.Module):
  def __init__(self, n_features):
    super(Model, self).__init__()
    self.fc1 = nn.Linear(n_features, 11)
    self.fc2 = nn.Linear(11, 8)
    self.fc3 = nn.Linear(8, 5)
    self.fc4 = nn.Linear(5, 3)
    self.fc5 = nn.Linear(3, 1)
  def forward(self, x):
    x = func.relu(self.fc1(x))
    x = func.relu(self.fc2(x))
    x = func.relu(self.fc3(x))
    x = func.relu(self.fc4(x))
    return torch.sigmoid(self.fc5(x))

In [None]:
model = Model(X_train.shape[1])

We start by creating the layers of our model in the constructor. The forward() method is where the magic happens. It accepts the input x and allows it to flow through each layer. There is a corresponding backward pass (defined by pytorch) that allows the model to learn from the errors that is currently making.

# Training

With the model in place, we need to find parameters that predict will it rain tomorrow. First, we need something to tell us how good we are currently doing:

In [None]:
criterion = nn.BCELoss()

The BCELoss is a loss function that measures the difference between two binary vectors. In our case, the predictions of our model and the real values. It expects the values to be outputed by the sigmoid function. The closer this value gets to 0, the better the model performs.

But how do we find parameters that minimize the loss function?

# Optimisation

Optimisers are used to change the attributes of the neural network such as weights and learning rate in order to reduce the losses. We gonna use Adam optimiser.

In [None]:
optimiser = optim.Adam(model.parameters(), lr = 0.001)

Naturally, the optimiser requires the parameters. The second argument lr is learning rate. It is a tradeoff between how good parameters we gonna find and how fast we will get there. Finding good values for this can be black magic.

# Check for GPU

Doing massively parallel computations on GPUs is one of the enablers for modern deep learning. We will need nVIDIA GPU to transfer all the computation to it. First we will check whether or not a CUDA is available. Then we gonna transfer all training and test sets to whether GPU or CPU. Finally, we move our model and loss function.

In [None]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
X_train = X_train.to(device)
y_train = y_train.to(device)

X_test = X_test.to(device)
y_test = y_test.to(device)

model = model.to(device)

In [None]:
# define the loss function to compare the output with the target
criterion = criterion.to(device)

# Rain Forecasting

Having a loss function is great, but tracking the accuracy of our model is something easier to understand, for us mere mortals. Here is the definition for our accuracy:

In [None]:
def calculate_accuracy(y_true, y_pred):
  predicted = y_pred.ge(.5).view(-1)
  return (y_true == predicted).sum().float() / len(y_true)

We convert every value below 0.5 to 0. Otherwise, we set it to 1. Finally, we calculate the percentage of correct values. With all the pieces of the puzzle in place, we can start training our model:

In [None]:
def round_tensor(t, decimal_places = 3):
  return round(t.item(), decimal_places)

In [None]:
# run the model
for epoch in range(1000):
    y_pred = model(X_train)
    y_pred = torch.squeeze(y_pred)
    train_loss = criterion(y_pred, y_train)
    if epoch % 100 == 0:
      train_acc = calculate_accuracy(y_train, y_pred)
      y_test_pred = model(X_test)
      y_test_pred = torch.squeeze(y_test_pred)
      test_loss = criterion(y_test_pred, y_test)
      test_acc = calculate_accuracy(y_test, y_test_pred)
      print (str('epoch ') + str(epoch) + str(' Train set: loss: ') + str(round_tensor(train_loss)) + str(', accuracy: ') + str(round_tensor(train_acc)) + str(' Test  set: loss: ') + str(round_tensor(test_loss)) + str(', accuracy: ') + str(round_tensor(test_acc)))
    optimiser.zero_grad()
    train_loss.backward()
    optimiser.step()

# Evaluation

During the training, we show our model the data for 1,000 times. Each time we measure the loss, propagate the errors trough our model and asking the optimiser to find better parameters.

The zero_grad() method clears up the accumulated gradients, which the optimiser uses to find better parameters.

Well, using just accuracy wouldn't be a good way to do it. Recall that our data contains mostly no rain examples! Another way to delve a bit deeper into our model performance is to assess the precision and recall for each class.

In [None]:
classes = ['No rain', 'Raining']

y_pred = model(X_test)
y_pred = y_pred.ge(.5).view(-1).cpu()
y_test = y_test.cpu()
print(classification_report(y_test, y_pred, target_names=classes))

You can see that our model is doing good when it comes to the No rain class. We have so many examples. Unfortunately, we can't really trust predictions of the Raining class. One of the best things about binary classification is that we can have a good look at a simple confusion matrix:

In [None]:
conf_mat = confusion_matrix(y_test, y_pred)
df_conf_mat = pd.DataFrame(conf_mat, index = classes, columns = classes)
heat_map = sns.heatmap(df_conf_mat, annot = True, fmt = 'd')
heat_map.yaxis.set_ticklabels(heat_map.yaxis.get_ticklabels(), ha = 'right')
heat_map.xaxis.set_ticklabels(heat_map.xaxis.get_ticklabels(), ha = 'right')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')