<a class="anchor" id="0"></a>
# **Rainfall in Australia** 


<img src=https://thumbs.dreamstime.com/b/trees-whatipu-point-huia-bay-auckland-new-zealand-march-two-tall-green-windswept-shoreline-under-heavy-cloudy-sky-91689726.jpg> 






<a class="anchor" id="0.1"></a>
# **Table of Contents** 

1. [Background](#1)
2. [The Data](#2)
3. [Data Preprocessing](#3)
4. [Feature Engineering](#4)
5. [Training the Model](#5)
6. [Evaluating the Model](#6)
7. [Dealing with Class Imbalance](#7)
8. [Tuning Hyperparameters](#8)
9. [Conclusion](#9)







# **1. Background** <a class="anchor" id="1"></a>

[Table of Contents](#0.1)

Our objective is to predict whether or not rain will fall the next day in Australia. This knowledge might be relevant for several reasons, and some of them are listed below:

- To help decide if you should head out with your umbrella or not
- To know what kind of clothes would be suitable
- To know if additional preparations are needed to ensure an outdoor date or event goes smoothly

Whatever the case may be, we would try to make sense of the data we have to inform our predictions.

Here we go.



<a class = "anchor" id="2"></a>
# 2. **The Data**

[Table of Contents](#0.1)

Let's take a look at our data and get working


In [None]:
# importing relevant libraries
import pandas as pd
import numpy as np

In [None]:
# loading the dataset
rain_data = pd.read_csv('../input/weather-dataset-rattle-package/weatherAUS.csv')
print('The dataset has {} rows and {} columns'.format(rain_data.shape[0],rain_data.shape[1]))

In [None]:
rain_data.head()

A cursory look at the first five rows of our data reveals that some columns have missing values. We will try to resolve this later, but let's do some more inspection on our dataset

In [None]:
rain_data.describe()

In [None]:
# printing out the column names
print(rain_data.columns)

Let's see the categorical and numeriacal columns we have in our data

In [None]:
# Checking the categorical and non-categorical datasets
cat = rain_data.dtypes=='object'
num = rain_data.dtypes=='float64'
cat_columns = list(cat[cat].index)
num_columns = list(num[num].index)
print("Categorical variables are:",cat_columns)

print("Numerical variables are:",num_columns)

We see that there are 7 categorical variables while the rest are numerical. Great.

Now let's do some preprocessing of our data, and try to clean it up a bit

<a class="anchor" id="3"></a>
# 3. **Data Preprocessing** 

[Table of Contents](#0.1)

### Missing Values

Let's examine our missing value problem more squarely

In [None]:
# checking the number of missing values per column
rain_data.isnull().sum()

Some visualization might be useful, let's see

In [None]:
# visualizing missing data
import missingno as msno
msno.matrix(rain_data)

In [None]:
# plotting the number of rows with entries per column
msno.bar(rain_data)

In [None]:
msno.heatmap(rain_data)

From the plot above and also the data printed out, there are columns with missing values. "Sunshine", "Evaporation", "Cloud9am", and "Cloud3am" in particular have a significant number of missing values. According to the definition of these columns, they seem to be important features.

You would observe that the target column (RainTomorrow) contains some missing values. We will drop row entries without targets. You would also observe that the correlation between RainToday and Rainfall is high, and that they both even have the same number of missing values. I'll keep both (no pressure), but drop rows with missing values. Later on, the missing values in other columns will be replaced.

In [None]:
# drop rows without targets, raintoday, and rainfall entries
rain_data_clean = rain_data.dropna(axis=0,how='any',subset=["RainTomorrow", "Rainfall", "RainToday"])


In [None]:
# checking the number of missing values we now have;
rain_data_clean.isnull().sum()

### Splitting the data into training and split sets

Lest we  fall victim to the silent killer called data leakage, let's split our data into training and test sets

In [None]:
# separating the target variables from the features
X = rain_data_clean.drop(columns = "RainTomorrow")
y = rain_data_clean.loc[:,"RainTomorrow"]
print ("The size of X is {}".format(X.shape))
print ("The size of y is {}".format(y.shape))

In [None]:
# importing train_test_split
from sklearn.model_selection import train_test_split

In [None]:
# splitting the dataset, and using the "stratify" argument to preserve the class ratio in the train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42, stratify=y)

<a class="anchor" id="4"></a>
# 4. **Feature Engineering**

[Table of Contents](#0.1)

### Replacing Dates with Seasons

According to [TripSavvy](https://www.tripsavvy.com/australian-seasons-1462601#:~:text=To%20break%20things%20down%20for%20you%2C%20each%20of,to%20August%2C%20and%20spring%20from%20September%20to%20November), Australia has four seasons categorised into months as follows:
<ol>
 <li> Summer : December - February </li>
 <li> Autumn : March - May </li>
 <li> Winter : June - August </li>
 <li> Spring : September - November </li>
</ol>
 
As such, we will replace the entries in the Date column with the corresponding season. This might help us get some insight as rainfall tends to be seasonal.

In [None]:
def season_replace(df):
    import datetime as dt
#     initialize empty list of monthsmonth 
    month = []
    for num in df['Date']:
#         get the year, month, day per entry
        date_obj = dt.datetime.strptime(num,"%Y-%m-%d")
#         get the month only
        date_mon = date_obj.month
#     add month to the series of months
        month.append(date_mon)
#     initialise the seasons and let their indexes correspond with the month of the year
#     i.e. Jan, Feb, Mar correspond to index 0, 1, 2 which are Summer, Summer, Autumn based on seasons
    season_options = ['Summer', 'Summer', 'Autumn', 'Autumn', 'Autumn', 'Winter', 'Winter', 'Winter', 'Spring', 'Spring', 'Spring','Summer']
#     intialize empty list of seasons
    seasons= []
    for i in month:
#         add the season for each date entry to the seasons list
        seasons.append(season_options[i-1])
#     Drop the date column (it is the first column, index is 0)
    n = df.columns[0]
    df.drop(n, axis = 1, inplace = True)
#     add seasons to the dataframe
    df['Seasons'] = seasons
#     re-order the dataframe to start with the seasons column
    df = df[['Seasons'] +  [col for col in df.columns if col != 'Seasons']]
    return df

In [None]:
# replacing with the corresponding season
X_train = season_replace(X_train)

In [None]:
X_train.head()

Next, we will encode the data using One Hot Encoder. First, we create a list of the categorical variables to encode, and numerical variables to standardize.

In [None]:
features = X_train.columns
features_to_encode = X_train.select_dtypes(include=['object', 'bool']).columns
features_to_scale = X_train.select_dtypes(include=['int64', 'float64']).columns

Next, we will create a transformer object through which we will pass the encoder, and scaler

In [None]:
# importing relevant packages

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler

from sklearn.impute import SimpleImputer

In [None]:
# instantiate one hot encoder to use
encoder = OneHotEncoder(handle_unknown='error', drop='first', sparse='True')
# setting up categorical pipeline
cat_transformer = Pipeline(steps=[('onehot', encoder)])

In [None]:
# instantiate imputer and scalar for numeric variables
imputer = SimpleImputer(missing_values = np.nan, strategy="median")
scaler = RobustScaler()
# setting up the numerical pipeline
num_transformer = Pipeline(steps = [
    ('imputer', imputer),
    ('scaler', scaler)
])

In [None]:
# combining both the numerical and categorical pipeline into a ColumnTransformer instance
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, features_to_scale),
        ('cat', cat_transformer, features_to_encode)
    ],remainder='passthrough')

<a class="anchor" id="5"></a>
# 5. **Training the Model** 

[Table of Contents](#0.1)

We will be using a Random Forest Classifier model for this probem

In [None]:
# importing Random Forest Classifie
from sklearn.ensemble import RandomForestClassifier

# instantiating the classifier
rf_classifier = RandomForestClassifier(
                      min_samples_leaf=50,
                      n_estimators=150,
                      bootstrap=True,
                      oob_score=True,
                      n_jobs=-1,
                      random_state=42,
                      max_features='auto')

Next we label encode the target variable as it is currently a catrgorical data type

In [None]:
# Encoding the dependent variable
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_train = le.fit_transform(y_train)

Next, we train the model using the training dataset. The pipe object makes it easy to pass data through a series of processes that happene one after the other. Remember that the preprocessor object was defined for the imputing of missing values and standardization of out data while the rf_classifier is our chosen model.

In [None]:
pipe = make_pipeline(preprocessor, rf_classifier)
pipe.fit(X_train, y_train)

<a class="anchor" id="6"></a>
# 6. **Evaluating the Model**

[Table of Contents](#0.1)

Let's see how well our does with predicting the target class for out test dataset.

Remember, that we are to take the test data through all the preprocessing and feature engineering processes our training set went through.

## Preprocessing the test data
First, we replace the dates with their corresponding season

In [None]:
# Replace Dates with season in test data
X_test = season_replace(X_test)
X_test.head()

Next, we label the target test variable accordingly.

In [None]:
# Label encode y_test
y_test = le.transform(y_test)

Next, we predict our target classses

In [None]:
y_pred = pipe.predict(X_test)

## Evaluating the Classifier

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, roc_auc_score, roc_curve, f1_score

import matplotlib.pyplot as plt

In [None]:
acc = accuracy_score(y_test, y_pred)

In [None]:
print("The accuracy of the model is {}%".format(round(acc * 100,3)))

#### Probability Predictions

Next, let's see our ROC and AUC performance

In [None]:
train_probs = pipe.predict_proba(X_train)[:, 1]
test_probs = pipe.predict_proba(X_test)[:, 1]
train_pred = pipe.predict(X_train)

In [None]:
print("Train ROC AUC Score: {}".format(roc_auc_score(y_train,train_probs)))
print("Test ROC AUC Score: {}".format(roc_auc_score(y_test,test_probs)))

To plot the ROC curve, let's define a function that takes in all the necessary arguments and returns the ROC Curve as well as the precision and recall metrics

In [None]:
def evaluate_model(y_pred, test_probs, train_pred, train_probs, y_train):
    
    baseline = {}
    baseline['recall'] = recall_score(y_test,
                        [1 for _ in range(len(y_test))])
    baseline['precision'] = precision_score(y_test,
                            [1 for _ in range(len(y_test))])
    baseline['roc'] = 0.5
    
    results = {}
    results['recall'] = recall_score(y_test, y_pred)
    results['precision'] = precision_score(y_test, y_pred)
    results['roc'] = roc_auc_score(y_test, test_probs)
    
    train_results = {}
    train_results["recall"] = recall_score(y_train, train_pred)
    train_results['precision'] = precision_score(y_train, train_pred)
    train_results['roc'] = roc_auc_score(y_train, train_probs)

    for metric in ['recall', 'precision', 'roc']:
        print('{} \n Baseline: {} \n Test: {} \n Train: {}'.format(metric.capitalize(),round(baseline[metric], 2),round(results[metric], 2),round(train_results[metric], 2)))
              
#     calculate the  FPR and TPR
    base_fpr, base_tpr, _ = roc_curve(y_test, [1 for _ in range(len(y_test))])
    model_fpr, model_tpr, _ = roc_curve(y_test, test_probs)
              
    plt.figure(figsize = (8,6))
    plt.rcParams['font.size'] = 16
    
#     Plot both curves
    plt.plot(base_fpr, base_tpr, 'b', label='baseline')
    plt.plot(model_fpr, model_tpr, 'r', label='model')
    plt.legend();
    
    plt.xlabel('False Positive Rate');
    plt.ylabel('True Positive Rate');
    plt.title('ROC Curves');
    plt.show()

In [None]:
evaluate_model(y_pred, test_probs, train_pred, train_probs, y_train)

We can see that the recall, precision, and auc score for the train and test sets are pretty close to each other. This suggests it is unlikely that our model is being overfitted

#### Confusion Matrix

Next, let's plot a pretty confusion matrix for some more insight into our model performance

In [None]:
import itertools

def plot_confusion_matrix (cm, classes, normalize=False, title='Confusion Matrix', cmap = plt.cm.Blues):
    
    plt.figure(figsize = (10, 10))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title, size = 24)
    plt.colorbar (aspect = 4)
    
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45, size=14)
    plt.yticks(tick_marks, classes, size = 14)
    
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    
#     Label the plot
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                fontsize=20,
                horizontalalignment='center',
                color='white' if cm[i, j] > thresh else "black")
        
        plt.grid(None)
        plt.tight_layout()
        plt.ylabel ('True label', size = 18)
        plt.xlabel ('Predicted label', size = 18)

In [None]:
cm = confusion_matrix(y_test, y_pred)
plot_confusion_matrix(cm, classes=['0 - No Rain', '1 - Rain'],
                     title = 'Rainfall Confusion Matrix')

From this, we can see that the model is not doing so well to correctly predict that it would rain the next day. In fact you can infer this from the low recall score we had (0.44 and 0.43 for the test and training sets respectively). The precision score, on the other hand was pretty high (0.78 and 0.81 for the test and training sets respectively.

This means that when our model predicts rainfall, it is more likely to rain than otherwise. However, we would also run issues, because there are a good number of cases where it predicts an absense of rainfall, and it actually rains. Having a low recall score in this case or a high number of false negatives is not desirable.

One reason why this is the case could be that the dataset is imbalanced, i.e. there are way more instances of the "No rain" class than the "Rain" class.

Let's attempt to use SMOTE to sample the dataset and improve our model's predictive performance

<a class="anchor" id="7"></a>
# 7. **Dealing with Class Imbalance**

[Table of Contents](#0.1)

Using SMOTE to remedy the class imbalance

In [None]:
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state = 42)
# pipe = make_pipeline(preprocessor, rf_classifier)
# first we fit and transform the training data using the preprocessor transformer instance
# this ensures that the categorical variables are encoded before the sampling takes place
X_train_new = preprocessor.fit_transform(X_train)
X_train_new, y_train_new = sm.fit_resample(X_train_new, y_train)

Instantiating a new pipeline to train the resampled dataset

In [None]:
pipe_smote = make_pipeline(rf_classifier)
pipe_smote.fit(X_train_new, y_train_new)

Next, we prepare the test dataset for prediction

In [None]:
X_test_new = preprocessor.fit_transform(X_test)

Next, we make predictions using the new model

In [None]:
y_pred_new = pipe_smote.predict(X_test_new)

Let's see our model acccuracy

In [None]:
acc_smote = accuracy_score(y_test, y_pred_new)
print("The accuracy of the smote_model is {}%".format(round(acc_smote * 100,3)))

Next, let's evaluate our model's recall, precison, and roc_auc_score

In [None]:
train_probs_new = pipe_smote.predict_proba(X_train_new)[:, 1]
test_probs_new = pipe_smote.predict_proba(X_test_new)[:, 1]
train_pred_new = pipe_smote.predict(X_train_new)

In [None]:
evaluate_model(y_pred_new, test_probs_new, train_pred_new, train_probs_new, y_train_new)

Next, let's see what our confusion matrix looks like

In [None]:
cm_smote = confusion_matrix(y_test, y_pred_new)
plot_confusion_matrix(cm_smote, classes=['0 - No Rain', '1 - Rain'],
                     title = 'Rainfall Confusion Matrix [Smote]')

We can observe some improvements to our model. The number of false negatives have reduced (so we can expect our recall score to improve). However, the number of false positives have also increased (so our precision has dropped). But that's fine. It is much better to be wrong about rain falling than about rain not falling. 

<a class="anchor" id="8"></a>
# **8. Tuning Hyperparameters**

[Table of Contents](#0.1)

In [None]:
# this package prints out data in a pretty format
from pprint import pprint

# let's see the current parameters in use
print('Parameters currently in useL\n')
pprint(rf_classifier.get_params())

Next, create a grid of parameters for the model to randomly pick and train

In [None]:
from sklearn.model_selection import RandomizedSearchCV

n_estimators = [int (x) for x in np.linspace(start=100, stop=700,num=50)]

# number of features to consider at every split
max_features = ['auto', 'log2'] 

# maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]

# include None in max_depth
max_depth.append(None)

# minimum number of samples required to split a node
min_samples_split = [2, 5, 10]

# minimum number of samples required at each leaf node
min_samples_leaf = [1, 4, 10]

# method of selecting samples for training each tree
bootstrap = [True]

max_leaf_nodes = [None] + list(np.linspace(10, 50, 500).astype(int))

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'max_leaf_nodes': max_leaf_nodes,
               'bootstrap': bootstrap}

In [None]:
rf = RandomForestClassifier(oob_score=True, n_jobs=-1)

# creating a grid of hyperparameters
rf_random = RandomizedSearchCV(
                estimator = rf,
                param_distributions = random_grid,
                n_iter = 5, cv = 3,
                verbose=1, random_state=42,
                scoring='roc_auc')

# next, we define a pipeline instance that takes fits each model gotten from the grid search
# onto the training data
pipe_random = make_pipeline(rf_random)
pipe_random.fit(X_train_new, y_train_new)

# return the hyperparameters of the best model
rf_random.best_params_

Next, we checj the average number of nodes and maximum depths in our best random forest classifier

In [None]:
best_model = rf_random.best_estimator_
n_nodes = []
max_depths = []

for ind_tree in best_model.estimators_:
    n_nodes.append(ind_tree.tree_.node_count)
    max_depths.append(ind_tree.tree_.max_depth)
    
print ('Average number of nodes: {}'.format(int(np.mean(n_nodes))))
print ('Average maximum depth: {}'.format(int(np.mean(max_depths))))

Next, we evaluate the best model

In [None]:
pipe_best = make_pipeline(best_model)
pipe_best.fit(X_train_new, y_train_new)
y_pred_best = pipe_best.predict(X_test_new)

In [None]:
train_rf_probs_best = pipe_best.predict_proba(X_train_new)[:, 1]
test_rf_probs_best = pipe_best.predict_proba(X_test_new)[:, 1]
train_rf_pred_best = pipe_best.predict(X_train_new)

In [None]:
acc_best = accuracy_score(y_test, y_pred_best)
print("The accuracy of the smote_model is {}%".format(round(acc_best * 100,3)))

In [None]:
evaluate_model(y_pred_best, test_rf_probs_best, train_rf_pred_best, train_rf_probs_best, y_train_new)

And of course, the confusion matrix

In [None]:
cm_best_model = confusion_matrix(y_test, y_pred_best)
plot_confusion_matrix(cm_best_model, classes=['0 - No Rain', '1 - Rain'],
                     title = 'Rainfall Confusion Matrix')

<a class="anchor" id="8"></a>
# **9. Conclusion** 

[Table of Contents](#0.1)

We have been able to make predictions on whether rain will fall in Australia the next day with an accuracy of 79%. Our recall score shows was optimized over the precision score because we'd rather have a situation where we were wrong to predict rainfall than one where we were wrong to predict no rainfall. 

Thank you.

Kindly upvote if you found it interesting or helpful. Also, I'd very much appreciate any comments and feedback!