# Import Data

Can we predict whether it is going to rain tomorrow from weather data obtained today? This dataset contains daily weather observations from numerous Australian weather stations. The target variable RainTomorrow means: Did it rain the next day? Yes or No.

We are going to train diffrent binary classification algorithms from the supervised dataset provided and determine how accurate we can predict if it is going to rain tomorrow given today's weather conditions. Let's begin.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # for plotting

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
df = pd.read_csv('../input/weatherAUS.csv')
print('Dataset dimensions: ', df.shape)

## Data Fields

* Date: The date of observation
* Location: The common name of the location of the weather station
* MinTemp: The minimum temperature in degrees celsius
* MaxTemp: The maximum temperature in degrees celsius
* Rainfall: The amount of rainfall recorded for the day in mm
* Evaporation: The so-called Class A pan evaporation (mm) in the 24 hours to 9am
* Sunshine: The number of hours of bright sunshine in the day.
* WindGustDir: The direction of the strongest wind gust in the 24 hours to midnight
* WindGustSpeed: The speed (km/h) of the strongest wind gust in the 24 hours to midnight
* WindDir9am: Direction of the wind at 9am
* WindDir3pm: Direction of the wind at 3pm
* WindSpeed9am: Wind speed (km/hr) averaged over 10 minutes prior to 9am
* WindSpeed3pm: Wind speed (km/hr) averaged over 10 minutes prior to 3pm
* Humidity9am: Humidity (percent) at 9am
* Humidity3pm: Humidity (percent) at 3pm
* Pressure9am: Atmospheric pressure (hpa) reduced to mean sea level at 9am
* Pressure3pm: Atmospheric pressure (hpa) reduced to mean sea level at 3pm
* Cloud9am: Fraction of sky obscured by cloud at 9am. This is measured in "oktas", which are a unit of eigths. It records how many eigths of the sky are obscured by cloud. A 0 measure indicates completely clear sky whilst an 8 indicates that it is completely overcast.
* Cloud3pm: Fraction of sky obscured by cloud (in "oktas": eighths) at 3pm. See Cload9am for a description of the values
* Temp9am: Temperature (degrees C) at 9am
* Temp3pm: Temperature (degrees C) at 3pm
* RainToday: Boolean: 1 if precipitation (mm) in the 24 hours to 9am exceeds 1mm, otherwise 0
* RISK_MM: The amount of next day rain in mm. Used to create response variable RainTomorrow. A kind of measure of the "risk".
* RainTomorrow: The target variable. Did it rain tomorrow?

# Explore the Data

In [None]:
df.info()

In [None]:
#check the counts for each column to check if the dataset is complete
df.count().sort_values()

From above we can see that Sunshine, Evaporation, Cloud3pm, and Cloud9am columns have less than 60% of the rows populated. So let's drop these columns. Also drop Risk_MM as this indicates  the amount of rainfall in millimeters for the next day. This value is used to determine the target variable "RainTomorrow". So it should be ignored here as this would give the model a false accuracy. 
We will fill the missing values of columns that we didn't drop later

In [None]:
print('Prior to dropping the columns :',df.shape)
df = df.drop(columns=['Sunshine','Evaporation','Cloud3pm','Cloud9am','Location','RISK_MM','Date', 'RISK_MM'],axis=1)
print('After dropping the columns :',df.shape)

In [None]:
# split the train and test sets
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(df, test_size = 0.2, random_state = 123)

In [None]:
# Explore the train_set
rain = train_set.copy()
rain.head()

In [None]:
rain.describe()

All the numerical falls within reasonable ranges Therefore no need to alter any numerical values. Now let's investigate categorical data for any abnormlaities.

In [None]:
rain['WindGustDir'].value_counts()

In [None]:
rain['WindDir9am'].value_counts()

In [None]:
rain['WindDir3pm'].value_counts()

In [None]:
rain['RainTomorrow'].value_counts()

There aren't any abnormalities or unnecessary categories in the categorical columns. So no need to alter any categorical varaibles.
Let's now investigate the null values in each column. We need to remove or change these null values before training models.

In [None]:
 rain.isnull().sum(axis = 0)

It seems that all the columns have null values. Since we don't want to drop lot of data points we need to replace these null values with appropriate values.

In [None]:
# Correlation Matrix
import scipy.stats as ss
import seaborn as sns

def cramers_v(x, y):
    confusion_matrix = pd.crosstab(x,y)
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
    rcorr = r-((r-1)**2)/(n-1)
    kcorr = k-((k-1)**2)/(n-1)
    return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))

headers = list(rain)
print(headers)
coeff_matrix = []
for header in headers:
    coeff_list = []
    for item in headers:
        coeff = cramers_v(rain[header], rain[item])
        coeff_list.append(coeff)
    coeff_matrix.append(coeff_list)
    
np_arr = np.array(coeff_matrix)
plt.figure(figsize=(20,12))
ax = sns.heatmap(np_arr, annot=True)

From the correlation matrix above RainToday seem to have a very high correlation with RainFall, which make sense as RainFall provides the amount of rain we got today. Also RainToday has a high correlation with many other columns in this data set. So let's drop RainToday column.

In [None]:
train_set = train_set.drop("RainToday", axis = 1)

# Prepare the Data

In [None]:
#Alternative to DataframeSelector. Can be replaced later
from sklearn.base import BaseEstimator, TransformerMixin

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names=attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

In [None]:
# Create the feature set and target for training
X = train_set.drop("RainTomorrow", axis = 1)
y = train_set["RainTomorrow"].copy()

First let's generate our categorical variable pipeline. For categorical variables we decieded to change the null values to most frequent values in the column. We also used OneHotEncoder to encode categorical data.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

cat_pipeline = Pipeline([
        ("select_cat", DataFrameSelector(["WindGustDir", "WindDir9am", "WindDir3pm"])),
        ("imp", SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
        ("cat_encoder", OneHotEncoder(sparse=False)),
    ])

For numerical variables we decieded to change the null values to the mean in the column. We also used StandardScaler to normalize numerical values

In [None]:
num_pipeline = Pipeline([
        ("select_numeric", DataFrameSelector(["MinTemp", "MaxTemp", "Rainfall", "WindGustSpeed", "WindSpeed9am", "WindSpeed3pm", "Humidity9am", "Humidity3pm",
                                              "Pressure9am", "Pressure3pm", "Temp9am", "Temp3pm"])),
        ("imp", SimpleImputer(missing_values=np.nan, strategy='mean')),
        ('scaler', StandardScaler()),
        ])

In [None]:
preprocess_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])

In [None]:
X_train_prepared = preprocess_pipeline.fit_transform(X)
y_train_prepared = y.map({'Yes':1, 'No':0})

In [None]:
#Test set
X_test = test_set.drop("RainTomorrow", axis = 1)
y_test = test_set["RainTomorrow"].copy()

X_test_prepared = preprocess_pipeline.fit_transform(X_test)
y_test_prepared = y_test.map({'Yes':1, 'No':0})



# SHORT-LIST PROMISING MODELS and FINE-TUNE THE SYSTEM

In [None]:
#Plt ROC Curve
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import roc_curve, classification_report, roc_auc_score

def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate', fontsize=16)
    plt.ylabel('True Positive Rate', fontsize=16)

We are choosing roc_auc or area under the ROC curve as the metric to measure the performance of each model. AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting roc_auc is as the probability that the model ranks a random positive example more highly than a random negative example. This metric is chosen because we want the model to identify as mamy possible positive cases  (rain tomorrow) as possible. Positive case here being it will rain tomorrow and negative case being it will not rain tomorrow.

roc_auc is desirable for the following two reasons:

*   roc_auc is scale-invariant. It measures how well predictions are ranked, rather than their absolute values.
*   roc_auc is classification-threshold-invariant. It measures the quality of the model's predictions irrespective of what classification threshold is chosen.

### LogisticRegression

The first classifier to test is Logistic Regression. Logistic regression is appropriate to conduct regression analysis when the dependent variable is dichotomous (binary). This dataset is not very large and we are testing for both l1 and l2 penalties. Therefore ‘liblinear’ is used as the solver. We are tuning the parameters C and penalty of the LogisticRegression classifier.
* C: Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
* penalty: Used to specify the norm used in the penalization.(‘l1’ or ‘l2’)

Model tuning is commented out as it takes long time to run. Uncomment and run if tuning is necessary 

In [None]:
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression

# #tuning
# param_grid={"C":np.logspace(-3,3,7), "penalty":["l1","l2"] }

# #training
# log = LogisticRegression(solver = 'liblinear')
# logreg_cv=GridSearchCV(log,param_grid,cv=5,scoring= 'roc_auc')
# logreg_cv.fit(X_train_prepared,y_train_prepared)
# print(logreg_cv.best_params_)
# print(logreg_cv.best_score_)


The optimal parameters obtained from tuning are {'C': 1.0, 'penalty': 'l1'} with a roc_auc value of 0.85495.

We train the LogisticRegression regression model with given parameters and the training set then test with the test set.

In [None]:
#training
log_best_model = LogisticRegression(C = 1.0, penalty = 'l1', solver = 'liblinear')
log_best_model.fit(X_train_prepared,y_train_prepared)

#testing
y_pred_prob = log_best_model.predict_proba(X_test_prepared)[:,1]
fpr_log, tpr_log, thresholds = roc_curve(y_test_prepared, y_pred_prob, pos_label= 1)
print("Roc_auc_score {}".format(roc_auc_score(y_test_prepared, y_pred_prob)))

plt.figure(figsize=(8, 6))
plot_roc_curve(fpr_log, tpr_log, "LogisticRegression")
plt.legend(loc="lower right", fontsize=16)
plt.show()

LogisticRegression classifier gives a roc_auc (area under the curve) value of 0.85575.

### RandomForestClassifier

A random forest is a classifier that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement when bootstrap=True (default). We are tuning the parameters max_depth and n_estimators of the RandomForestClassifier classifier.

* max_depth: The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
* n_estimators: The number of trees in the forest.

In [None]:
from sklearn.ensemble import RandomForestClassifier
# #tuning
# param_grid = {
#     'bootstrap': [True],
#     'max_depth': [50, 60 ,70 ,80, 90],
#     'n_estimators': [20, 50, 100, 150, 200]
# }

# rf = RandomForestClassifier(random_state=123)
# rf_cv=GridSearchCV(rf,param_grid,cv=5,scoring= 'roc_auc')
# rf_cv.fit(X_train_prepared,y_train_prepared)
# print(rf_cv.best_params_)
# print(rf_cv.best_score_)

The optimal parameters obtained from tuning are {'bootstrap': True, 'max_depth': 60, 'n_estimators': 300} with a roc_auc value of 0.87432.

We train the RandomForestClassifier classifier with given parameters and the training set then test with the test set.

In [None]:
#training
rf_best_model = RandomForestClassifier(bootstrap = True, max_depth = 60, n_estimators = 300, random_state=123)
rf_best_model.fit(X_train_prepared,y_train_prepared)

#testing
y_pred_prob = rf_best_model.predict_proba(X_test_prepared)[:,1]
fpr_rf, tpr_rf, thresholds = roc_curve(y_test_prepared, y_pred_prob, pos_label= 1)
print("Roc_auc_score {}".format(roc_auc_score(y_test_prepared, y_pred_prob)))

plt.figure(figsize=(8, 6))
plot_roc_curve(fpr_rf, tpr_rf, "Random Forest")
plt.legend(loc="lower right", fontsize=16)
plt.show()

RandomForestClassifier classifier gives a roc_auc value of 0.87900. RandomForestClassifier gives a higher roc_auc than LogisticRegression. 

In addition to classifying RandomForestClassifier can also be used to determine feature importance. Here we plot the significance of each feature when classifying the given features set into positive and negative classes.

In [None]:
feature_importances = rf_best_model.feature_importances_
feats = {} # a dict to hold feature_name: feature_importance
for feature, importance in zip(X.columns, feature_importances):
    feats[feature] = importance #add the name/value pair 

importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
importances.sort_values(by='Gini-importance').plot(kind='bar', rot=45)

### LGBMClassifier

Light GBM is a gradient boosting framework that uses tree based learning algorithm. It grows tree vertically while other algorithm grows trees horizontally meaning that Light GBM grows tree leaf-wise while other algorithm grows level-wise. It will choose the leaf with max delta loss to grow. When growing the same leaf, Leaf-wise algorithm can reduce more loss than a level-wise algorithm. Parameters assigned are,

* objective: Specifies the application of your model, whether it is a regression problem or classification problem. This is binary classification problem so 'binary' is assigned.
* metric: Specifies loss for model building. 'binary_logloss' is appropriate for loss for binary classification.
* boosting: Defines the type of algorithm you want to run. 'dart' is used for better accuracy.

Parameters tuned are,

* min_data_in_leaf: Setting it to a large value can avoid growing too deep a tree, but may cause under-fitting. In practice, setting it to hundreds or thousands is enough for a large dataset.
* max_depth: The maximum depth of the tree.
* learning_rate: This determines the impact of each tree on the final outcome. GBM works by starting with an initial estimate which is updated using the output of each tree. The learning parameter controls the magnitude of this change in the estimates. 





In [None]:
from lightgbm import LGBMClassifier
# #tuning
# param_grid = {
#     "min_data_in_leaf":[50,100,200, 300, 400],
#     "max_depth":[8, 10, 20, 50],
#     "learning_rate": [0.05, 0.1, 0.2, 0.3, 0.4]
# }

# lgbm = LGBMClassifier(application = 'binary', metric = 'binary_logloss', boosting = 'dart')
# lgbm_cv=GridSearchCV(lgbm, param_grid,cv=5,scoring= 'roc_auc')
# lgbm_cv.fit(X_train_prepared,y_train_prepared)
# print(lgbm_cv.best_params_)
# print(lgbm_cv.best_score_)

The optimal parameters obtained from tuning are {'learning_rate': 0.4, 'max_depth': 10, 'min_data_in_leaf': 300} with a roc_auc value of 0.87731.

We train the LGBMClassifier classifier with given parameters and the training set then test with the test set.

In [None]:
#training
lgbm_best_model = LGBMClassifier(application = 'binary', metric = 'binary_logloss', boosting = 'dart', min_data_in_leaf = 300, max_depth = 10, learning_rate = 0.4)
lgbm_best_model.fit(X_train_prepared,y_train_prepared)

#testing

y_pred_prob = lgbm_best_model.predict_proba(X_test_prepared)[:,1]
fpr_lgbm, tpr_lgbm, thresholds = roc_curve(y_test_prepared, y_pred_prob, pos_label= 1)
print("Roc_auc_score {}".format(roc_auc_score(y_test_prepared, y_pred_prob)))

plt.figure(figsize=(8, 6))
plot_roc_curve(fpr_lgbm, tpr_lgbm, "LGBMClassifier")
plt.legend(loc="lower right", fontsize=16)
plt.show()

LGBMClassifier classifier gives a roc_auc value of 0.87829. This is slightly less than RandomForestClassifier. However, LGBMClassifier runs faster than RandomForestClassifier.

# Neural Network

Lastly we try a simple Neural network model to predict if it is going to rain tomorrow. Deep Learning is a subfield of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks. We are using a Sequential Neural Netwrok with nodes and layers. The optimal Neural Network is obtained by chnaging the number of nodes and layers in the model. Then these optimal number of nodes and layers are used to train the Neural network with training data.

First we need to figure out what optimizer to use. Our choices are,

* adam optimizer
* Stochastic gradient descent optimizer with different learning rates.

In [None]:
from keras.utils import to_categorical
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import SGD

n_cols =  X_train_prepared.shape[1]
target =  to_categorical(y_train_prepared)

def get_new_model():
    model = Sequential()
    model.add(Dense(100, activation='relu', input_shape = (n_cols,)))
    model.add(Dense(100, activation='relu'))
    model.add(Dense(2, activation='softmax'))
    return model

lr_to_test = [.000001, 0.01, 0.1, 0.2, 0.3]

for lr in lr_to_test:
    print('\n\nTesting model with learning rate: %f\n'%lr )
    # Build new model to test, unaffected by previous models
    model = get_new_model()
    # Create SGD optimizer with specified learning rate: my_optimizer
    my_optimizer = SGD(lr=lr) 
    # Compile the model
    model.compile(optimizer = my_optimizer, loss = 'categorical_crossentropy') 
    #model.compile(optimizer='adam', loss='categorical_crossentropy',metrics=['binary_accuracy'])
    model.fit(X_train_prepared, target)


# With adam optimizer
print("Testing model with adam optimizer")
model = get_new_model()
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy')
model.fit(X_train_prepared, target)

adam optimizer gives the lowest loss. Therefore we are using the 'adam' optimizer in our deep learning model. Now we need to train the model and validate it. We will increase the number of nodes and layers to get the best posiible validation score possible.

In [None]:
from keras.callbacks import EarlyStopping

early_stopping_monitor = EarlyStopping(patience=2) 

# Without adding any nodes or layers
model = get_new_model()
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train_prepared, target, validation_split=0.3, epochs=20, callbacks = [early_stopping_monitor])


In [None]:
# Increasing the number of nodes
model = Sequential()
model.add(Dense(120, activation='relu', input_shape = (n_cols,)))
model.add(Dense(120, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train_prepared, target, validation_split=0.3, epochs=20, callbacks = [early_stopping_monitor])

Increasing the number of nodes to 120 decreased the loss of the model. Therefore we are going to use 120 as the number of nodes. Next we are going to increase the number of layers.

In [None]:
# Increasing number of layers
model = Sequential()
model.add(Dense(120, activation='relu', input_shape = (n_cols,)))
model.add(Dense(120, activation='relu'))
model.add(Dense(120, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train_prepared, target, validation_split=0.3, epochs=20, callbacks = [early_stopping_monitor])

Increasing the number of layers did not decrease the loss of the model. Therefor we are going to use the same number of layers as the base model. Now let's train and test our Neural Network.

In [None]:
#training
model = Sequential()
model.add(Dense(120, activation='relu', input_shape = (n_cols,)))
model.add(Dense(120, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train_prepared, target, validation_split=0.3, epochs=20, callbacks = [early_stopping_monitor])

#testing
y_pred_prob = model.predict_proba(X_test_prepared)[:,1]
fpr_nn, tpr_nn, thresholds = roc_curve(y_test_prepared, y_pred_prob, pos_label= 1)
print("Roc_auc_score {}".format(roc_auc_score(y_test_prepared, y_pred_prob)))

plt.figure(figsize=(8, 6))
plot_roc_curve(fpr_nn, tpr_nn, "Neural Network")
plt.legend(loc="lower right", fontsize=16)
plt.show()

Neural Network gives a roc_auc value of 0.0.87594. This is slightly less than both LGBMClassifier and RandomForestClassifier.

# Conclusion

The best classifier to predict if it is going to rain tomorrow given weather data set provided is the RandomForestClassifer with parameters {'bootstrap': True, 'max_depth': 60, 'n_estimators': 300}. It gives the best roc_auc value which provides the highest probability in identifying positive cases (rain tomorrow)

In [None]:
from sklearn.metrics import classification_report

# Best model to predict
rf_best_model = RandomForestClassifier(bootstrap = True, max_depth = 60, n_estimators = 300, random_state=123)
rf_best_model.fit(X_train_prepared,y_train_prepared)

y_pred_prob = rf_best_model.predict_proba(X_test_prepared)[:,1]
fpr_rf, tpr_rf, thresholds = roc_curve(y_test_prepared, y_pred_prob, pos_label= 1)
print("Roc_auc_score {}".format(roc_auc_score(y_test_prepared, y_pred_prob)))

plt.figure(figsize=(8, 6))
plot_roc_curve(fpr_rf, tpr_rf, "Random Forest")
plt.legend(loc="lower right", fontsize=16)
plt.show()

print(classification_report(y_test_prepared, rf_best_model.predict(X_test_prepared)))