# Rain prediction in Australia

**Task type:** Classification

**ML algorithm used:** Random Forest Classifier

**Metrics:** Accuracy, ROC-AUC

**Other methods used:** GridSearchCV, precision/recall balancing

<img src="https://www.abc.net.au/cm/rimage/11665138-3x2-xlarge.jpg?v=3" height=500 width=500>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns 
import matplotlib.pyplot as plt
# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns 

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline


# 1. Load data

In [None]:
df = pd.read_csv('../input/weather-dataset-rattle-package/weatherAUS.csv')

In [None]:
df.describe()

**Let's see how many zero values are there for each column.**

# 2. Data preprocessing

In [None]:
zeros_cnt = df.isnull().sum().sort_values(ascending=False)
percent_zeros = (df.isnull().sum() / df.isnull().count()).sort_values(ascending=False)

missing_data = pd.concat([zeros_cnt, percent_zeros], axis=1, keys=['Total', 'Percent'])
missing_data
#missing_data.T

**Let's drop those features where the missing/total coefficient is higher than 15%.**

In [None]:
dropList = list(missing_data[missing_data['Percent'] > 0.15].index)
dropList
df.drop(dropList, axis=1, inplace=True)

In [None]:
df['Location'].unique()

In [None]:
#df.head()
df.shape

**A pairplot helps visualize dependencies and correlation between features. Some of them have quite obvious links.**

In [None]:
sns.pairplot(df[:1000])

**We can see some pretty straightforward correlations with almost linear-shaped distribution.**

In [None]:
df.head()
df.drop(['Date'], axis=1, inplace=True)
df.drop(['Location'], axis=1, inplace=True)

In [None]:
df.info()

**Let's encode categorical features using one-hot-encoding.**

In [None]:
ohe = pd.get_dummies(data=df, columns=['WindGustDir','WindDir9am','WindDir3pm'])
ohe.info()

In [None]:
from sklearn import preprocessing
from numpy import array

ohe['RainToday'] = df['RainToday'].astype(str)
ohe['RainTomorrow'] = df['RainTomorrow'].astype(str)

lb = preprocessing.LabelBinarizer()

ohe['RainToday'] = lb.fit_transform(ohe['RainToday'])
ohe['RainTomorrow'] = lb.fit_transform(ohe['RainTomorrow'])

In [None]:
ohe = ohe.dropna()
#ohe.drop('Location', axis=1, inplace=True)
y = ohe['RainTomorrow']
X = ohe.drop(['RainTomorrow'], axis=1)

# 3. Model building

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
X_train.info()

**Please uncomment this part of code to use grid search for hyperparameter tuning for the model. The model below uses the outcome of the GridSearch operation with best parameters.**

In [None]:
#param_grid = { 
#    'n_estimators': [100, 200],
#    'max_features': ['auto'],
#    'max_depth' : [4,5,8,10],
#    'criterion' :['gini', 'entropy']
#}
#RFC = RandomForestClassifier()

#cv_RFC = GridSearchCV(estimator=RFC, param_grid=param_grid, cv=2)
#cv_RFC.fit(X_train, y_train)

In [None]:
#cv_RFC.best_params_
#sorted(zip(cv_RFC.best_estimator_.feature_importances_,ohe.columns))

In [None]:
pipe = Pipeline([('scaler', StandardScaler()), ('RFC', RandomForestClassifier(criterion='gini', 
                                                                              max_depth=10, 
                                                                              max_features='auto',
                                                                              n_estimators=200))])

In [None]:
pipe.fit(X_train, y_train)

# 4. Model evaluation

In [None]:
pipe.score(X_train, y_train)

**Cross validation scores on the whole dataset:**

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(pipe, X, y, cv=3)

In [None]:
y_pred = pipe.predict(X_test)

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

#confusion_matrix(y_test, y_pred)
accuracy_score(y_test, y_pred)

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

#recall_score(y_test, y_pred)
#precision_score(y_test, y_pred)
f1_score(y_test, y_pred)

# 5. Plotting precision-recall & ROC curves.

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

ns_probs = [0 for _ in range(len(y_test))]
lr_probs = pipe.predict_proba(X_test)
lr_probs = lr_probs[:, 1]

ns_auc = roc_auc_score(y_test, ns_probs)
lr_auc = roc_auc_score(y_test, lr_probs)

print('No Skill: ROC AUC=%.3f' % (ns_auc))
print('RFC: ROC AUC=%.3f' % (lr_auc))

# calculate roc curves
ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_probs)
lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_probs)
# plot the roc curve for the model
plt.plot(ns_fpr, ns_tpr, linestyle='--', label='Dummy Classifer')
plt.plot(lr_fpr, lr_tpr, marker='.', label='RFC')
# axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
# show the legend
plt.legend()
# show the plot
plt.show()

**Let's plot a graph to identify the threshold influence on the scores**

In [None]:
from sklearn.metrics import precision_recall_curve
y_scores = pipe.predict_proba(X_train)[:,1]
#y_scores

precisions, recalls, thresholds = precision_recall_curve(y_train, y_scores)

def plot_prc (precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], 'b--', label='Precision')
    plt.plot(thresholds, recalls[:-1], 'g-', label='Recall')
    plt.xlabel('Thresholds')
    plt.legend(loc='center left')
    plt.ylim([0,1])

plot_prc(precisions, recalls, thresholds)

In [None]:
#y_pred = clf.predict(X_test)  # default threshold is 0.5
y_pred1 = (pipe.predict_proba(X_train)[:,1] >= 0.8).astype(int) # set threshold as 0.3
precision_score(y_train, y_pred1)

**Here we can clearly see the balance between precision & recall. 
So if we want a higher recall, we can shift a threshold to a higher value.**

**However, you should decide on the threshold with a thorough analysis not to miss-out on the model performance later.**

# 6. Conclusion

**So, we have build a quite simple Random Forest Classifier using the features from dataset applying one-hot-encoding to the categorical features.**

**The accuracy-score for the out-of-the-box model is around 85% which is not bad. The AUC score is 0.862.**

**We have also conducted an experiment with shifting the decision boundary for the model which resulted in a precision score spike. This is the technique you can use to manually set the threshold for the trained classifier.**