<h1>Predicting rain in Australia</h1>

<h2>List of contents</h2>
1. EDA
2. Data cleaning & feature extraction
3. Comparison of preprocessed data against original data
4. Model training
5. Model validation 


<h2>1. EDA - Exploratory Data Analysis </h2>
<h3> Preliminary data insight </h3>
Import libraries and load dataset:

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# load data
df = pd.read_csv('../input/weatheraus/weatherAUS.csv')

In [None]:
# check the head of dataset
df.head()

In [None]:
# Check shape of dataset:
df.shape

In [None]:
# Check the datatypes
df.dtypes

We see that we have 24 variables including one target variable (RainTomorrow) and one variable that we should skip according to data description on kaggle (RAIN_MM).
Besides this we have 5 categorical variables: Location, WindGustDir, WindDir9am, WindDir3pm and RainToday (can also be considered as binary)
We also have Date - how to treat this variable we decide after analysis. Surely we cannot just use it as it is because it will cause overfit to the rain history.


<h3> Target value insigths </h3>

In [None]:
# First - we check the distribution of the target value
counts = df['RainTomorrow'].value_counts()
print(counts)

In [None]:
# We check the exact ratio of 'Yes' samples
print(np.sum(counts))
print(counts[1]/np.sum(counts))

We have ~22% samples with the 'Yes' output. So we have imbalanced dataset. Now let's check the distribution of the other values against the target value.
First numeric data.

<h3> Exploration of numeric variables </h3>

In [None]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
tmp = df.select_dtypes(include=numerics)
tmp["RainTomorrow"]= df["RainTomorrow"]

In [None]:
# check distributions of first 4 numerical values against target:
sns.pairplot(tmp, vars = tmp.columns[:4],hue="RainTomorrow")
plt.show()

Min and Max temperature  have slight differences in distributions among our target class. This two variables are also correlated.
Rainfall and Evaporation are skewed and probably have some outliers (long 'tail' of the distribution plot.

In [None]:
# check distributions of first numerical values against target (cols 4-8):
sns.pairplot(tmp, vars = tmp.columns[4:8],hue="RainTomorrow")
plt.show()

From the figure above we see that 'Sunshine' may be good feature (peaks of 'No' and 'Yes' distributions are clearly separable)

In [None]:
# check distributions of first numerical values against target (cols 8-12):
sns.pairplot(tmp, vars = tmp.columns[8:12],hue="RainTomorrow")
plt.show()

Both hummudities show differences in distributions for our target value, also pressures have their distribution peaks slightly different. This probably makes them good feature to distinguish our target value. Pressures are also correlated with each other.

In [None]:
# check distributions of first numerical values against target (cols 4-8):
sns.pairplot(tmp, vars = tmp.columns[12:16],hue="RainTomorrow")
plt.show()

We see that cloud features have good separation of distributions. Temperatures are correlated.

In [None]:
# Just for curiosity we check the RISK_MM - but according to the data description we should drop this data to not oto overfit
# Below note from dataset description:
# "Note: You should exclude the variable Risk-MM when training a binary classification model. 
# Not excluding it will leak the answers to your model and reduce its predictability.""
sns.pairplot(tmp, vars = tmp.columns[16:17],hue="RainTomorrow")
plt.show()

<h3> Exploration of non-numerical variables </h3>

In [None]:
# We should not use strict date in our model - instead we will engineer a feature by extracting the month.
# We assume that it makes sense that in some months rain is more likely to happen
df['Month'] = pd.to_datetime(df['Date']).dt.month

# We check the target distribution across our new feature
sns.countplot(x = 'Month', hue =  'RainTomorrow', orient = 'h', data = df)

We can see thaht in monts 6 and 7 it rained more oftern than in other months.

In [None]:
# Now check  the location
# Set the plot size to make it more readable
plt.figure(figsize=(20, 10))
sns.countplot(y = 'Location', hue =  'RainTomorrow', orient = 'h', data = df)

We see that certain location (like Portland) have higher chance for the rain than the others. Seems like we can leave location as a feature.
We just check its cardinality and values counts:

In [None]:
len(df['Location'].unique())

In [None]:
df['Location'].value_counts()

So we have 49 unique values and the distribution is quite even. We can take this column as categorical.
We could also engineer additional features - like geographic coordinates for that locations - but we will be basing on the original dataset content

We check the rest of categorical variables: WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday'

In [None]:
sns.countplot(y = 'WindGustDir', hue =  'RainTomorrow', orient = 'h', data = df)

Some directions like NW seem to be correlated stronger with the 'Yes' outpiut of our target value

In [None]:
sns.countplot(y = 'WindDir9am', hue =  'RainTomorrow', orient = 'h', data = df)

Similar here - for example 'N'

In [None]:
sns.countplot(y = 'WindDir3pm', hue =  'RainTomorrow', orient = 'h', data = df)

Here we laso have to differences

In [None]:
sns.countplot(y = 'RainToday', hue =  'RainTomorrow', orient = 'h', data = df)

Here we see that for most days if it was raining today - we also had rain tomorrow.

<h3> Data cleaning </h3>

In [None]:
# We drop the Date to not overfit the model to particular date and place:
df.drop(['Date'], axis=1, inplace = True)

# And Risk-MM according to the data descripton:
# "Note: You should exclude the variable Risk-MM when training a binary classification model. 
# Not excluding it will leak the answers to your model and reduce its predictability.""
df.drop(['RISK_MM'], axis=1, inplace = True)

In [None]:
# check % of missing data in columns
df.isnull().sum()/df.shape[0]*100

In [None]:
# Evaporation, Sunshine Cloud 9 am and Cloud 3pm have a lot of missing data (above 30%)- we remove them:
df.drop(['Evaporation', 'Sunshine', 'Cloud9am', 'Cloud3pm'], axis=1, inplace = True)
df.isnull().sum()/df.shape[0]*100

In [None]:
df.shape

In [None]:
#replace Na in numerical columns with mean for columns with Na ratio higher than 3%:
df['WindGustSpeed'].fillna(np.mean(df['WindGustSpeed'].dropna().values), inplace = True)
df['Pressure9am'].fillna(np.mean(df['Pressure9am'].dropna().values), inplace = True)
df['Pressure3pm'].fillna(np.mean(df['Pressure3pm'].dropna().values), inplace = True)

In [None]:
# replace categorical values with the 'Unknown' value for columns with Na ratio higher than 3%:
df['WindGustDir']= df['WindGustDir'].fillna('Unknown')
df['WindDir9am']= df['WindDir9am'].fillna('Unknown')
df.isnull().sum()/df.shape[0]*100

Drop rest of the Na values from dataset (we assume that we can delete data in columns where the Na ratio is < 3%):

In [None]:
df.dropna(inplace = True)
df.isnull().sum()/df.shape[0]*100

<h3> Check distributions of 'cleaned' data </h3>

First target variable

In [None]:
# First - we check the distribution of the target value
counts = df['RainTomorrow'].value_counts()
print(counts)

In [None]:
# We check the exact ratio of 'Yes' samples
print(np.sum(counts))
print(counts[1]/np.sum(counts))

Overall ratio of target value after data cleaning is close the ratio before that process.

Now check numeric variables after data cleaning:

In [None]:
# build temporary dataset:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
tmp2 = df.select_dtypes(include=numerics)
tmp2["RainTomorrow"]= df["RainTomorrow"]
# check columns:
tmp2.columns

In [None]:
# check distributions of first 3 numerical values against target (to comapre with the previous plots on original data - we take 3 colums
# because we removed evaporation because a lot of Na:
sns.pairplot(tmp2, vars = tmp2.columns[:4],hue="RainTomorrow")
plt.show()

Distributions are similar to the ones before data cleaning. We have to remember to transoform Rainfall due to outliers and skewed distribution

In [None]:
sns.pairplot(tmp2, vars = tmp2.columns[4:8],hue="RainTomorrow")
plt.show()

Distributions are similar to the ones before data cleaning. We have to remember to transoform Humidities due to skewed distribution.
Wind speeds have outliers

In [None]:
sns.pairplot(tmp2, vars = tmp2.columns[8:12],hue="RainTomorrow")
plt.show()

We see that distributions for Pressures have 'spikes' caused by our inputation of mean value.
To overcome this we should use more sofisticated metod of inputation. We leave it as it is and check how our model will perform.

We don't need to check the Month variable - it  was derived from date and we didn't remove Nan's from this column


Now we check again categorical data

In [None]:
# checkt the types after data removal:
df.dtypes

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(y = 'Location', hue =  'RainTomorrow', orient = 'v', data = df)

In [None]:
sns.countplot(x = 'WindGustDir', hue =  'RainTomorrow', orient = 'h', data = df)

In [None]:
sns.countplot(x = 'WindDir9am', hue =  'RainTomorrow', orient = 'h', data = df)

In [None]:
sns.countplot(x = 'WindDir3pm', hue =  'RainTomorrow', orient = 'h', data = df)

In [None]:
sns.countplot(x = 'RainToday', hue =  'RainTomorrow', orient = 'h', data = df)

In general distributions of variables is similar to the distribution before data cleaning
Now we can do the encoding and transformations

<h3> Encoding the categorical data </h3>

In [None]:
# replace the string labels with 0 and 1 numbers:
df['RainToday'].replace({'No':0,'Yes':1},inplace = True)
df['RainTomorrow'].replace({'No':0,'Yes':1},inplace = True)

# encode categorical values
categorical = ['WindGustDir','WindDir9am','WindDir3pm','Location']
df = pd.get_dummies(df,columns = categorical,drop_first=True)


In [None]:
df.shape

Now we have to deal with the skew distributions in datasets

In [None]:
df.select_dtypes(include=numerics).describe()

In [None]:
from scipy import stats

skew_var = ['Humidity3pm', 'Humidity9am', 'Rainfall', 'WindSpeed3pm', 'WindSpeed9am']
tmp3 = df[skew_var]

for c in tmp3.columns:
    r = stats.boxcox(df[c] + 1)
    tmp3[c] = r[0]

sns.pairplot(tmp3)
plt.show

In [None]:
df[skew_var] = tmp3


In [None]:
df.shape

<h3> Build the model </h3>

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

In [None]:
x = df.drop(labels = ['RainTomorrow'],axis = 1)
x.columns

In [None]:
y = df['RainTomorrow']

In [None]:
x = sc.fit_transform(x)

In [None]:
x.shape

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.4,random_state = 40)
x_test,x_validation,y_test,y_validation = train_test_split(x_test,y_test,test_size = 0.5,random_state = 40)

## ANN - Artificial Neural Network

In [None]:
import keras
from keras.models import Sequential
from keras.layers import Dense

In [None]:
classifier = Sequential()

In [None]:
classifier.add(Dense(units = 30,kernel_initializer='uniform',activation = 'relu',input_dim = 109))
classifier.add(Dense(units = 30,kernel_initializer='uniform',activation = 'relu'))
classifier.add(Dense(units = 30,kernel_initializer='uniform',activation = 'relu'))
classifier.add(Dense(units = 1,activation='sigmoid',kernel_initializer='uniform'))



In [None]:
from keras.utils import plot_model
plot_model(classifier, show_shapes=True, to_file='model.png')

In [None]:
classifier.compile(optimizer = 'adam',loss = 'binary_crossentropy',metrics = ['accuracy'])

In [None]:
classifier.fit(x_train,y_train,epochs = 100,batch_size=10)

In [None]:
y_pred = classifier.predict_classes(x_test)
y_train_pred = classifier.predict_classes(x_train)
y_validation_pred = classifier.predict_classes(x_validation)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
print('Training Accuracy ---->',accuracy_score(y_train,y_train_pred))
print('Testing Accuracy  ---->',accuracy_score(y_test,y_pred))
print('Validation Accuracy  ---->',accuracy_score(y_validation,y_validation_pred))

In [None]:
print(classification_report(y_train,y_train_pred))

In [None]:
print(confusion_matrix(y_train,y_train_pred))

In [None]:
print(classification_report(y_validation,y_validation_pred))

In [None]:
print(confusion_matrix(y_validation,y_validation_pred))