<a id="0"></a>
![Rains](https://media.giphy.com/media/xUPGcCYOXmcQqhVQli/giphy.gif)

# Will it rain tomorrow? 

- This dataset contains daily weather observations from numerous Australian weather stations.

- Classification problem: Did it rain the next day? Yes (1) or No (0).

- For this problem I will be using **Logistic Regression** and **XGBoost**

## Contents
1. [ Importing Libraries and Data](#1)
1. [ Exploratory Data Analysis](#2)
1. [ Data Cleaning](#3)
    - [ Handling Numericals](#3.1)
    - [ Handling Categoricals](#3.2)
1. [ Splitting data](#4)
1. [ Feature Scaling](#5)
1. [ Model Training](#6)
    - [ Logistic Regression](#6.1)
    - [ XGBoost](#6.2)
1. [ Comparing Models](#7)
1. [ Conclusion](#8)

## 1. Importing Libraries and Data <a id="1"></a>

In [None]:
import numpy as np
import math
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
%matplotlib inline
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/weather-dataset-rattle-package/weatherAUS.csv')

## 2. Exploratory Data Analysis <a id="2"></a>

#### Viewing shape of the data set

In [None]:
print('Dataframe shape:', df.shape)

#### Viewing information (Variable data types and counts)

In [None]:
df.info()

#### Previewing first 5 rows of data

In [None]:
df.head()

#### Finding percentage of missing data in each column

In [None]:
def null_percentage(data):
    total = df.isnull().sum().sort_values(ascending = False)
    percent = round(df.isnull().sum().sort_values(ascending = False)/len(df)*100,2)
    return pd.concat([total, percent], axis=1, keys=['Total','Percent Missing'])

print('''Null values:\n
{}'''.format(null_percentage(df)))

#### Visualising missing data

In [None]:
cols = df.columns
colours = ['#000099', '#ffff00']
sns.heatmap(df[cols].isnull(), cmap=sns.color_palette(colours))

#### Dropping columns with over 30% missing data and un-needed RISK_MM column

In [None]:
df.drop(columns=['Sunshine', 'Evaporation', 'Cloud3pm', 'Cloud9am', 'RISK_MM'], axis=1, inplace=True)

#### Seperating numerical and categorical columns

In [None]:
numerical_cols = [var for var in df.columns if df[var].dtype=='f8']
categorical_cols = [var for var in df.columns if df[var].dtype=='O']

In [None]:
print('Numerical Columns: \n{}\n'.format(numerical_cols))
print('Categorical Columns: \n{}\n'.format(categorical_cols))

#### Viewing statistical properties of numericals

In [None]:
df.describe()

## 3. Data cleaning <a id="3"></a>

#### Viewing cardinality of categoricals

In [None]:
for var in categorical_cols:
    print(var, ' has {} unique values'.format(len(df[var].unique())))

#### Handling of Date column

In [None]:
df['Date'] = pd.to_datetime(df['Date'])

# Extracting Year, Month and Day from Date Column
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day

# Dropping original Date column
df.drop('Date', inplace=True, axis=1)

In [None]:
# Reviewing date changes
df.head()

#### Reviewing categoricals

In [None]:
categorical_cols = [var for var in df.columns if df[var].dtype=='O']

categorical_nulls = df[categorical_cols].isnull().sum()

for var in categorical_cols:
    print(var, ' has: \n{} unique values\n {} null values\n'.format(len(df[var].unique()), categorical_nulls[var]))

#### Location variable
* 49 unique values
* 0 null values

In [None]:
def location_percentage(data):
    total = df['Location'].value_counts()
    percent = round(df['Location'].value_counts()/len(df)*100,2)
    return pd.concat([total, percent], axis=1, keys=['Total','%'])

print('''Location Values:
{}'''.format(location_percentage(df)))

In [None]:
# Location dummies
location_dummies = pd.get_dummies(df.Location, drop_first=True).head()
location_dummies.head()

#### WindGustDir variable
* 17 unique values
* 9330 null values

In [None]:
def WindGustDir_percentage(data):
    total = df['WindGustDir'].value_counts()
    percent = round(df['WindGustDir'].value_counts()/len(df)*100,2)
    return pd.concat([total, percent], axis=1, keys=['Total','%'])

WindGustDir_null = df['WindGustDir'].isnull().sum() / len(df['WindGustDir'])

print('''WindGustDir Values:
{}
Null percentage = {}%'''.format(WindGustDir_percentage(df), round(WindGustDir_null * 100, 2)))

In [None]:
# WidGustDir Dummies
WindGustDir_dummies = pd.get_dummies(df.WindGustDir, drop_first=True, dummy_na=True)
WindGustDir_dummies.head()

#### WindDir9am variable
* 17 unique values
* 10013 null values

In [None]:
def WindDir9am_percentage(data):
    total = df['WindDir9am'].value_counts()
    percent = round(df['WindDir9am'].value_counts()/len(df)*100,2)
    return pd.concat([total, percent], axis=1, keys=['Total','%'])

WindDir9am_null = df['WindDir9am'].isnull().sum() / len(df['WindDir9am'])

print('''WindDir9am Values:
{}
Null percentage = {}%'''.format(WindDir9am_percentage(df), round(WindDir9am_null * 100, 2)))

In [None]:
# WindDir9am Dummies
WindDir9am_dummies = pd.get_dummies(df.WindDir9am, drop_first=True, dummy_na=True)
WindDir9am_dummies.head()

#### WindDir3pm variable
* 17 unique values
* 3778 null values

In [None]:
def WindDir3pm_percentage(data):
    total = df['WindDir3pm'].value_counts()
    percent = round(df['WindDir3pm'].value_counts()/len(df)*100,2)
    return pd.concat([total, percent], axis=1, keys=['Total','%'])

WindDir3pm_null = df['WindDir3pm'].isnull().sum() / len(df['WindDir3pm'])

print('''WindDir3pm Values:
{}
Null percentage = {}%'''.format(WindDir3pm_percentage(df), round(WindDir3pm_null * 100, 2)))

In [None]:
# WindDir3pm Dummies
WindDir3pm_dummies = pd.get_dummies(df.WindDir3pm, drop_first=True, dummy_na=True)
WindDir3pm_dummies.head()

#### RainToday variable
* 3 unique values
* 1406 null values

In [None]:
def RainToday_percentage(data):
    total = df['RainToday'].value_counts()
    percent = round(df['RainToday'].value_counts()/len(df)*100,2)
    return pd.concat([total, percent], axis=1, keys=['Total','%'])

RainToday_null = df['RainToday'].isnull().sum() / len(df['RainToday'])

print('''RainToday Values:
{}
Null percentage = {}%'''.format(RainToday_percentage(df), round(RainToday_null * 100, 2)))

In [None]:
# RainToday Dummies
RainToday_dummies = pd.get_dummies(df.RainToday, drop_first=True, dummy_na=True)
RainToday_dummies.head()

#### RainTomorrow variable
* 2 unique values
* 0 null values

In [None]:
# Exploring 'RainTomorrow' values (labels)
def rain_tomorrow_percentage(data):
    total = df['RainTomorrow'].value_counts()
    percent = round(df['RainTomorrow'].value_counts()/len(df)*100,2)
    return pd.concat([total, percent], axis=1, keys=['Total','%'])

print('''RainTomorrow Values:
{}'''.format(rain_tomorrow_percentage(df)))

In [None]:
f, ax2 = plt.subplots(figsize=(5, 5))
ax2 = sns.countplot(x="RainTomorrow", data=df, palette='Blues')
plt.show()

### 3.1. Handling Numericals<a id="3.1"></a>

In [None]:
df[numerical_cols].head()

In [None]:
round(df[numerical_cols].describe(), 2)

In [None]:
# As we can see from the difference in 75% percentile and max values, it is likely we have outliers in:
# 'Rainfall', 'WindSpeed9am' and 'WindSpeed3pm'

# Plotting suspected outliers
plt.figure(figsize=(15,10))

plt.subplot(3, 1, 2)
fig = sns.boxplot(x='Rainfall', data=df)
fig.set_title('')

plt.subplot(3, 2, 2)
fig = sns.boxplot(x='WindSpeed9am', data=df)
fig.set_title('')

plt.subplot(3, 2, 1)
fig = sns.boxplot(x='WindSpeed3pm', data=df)
fig.set_title('')

* ##### As we can see from the boxplots, all suspected variables contain outliers

In [None]:
# Plotting Histograms to check skew

plt.figure(figsize=(15,10))

plt.subplot(3, 1, 2)
fig = df['Rainfall'].hist(bins=20)
fig.set_xlabel('Rainfall')

plt.subplot(3, 2, 2)
fig = df['WindSpeed9am'].hist(bins=20)
fig.set_xlabel('WindSpeed9am')

plt.subplot(3, 2, 1)
fig = df['WindSpeed3pm'].hist(bins=20)
fig.set_xlabel('WindSpeed3pm')

#### Removing outliers using IQR

In [None]:
#IQR for Rainfall
Q1 = df['Rainfall'].quantile(0.25)
Q3 = df['Rainfall'].quantile(0.75)
IQR = Q3 - Q1
Lower_bound = Q1 - (IQR * 1.5)
Upper_bound = Q3 + (IQR * 1.5)
print('Rainfall has outliers: < {} or > {}'.format(Lower_bound, Upper_bound))

In [None]:
# Removing Rainfall outliers
df = df[~((df['Rainfall'] < - 1.20) |(df['Rainfall'] > 2.0))]
print(df.shape)

In [None]:
#IQR for WindSpeed9am
Q1 = df['WindSpeed9am'].quantile(0.25)
Q3 = df['WindSpeed9am'].quantile(0.75)
IQR = Q3 - Q1
Lower_bound = Q1 - (IQR * 1.5)
Upper_bound = Q3 + (IQR * 1.5)
print('WindSpeed9am has outliers: < {} or > {}'.format(Lower_bound, Upper_bound))

In [None]:
# Removing WindSpeed9am outliers
df = df[~((df['WindSpeed9am'] < - 11.0) |(df['WindSpeed9am'] > 37.0))]
print(df.shape)

In [None]:
#IQR for WindSpeed3pm
Q1 = df['WindSpeed3pm'].quantile(0.25)
Q3 = df['WindSpeed3pm'].quantile(0.75)
IQR = Q3 - Q1
Lower_bound = Q1 - (IQR * 1.5)
Upper_bound = Q3 + (IQR * 1.5)
print('WindSpeed3pm has outliers: < {} or > {}'.format(Lower_bound, Upper_bound))

In [None]:
# Removing WindSpeed3pm outliers
df = df[~((df['WindSpeed3pm'] < - 3.5) |(df['WindSpeed3pm'] > 40.5))]
print(df.shape)

In [None]:
# Reviewing Histograms after outlier removal
plt.figure(figsize=(15,10))

plt.subplot(3, 1, 2)
fig = df['Rainfall'].hist(bins=20)
fig.set_xlabel('Rainfall')

plt.subplot(3, 2, 2)
fig = df['WindSpeed9am'].hist(bins=20)
fig.set_xlabel('WindSpeed9am')

plt.subplot(3, 2, 1)
fig = df['WindSpeed3pm'].hist(bins=20)
fig.set_xlabel('WindSpeed3pm')

* ##### As we can see our columns are a generally less skewed with the outliers removed, Rainfall is still skewed due to most values being '0'

#### Taking care of nulls

In [None]:
# Viewing number of nulls 
pd.DataFrame(df[numerical_cols].isnull().sum().sort_values(ascending=False)).head(12)

In [None]:
# Filling null numericals with mean
for col in numerical_cols:
    mean = df[col].mean()
    df[col].fillna(mean, inplace=True)  

#### Viewing correlations

In [None]:
corr_matrix = df.corr()

plt.figure(figsize=(16,12))
plt.title('Correlation Heatmap of Rain in Australia Dataset')
ax = sns.heatmap(corr_matrix, square=True, annot=True, fmt='.2f', linecolor='white')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
ax.set_yticklabels(ax.get_yticklabels(), rotation=30)           
plt.show()

### 3.2. Handling Categoricals<a id="3.2"></a>

In [None]:
categorical_cols

#### Replacing categoricals with dummies of each column

In [None]:
df = pd.concat([df, pd.get_dummies(df['Location'],dummy_na=True, prefix='Location', columns=categorical_cols)],axis=1).drop(['Location'],axis=1)
df = pd.concat([df, pd.get_dummies(df['WindGustDir'],dummy_na=True, prefix='WindGustDir', columns=categorical_cols)],axis=1).drop(['WindGustDir'],axis=1)
df = pd.concat([df, pd.get_dummies(df['WindDir9am'],dummy_na=True, prefix='WindDir9am', columns=categorical_cols)],axis=1).drop(['WindDir9am'],axis=1)
df = pd.concat([df, pd.get_dummies(df['WindDir3pm'],dummy_na=True, prefix='WindDir3pm', columns=categorical_cols)],axis=1).drop(['WindDir3pm'],axis=1)
df = pd.concat([df, pd.get_dummies(df['RainToday'],dummy_na=True, prefix='RainToday', columns=categorical_cols)],axis=1).drop(['RainToday'],axis=1)


#### Viewing data with dummies

In [None]:
df.head()

## 4. Splitting data<a id="4"></a>

#### Splitting data into X and y variables

In [None]:
X = df.drop('RainTomorrow', axis=1)
y = df['RainTomorrow']

print('''X Shape: {}
y Shape: {}'''.format(X.shape, pd.DataFrame(y).shape))

#### Splitting data into training and test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Checking shapes of each set
print('''X train: {}
X test: {}
y train: {}
y test: {}'''.format(X_train.shape, X_test.shape, pd.DataFrame(y_train).shape, pd.DataFrame(y_test).shape))

## 5. Feature Scaling<a id="5"></a>

In [None]:
scaler = StandardScaler()
cols = pd.DataFrame(X_train).columns

X_train = pd.DataFrame(scaler.fit_transform(X_train), columns= cols)
X_test = pd.DataFrame(scaler.transform(X_test), columns=cols)

In [None]:
# Viewing scaled training set
X_train.head()

In [None]:
# Viewing scaled test set
X_test.head()

## 6. Model Training<a id="6"></a>

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

random_state = 42

### 6.1. Logistic Regression<a id="6.1"></a>

In [None]:
log_reg = LogisticRegression(random_state=random_state)
log_reg.fit(X_train, y_train)

log_reg_pred = log_reg.predict(X_test)

log_reg_cm = confusion_matrix(y_test, log_reg_pred)

print('Model accuracy score:\n{}\nConfusion Matrix:\n{}'. format(round(accuracy_score(y_test, log_reg_pred), 4), log_reg_cm))

In [None]:
# Checking for over/under fitting
print('Training set score: \n{}\nTest set score: \n{}'.format(round(log_reg.score(X_train, y_train), 4), round(log_reg.score(X_test, y_test), 4)))

#### Using grid-search to find better parameters for Logistic Regression

In [None]:
param_grid = {'C' : [1, 25, 50, 75, 100]}

log_reg_2 = LogisticRegression(random_state=random_state, solver='lbfgs')

grid_search_log = GridSearchCV(log_reg_2, param_grid, scoring="roc_auc", cv=5)

grid_search_log.fit(X_train, y_train)

In [None]:
print('Best Parameters:\n{}'.format(grid_search_log.best_params_))

In [None]:
# Using our best parameters
log_reg_3 = LogisticRegression(random_state=random_state, C=50)
log_reg_3.fit(X_train, y_train)

log_reg_3_pred = log_reg_3.predict(X_test)

log_reg_3_cm = confusion_matrix(y_test, log_reg_3_pred)

print('Model accuracy score:\n{}\nConfusion Matrix:\n{}'. format(round(accuracy_score(y_test, log_reg_3_pred), 4), log_reg_3_cm))

### 6.2 XGBoost<a id="6.2"></a>

In [None]:
from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X_train, y_train)

In [None]:
xgb_pred = xgb.predict(X_test)

xgb_cm = confusion_matrix(y_test, xgb_pred)

print('Model accuracy score:\n{}\nConfusion Matrix:\n{}'. format(round(accuracy_score(y_test, xgb_pred), 4), xgb_cm))

In [None]:
# Checking for over/under fitting
print('Training set score: \n{}\nTest set score: \n{}'.format(round(xgb.score(X_train, y_train), 4), round(xgb.score(X_test, y_test), 4)))

#### Using grid-search to find better parameters for XGBoost

In [None]:
param_grid = {
     'eta'    : [0.01, 0.15, 0.30 ] ,
     'max_depth'        : [ 3, 6, 9],
     'min_child_weight' : [ 1, 3, 5],
     'gamma'            : [ 0.0, 0.2, 0.4 ]
     }

xgb_2 = XGBClassifier(random_state=random_state)

grid_search = GridSearchCV(xgb_2, param_grid, n_jobs=4, scoring="roc_auc", cv=5)

grid_search.fit(X_train, y_train)

In [None]:
print('Best Parameters:\n{}'.format(grid_search.best_params_))

In [None]:
# Using our best parameters
xgb_3 = XGBClassifier(eta=0.01, gamma=0.4, max_depth=9, min_child_weight=3)
xgb_3.fit(X_train, y_train)

xgb_3_pred = xgb_3.predict(X_test)

xgb_3_cm = confusion_matrix(y_test, xgb_3_pred)

print('Model accuracy score:\n{}\nConfusion Matrix:\n{}'. format(round(accuracy_score(y_test, xgb_3_pred), 4), xgb_3_cm))

## 7. Comparing models<a id="7"></a>

In [None]:
# Accuracy scores of each model
log_reg_acc = round(accuracy_score(y_test, log_reg_pred), 4)
log_reg_3_acc = round(accuracy_score(y_test, log_reg_3_pred), 4)
xgb_acc = round(accuracy_score(y_test, xgb_pred), 4)
xgb_3_acc = round(accuracy_score(y_test, xgb_3_pred), 4)

In [None]:
# Creating dataframe showing accuracy scores of each model
compare = {'Model': ['Logistic Regression Original', 'Logistic Regression Tuned', 'XGBoost Original', 'XGBoost Tuned'],
          'Accuracy score': [log_reg_acc, log_reg_3_acc, xgb_acc, xgb_3_acc]}

pd.DataFrame(data=compare)
 

## 8. Conclusion<a id="8"></a>

As we can see from the comparison table, our best model is the XGBoost model in which we used our best parameters from grid search. This model gives us a high accuracy of 87.85%. From these results we can conclude that we have a reliable model for predicting future rainy days given we know todays data.

[Back to top](#0)