# EDA, Data Cleaning, Data Preprocessing and Modelling

Australia has recently seen horrific droughts and bush fires. This has lead to the <font color="black" size="4px"><b>destruction</b></font> of the habitats of Australian wildlife, loss of livestock and local communities are in ruin. 
 
 The World Meteorological Organisation says weather forecasting is a <font color="black" size="4px"><b> vital element</b></font> needed “in order to meet the food, fodder, fibre and renewable agri-energy needs of rapidly growing populations”.
 
 In the US, improved climate forecasting in the corn belt is expected to bring in an extra \$1.2 billion to \$2.9 billion over a ten-year period. The World Bank Group estimates that improved global weather forecasting would result in increases in productivity worth \$30 billion per year, as well as reducing asset losses by \$2 billion per year.

 https://www.vaisala.com/en/blog/2019-07/day-day-benefits-weather-detection
 
 If we can help predict rain, it may help farmers to better harvest their stock such that they could take advantage of the rain.

![](https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcRwF1wKjwoPgBqBjgn3r0E8MQj85mA6BfDm8SlJWQK9Lszo_MfF)

This kernel has been inspired by several kernels. Notably https://www.kaggle.com/prashant111/extensive-analysis-eda-fe-modelling by Prashant11.
If you have any questions please comment and if you like it please   <font color='red' size="4px" ><b>upvote<b></font>

 Table of Content
 
 * <a href="#1"><font size="5px">Elementary Data Analysis </font></a>
     * <a href="#1.1"><font size="4px">Hypothesis</font></a>
     * <a href="#1.2"><font size="4px">Import Dependencies</font></a>   
     * <a href="#1.3"><font size="4px">Import Rain Data</font></a>
     * <a href="#1.3"><font size="4px"> Category and Numerical Features</font></a>
         * <a href="#1.31"><font size="4px"> Cateogory Data </font></a>
             * <a href="#1.311"><font size="4px"> Target </font></a>
             * <a href="#1.312"><font size="4px"> Categorical Features  </font></a>
                 * <a href="#1.3121"> <font size="3px">Location   </font></a>
                 * <a href="#1.3122"> <font size="3px">Dates </font></a>
                 * <a href="#1.3123"> <font size="3px">Rain Today  </font></a>      
                 * <a href="#1.3124"> <font size="3px">WindGustDir, WindDir9am and WindDir3pm </font></a>                
         * <a href="#1.4"><font size="4px"> Numerical Features</font> </a>
             * <a href="#1.41">  <font size="4px"> Outliers</font></a>
             * <a href="#1.42">  <font size="4px"> Correlation</font></a>
             * <a href="#1.43">  <font size="4px"> Exploration of Humidity </font></a>
 * <a href="#2"> <font size="5px">  Data Cleaning</font></a>
     * <a href="#2.1"> <font size="4px">Treatment of Outliers</font></a>
     * <a href="#2.2"> <font size="4px">Treatement of Missing Data</font></a>
         * <a href="#2.21"><font size="4px">  Numerical Features</font></a>
         * <a href="#2.22"><font size="4px"> Categorical Features</font></a>
 * <a href="#3"><font size="5px"> Data Preprocessing</font></a>
     * <a href="#3.1"> <font size="4px"> Target and Features</font></a>
     * <a href="#3.2"> <font size="4px"> Imbalance of the Target</font></a>
     * <a href="#3.3"><font size="4px">  Split the data to training and testing sets</font></a>
     * <a href="#3.4"><font size="4px"> Encode cateogorical features</font></a>
         * <a href="#3.41"><font size="4px"> Binary Categorical Features</font></a>
         * <a href="#3.42"><font size="4px"> Dates</font></a>
         * <a href="#3.43"><font size="4px"> Other Cateogorical Feaures</font></a>
     * <a href="#3.5"><font size="4px">Over sampling</font></a>
     * <a href="#3.6"><font size="4px">Feature Scaling</font></a>
 * <a href="#4"> <font size="5px">Training the Model</font></a>
     * <a href="#4.1"> <font size="4px">Performance Metrics</font></a>
     * <a href="#4.2"> <font size="4px">Grid Search</font></a>
         * <a href="#4.21"><font size="4px">Logistic Regression</font></a>
         * <a href="#4.22"><font size="4px"> Random Foreset Classifer</font></a>
 * <a href="#5"> <font size="5px">Conclusion</font></a>
 


<a id="1"></a>
## Elemntary Data Analysis


<a id="1.1"></a>
### Hypoyhesis

According to NASA, 
"Precipitation, evaporation, freezing and melting and condensation are all part of the hydrological cycle - a never-ending global process of water circulation from clouds to land, to the ocean, and back to the clouds. "
My initial hypothesis in the elementary data analysis is to look at the if evaporation is the main factor in the next day's rainfall. 


<a id="1.2"></a>
### Import dependencies


In [None]:
import warnings
warnings.filterwarnings('ignore')
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import math
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os


# Any results you write to the current directory are saved as output.
# import libraries for plotting
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium import plugins
%matplotlib inline

<a id="1.3"></a>
### Import Data

In [None]:
data = pd.read_csv("../input/weather-dataset-rattle-package/weatherAUS.csv")
locationData = pd.read_csv("../input/loccsv/location.csv")

data.describe()

There are 142193 data observations and 23 features.

<a id="1.4"></a>
### Category and Numerical Features

We will split the features into category and numerical.


In [None]:
category_columns = [col for col in data.columns if data[col].dtype=="object" ]
numerical_columns = [col for col in data.columns if data[col].dtype!="object" ]
#Check if all columns are accouned for 
#print(len(category_columns)+len(numerical_columns)==len(data.columns))
print("There are {} columns. {} are category and {} are numerical".format(len(data.columns),len(category_columns),len(numerical_columns)))

<a id="1.41"></a>
#### Cateogory Data

In [None]:
print("The category columns are {}".format(category_columns))

<a id="1.411"></a>
##### Target Column: RainTomorrow

In [None]:
print("There are {} missing values for the target".format(data['RainTomorrow'].isnull().sum()))

In [None]:
def balanceTarget (target):
    rainTodayAnalysis = target.value_counts()
    f,  ax = plt.subplots(nrows=1,ncols=2) 
    sns.barplot(rainTodayAnalysis.index,rainTodayAnalysis.values, ax=ax[0])
    sns.barplot(rainTodayAnalysis.index,rainTodayAnalysis.values/len(data), ax=ax[1])
    ax[0].set(xlabel='Rain Tomorrow', ylabel='Number of Occurrences')
    ax[1].set(xlabel='Rain Tomorrow', ylabel='Percentage of Occurrences')
    plt.tight_layout()

balanceTarget(data["RainTomorrow"])

There is clearly an imbalance in classes since No is over 70 percent. This will have significant issues when training a machine learning model

<a id="1.412"></a>
##### Categorical Features

In [None]:
cat_features = list(filter(lambda x: x!="RainTomorrow", category_columns))
cat_features_miss = data[cat_features].isnull().sum()
f,  ax = plt.subplots(nrows=1,ncols=2) 
sns.barplot(cat_features_miss.index,cat_features_miss.values, ax=ax[0])
sns.barplot(cat_features_miss.index,cat_features_miss.values/len(data), ax=ax[1])
ax[0].set(ylabel='Number of Occurrences')
ax[0].set_xticklabels(ax[0].get_xticklabels(),rotation=75)
ax[1].set( ylabel='Percentage of Occurrences')
ax[1].set_xticklabels(ax[1].get_xticklabels(),rotation=75)
ax[1].set_ylim(0,1)                      
plt.suptitle("Missing Values for Categorical Data")
plt.tight_layout(rect=[0, 0.03, 1, 0.90])

The missing values are less than 10% of the total data points for WindGustDir, WindDir9am, WindDir3pm and RainToday.

In [None]:
cat_features = list(filter(lambda x: x!="RainTomorrow", category_columns))
cat_features_dict = {}
for features in cat_features:
    cat_features_dict[features]=len(list(filter(lambda x: isinstance(x, str) or math.isnan(x)==False ,data[features].unique())))
cat_features_dict
uniqueCat = pd.DataFrame(cat_features_dict,index=["Number of Unique Values"])
uniqueCat

Once we remove the missing values, we find that RainToday is binary like RainTomorrow. A high number of unique values per category also has significant issues in machine learning models. Dates can be encoded before running the machine learning models.

<a id="1.4121"></a>
###### Location

In [None]:
locationData = locationData.dropna()

m=folium.Map([-25.2744,133.7751],zoom_start=4,width="70%",height="70%",left="10%")
for lat,lon,area in zip(locationData['Latitude'],locationData['Longitude'],locationData['Location']):
     folium.CircleMarker([lat, lon],
                            popup=area,
                            radius=3,
                            color='b',
                            fill=True,
                            fill_opacity=0.7,
                            fill_color="green",
                           ).add_to(m)
m.save('Australia.html')
m

These are the areas in which observations have been recorded.As we can see it is more densed in the south eastern region of Australia.The north west region of Australia doesn't have an observations. This is could be due to the lack of population in the area and thus, the need for an observation post is minimal. In addition, there is one observation east of Australia (Norfolk Island)

<a id="1.4122"></a>
###### Dates

In [None]:

print("The date ranges from {} to {}".format(data["Date"].sort_index().unique()[0],data["Date"].sort_index().unique()[-1]))

In [None]:
dateAnalysis = data.Date.value_counts().value_counts()
dateDict = {}
for i in range(1,max(dateAnalysis.index)+1):
    if i in dateAnalysis.index:
        dateDict[i]=dateAnalysis[i]
    else:
        dateDict[i]=0
dateAnalysis=pd.DataFrame.from_dict(dateDict, orient='index',columns=["count"])

In [None]:
f,  ax = plt.subplots(1,1,figsize=(14,4)) 
sns.barplot(dateAnalysis.index,dateAnalysis["count"], ax=ax, color="blue")
ax.set(ylabel='Number of Occurrences')
ax.set_xticklabels(ax.get_xticklabels(),rotation=75)                  
plt.suptitle("Count of locations per date")
plt.tight_layout(rect=[0, 0.03, 1, 0.90])

We can see that the majority of the dates have over 42 locations.

<a id="1.4123"></a>
###### Rain Today

In [None]:
import branca.colormap as cm

countRainToday = data[["Location","RainToday"]]
countRainToday=countRainToday.groupby("Location")['RainToday'].apply(lambda x: (x=='Yes').sum()).reset_index(name='count')
countRainToday=countRainToday.set_index("Location").join(locationData.set_index("Location")).reset_index("Location")
countRainToday['colour']=countRainToday['count'].apply(lambda count:"darkblue" if count>=1000 else
                                         "blue" if count>=800 and count<1000 else
                                         "green" if count>=600 and count<800 else
                                         "orange" if count>=400 and count<600 else
                                         "tan" if count>=200 and count<400 else
                                         "red")

In [None]:
m=folium.Map([-25.2744,133.7751],zoom_start=4,width="70%",height="70%",left="10%")
for lat,lon,area,radius,colour in zip(countRainToday['Latitude'],countRainToday['Longitude'],countRainToday['Location'],countRainToday["count"],countRainToday["colour"]):
     folium.CircleMarker([lat, lon],
                            popup=area,
                            radius=7,
                            color='b',
                            fill=True,
                            fill_opacity=0.9,
                            fill_color=colour,
                           ).add_to(m)
from branca.element import Template, MacroElement

template = """
{% macro html(this, kwargs) %}

<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title>jQuery UI Draggable - Default functionality</title>
  <link rel="stylesheet" href="//code.jquery.com/ui/1.12.1/themes/base/jquery-ui.css">

  <script src="https://code.jquery.com/jquery-1.12.4.js"></script>
  <script src="https://code.jquery.com/ui/1.12.1/jquery-ui.js"></script>
  
  <script>
  $( function() {
    $( "#maplegend" ).draggable({
                    start: function (event, ui) {
                        $(this).css({
                            right: "auto",
                            top: "auto",
                            bottom: "auto"
                        });
                    }
                });
});

  </script>
</head>
<body>

 
<div id='maplegend' class='maplegend' 
    style='position: absolute; z-index:9999; border:2px solid grey; background-color:rgba(255, 255, 255, 0.8);
     border-radius:6px; padding: 10px; font-size:14px; right: 20px; bottom: 20px;'>
     
<div class='legend-title'>Legend</div>
<div class='legend-scale'>
  <ul class='legend-labels'>
    <li><span style='background:darkblue;opacity:0.7;'></span>Over 1000 days</li>
    <li><span style='background:blue;opacity:0.7;'></span>800-1000 days</li>
    <li><span style='background:green;opacity:0.7;'></span>600-800 days</li>
    <li><span style='background:orange;opacity:0.7;'></span>400-600 days</li>
    <li><span style='background:tan;opacity:0.7;'></span>200-400 days</li>
    <li><span style='background:red;opacity:0.7;'></span>0-200 days</li>

  </ul>
</div>
</div>
 
</body>
</html>

<style type='text/css'>
  .maplegend .legend-title {
    text-align: left;
    margin-bottom: 5px;
    font-weight: bold;
    font-size: 90%;
    }
  .maplegend .legend-scale ul {
    margin: 0;
    margin-bottom: 5px;
    padding: 0;
    float: left;
    list-style: none;
    }
  .maplegend .legend-scale ul li {
    font-size: 80%;
    list-style: none;
    margin-left: 0;
    line-height: 18px;
    margin-bottom: 2px;
    }
  .maplegend ul.legend-labels li span {
    display: block;
    float: left;
    height: 16px;
    width: 30px;
    margin-right: 5px;
    margin-left: 0;
    border: 1px solid #999;
    }
  .maplegend .legend-source {
    font-size: 80%;
    color: #777;
    clear: both;
    }
  .maplegend a {
    color: #777;
    }
</style>
{% endmacro %}"""

# Code for the legend from https://nbviewer.jupyter.org/gist/talbertc-usgs/18f8901fc98f109f2b71156cf3ac81cd

macro = MacroElement()
macro._template = Template(template)

m.get_root().add_child(macro)
m.save('RainToday.html')

m

From the analysis, we can see that most of the rain are in the costal regions of Australia while regions in the center barely get rain,

In [None]:
dateRainToday = data[["Date","RainToday"]]
dateRainToday['Date'] = pd.to_datetime(dateRainToday['Date'])
dateRainToday['Year'] = dateRainToday['Date'].dt.year
dateRainToday['Month'] = dateRainToday['Date'].dt.month
dateRainToday.drop("Date", axis=1, inplace = True)
years = dateRainToday['Year'].unique().tolist()
dateRainToday["Period"] = dateRainToday['Year'].apply(str) +"-"+ dateRainToday['Month'].apply(str)
dateRainToday =dateRainToday.groupby(["Year","Month","Period"])['RainToday'].apply(lambda x: (x=='Yes').sum()).reset_index(name='count')
dateRainToday.drop(["Month"], axis=1, inplace = True)
years = sorted(years, key=lambda x: int(x))
dateRainToday[dateRainToday["Year"]==2012]

In [None]:
g = sns.FacetGrid(dateRainToday, col="Year", col_wrap=4, height=4, ylim=(0, 500),margin_titles=True,sharey=True,sharex=False)
g.map(sns.barplot, "Period", "count", ci=None,order=None);
for ax in g.axes.ravel():
    ax.set_xticklabels(ax.get_xticklabels(), rotation=75)
plt.subplots_adjust(hspace=0.4, wspace=0.4)

If we discard the years with insignificant levels of data, we can observe the majority of the rainfall to be in the winter months (June, July August). The only exception was in 2010 while 2011 seems relatively flat. According to the Australian Government Bereau of Meteorology, "In 2010, Australia experienced its third-wettest year since national rainfall records began in 1900, with second place taken by 2011" - http://www.bom.gov.au/climate/enso/history/ln-2010-12/rainfall-flooding.shtml

<a id="1.4124"></a>
###### WindGustDir, WindDir9am and WindDir3pm

In [None]:
dirGust =  data[["WindGustDir"]]
dirGust =dirGust["WindGustDir"].value_counts().rename_axis('direction').reset_index(name='count')
dirGust["name"]="WindGustDir"
dir9pm =  data[["WindDir9am"]]
dir9pm=dir9pm["WindDir9am"].value_counts().rename_axis('direction').reset_index(name='count')
dir9pm["name"]="WindDir9am"
dir3pm =  data[["WindDir3pm"]]
dir3pm=dir3pm["WindDir3pm"].value_counts().rename_axis('direction').reset_index(name='count')
dir3pm["name"]="WindDir3pm"
direction = pd.concat([dirGust,dir9pm,dir3pm])
#Graph the number of directions
g = sns.FacetGrid(direction, col="name", col_wrap=3, height=4, ylim=(0, direction.max()["count"]*1.1),margin_titles=True,sharey=True,sharex=False)
g.map(sns.barplot, "direction", "count", ci=None,order=["N","NNE","NE","ENE","E","ESE","SE","SSE","S","SSW","SW","WSW","W","WNW","NW","NNW"]);
for ax in g.axes.ravel():
    ax.set_xticklabels(ax.get_xticklabels(), rotation=75)
plt.subplots_adjust(hspace=0.4, wspace=0.4)

WindDir9am and WindDir3pm are almost uniform apart from a few spikes and WindGustDir is almost normally distrubuted.

 <a id="1.413"></a>
##### Numerical Features

In [None]:
print("The category columns are {}".format(numerical_columns))

In [None]:
num_features = list(numerical_columns)
num_features_miss = data[num_features].isnull().sum()
f,  ax = plt.subplots(nrows=1,ncols=2,figsize=(15,8)) 
sns.barplot(num_features_miss.index,num_features_miss.values, ax=ax[0])
sns.barplot(num_features_miss.index,num_features_miss.values/len(data), ax=ax[1])
ax[0].set(ylabel='Number of Occurrences')
ax[0].set_xticklabels(ax[0].get_xticklabels(),rotation=75)
ax[1].set( ylabel='Percentage of Occurrences')
ax[1].set_xticklabels(ax[1].get_xticklabels(),rotation=75)
ax[1].set_ylim(0,1)                      
plt.suptitle("Missing Values for Numerical Data")
plt.tight_layout(rect=[0, 0.03, 1, 0.90])

We can see that Evaporation, Sunshine, Cloud9am and Cloud3pm are missing over 40% of their data. 
# 
#

 <a id="1.4131"></a>
###### Outliers
When using a machine learning model, we need to look at the outliers.

In [None]:

appended_data = []
for feature in numerical_columns:
    name = pd.DataFrame()
    data["binned"]=pd.cut(data[feature], 10)
    name[feature]=data["binned"]
    data.drop(["binned"], axis=1, inplace=True)
    name=name[feature].value_counts().rename_axis('numerical').reset_index(name='count')
    name["name"]=feature
    appended_data.append(name)
num_data = pd.concat(appended_data)

g = sns.FacetGrid(num_data, col="name", col_wrap=4, height=4,margin_titles=True,sharey=False,sharex=True)
g.map(sns.barplot, "numerical", "count", ci=None,order=None);
for ax in g.axes.ravel():
    ax.set_xticklabels(labels="")
    ax.set(xlabel='', ylabel='')
plt.subplots_adjust(hspace=0.4, wspace=0.4)

* After binning each numerical feature by 10, we can see that all the features are skewed and thus we shouldn't use the z-score to find the outliers but use the IQR score

In [None]:
data[numerical_columns].describe()

Features such as Rainfall, Evaporation, WindGustSpeed, WindSpeed9am, WindSpeed3pm and RISK_MM have extremely high maximum levels proportionate to their 75% quartile (463,19.59, 2.81,6.84,3.65 and 463 respectively). We shall see if these upper quartiles lead to a higher chance of rain fall tommorow. If they do, we should keep them but if they don't, we will remove them as it will help with the machine learning model.

In [None]:
num_col_noRiskM =list(filter(lambda x: x!="RISK_MM", numerical_columns))
Q1 = data[num_col_noRiskM].quantile(0.25)
Q3 = data[num_col_noRiskM].quantile(0.75)
IQR = Q3 - Q1
appended_data = []
for col in num_col_noRiskM:
    aboveOutlier=data[data[col]>(Q3[col]+1.5*IQR[col])]["RainTomorrow"]
    aboveOutlier=aboveOutlier.value_counts().rename_axis('Rain').reset_index(name='counts')
    aboveOutlier["name"]=col
    aboveOutlier["total"]=aboveOutlier["counts"].sum() 
    aboveOutlier["percentageRain"]=aboveOutlier["counts"]/aboveOutlier["total"]
    aboveOutlier["percentageTotal"]=aboveOutlier["total"]/data.shape[0]
    appended_data.append(aboveOutlier)
ResultAbove = pd.concat(appended_data)
appended_data = []
for col in num_col_noRiskM:
    belowOutlier=data[data[col]<(Q1[col]-1.5*IQR[col])]["RainTomorrow"]
    belowOutlier=belowOutlier.value_counts().rename_axis('Rain').reset_index(name='counts')
    belowOutlier["name"]=col
    belowOutlier["total"]=belowOutlier["counts"].sum() 
    belowOutlier["percentageRain"]=belowOutlier["counts"]/belowOutlier["total"]
    belowOutlier["percentageTotal"]=belowOutlier["total"]/data.shape[0]
    appended_data.append(belowOutlier)
ResultBelow = pd.concat(appended_data)

In [None]:
g = sns.FacetGrid(ResultAbove, col="name", col_wrap=4, height=4,margin_titles=True,sharey=False,sharex=False)
g.map(sns.barplot, "Rain", "percentageRain", ci=None,order=["Yes","No"]);
for ax in g.axes.ravel():
    ax.set_xticklabels(labels="")
    ax.set(xlabel='', ylabel='')
for ax in g.axes.ravel():
    ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
g.fig.suptitle('Percentage of RainTomorrow for the Higher IQR ')
g.fig.subplots_adjust(top=0.9)
plt.subplots_adjust(hspace=0.4, wspace=0.4)

The following graphs show if any outliers (both upper and lower) will lead to rain. I have noticed after an extremely hot day, there is rain fall the next day. The majority of the outliers in the numerical features have lead to no rain fall except for upper outliers of wind gust speed. Thus, we will keep the outliers of wind gust speed and remove outliers for Rainfall, Evaporation, WindSpeed9am, WindSpeed3pm. 
I have removed RISK_MM because the features measure the rainfall the next day and thus, if rain fall the next day is greater than 0, rainTomorrow will be yes. 


In [None]:
dataUpper = data[data[numerical_columns]>Q3+1.5*IQR].describe()
dataUpper[["Rainfall", "Evaporation", "WindSpeed9am", "WindSpeed3pm"]]
#Description of the upper quartile of these Features

We can see that Evaporation, WindSpeed9am and WindSpeed3pm upper values (Q3+1.5 IQR) are relatively small compared to Rainfall. We can't remove (25228/142193) 17% of our data. These features will be treated later in the report in Data cleaning

In [None]:
ResultBelow
g = sns.FacetGrid(ResultBelow, col="name", col_wrap=4, height=4,margin_titles=True,sharey=False,sharex=False)
g.map(sns.barplot, "Rain", "percentageRain", ci=None,order=["Yes","No"]);
for ax in g.axes.ravel():
    ax.set_xticklabels(labels="")
    ax.set(xlabel='', ylabel='')
for ax in g.axes.ravel():
    ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
g.fig.suptitle('Percentage of RainTomorrow for the Lower IQR ')
g.fig.subplots_adjust(top=0.9)
plt.subplots_adjust(hspace=0.4, wspace=0.4)

For the lower outliers, max tempurature, pressure 9m, pressure 3pm and temp 3pm have lead to a higher chance of rain tomorrow. These features should be focused on while performing machine learning models.

In [None]:
Q1 = data[numerical_columns].quantile(0.25)
Q3 = data[numerical_columns].quantile(0.75)
IQR = Q3 - Q1

 <a id="1.4132"></a>
###### Correlation

In [None]:
correlation = data.corr()
plt.figure(figsize=(16,12))
plt.suptitle('Correlation Heatmap of Rain in Australia Dataset', size=16, y=0.93);     
ax = sns.heatmap(correlation, square=True, annot=True, fmt='.2f', linecolor='white',linewidths=.5, center=0)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
ax.set_yticklabels(ax.get_yticklabels(), rotation=30)      
plt.show()

In [None]:
corrTable = data.corr().unstack().sort_values(ascending = False)
corrTable=corrTable.rename_axis(['Feature 1',"Feature 2"]).reset_index(name='Correlation')
corrTable=corrTable[corrTable["Feature 1"]!=corrTable["Feature 2"]]
TopCorr=corrTable[(corrTable["Correlation"]>0.7)|(corrTable["Correlation"]<-0.7)]
TopCorr.drop_duplicates(subset='Correlation')

The features that have high correlation show that the weather throughout the day is very consistent. For example, If the pressure it high at 9am, then it would remain high at 3pm.

In [None]:
corrTable[corrTable["Feature 1"]=="RISK_MM"]

An interesting note is that there is very low correlation between RISK_MM which is the rain tommorow and the other numnerical features. The highest absolute correlation are Humdidy, Rainfall (rainfall today) and Sunshine but its at 30% absolute correlation. So in answering our hypothesis initally stated (Evaporation plays a major role in the rain fall) we find that it is false because Humidty at 3pm has the highest correlation at 30% than evaporation which is at -4%. 

Humidity is the concentration of water vapour present in the air and indicates the likelihood for precipitation, dew, or fog to be present. A high humidity is usually in the morning due to the change from overnight temperatures to day light temperatures. However, a high humidity in the afternoon could lead to rain fall the next day.

 <a id="1.4133"></a>
###### Exploration of Humidity

Lets look at the higher levels of humidity to see if we can find a stronger relationship with rainfall tomorrow.

In [None]:
f = plt.figure(figsize=(18,15))
gs = f.add_gridspec(4,2, hspace=0.2, wspace=0.2)

ax1 = f.add_subplot(gs[0, 0])
ax2 = f.add_subplot(gs[0, 1])
ax11 = f.add_subplot(gs[1, 0])
ax22 = f.add_subplot(gs[1, 1])
ax3 = f.add_subplot(gs[2, 0])
ax4 = f.add_subplot(gs[2, 1])
ax5 =  f.add_subplot(gs[3, :])

humidity = data[data["Humidity3pm"]>(Q3["Humidity3pm"]+0.5*IQR["Humidity3pm"])]
humidity=humidity["RainTomorrow"].value_counts().rename_axis('Rain').reset_index(name='counts')
humidity["total"]=humidity["counts"].sum()
humidity["percentage"]=humidity["counts"]/humidity["total"]
sns.barplot(humidity.Rain,humidity.percentage,ax=ax1,order=["Yes","No"])
ax1.set_title('Rainfall Tomorrow when Humidity 3pm > Q3 + 0.5 IQR = {} - Data size - {}'
              .format(Q3["Humidity3pm"]+0.5*IQR["Humidity3pm"],humidity["counts"].sum()), size=12, y=1.05)

humidity = data[data["Humidity9am"]>(Q3["Humidity9am"]+0.5*IQR["Humidity9am"])]
humidity=humidity["RainTomorrow"].value_counts().rename_axis('Rain').reset_index(name='counts')
humidity["total"]=humidity["counts"].sum()
humidity["percentage"]=humidity["counts"]/humidity["total"]
sns.barplot(humidity.Rain,humidity.percentage,ax=ax2,order=["Yes","No"])
ax2.set_title('Rainfall Tomorrow when Humidity 9am > Q3 + 0.5 IQR = {} - Data size - {}'
              .format(Q3["Humidity9am"]+0.5*IQR["Humidity9am"],humidity["counts"].sum()), size=12, y=1.05)

humidity = data[data["Humidity3pm"]==100]
humidity=humidity["RainTomorrow"].value_counts().rename_axis('Rain').reset_index(name='counts')
humidity["total"]=humidity["counts"].sum()
humidity["percentage"]=humidity["counts"]/humidity["total"]
sns.barplot(humidity.Rain,humidity.percentage,ax=ax11,order=["Yes","No"])
ax11.set_title('Rainfall Tomorrow when Humidity 3pm = {} - Data size - {}'
              .format(100,humidity["counts"].sum()), size=12, y=1.05)

humidity = data[data["Humidity9am"]==100]
humidity=humidity["RainTomorrow"].value_counts().rename_axis('Rain').reset_index(name='counts')
humidity["total"]=humidity["counts"].sum()
humidity["percentage"]=humidity["counts"]/humidity["total"]
sns.barplot(humidity.Rain,humidity.percentage,ax=ax22,order=["Yes","No"])
ax22.set_title('Rainfall Tomorrow when Humidity 9am = {} - Data size - {}'
              .format(100,humidity["counts"].sum()), size=12, y=1.05)

humidity = data[data["Humidity9am"]<data["Humidity3pm"]]
humidity=humidity["RainTomorrow"].value_counts().rename_axis('Rain').reset_index(name='counts')
humidity["total"]=humidity["counts"].sum()
humidity["percentage"]=humidity["counts"]/humidity["total"]
sns.barplot(humidity.Rain,humidity.percentage,ax=ax3,order=["Yes","No"])
ax3.set_title('Rainfall Tomorrow when Humidity increases during the day - Data size - {}'.format(humidity["counts"].sum()), size=12, y=1.05)

humidity = data[data["Humidity9am"]>data["Humidity3pm"]]
humidity=humidity["RainTomorrow"].value_counts().rename_axis('Rain').reset_index(name='counts')
humidity["total"]=humidity["counts"].sum()
humidity["percentage"]=humidity["counts"]/humidity["total"]
sns.barplot(humidity.Rain,humidity.percentage,ax=ax4,order=["Yes","No"])
ax4.set_title('Rainfall Tomorrow when Humidity decreases during the day - Data size - {}'.format(humidity["counts"].sum()), size=12, y=1.05)

humidity = data[data["Humidity3pm"]>(Q3["Humidity3pm"]+0.5*IQR["Humidity3pm"])]
humidity = humidity[humidity["Humidity9am"]<humidity["Humidity3pm"]]
humidity=humidity["RainTomorrow"].value_counts().rename_axis('Rain').reset_index(name='counts')
humidity["total"]=humidity["counts"].sum()
humidity["percentage"]=humidity["counts"]/humidity["total"]
sns.barplot(humidity.Rain,humidity.percentage,ax=ax5,order=["Yes","No"])
ax5.set_title('Rainfall Tomorrow when Humidity 3pm > Q3 + 0.5 IQR ={} and Humidity increases during the day- Data size - {}'
              .format(Q3["Humidity3pm"]+0.5*IQR["Humidity3pm"],humidity["counts"].sum()), size=12, y=1.05);     


ax1.set(xlabel='', ylabel='Percentage of Occurrences')
ax2.set(xlabel='', ylabel='Percentage of Occurrences')
ax3.set(xlabel='', ylabel='Percentage of Occurrences')
ax4.set(xlabel='', ylabel='Percentage of Occurrences')
ax5.set(xlabel='', ylabel='Percentage of Occurrences')

gs.update( wspace=0.2, hspace=0.4)
f.tight_layout(pad=4.0)

If we filter the data by the certain humidity criterias, we find that when the Humidity3pm is above the 80.5, over 70% of 12434 (12434/142193 = 8.7% of all data observation) days lead to rain the next day. Even a humidity of 100 doesn't guarantee rain the next day and in fact humidity of 100 at 9am has a lower chance of rain the next day. A decrease in humidity (humidity 9am less than humidity 3pm) leads to almost 80% of 118880 (118880/142193 = 83% of all data observation) observations to no rain the next day. 

Our model should try and take into consideration the relationship between high humidity3pm and rain fall the next day

<a id="2"></a>
## Data Clean


As stated in the data description, we need to remove RISKMM column. According to the column description, RISKMM is 'the amount of next day rain in mm. Used to create response variable RainTomorrow. A kind of measure of the "risk"' As a result it contains information about our target (RainTomorrow) and thus will be bias.

In [None]:
data.drop(['RISK_MM'], axis=1, inplace=True)
numerical_columns.remove("RISK_MM")

<a id="2.1"></a>
### Treatment of Outliers

I will keep the outliers because of the relationship of the outliers in humidity3pm and raintomorrow and if we removed the outliers, we would remove over 32% of the data. 
We will however remove outliers from Evaporation, WindSpeed9am and WindSpeed3pm since the maximum values are significantly greater than then 75% quartile and the upper quartile (Q3+ 1.5 IQR) does not lead to higher chance of rain the next day. We will not drop the lower quartlie because the distribution of the numerical features are positively skewed


In [None]:
data = data[~(data["Evaporation"] > (Q3["Evaporation"] + 1.5 * IQR["Evaporation"]))]
data = data[~(data["WindSpeed9am"] > (Q3["WindSpeed9am"] + 1.5 * IQR["WindSpeed9am"]))]
data = data[~(data["WindSpeed3pm"] > (Q3["WindSpeed3pm"] + 1.5 * IQR["WindSpeed3pm"]))]

In [None]:
data.describe()

As we can see, the max values over the 75% quartile of Evporation, WindSpeed9am and WindSpeed3pm  have dropped from (19.59,6.84 and 3.65 respectively ) to (2.05, 1.94 and 1.625 respectively). We have removed (1-136735/142193) around 4% of the data due to outliers. 
#

In [None]:
appended_data = []
for feature in numerical_columns:
    name = pd.DataFrame()
    data["binned"]=pd.cut(data[feature], 10)
    name[feature]=data["binned"]
    data.drop(["binned"], axis=1, inplace=True)
    name=name[feature].value_counts().rename_axis('numerical').reset_index(name='count')
    name["name"]=feature
    appended_data.append(name)
num_data = pd.concat(appended_data)
g = sns.FacetGrid(num_data, col="name", col_wrap=4, height=4,margin_titles=True,sharey=False,sharex=True)
g.map(sns.barplot, "numerical", "count", ci=None,order=None);
for ax in g.axes.ravel():
    ax.set_xticklabels(labels="")
    ax.set(xlabel='', ylabel='')
plt.subplots_adjust(hspace=0.4, wspace=0.4)

<a id="2.2"></a>
### Treatment of Missing Data
 
We will assume MCAR (Missing Completely at Random) which means that the missing value has has nothing to do with its hypothetical value and with the values of other variables.


<a id="2.21"></a>
#### Numerical Features

In [None]:
def missingValues (data):
    numerical_columns = [col for col in data.columns if data[col].dtype!="object" ]
    num_features = list(numerical_columns)
    num_features_miss = data[num_features].isnull().sum()
    f,  ax = plt.subplots(nrows=1,ncols=2,figsize=(15,8)) 
    sns.heatmap(data.isnull(),yticklabels=False,cbar=False,cmap='viridis', ax=ax[0])
    sns.barplot(data.isnull().sum().index,data.isnull().sum().values/len(data), ax=ax[1])
    ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation=70)
    ax[1].set( ylabel='Percentage of Occurrences')
    ax[1].set_xticklabels(ax[1].get_xticklabels(),rotation=75)
    ax[1].set_ylim(0,1)             
    plt.suptitle('Missing Values in the Data', size=16, y=0.93)
    plt.tight_layout(rect=[0, 0.03, 1, 0.90])

In [None]:
missingValues(data)

The major features that are missing values are Evaporation, Sunshine, Cloud9am and Cloud3pm. We cannot remove the column nor the row because these features are missing over 40%. We will use imputation to solve the missing data issue.

In [None]:

def position (length,nLargest):
    numFeatures = int(length/nLargest)
    pos = np.arange(1,length/numFeatures+1) *0.4
    res = np.empty(shape=[0,0])
    for i in range(0,int(length/nLargest)*2,2):
        res=np.append(res,pos+i)
    return res
def corrT (arr):
    corrTable = data.corr().unstack().sort_values(ascending = False)
    corrTable=corrTable.rename_axis(['Feature 1',"Feature 2"]).reset_index(name='Correlation')
    corrTable["CorrelationAbs"]=abs(corrTable["Correlation"])
    corrTable=corrTable[corrTable["Feature 1"]!=corrTable["Feature 2"]]
    corrTable=corrTable[corrTable["Feature 1"].isin(arr)]
    corrTable=corrTable[~corrTable["Feature 2"].isin(arr)]
    corrTable=corrTable.loc[corrTable.groupby('Feature 1')['CorrelationAbs'].nlargest(3).index.get_level_values(1)]
    length = len(corrTable)
    pos = position(length,3)
    fig, ax=plt.subplots(figsize=(16,5))
    uelec, uind = np.unique(corrTable["Feature 2"], return_inverse=1)
    cmap = plt.cm.get_cmap("Set1")
    ax.bar(pos, corrTable["CorrelationAbs"], width=0.4, align="edge", ec="k", color=cmap(uind)  )
    handles=[plt.Rectangle((0,0),1,1, color=cmap(i), ec="k") for i in range(len(uelec))]
    ax.legend(handles=handles, labels=list(uelec),
               prop ={'size':10}, loc=9, ncol=8, 
                title=r'Feature 2')
    ax.set_xticks([x*2+1 for x in range(length//3)])
    ax.set_xticklabels(corrTable["Feature 1"].unique())
    ax.set_ylim(0, 1)
    plt.show()

In [None]:
arr = ['Evaporation',"Sunshine","Cloud9am","Cloud3pm"]
corrT(arr)

We will find the features that correlate to the 4 features that have the most missing data. We have picked the top 3 correlated features for each of the 4 features. For each of these features that have missing data, we will bin (split it into category) the feature and its most correlated feature in order to determine which the median value of the feature with missing data given the value of the correlated feature. 
For example, Cloud3pm (mising data feature) is most correlated with Humidity3pm. If Humidity3pm is in the range of 80 to 100, we would want to know what the median bin range of Cloud3pm so we could replace any missing Cloud3pm data that has a Humidity in the range of 80 to 100.
We will apply this method for 4 features.

In [None]:
class medianBins():
    def __init__(self,data, missingFeature, correlatedFeature,binSize):
        self.data = data
        self.missingFeature = missingFeature
        self.correlatedFeature = correlatedFeature
        self.binSize = binSize
        print (self)
        
    def binD (self):
        binData = pd.DataFrame()
        binData[self.correlatedFeature]= self.data[self.correlatedFeature]
        binData[self.missingFeature]= data[self.missingFeature]
        binData["HCorr"]= pd.cut(binData[self.correlatedFeature], self.binSize)
        binData["HMissing"]= pd.cut(binData[self.missingFeature], self.binSize)
        binData=binData.dropna(subset=["HCorr","HMissing"])
        binData = binData.groupby(["HCorr","HMissing"])[self.missingFeature].count().rename_axis(["HCorr","HMissing"]).reset_index(name='Count')
        binData["cummulative"]=binData.groupby(["HCorr"])["Count"].apply(lambda x: x.cumsum())
        binData["bin_centres"] = binData["HMissing"].apply(lambda x: x.mid)
        return binData
    
    def median(self):
        binData= self.binD()
        median=binData.groupby(["HCorr"])["Count"].agg("sum")/2
        median=median.rename_axis(["HCorr"]).reset_index(name='median')
        binData=pd.merge(binData,median,on="HCorr")
        median=binData[binData["cummulative"]>binData["median"]]
        median =  median.loc[median.groupby(["HCorr"])['cummulative'].idxmin()]
        median=median.drop(columns=["Count","cummulative","median"])
        median["bin_centres"] = median["HMissing"].apply(lambda x: x.mid)
        return median
    
    def graph(self):
        binData = self.binD()
        pivot = binData.pivot(index="HCorr",columns="HMissing",values="Count")
        ax = pivot.plot(kind='bar', stacked=True, figsize=(18.5, 7))
        ax.set( ylabel='Count' , xLabel=self.correlatedFeature)
        ax.legend(title=self.missingFeature)
        ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
        plt.title("Cumulative Count of the Combination {} and {} with median lines".format(self.missingFeature,self.correlatedFeature))
        return ax

impCloud3pm=medianBins(data,"Cloud3pm","Humidity3pm",10)
impCloud9am=medianBins(data,"Cloud9am","Humidity3pm",10)
impEvaporation=medianBins(data,"Evaporation","MaxTemp",10)
impSunshine=medianBins(data,"Sunshine","Humidity3pm",10)

In [None]:
impCloud3pm.graph()

In [None]:
impCloud9am.graph()

In [None]:
impEvaporation.graph()

When the maxTemp is below 5.62, we do not have any evaporation data. As a result, the imputation algorithm will return a nan for evaporation when the max temp is below 5.62.

In [None]:
impSunshine.graph()

These graphs show the combination of the missing feature and its most correlated feature in bins. The imputaiton algorithm will change a "nan" of the missing feature by looking at the value of its most correlated value and then select the mid value of the bin where the median lies within. If both the correlated and missing feature values are "nan", then the algorithm will return "nan" and if there is a value for the missing feature, it will return the same value. 
For example, Sunshine is the missing feature and Humidity3pm is it's most correlated feature. If the Humidity3pm has a value of 54, and Sunshine is "nan", the algorithm will select the Sunshine Bin 7.25-8.7 because the median is about 7000 fo the Humidity3pm bin of 50-60 (since 54 lies within the bin). As a result, the algorithm will change the "nan" with 7.975 ((7.25+8.7)/2)

In [None]:
def imputation(cols,binD):
    missingFeature = cols[0]
    corrFeature = cols[1]
    if pd.isnull(missingFeature)&pd.notna(corrFeature):
        temp=binD
        temp=temp[temp["HCorr"].apply(lambda x:x.left<corrFeature<=x.right)]
        temp["median"]=temp["Count"].sum()/2
        temp=temp[temp["cummulative"]>temp["median"]]
        #print(temp["bin_centres"].iloc[0]  ,"fixed",corrFeature)        
        try:    
            result=temp.iloc[0, temp.columns.get_loc('bin_centres')]
        except:return float('nan')
        return result
        #print(corrFeature,temp)
    elif pd.isnull(missingFeature)&pd.isnull(corrFeature):
        #print("nan")
        return float('nan')
    else:
        #print(missingFeature,"original")
        return missingFeature

In [None]:



"""
data["Evaporation"]=data[['Evaporation','MaxTemp']].apply(imputation,binD=impEvaporation.binD(),axis=1)
data["Cloud3pm"]=data[['Cloud3pm','Humidity3pm']].apply(imputation,binD=impCloud3pm.binD(),axis=1)
data["Cloud9am"]=data[['Cloud9am','Humidity3pm']].apply(imputation,binD=impCloud9am.binD(),axis=1)
data["Sunshine"]=data[['Sunshine','Humidity3pm']].apply(imputation,binD=impSunshine.binD(),axis=1)
data.to_csv('dataImputation.csv',index=False)
print("done")
"""

#saved to datacleanv1

In [None]:
data1 = pd.read_csv(r"../input/datacleanv1/dataImputation.csv")

In [None]:
missingValues(data1)

We have dropped Evaporation,Sunshine,Cloud9am,Cloud3pm drastically. We will look at WindGustSpeed, Pressure9am and Pressure3pm and see if we can use the algorithm imputation to treat the missing values

In [None]:
corrT(["WindGustSpeed", "Pressure9am", "Pressure3pm"])

In [None]:
impWindGustSpeed=medianBins(data1,"WindGustSpeed","WindSpeed3pm",10)
impPressure9am=medianBins(data1,"Pressure9am","MinTemp",10)
impPressure3pm=medianBins(data1,"Pressure3pm","Temp9am",10)

In [None]:

"""
data1["WindGustSpeed"]=data1[['WindGustSpeed','WindSpeed3pm']].apply(imputation,binD=impWindGustSpeed.binD(),axis=1)
data1["Pressure9am"]=data1[['Pressure9am','MinTemp']].apply(imputation,binD=impPressure9am.binD(),axis=1)
data1["Pressure3pm"]=data1[['Pressure3pm','Temp9am']].apply(imputation,binD=impPressure3pm.binD(),axis=1)
data1.to_csv('dataImputation.csv',index=False)
print("done")
"""

#saved to datacleanv2

In [None]:
data2 = pd.read_csv(r"../input/datacleanv2/dataImputation.csv")

In [None]:
missingValues(data2)

After re runing the imputation algorithm on the WindGustSpeed, Pressure9am and Pressure3pm, we have dropped missing values of numerical features to under 5%. We could re run the algorithm  until there are no missing values but it takes computaiton space. 

We will simply replace the remaining missing values with the median of their Feature

In [None]:
for col in numerical_columns:
    data2[col].fillna(data2[col].median(), inplace=True)

<a id="2.22"></a>
#### Categorical Features

In terms of the categorical features, WindGustDir, WindDir9am and WindDir3pm have missing data over 5%. We will replace all missing values with the mode of their column.

In [None]:

for col in category_columns:
    data2[col].fillna(data2[col].mode()[0], inplace=True)

In [None]:
missingValues(data2)

As a result, we have no missing data

<a id="3"></a>
## Data Preprocessing

<a id="3.1"></a>
### Target and Features

In [None]:
X = data2.drop(['RainTomorrow'], axis=1)
y = data2['RainTomorrow']
y=y.map(dict(Yes=1, No=0))

<a id="3.2"></a>
### Imbalance of the Target

In [None]:
balanceTarget(data2["RainTomorrow"])

In [None]:
data2["RainTomorrow"].value_counts()

As stated above, there is a big imbalance in the target which can causes inaccurate and decreased predictive performance of many classification algorithms. Most classification algorithms, such as logistic regression, Naive Bayes and decision trees, output a probability for an instance belonging to the positive class: Pr(y=1|x).Thus, if we have over 70% (106512/(106512+30223))=0.77 of the data in one class, it can cause problems. 

We will stratify our data training and validation. Stratification is the technique to allocate the samples evenly based on sample classes so that training set and validation set have similar ratio of classes. It is essential to ensure your training and validation sets share approximately the same ratio of examples from each class, so that you can achieve consistent predictive performance scores in both sets. 

We will also resample to get more balanced data. There are 3 techniques, 
*     Down-sampling (Under sampling) the majority class
*     Up-sampling (Over sampling) the minority class
*     Advanced sampling techniques, such as Synthetic Minority Over-sampling Technique (SMOTE)

Down-sampling will mean we lose data while up-sampling may increase the likelihood of overfitting since it replicates the minority class events.
SMOTE is an oversampling method which creates “synthetic” example rather than oversampling by replacements. The minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbors. Depending upon the amount of over-sampling required, neighbors from the k nearest neighbors are randomly chosen. -https://www.datacamp.com/community/tutorials/diving-deep-imbalanced-data
 

<a id="3.3"></a>
### Split the data to training and testing sets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0,stratify = y)

<a id="3.4"></a>
### Encode cateogorical features

Many machine learning algorithms cannot work with categorical data directly. The categories must be converted into numbers. This is required for both input and output variables that are categorical.

A one hot encoding is a representation of categorical variables as binary vectors which allows the representation of categorical data to be more expressive.
Label Encoder converts each string value to a whole number. 

The problem here is since there are different numbers in the same column, the model will misunderstand the data to be in some kind of order, 0 < 1 <2.

The model may derive a correlation like as the country number increases the population increases but this clearly may not be the scenario in some other data or the prediction set. To overcome this problem, we use One Hot Encoder.

https://towardsdatascience.com/choosing-the-right-encoding-method-label-vs-onehot-encoder-a4434493149b

The one hot encoding needs to be done to the training and testing sets

<a id="3.41"></a>
#### Binary Categorcal Features

RainToday is binary since it has results of Yes and No.

In [None]:
import category_encoders as ce
encoder = ce.BinaryEncoder(cols=['RainToday'])
X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)

In [None]:
X_train.columns

<a id="3.42"></a>
#### Date
 
We will extract the year, month and day from the dates

In [None]:
X_train['Date']= pd.to_datetime(X_train['Date']) 
X_train['Year'] = X_train['Date'].dt.year
X_train['Month'] = X_train['Date'].dt.month
X_train['Day'] = X_train['Date'].dt.day

X_test['Date']= pd.to_datetime(X_test['Date']) 
X_test['Year'] = X_test['Date'].dt.year
X_test['Month'] = X_test['Date'].dt.month
X_test['Day'] = X_test['Date'].dt.day

#Dropping the date column
X_train.drop('Date', axis=1, inplace = True)
X_test.drop('Date', axis=1, inplace = True)

In [None]:
X_train.columns

<a id="3.43"></a>
#### Other Cateogrical Features
We will enocde the rest of the categorical features

In [None]:
X_train = pd.get_dummies(X_train, columns=['Location','WindGustDir','WindDir9am','WindDir3pm'])

X_test = pd.get_dummies(X_test, columns=['Location','WindGustDir','WindDir9am','WindDir3pm'])

In [None]:
X_train.columns

In [None]:
X_train.head()

<a id="3.5"></a>
### Over sampling

In [0]:
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train.ravel())

<a id="3.6"></a>
### Feature Scaling
 
Feature Scaling basically helps to normalise the data within a particular range. While many algorithms (such as SVM, K-nearest neighbors, and logistic regression) require features to be normalized, intuitively we can think of Principle Component Analysis (PCA) as being a prime example of when normalization is important. In PCA we are interested in the components that maximize the variance. 
Sometimes, it also helps in speeding up the calculations in an algorithm.

https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html

**Min-Max Scaler**

 MinMaxScaler is the probably the most famous scaling algorithm, and follows the following formula for each feature:

![](https://miro.medium.com/max/521/0*K2QwZ16bEAxA4hUe.jpg)

It essentially shrinks the range such that the range is now between 0 and 1 (or -1 to 1 if there are negative values).
This scaler works better for cases in which the standard scaler might not work so well. If the distribution is not Gaussian or the standard deviation is very small, the min-max scaler works better.
#However, it is sensitive to outliers, so if there are outliers in the data, you might want to consider the Robust Scaler.

**RobustScaler**

The RobustScaler uses a similar method to the Min-Max scaler but it instead uses the interquartile range, rather than the min-max, so that it is robust to outliers. Therefore it follows the formula:
xi–Q1(x)/Q3(x)–Q1(x)
For each feature.
Of course, this means it is using the less of the data for scaling so it’s more suitable for when there are outliers in the data.

https://medium.com/analytics-vidhya/feature-scaling-in-scikit-learn-b11209d949e7

Since we do have outliers, we will use te RobustScaler

In [None]:
allColumns = X_train.columns

from sklearn import preprocessing
scaler = preprocessing.RobustScaler()

X_train_res = scaler.fit_transform(X_train_res)
X_train_res  = pd.DataFrame(X_train_res, columns=[allColumns])

X_train = scaler.fit_transform(X_train)
X_train  = pd.DataFrame(X_train, columns=[allColumns])

X_test = scaler.transform(X_test)
X_test = pd.DataFrame(X_test, columns=[allColumns])

In [None]:
X_train_res = pd.read_csv(r"../input/traintest/X_train_res.csv", index_col=0)
X_train = pd.read_csv(r"../input/traintest/X_train.csv", index_col=0)
X_test = pd.read_csv(r"../input/traintest/X_test.csv", index_col=0)
y_train_res = pd.read_csv(r"../input/traintest/y_train_res.csv")
y_train = pd.read_csv(r"../input/traintest/y_train.csv", index_col=0)
y_test = pd.read_csv(r"../input/traintest/y_test.csv", index_col=0)

y_train_res = np.array(y_train_res["0"])




<a id="4"></a>
## Training the Model

<a id="4.1"></a>
### Performance Metrics

In addition to address the imbalance of class, we will not focus on accuracy as a measure of performance because if our majority class was 85%, then a model that always predicted in favour of the majority class will have an accuracy of 85%. This is know as the Accuracy Paradox. We need to look at the True Positive, True Negative, False Positive and False Negative of the prediction. 

 *     True Positive (TP) – An instance that is positive and is classified correctly as positive
 *     True Negative (TN) – An instance that is negative and is classified correctly as negative
 *     False Positive (FP) – An instance that is negative but is classified wrongly as positive
 *     False Negative (FN) – An instance that is positive but is classified incorrectly as negative
 
A confusiong matrix represents these concepts visually
 
A classification report introduces Precision and Recall
 
Precision can be thought of as a measure of a classifier's exactness. A low precision can also indicate a large number of False Positives.
Recall can be thought of as a measure of a classifier's completeness. A low recall indicates many False Negatives.
 
Finally we will also look at the AUC ROC (Area under the curve of the ROC curve (receiver operating characteristic curve) )
 
AUC - ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. By analogy, Higher the AUC, better the model is at distinguishing between patients with disease and no disease.
 
https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

Our focus on finding a suitable model will be idenifying which model has the highest F1 for Rain Tomorrow. We want to know a combination of if the model predicts Rain, what are the chance of it really raining tomorrow and If it does rain, what are the chances of the model predicting rain. 

We will select a model that has the highest F1 for Rain Tomorrow

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, fbeta_score

f2_score = make_scorer(fbeta_score, beta=2, pos_label=1)

f2_score is a customer score that focuses on the F1 score for label 1 which is rain tomorrow

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import sklearn.metrics as metrics
from matplotlib import pyplot
from sklearn.metrics import roc_curve,roc_auc_score

def results(y_test, y_pred_test,y_pred_proba,model,ROC):

    print("Classifcation Report")
    print("\n")
    print("1- Rain Tomorrow, 0 - No Rain Tomorrow")
    print("\n")
    print(classification_report(y_test, y_pred_test))
    
    
    cm = confusion_matrix(y_test, y_pred_test)
    cm_df = pd.DataFrame(data=cm, columns=['Actual No Rain', 'Actual Rain'], 
                                     index=['Predict No Rain', 'Predict Rain'])
    plt.figure(figsize=(6,6))
    plt.suptitle('Confusion matrix', size=16, y=1.0);     
    ax=sns.heatmap(cm_df, square=True, annot=True, fmt='d', cbar=True)
    ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
    ax.set_yticklabels(ax.get_yticklabels(), rotation=0)      
    plt.show()
    
    if ROC==True:
        auc = roc_auc_score(y_test, y_pred_proba)
        print("{} : ROC AUC = {}%".format(model,round(auc, 3)))
        fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
        pyplot.plot(fpr, tpr, marker='.', label=model)
        pyplot.plot([0,1], [0,1], 'k--' )

        pyplot.xlabel('False Positive Rate')
        pyplot.ylabel('True Positive Rate')
        pyplot.legend()
        pyplot.show()
    print("\n")
    print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred_test)))

<a id="4.2"></a>
### Grid Search

We will perform a grid search on both the logisctic regession and Random Forest. I will run both models and save it to a pickle which saves computation time when commiting.

<a id="4.21"></a>
#### Logistic Regression

In [None]:
"""
# Ran the modelling and then saved it to a pickle

import pickle
from sklearn.linear_model import LogisticRegression
parameters = {'solver':['liblinear'], 'C':[100,10000],"penalty":["l1","l2"],
              "class_weight":[None,"balanced"]}
gsc=GridSearchCV(estimator=LogisticRegression(),
             param_grid=parameters,cv=5, scoring=f2_score, verbose=0, n_jobs=-1)
gr_log_bal = gsc.fit(X_train_res, y_train_res)

filename = 'logReg.sav'
pickle.dump(gr_log_bal, open(filename, 'wb'))
# Save to pickle
"""

In [None]:
import pickle
logReg = pickle.load(open("../input/results/logReg.sav", 'rb'))
print(logReg.param_grid)
print("\n")
y_pred_test = logReg.predict(X_test)
y_pred1 = logReg.predict_proba(X_test)[:, 1]
results(y_test, y_pred_test, y_pred1,"Logistic Regression Balance",True)


<a id="4.22"></a>
#### Random Forest Classifer

In [None]:
"""
from sklearn.ensemble import RandomForestClassifier

parameters = { 
    'n_estimators': [200, 500],
    'max_features': ['auto'],
    'max_depth' : [4,8],
    'criterion' :['gini', 'entropy'],
    "class_weight":[None,"balanced"],
    "oob_score":[True]
}
gsc=GridSearchCV(estimator=RandomForestClassifier(),
             param_grid=parameters,cv=5, scoring=f2_score, verbose=0, n_jobs=-1)

grid_result_Rand_bal = gsc.fit(X_train_res, y_train_res)

filename = 'RandForest.sav'
pickle.dump(grid_result_Rand_bal, open(filename, 'wb'))
# Save to pickle
"""

In [None]:
import pickle
randForest = pickle.load(open("../input/results/RandForest.sav", 'rb'))
print(randForest.param_grid)
print("\n")
y_pred_test = randForest.predict(X_test)
y_pred1 = randForest.predict_proba(X_test)[:, 1]
results(y_test, y_pred_test, y_pred1,"Logistic Regression Balance",True)

<a id="5"></a>
## Conclusion

In [None]:
y.value_counts()/y.value_counts().sum()*100

The random forest using the resample data has lower f1 scores than the logistic regression. However, it is able to predict rain when it will actually rain better than the logistic regression (higher recall for predicting rain).

The original goal of the model as stated above was to predict rain such that it could better prepare farmers to harvest their stock such that they could take advantage of the rain. A model that has a high recall for predicting rain such as the random forest is helpful in that farmers can heavily rely on the prediction of rain but they could of prepared for alot more since there is more rain than predicted (low precision).  

A model with a higher f1 score (weighted average of precision and recall) since the farmer can both rely on the predicion and more frequently prepare for the rain.

As a result, the logistic regression is the better model with an overall f1 of 64%. In addition, the accuracy is better than than if the model just simply predicted no since its accuracy is 82.74% which is above 77.8%. However it should be noted that the accuracy is NOT important as the F1 score due to the Accuracy Paradox.