![Australia](https://theplanningmaster.com/wp-content/uploads/2021/03/Australia-map.png)

The prediction of rain was always an important task that humanity tries to deal with. The Australian Bureau of Meteorology collects a lot of data published on [kaggle](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package).

 The goal is to **explore today's data and predict if tomorrow there will be rain or not.**


In this notebook, we enriched the data and added many new features based on geography and climate. In the EDA, we explored many features and their effect on rain the next day.

This notebook was created by:


*   Gal Merom
*   Hanoch Gendelman




#  Notice: we don't use 'RISK_MM' for predictions.
**'RISK_MM' is the amount of rain that will rain tomorrow so, we assume it is part of the target features**

# Imports

Import the data science tools  - used for creating charts 

In [None]:
import sys
import shutil

shutil.rmtree('Data-Science-Tools',ignore_errors=True)

!git clone https://github.com/galmerom/Data-Science-Tools.git
SourceCodePath2 = 'Data-Science-Tools'
sys.path.insert(2, SourceCodePath2 )


In [None]:
!pip install openpyxl

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from xgboost import XGBClassifier
%matplotlib inline
from datetime import timedelta
from datetime import datetime
import calendar
from collections import defaultdict
import sys
from sklearn.model_selection import train_test_split
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

import charts
import transformers as TR
import mega_classifier as MC

## Read data - Original data + data enrichment file.

In [None]:
data = pd.read_csv('../input/australiaweather/weatherAUS.csv', squeeze=True,parse_dates =[0])
Locations = pd.read_excel('../input/locationsinaustralia/LocationsAndClimate.xlsx', squeeze=True,usecols ="A:J")

In [None]:
data.head(5)


Number of weather stations

In [None]:
len(data.Location.unique())

In [None]:
Locations.head()

In [None]:
data.info()

In [None]:
Locations.info()

## Exploring NULLs

Many many NULLs in almost every feature

In [None]:
data.isnull().sum()

# Split to train and test - for EDA

Split the data to training set and test set by date.

In [None]:
data.sort_values(by='Date',inplace=True)
data_train=data.iloc[0:113747,:] # 80% train 20% test
data_test=data.iloc[113748:,:]

Check if, after the split, we have the same class ratio.

In [None]:
TestClass=data_test['RainTomorrow'].value_counts()
TrainClass=data_train['RainTomorrow'].value_counts()
charts.BarCharts([TestClass,TrainClass],['Test set classes','Train set classes'],1,2,txt2show=[('22% for YES label in\n TEST dataset', 13,0.6,-0.15),('22% for YES label in\n TRAIN dataset', 13,0.6,-0.15)]) 

Copy the data to a new dataset to avoid accidentally change of the original data

In [None]:
data_train4EDA = data_train.copy()

# Transformers

## Change "wind Direction" columns to integer

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
class DirectionTransformer(BaseEstimator, TransformerMixin):
  def __init__(self):
    self.DirecDict = {'N':0,'NNE':1,'NE':2,'ENE':3,'E':4,'ESE':5,'SE':6,'SSE':7,'S':8,'SSW':9,'SW':10,'W':11,'WSW':12,'WNW':13,'NW':14,'NNW':15}
  def fit(self,X,y=None):
    return self
  def transform(self, X):
    X_new = X.copy()
    X_new['WindGustDir']=X_new['WindGustDir'].apply(self.ChangeWindDirec2Int)
    X_new['WindDir3pm']=X_new['WindDir3pm'].apply(self.ChangeWindDirec2Int)
    X_new['WindDir9am']=X_new['WindDir9am'].apply(self.ChangeWindDirec2Int)
    return X_new
  def ChangeWindDirec2Int(self,x):
    try:
      return self.DirecDict[x]
    except:
      return x

In [None]:
DirectTransformer = DirectionTransformer()
data_train4EDA=DirectTransformer.transform(data_train4EDA)

## Merge with the locations data

Merge the original data and the location data (the data enrichment)

In [None]:
class MergeLocationTransformer(BaseEstimator, TransformerMixin):
  def __init__(self,Locations):
    self.Location = Locations
  def fit(self,X,y=None):
    return self
  def transform(self, X):
    X_new = X.copy()
    X_new= pd.merge(X_new,self.Location, on='Location',how ='left')
    Sameasdf = X_new.groupby(['Date','Same as']).mean()
    Neardf = X_new.groupby(['Date','Nearest location']).mean()
    Areadf = X_new.groupby(['Date','Area']).mean()
    SameDaydf = X_new.groupby(['Date']).mean()
    Climatedf=X_new.groupby(['Date','Climate']).mean()
    X_new= pd.merge(X_new,Sameasdf, on=('Date','Same as'),suffixes=('', '_SameAS'),how ='left')
    X_new= pd.merge(X_new,Neardf, on=('Date','Nearest location'),suffixes=('', '_Nearest'),how ='left')
    X_new= pd.merge(X_new,Climatedf, on=('Date','Climate'),suffixes=('', '_Climate'),how ='left')
    X_new= pd.merge(X_new,Areadf, on=('Date','Area'),suffixes=('', '_Area'),how ='left')
    X_new= pd.merge(X_new,SameDaydf, on=('Date'),suffixes=('', '_All'),how ='left')
    X_new.index = X.index
    return X_new


In [None]:
MergeLocation = MergeLocationTransformer(Locations)
data_train4EDA=MergeLocation.transform(data_train4EDA)

## Impute NULLs

We have many features with many Nulls. The worst have 47% Nulls.

To fill this number, we will take the missing data from the following features by this order:

(stop searching if a non NULL value exists)
 

1.   A very close station
2.   The nearest station
3.   Stations that have the same climate (Average)
4.   Stations that have the same 'Area' (Average)
5.   If all of the above is NULL, then put zero



In [None]:
class impute_nullsTransformer(BaseEstimator, TransformerMixin):
  def __init__(self,WithTargetCol=True):
    self.df=pd.DataFrame()
    self.IncldTrgtcol=WithTargetCol
  def fit(self,X,y=None):
    return self
  def transform(self, X):
    X_new = X.copy()
    self.df = X_new
    AllCol = X_new.columns.to_list()
    ColList = [AllCol[i] for i in range(2,21)]
    list(map(self.impute_nulls, ColList))
    # Update the target column to follow the rules over 1 mm is rain
    X_new.loc[:,'RainToday'] = np.where(X_new.loc[:,'Rainfall']>1,'Yes','No')
    
    # remove extra columns
    if self.IncldTrgtcol:
      X_new=X_new.iloc[:,0:33]
    else:
      X_new=X_new.iloc[:,0:31]

    return X_new

  def impute_nulls(self,col_name):
    df=self.df
    prefixes = ['_SameAS','_Nearest','_Climate','_Area','_All']
    # Go over the column that was given as a parameter. When you find null: go over all the prefixes in the right order
    # and put the value that is in that col_name+prefix. If there is null there then in the next iteration, it will be fixed with 
    # the next prefix.
    for pf in prefixes:
      df.loc[:,col_name] = np.where(df.loc[:,col_name].isnull(),
                                    df[col_name+pf],
                                    df[col_name])
    # In case there are still nulls, then give them the value of zero
    df.loc[:,col_name] = np.where(df.loc[:,col_name].isnull(),
                                  0,
                                  df[col_name])

In [None]:
ImputeNulls = impute_nullsTransformer()
data_train4EDA=ImputeNulls.transform(data_train4EDA)

Make sure there are no NULLs.

In [None]:
data_train4EDA.loc[data_train4EDA.isnull().any(axis=1),:]

## Breaking point

We use the breaking point to back up all the data manipulation we did.

In [None]:
data_train4EDA.to_csv("No_nulls.csv")

In [None]:
df2= pd.read_csv("No_nulls.csv",index_col=0)#,parse_dates =[1])


# EDA

In [None]:
df2.describe()

## Exploring the differences between the weather stations by locations

### Exploring by feature 'area'

*   The chances of raining in the central area of Australia is only 7%
*   The chances of raining on the Island east of Australia is about a third higher than the rest of the areas

In [None]:
dfArea=pd.crosstab( df2['Area'],df2['RainTomorrow'], normalize='index')
dfArea.style.set_precision(2).background_gradient(cmap='Blues')

### Exploring by feature  'rain_district'

![Rain districts](https://theplanningmaster.com/wp-content/uploads/2021/03/Australia-rain-district.png)

Find the number of stations in each rain district.

In [None]:
tmp=df2[['rain_district','Location','Height']].groupby(['rain_district','Location']).max().reset_index()
tmp.groupby('rain_district').count()['Location'].to_frame().transpose()

It seems that we don't get much more information than using the area column

In [None]:
dfRD=pd.crosstab( df2['RainTomorrow'],df2['rain_district'], normalize='columns')
dfRD.style.set_precision(2).background_gradient(cmap='Blues')

### Exploring by feature ' Climate'

![Australia climate](https://theplanningmaster.com/wp-content/uploads/2021/03/Australia-climate.png)

Number of stations in each climate

In [None]:
tmp2 = df2[['Climate','Location','Height']].groupby(['Climate','Location']).max().reset_index()
tmp2.groupby('Climate').count()['Location'].to_frame().transpose()

That seems to be a very good indication to help predict the target

In [None]:
dfRD=pd.crosstab( df2['RainTomorrow'],df2['Climate'], normalize='columns')
dfRD.style.set_precision(2).background_gradient(cmap='Blues')

In [None]:
ClimateDF=df2.groupby(['RainTomorrow','Climate']).count()['Date'].reset_index()
charts.StackBarCharts([(ClimateDF,'Climate','RainTomorrow','Date')],['Number of records per climate and the chances of rain'],ChartSize=(18, 6))

### Exploring by feature 'Height'

In [None]:
df2.Height.hist(bins= 200,figsize=(10,5))

Using bins to aggregate 

In [None]:
def HeightCatg(x):
  if x<=40:
    return '0000-0040'
  elif x<= 260:
    return '0040-0260'
  elif x<= 1000:
    return '0260-1000'
  else:
     return '1000+'

df2['HeightCategory']=  df2['Height'].apply(lambda x: HeightCatg(x))
dfHeight = df2.groupby(['HeightCategory','RainTomorrow']).count()['Location'].to_frame().reset_index()

In [None]:
charts.StackBarCharts([(dfHeight,'HeightCategory','RainTomorrow','Location')],['Num. of records in each Height category vs. will it rain tomorrow?'],NumRows=1,NumCol=1,ColorInt=3,ChartSize=(15,7),TitleSize =22,txt2show=[('We can see that height does not\n help predict the target.',12, 0.4,-0.09)])

###  Exploring by feature  'distance from the sea'

In [None]:
df2.Distance_from_Sea.hist(bins= 200,figsize=(10,5))

Using bins to aggregate 

In [None]:
def DisFromSeaCatg(x):
  if x<=0:
    return '000-000'
  elif x<= 40:
    return '000-040'
  elif x<= 100:
    return '040-100'
  elif x<= 200:
    return '100-200'  
  else:
     return '500+'

df2['DisFromSeaCategory']=  df2['Distance_from_Sea'].apply(lambda x: DisFromSeaCatg(x))
dfDisFromSea = df2.groupby(['DisFromSeaCategory','RainTomorrow']).count()['Location'].to_frame().reset_index()

In [None]:
charts.StackBarCharts([(dfDisFromSea,'DisFromSeaCategory','RainTomorrow','Location')],['Num. of records in each \"distance in KM from the sea\" category vs. will it rain tomorrow?'],NumRows=1,NumCol=1,ChartSize=(15,7),TitleSize =20,txt2show=[('The distance from the sea correlates\n with the rain tomorrow. The longer the distance,\n the lower the chance for rain',12, 0.5,-0.15)]) 

## Correlations

Exploring the correlations between the different parameters:


*   Obviously, Temp. parameters correlate to other Temp parameters
*   We'll explore the parameters that correlate to the target parameter RISK_MM:


1.   The most correlated parameter is Rainfall. So if there were rainfall today, we would see a 30% correlation to tomorrow's rain's value.
2.   The next correlated parameter is obviously sunshine

We need to remember that the correlation looks for a connection between 2 sets of numbers. But we are looking for a binary result was there rain the next day or not. Look at the chart of the evaporation parameter. Although the correlation is only 0.04, the increase in evaporation reduces the chance of raining.

Adding a binary column for RainTomorrow will help us find the correlation.

In [None]:
df2['Month'] = pd.DatetimeIndex(df2['Date']).month
df2['TommrRainBin']=np.where(df2['RainTomorrow']=='Yes',1,0)

In [None]:
num_corr = df2.select_dtypes(include=np.number).corr()
fig, ax = plt.subplots(figsize=(15, 15))
title = 'Pearson correlation coefficients (PCC) for Australia Rain DS'
plt.title(title, fontsize=18)

mask = np.zeros_like(num_corr)
mask[np.triu_indices_from(mask)] = True

heat_map = sns.heatmap(num_corr, ax=ax, annot=True, linewidths=0.7, fmt='.2f', vmin=-0.80, vmax=0.8,
                       cmap='magma_r', mask=mask, center=0.35, cbar_kws={"shrink": 0.75}, square=True)
plt.show()

## Exploring original features with high correlation

### Sunshine

In [None]:
df2['Sunshine'].hist(bins= 100,figsize=(10,5))

Using bins to aggregate 

In [None]:
def SunshineCatg(x):
  if x<=0.0:
    return '00-00'
  elif x<= 4:
    return '00-04'
  elif x<= 8.0:
    return '04-08'
  elif x<= 11.0:
    return '08-11'
  else:
     return '11+'

df2['SunshineCategory']=  df2['Sunshine'].apply(lambda x: SunshineCatg(x))
dfSun = df2.groupby(['SunshineCategory','RainTomorrow']).count()['Location'].to_frame().reset_index()

In [None]:
charts.StackBarCharts([(dfSun,'SunshineCategory','RainTomorrow','Location')],['Num. of records in each Sunshine category vs. will it rain tomorrow?'],NumRows=1,NumCol=1,ChartSize=(15,7),TitleSize =22,txt2show=[('The more sunshine we get,\n the more dry days we get',12, 0.02,-0.09)])

### Evaporation

In [None]:
df2[(df2['Evaporation']<30)&(df2['Evaporation']>15)].hist(column='Evaporation',bins= 100,figsize=(10,5))

Using bins to aggregate 

In [None]:
def EvapCatg(x):
  if x<=0.0:
    return '00-00'
  elif x<= 2:
    return '00-02'
  elif x<= 5.0:
    return '02-05'
  elif x<= 10.0:
    return '05-10'
  elif x<=30.0:
    return '11-30'
  else:
     return '30+'

df2['EvaporationCategory']=  df2['Evaporation'].apply(lambda x: EvapCatg(x))
dfEvap = df2.groupby(['EvaporationCategory','RainTomorrow']).count()['Location'].to_frame().reset_index()

**As the Evaporation value increases we get less chances of raining**

In [None]:
charts.StackBarCharts([(dfEvap,'EvaporationCategory','RainTomorrow','Location')],['Num. of records in each Evap. category vs. will it rain tomorrow?'],NumRows=1,NumCol=1,ChartSize=(15,7),ColorInt=3,TitleSize =25,txt2show=[('As the Evaporation value increases, we get\n fewer chances of raining',12, 0.02,-0.09)])

### rain fall

In [None]:
df2.Rainfall.hist(bins= 100,figsize=(10,5))

Removing the tail

In [None]:
df2[(df2['Rainfall']<100)&(df2['Rainfall']>10)].hist(column='Rainfall',bins= 100,figsize=(10,5))

Using bins to aggregate 

In [None]:
def RainfallCatg(x):
  if x<=1:
    return '00-01'
  elif x<= 15:
    return '01-15'
  elif x<= 60:
    return '15-60'
  else:
     return '60+'

df2['RainfallCategory'] =  df2['Rainfall'].apply(lambda x: RainfallCatg(x))
dfRF = df2.groupby(['RainfallCategory','RainTomorrow']).count()['Location'].to_frame().reset_index()

In [None]:
charts.StackBarCharts([(dfRF,'RainfallCategory','RainTomorrow','Location')],['Num. of records in each rainfall category vs. will it rain tomorrow?'],NumRows=1,NumCol=1,ChartSize=(15,7),TitleSize =25,txt2show=[('We can see that 85% of days that did not have\n rain will not have rain tomorrow.\n As the amount of rain increases, the\n chances for rain increases.',12, 0.55,-0.20)])

#### Exploring 'rain fall' with 'distance from the sea'

In [None]:
dfRFDFTS = df2[df2['Rainfall']>15].groupby(['DisFromSeaCategory','RainTomorrow']).count()['Location'].to_frame().reset_index()

In [None]:
charts.StackBarCharts([(dfRFDFTS,'DisFromSeaCategory','RainTomorrow','Location')],['Num. of rain>15 as a distance from the sea vs. will it rain tomorrow?'],NumRows=1,NumCol=1,ChartSize=(20,7),TitleSize =25,txt2show=[('We can see that 59% of days with rain over 15 and\n is near the sea will have rain tomorrow.',12, 0.55,-0.20)]) 

#### Explore rain fall and area

In [None]:
dfAreaRF = df2[df2['Rainfall']>15].groupby(['Area','RainTomorrow']).count()['Location'].to_frame().reset_index()

In [None]:
charts.StackBarCharts([(dfAreaRF,'Area','RainTomorrow','Location')],['Num. of records that rain>15 aggregate by area vs. will it rain tomorrow?'],NumRows=1,NumCol=1,ChartSize=(20,7),TitleSize =25,txt2show=[('We can see that using area and rainfall over 15 \nwe can get better prediction.\nUnfortunattly, this is only 4.7% of the data,\n and about 18% of the target class',12, 0.05,-0.2)])

### humidity3pm

In [None]:
df2.Humidity3pm.hist(bins= 100,figsize=(10,5))

Using bins to aggregate 

In [None]:
def HumidityCatg(x):
  if x<=20:
    return '00-20'
  elif x<= 40:
    return '20-40'
  elif x<= 60:
    return '40-60'
  elif x<= 80:
    return '60-80'
  else:
     return '80+'

df2['HumidityCategory'] =  df2['Humidity3pm'].apply(lambda x: HumidityCatg(x))
dfHmdty = df2.groupby(['HumidityCategory','RainTomorrow']).count()['Location'].to_frame().reset_index()

In [None]:
charts.StackBarCharts([(dfHmdty,'HumidityCategory','RainTomorrow','Location')],['Num. of records in each humidity category vs. will it rain tomorrow?'],NumRows=1,NumCol=1,ChartSize=(15,7),TitleSize =25,txt2show=[('We can see that humidity under 20 almost always\n means no rain. Humidity over 80 gives\n a strong prediction of rain',12, 0.05,-0.15)])

#### Exploring humidity under 40 with area

In [None]:
dfAreHum=df2[df2['Humidity3pm']<40].groupby(['Area','RainTomorrow']).count()['Location'].to_frame().reset_index()

In [None]:
charts.StackBarCharts([(dfAreHum,'Area','RainTomorrow','Location')],['Num. of records humidity<40 in each area vs. will it rain tomorrow?'],NumRows=1,NumCol=1,ChartSize=(15,7),TitleSize =25,txt2show=[('In the north and northeast, we can assume\n that if the humidity is under 40,\n next day will be dry',12, 0.05,-0.15)])

Let explore humidity under 40 with Climate

In [None]:
dfHumlimate=df2[(df2['Humidity3pm']<40)].groupby(['Climate','RainTomorrow']).count()['Location'].to_frame().reset_index()

In [None]:
charts.StackBarCharts([(dfHumlimate,'Climate','RainTomorrow','Location')],['Num. of records humidity<40 in each climate vs. will it rain tomorrow?'],NumRows=1,NumCol=1,ChartSize=(15,7),TitleSize =25,txt2show=[('We can see that in climates:\n Am and Aw\n if humidity <40 there is almost no chance for rain.',12, 0.05,-0.15)])

**We know for sure** that Humidity3pm <25 in the following climates will have no rain the next day:


*   Am
*   Aw



In [None]:
df2[(df2['Humidity3pm']<25)&((df2['Climate']=='Aw')|(df2['Climate']=='Am'))].groupby('RainTomorrow').count()['Location']

###  Cloud3pm - measered in okta. Zero means the sky have no clouds. 8 means the sky is covered

In [None]:
df2.Cloud3pm.hist(bins= 100,figsize=(10,5))

Using bins to aggregate 

In [None]:
def CloudCatg(x):
  if x<=0:
    return '0'
  elif x<= 1:
    return '0-1'
  elif x<= 2:
    return '2-4'
  elif x<= 6:
    return '4-6'
  elif x<= 7:
    return '6-7'
  else:
     return '7+'

df2['CloudCategory'] =  df2['Cloud3pm'].apply(lambda x: CloudCatg(x))
dfCloud = df2.groupby(['CloudCategory','RainTomorrow']).count()['Location'].to_frame().reset_index()

In [None]:
charts.StackBarCharts([(dfCloud,'CloudCategory','RainTomorrow','Location')],['Num. of records in each Cloud category vs. will it rain tomorrow?'],NumRows=1,NumCol=1,ChartSize=(15,7),TitleSize =25,txt2show=[('We can see that if there is no cloud\nin the sky, the chances for rain are\nalmost zero. Also, very few clouds\nrarely followed by rain. \nWhen we get more than 7 we have\nmore than 50% chance of raining.',12, 0.02,-0.3)])

### 'Month' -  defining [seasons](http://www.bom.gov.au/climate/glossary/seasons.shtml)

In [None]:
def Seasons(x):
  if x in [9,10,11]:
    return 'Spring'
  elif x in [12,1,2]:
    return 'Summer'
  elif x in [3,4,5]:
    return 'Autumn'
  else:
     return 'Winter'

df2['Seasons'] =  df2['Month'].apply(lambda x: Seasons(x))
dfseason = df2.groupby(['Seasons','RainTomorrow']).count()['Location'].to_frame().reset_index()

In [None]:
charts.StackBarCharts([(dfseason,'Seasons','RainTomorrow','Location')],['Seasons vs. will it rain tomorrow?'],NumRows=1,NumCol=1,ChartSize=(15,7),TitleSize =25,txt2show=[('Seasons as a whole does not help us predict rain.',12, 0.40,-0.05)])

Lets explore season with Area

In [None]:
df2['SeasonArea'] = 'S: '+ df2['Seasons'] + " C: "+df2['Area']
dfseasonArea = df2[df2['Seasons']=='Winter'].groupby(['SeasonArea','RainTomorrow']).count()['Location'].to_frame().reset_index()

In [None]:
charts.StackBarCharts([(dfseasonArea,'SeasonArea','RainTomorrow','Location')],['Winter vs. will it rain tomorrow? per Area'],NumRows=1,NumCol=1,ColorInt=0,ChartSize=(20,7),TitleSize =25,txt2show=[('We can see that there is a significant difference in the probability between areas',14, 0.15,-0.1)])

We can assume that in the **winter** there is no rain in the North area !!

In [None]:
df2[(df2['Seasons']=='Winter')&(df2['Area']=='North')].groupby('RainTomorrow').count()['Location']

In [None]:
df2['SeasonClimate'] = 'S: '+ df2['Seasons'] + " C: "+df2['Climate']
dfWinterClimate = df2[df2['Seasons']=='Winter'].groupby(['SeasonClimate','RainTomorrow']).count()['Location'].to_frame().reset_index()

In [None]:
charts.StackBarCharts([(dfWinterClimate,'SeasonClimate','RainTomorrow','Location')],['Winter & Climate vs. will it rain tomorrow?'],NumRows=1,NumCol=1,ColorInt=1,ChartSize=(20,7),TitleSize =25,txt2show=[('When we zoom in, we can see that the probability for rain in winter is different from climate to climate.\nIn the winter, there is a high probability for rainfall in Csb climate and\n relatively high probability in Csa Climate. Average probability in Cfa and Cfa climates and\n low likelihood for the other Climates',14, 0.01,-0.2)])

In [None]:
dfAutumnClimate = df2[df2['Seasons']=='Autumn'].groupby(['SeasonClimate','RainTomorrow']).count()['Location'].to_frame().reset_index()

In [None]:
charts.StackBarCharts([(dfAutumnClimate,'SeasonClimate','RainTomorrow','Location')],['Autumn & Climate vs. will it rain tomorrow?'],NumRows=1,NumCol=1,ColorInt=2,ChartSize=(20,7),TitleSize =25,txt2show=[('In the Autumn, there is a high probability of rain in the Am Climate.\nRelative high probability in Cfa & Csb climates.\nAverage probability in Aw & Cfb climates and low probability in the other Climates',14, 0.02,-0.15)])

In [None]:
dfSpringClimate = df2[df2['Seasons']=='Spring'].groupby(['SeasonClimate','RainTomorrow']).count()['Location'].to_frame().reset_index()

In [None]:
charts.StackBarCharts([(dfSpringClimate,'SeasonClimate','RainTomorrow','Location')],['Spring & Climate vs. will it rain tomorrow?'],NumRows=1,NumCol=1,ColorInt=3,ChartSize=(20,7),TitleSize =25,txt2show=[('When we zoom in, we can see that the probability\nfor rain in Spring is different from climate to climate.\nHowever, In the Spring, the differences between climates\n are lower than the other seasons',14, 0.02,-0.18)])

In [None]:
dfSummerClimate = df2[df2['Seasons']=='Summer'].groupby(['SeasonClimate','RainTomorrow']).count()['Location'].to_frame().reset_index()

In [None]:
charts.StackBarCharts([(dfSummerClimate,'SeasonClimate','RainTomorrow','Location')],['Summer & Climate vs. will it rain tomorrow?'],NumRows=1,NumCol=1,ColorInt=4,ChartSize=(20,7),TitleSize =25,txt2show=[('In the Summer, there is a very high probability for rain\nin Am & Aw Climates. Average chance for rain\n in Cfa & Cfb climates, and low likelihood for rain\n in Csa & Csb climates',14, 0.03,-0.2)])

#### Pressure9am

In [None]:
df2.Pressure9am.hist(bins= 100,figsize=(10,5))

Using bins to aggregate 

In [None]:
def Pressure9am(x):
  if x<=1000:
    return '0000-1000'
  elif x<= 1010:
    return '1000-1010'
  elif x<= 1020:
    return '1010-1020'
  elif x<= 1030:
    return '1020-1030'
  else:
     return '1030+'

df2['PressureCategory'] =  df2['Pressure9am'].apply(lambda x: Pressure9am(x))
dfPressure = df2.groupby(['PressureCategory','RainTomorrow']).count()['Location'].to_frame().reset_index()

In [None]:
charts.StackBarCharts([(dfPressure,'PressureCategory','RainTomorrow','Location')],['Num. of records in each Pressure category vs. will it rain tomorrow?'],NumRows=1,NumCol=1,ChartSize=(15,7),ColorInt=1,TitleSize =25,txt2show=[('We can see that the lower the pressure,\nthe higher the probability of raining.\nIf the pressure <= 1,000, there is a 64% chance\n of raining.\nAbove 1,030 we have ONLY a 10%\n chance of raining.',12, 0.03,-0.25)])

### Location

No location had excess records. Some have about 50% less.

In [None]:
df2.groupby('Location').count()['Date']

# Other Transofrmations - After EDA

The following transformer add month and year and reduces the rainfall to maximum of 100

In [None]:
class OtherChangesTransformer(BaseEstimator, TransformerMixin):
  def fit(self,X,y=None):
    return self
  def transform(self, X):
    X_new = X.copy()
    X_new['Month'] = pd.DatetimeIndex(X_new['Date']).month.astype(str)
    X_new['Year'] = pd.DatetimeIndex(X_new['Date']).year
    X_new['Rainfall']=np.where(X_new['Rainfall']>100,100,X_new['Rainfall'])
    return X_new

The following transformer create all the category features we used in the EDA. We can run the transormer without running the EDA

In [None]:
class CatgTransformer(BaseEstimator, TransformerMixin):
  def fit(self,X,y=None):
    return self
  def transform(self, X):
    X_new = X.copy()
    X_new['HeightCategory']=  X_new['Height'].apply(lambda x: self.HeightCatg(x))
    X_new['DisFromSeaCategory']=  X_new['Distance_from_Sea'].apply(lambda x: self.DisFromSeaCatg(x))
    X_new['SunshineCategory']=  X_new['Sunshine'].apply(lambda x: self.SunshineCatg(x))
    X_new['EvaporationCategory']=  X_new['Evaporation'].apply(lambda x: self.EvapCatg(x))
    X_new['RainfallCategory'] =  X_new['Rainfall'].apply(lambda x: self.RainfallCatg(x))
    X_new['HumidityCategory'] =  X_new['Humidity3pm'].apply(lambda x: self.HumidityCatg(x))
    X_new['CloudCategory'] =  X_new['Cloud3pm'].apply(lambda x: self.CloudCatg(x))
    X_new['Seasons'] =  X_new['Month'].apply(lambda x: self.Seasons(x))
    X_new['PressureCategory'] =  X_new['Pressure9am'].apply(lambda x: self.Pressure9am(x))

    return X_new
  def Pressure9am(self,x):
    if x<=1000:
      return '0000-1000'
    elif x<= 1010:
      return '1000-1010'
    elif x<= 1020:
      return '1010-1020'
    elif x<= 1030:
      return '1020-1030'
    else:
      return '1030+'    
  def Seasons(self,x):
    if x in [9,10,11]:
      return 'Spring'
    elif x in [12,1,2]:
      return 'Summer'
    elif x in [3,4,5]:
      return 'Autumn'
    else:
      return 'Winter'
  def CloudCatg(self,x):
    if x<=0:
      return '0'
    elif x<= 1:
      return '0-1'
    elif x<= 2:
      return '2-4'
    elif x<= 6:
      return '4-6'
    else:
      return '6+'
  def HumidityCatg(self,x):
    if x<=20:
      return '00-20'
    elif x<= 40:
      return '20-40'
    elif x<= 60:
      return '40-60'
    elif x<= 80:
      return '60-80'
    else:
      return '80+'
  def RainfallCatg(self,x):
    if x<=1:
      return '00-01'
    elif x<= 15:
      return '01-15'
    elif x<= 60:
      return '15-60'
    else:
      return '60+'
  def EvapCatg(self,x):
    if x<=0.0:
      return '00-00'
    elif x<= 2:
      return '00-02'
    elif x<= 5.0:
      return '02-05'
    elif x<= 10.0:
      return '05-10'
    elif x<=30.0:
      return '11-30'
    else:
      return '30+'
  def SunshineCatg(self,x):
    if x<=0.0:
      return '00-00'
    elif x<= 4:
      return '00-04'
    elif x<= 8.0:
      return '04-08'
    elif x<= 11.0:
      return '08-11'
    else:
      return '11+'
  def DisFromSeaCatg(self,x):
    if x<=0:
      return '000-000'
    elif x<= 40:
      return '000-040'
    elif x<= 100:
      return '040-100'
    elif x<= 200:
      return '100-200'  
    else:
      return '500+'  
  def HeightCatg(self,x):
    if x<=40:
      return '0000-0040'
    elif x<= 260:
      return '0040-0260'
    elif x<= 1000:
      return '0260-1000'
    else:
      return '1000+'

# Running all using transformers and Splitting to X and Y

The EDA uses data that includes target columns. So from this section on, we will split the data again, run all the transformers on the train, and test data to avoid any data leakage.

In [None]:
data.sort_values(by='Date',inplace=True)
data_train=data.iloc[0:113747,:] # 80% train 20% test
data_test=data.iloc[113748:,:]

In [None]:
# Train dataset

X_train = data_train.drop(['RISK_MM','RainTomorrow'],axis=1)
y_train = data_train[['RainTomorrow']]

# Test dataset

X_test = data_test.drop(['RISK_MM','RainTomorrow'],axis=1)
y_test = data_test[['RainTomorrow']]

New instance of each transformer

In [None]:
DirectTransformer = DirectionTransformer()
MergeLocation = MergeLocationTransformer(Locations)
ImputeNulls = impute_nullsTransformer(False)
OtherChangesTrans = OtherChangesTransformer()
CatgTrans = CatgTransformer()

Run all transformers on **train** data

In [None]:
X_train = DirectTransformer.transform(X_train)
X_train = MergeLocation.transform(X_train)
X_train = ImputeNulls.transform(X_train)
X_train = OtherChangesTrans.transform(X_train)
X_train = CatgTrans.transform(X_train)

Run all transformers on **test** data

In [None]:
X_test = DirectTransformer.transform(X_test)
X_test = MergeLocation.transform(X_test)
X_test = ImputeNulls.transform(X_test)
X_test = OtherChangesTrans.transform(X_test)
X_test = CatgTrans.transform(X_test)

# Models

## **Scoring**  
We will look at the **accuracy score** since our goal is to tell people:

Take an umbrella tomorrow or not.

By saying "no rain tomorrow", *all the time*, we will have a **78% accuracy**, so we will look for a better result than 78%.

## Decision tree Classifier - 80% accuracy

### First run - 80% accuracy

#### Picking columns **manually**

By value columns

In [None]:
df3=X_train.iloc[:,[22,3,4,5,6,7,8,9,11,14,16,18,19,21,24,27,30,31,32,33,34,35,36,37,38,39,40,41]]

In [None]:
df4=pd.get_dummies(df3)

#### Running a basic decision tree

In [None]:
X = df4
y = y_train

In [None]:
Aus_dt1 = DecisionTreeClassifier(min_samples_leaf=5)
Aus_dt1.fit(X, y)

#### feature_importances

In [None]:
charts.PlotFeatureImportance(X,Aus_dt1)

#### Predicting over the **training** set

In [None]:
X['TommorowRain_pred'] = Aus_dt1.predict(X)

In [None]:
charts.ClassicGraphicCM(X['TommorowRain_pred'],y,Aus_dt1.classes_,normalize=True)

#### **Test** dataset prediction

In [None]:
X_test_colPick = pd.get_dummies(X_test.iloc[:,[22,3,4,5,6,7,8,9,11,14,16,18,19,21,24,27,30,31,32,33,34,35,36,37,38,39,40,41]])

In [None]:
X_test_colPick['TommorowRain_pred'] = Aus_dt1.predict(X_test_colPick)

In [None]:
charts.ClassicGraphicCM(X_test_colPick['TommorowRain_pred'],y_test,Aus_dt1.classes_,normalize=True)

### Second run - Using **SelectKBest**  - 78% accuracy

#### Transformer  - Remove negative values so selektkbest can work

In [None]:
class PrepareForSelectKbestTransformer(BaseEstimator, TransformerMixin):
  def fit(self,X,y=None):
    return self
  def transform(self, X):
    X_new = X.copy()
    X_new=pd.get_dummies(X_new.drop(['Date','Location','Long','Nearest location'],axis=1))
    X_new.MaxTemp = X_new.MaxTemp + 10
    X_new.MinTemp = X_new.MinTemp + 10
    X_new.Latitude = X_new.Latitude + 50
    X_new.Temp9am = X_new.Temp9am + 10
    X_new.Temp3pm = X_new.Temp3pm + 10
    return X_new

In [None]:
PreSelectKbest = PrepareForSelectKbestTransformer()

#### Running the transformation and the model

In [None]:
KBestModel=SelectKBest(chi2, k=100)
tmpdf=PreSelectKbest.transform(X_train)
X_np =KBestModel.fit_transform(tmpdf, y_train)

# X_np is np array and we want it go back to data frame. The new dataframe is called X_new

mask = KBestModel.get_support() #list of booleans
new_features = [] # The list of  K best features

for bool, feature in zip(mask, tmpdf.columns):
    if bool:
        new_features.append(feature)

X_new=pd.DataFrame(X_np,columns=new_features,index=tmpdf.index)

In [None]:
Aus_dt2 = DecisionTreeClassifier(min_samples_leaf=5,class_weight='balanced')
Aus_dt2.fit(X_new, y_train)


#### feature_importances

In [None]:
charts.PlotFeatureImportance(X_new,Aus_dt2)

Predicting over the **TRAIN** dataset

In [None]:
X_new['TommorowRain_pred'] = Aus_dt2.predict(X_new)

In [None]:
charts.ClassicGraphicCM(X_new['TommorowRain_pred'],y_train,Aus_dt2.classes_,normalize=True)

#### Test dataset

In [None]:
X_test2=X_test.copy()

In [None]:
tmpdf_test=PreSelectKbest.transform(X_test2)

In [None]:
X_np =KBestModel.transform(tmpdf_test)

mask = KBestModel.get_support() #list of booleans
new_features = [] # The list of  K best features

for bool, feature in zip(mask, tmpdf_test.columns):
    if bool:
        new_features.append(feature)

X_new_test=pd.DataFrame(X_np,columns=new_features,index=tmpdf_test.index)

In [None]:
X_new_test['TommorowRain_pred'] = Aus_dt2.predict(X_new_test)

In [None]:
charts.ClassicGraphicCM(X_new_test['TommorowRain_pred'],y_test,Aus_dt2.classes_,normalize=True)

## logistic regression - 85% accuracy

In [None]:
df4=pd.get_dummies(X_train.iloc[:,[22,3,4,5,6,7,8,9,11,14,16,18,19,21,24,27,30,31,32,33,34,35,36,37,38,39,40,41]])

In [None]:
df44 = pd.get_dummies(X_test.iloc[:,[22,3,4,5,6,7,8,9,11,14,16,18,19,21,24,27,30,31,32,33,34,35,36,37,38,39,40,41]])

In [None]:
scaler = MinMaxScaler()

X_train2 = scaler.fit_transform(df4)
X_test2 = scaler.transform(df44)

In [None]:
logreg = LogisticRegression(solver='liblinear', random_state=0)
logreg.fit(X_train2, y_train)

Predicting over **Training** set

In [None]:
y_pred_train = logreg.predict(X_train2)

In [None]:
charts.ClassicGraphicCM(y_pred_train,y,logreg.classes_,normalize=True)

Predicting over **Test** set

In [None]:
y_pred_test = logreg.predict(X_test2)

In [None]:
charts.ClassicGraphicCM(y_pred_test,y_test,logreg.classes_,normalize=True)

## Random forest - 84% accuracy

### Training the model

In [None]:
PreSelectKbest = PrepareForSelectKbestTransformer()

In [None]:
rf_model1 = RandomForestClassifier(n_estimators= 200,max_depth=50,min_samples_leaf=5, class_weight= 'balanced',random_state=1234)

In [None]:
KBestModel=SelectKBest(chi2, k=100)
X_df=PreSelectKbest.transform(X_train)
X_np =KBestModel.fit_transform(X_df, y_train)

mask = KBestModel.get_support() #list of booleans
new_features = [] # The list of  K best features

for bool, feature in zip(mask, X_df.columns):
    if bool:
        new_features.append(feature)

X_new=pd.DataFrame(X_np,columns=new_features,index=X_df.index)

Predicting over **Training** set

In [None]:
rf_model1.fit(X_new, y_train)

#### feature_importances

In [None]:
charts.PlotFeatureImportance(X_new,rf_model1)

In [None]:
charts.ClassicGraphicCM(rf_model1.predict(X_new),y_train,rf_model1.classes_,normalize=True)

### Predicting over **Test** set

In [None]:
x_tsetDF=PreSelectKbest.transform(X_test)
X_np =KBestModel.transform(x_tsetDF)

mask = KBestModel.get_support() #list of booleans
new_features = [] # The list of  K best features

for bool, feature in zip(mask, x_tsetDF.columns):
    if bool:
        new_features.append(feature)

X_new_test=pd.DataFrame(X_np,columns=new_features,index=x_tsetDF.index)

In [None]:
 charts.ClassicGraphicCM(rf_model1.predict(X_new_test),y_test,rf_model1.classes_,normalize=True)

## XGBOOST - 86% accuracy

In [None]:
xgb_model = XGBClassifier( random_state=1234,n_estimators= 500,max_depth=50,)
xgb_model.fit(X_new, y_train)

Predicting over **Training** set

In [None]:
y_X_train=xgb_model.predict(X_new)

In [None]:
charts.ClassicGraphicCM(y_X_train,y_train,xgb_model.classes_,normalize=True)

### Predicting over **Test** set

In [None]:
y_tst_prd=xgb_model.predict(X_new_test)

In [None]:
charts.ClassicGraphicCM(y_tst_prd,y_test,xgb_model.classes_,normalize=True)


# Running combined models - 90% accuracy

### Running a combined models sliced by each 'climate' value - 90% accuracy

Lets start by preparing the data - again.

In [None]:
class PrepareForSelectKbestTransformer(BaseEstimator, TransformerMixin):
  def fit(self,X,y=None):
    return self
  def transform(self, X):
    X_new = X.copy()
    X_new=pd.get_dummies(X_new.drop(['Date','Location','Long','Nearest location'],axis=1))
    X_new.MaxTemp = X_new.MaxTemp + 10
    X_new.MinTemp = X_new.MinTemp + 10
    X_new.Latitude = X_new.Latitude + 50
    X_new.Temp9am = X_new.Temp9am + 10
    X_new.Temp3pm = X_new.Temp3pm + 10
    return X_new

In [None]:
PreSelectKbest = PrepareForSelectKbestTransformer()

In [None]:
KBestModel=SelectKBest(chi2, k=100)
X_df=PreSelectKbest.transform(X_train)
X_np =KBestModel.fit_transform(X_df, y_train)

mask = KBestModel.get_support() #list of booleans
new_features = [] # The list of  K best features

for bool, feature in zip(mask, X_df.columns):
    if bool:
        new_features.append(feature)

X_new=pd.DataFrame(X_np,columns=new_features,index=X_df.index)

In [None]:
x_tsetDF=PreSelectKbest.transform(X_test)
X_np =KBestModel.transform(x_tsetDF)

mask = KBestModel.get_support() #list of booleans
new_features = [] # The list of  K best features

for bool, feature in zip(mask, x_tsetDF.columns):
    if bool:
        new_features.append(feature)

X_new_test=pd.DataFrame(X_np,columns=new_features,index=x_tsetDF.index)

In [None]:
X_new['Climate']=X_train['Climate']
X_new_test['Climate']=X_test['Climate']

In [None]:
climates = X_new['Climate'].unique()

**Create a model for each climate value**

In [None]:
PredPerClimateDir = {}
for clm in climates:
    print('Start:' + str(clm))
    X_new2 = X_new[X_new['Climate']==clm]
    X_new_test2 = X_new_test[X_new_test['Climate']==clm]
    y_train2 = y_train[X_new['Climate']==clm]
    y_test2 = y_test[X_new_test['Climate']==clm]
    X_new2=X_new2.drop(['Climate'],axis=1)
    X_new_test2 = X_new_test2.drop(['Climate'],axis=1)
    xgb_model2 = XGBClassifier( random_state=1234,n_estimators= 500,max_depth=50,min_child_weight=3)
    xgb_model2.fit(X_new2, y_train2)
    y_tst_prd = xgb_model2.predict(X_new_test2)
    PredPerClimateDir[clm] = {'y_pred':pd.Series(y_tst_prd,name='y_pred',index=X_new_test2.index),'model':xgb_model2}

Combine all models results - running on **Test** set

In [None]:
flag = True
for clm in PredPerClimateDir.keys():
    if flag:
        DF = pd.DataFrame(PredPerClimateDir[clm],columns=['y_pred'],index=PredPerClimateDir[clm]['y_pred'].index)
        DF['Climate'] = clm
        flag=False
    else:
        CurrDF = pd.DataFrame(PredPerClimateDir[clm],columns=['y_pred'],index=PredPerClimateDir[clm]['y_pred'].index)
        CurrDF['Climate'] = clm
        DF = DF.append(CurrDF)

DF = DF.reindex(X_new_test2.index.tolist())

In [None]:
charts.ClassicGraphicCM(DF['y_pred'],y_test2,xgb_model2.classes_,normalize=True)

### Running a combined models sliced by each **rain district** value -

### **90%** accuracy with **much better score over the YES class**

Lets start by preparing the data - again.

In [None]:
class PrepareForSelectKbestTransformer(BaseEstimator, TransformerMixin):
  def fit(self,X,y=None):
    return self
  def transform(self, X):
    X_new = X.copy()
    X_new=pd.get_dummies(X_new.drop(['Date','Location','Long','Nearest location'],axis=1))
    X_new.MaxTemp = X_new.MaxTemp + 10
    X_new.MinTemp = X_new.MinTemp + 10
    X_new.Latitude = X_new.Latitude + 50
    X_new.Temp9am = X_new.Temp9am + 10
    X_new.Temp3pm = X_new.Temp3pm + 10
    return X_new

In [None]:
PreSelectKbest = PrepareForSelectKbestTransformer()

In [None]:
KBestModel=SelectKBest(chi2, k=100)
X_df=PreSelectKbest.transform(X_train)
X_np =KBestModel.fit_transform(X_df, y_train)

mask = KBestModel.get_support() #list of booleans
new_features = [] # The list of  K best features

for bool, feature in zip(mask, X_df.columns):
    if bool:
        new_features.append(feature)

X_new=pd.DataFrame(X_np,columns=new_features,index=X_df.index)

In [None]:
x_tsetDF=PreSelectKbest.transform(X_test)
X_np =KBestModel.transform(x_tsetDF)

mask = KBestModel.get_support() #list of booleans
new_features = [] # The list of  K best features

for bool, feature in zip(mask, x_tsetDF.columns):
    if bool:
        new_features.append(feature)

X_new_test=pd.DataFrame(X_np,columns=new_features,index=x_tsetDF.index)

In [None]:
X_new['rain_district']=X_train['rain_district']
X_new_test['rain_district']=X_test['rain_district']

In [None]:
RainDistricts = X_new['rain_district'].unique()

**Create a model for each rain_district value**

In [None]:
PredPerClimateDir = {}
for RD in RainDistricts:
    print('Start:' + str(RD))
    X_new2 = X_new[X_new['rain_district']==RD]
    X_new_test2 = X_new_test[X_new_test['rain_district']==RD]
    y_train2 = y_train[X_new['rain_district']==RD]
    y_test2 = y_test[X_new_test['rain_district']==RD]
    X_new2=X_new2.drop(['rain_district'],axis=1)
    X_new_test2 = X_new_test2.drop(['rain_district'],axis=1)
    xgb_model2 = XGBClassifier( random_state=1234,n_estimators= 500,max_depth=50,min_child_weight=3)
    xgb_model2.fit(X_new2, y_train2)
    y_tst_prd = xgb_model2.predict(X_new_test2)
    PredPerClimateDir[RD] = {'y_pred':pd.Series(y_tst_prd,name='y_pred',index=X_new_test2.index),'model':xgb_model2}

Combine all models results - running on **Test** set

In [None]:
flag = True
for clm in PredPerClimateDir.keys():
    if flag:
        DF = pd.DataFrame(PredPerClimateDir[clm],columns=['y_pred'],index=PredPerClimateDir[clm]['y_pred'].index)
        DF['Climate'] = clm
        flag=False
    else:
        CurrDF = pd.DataFrame(PredPerClimateDir[clm],columns=['y_pred'],index=PredPerClimateDir[clm]['y_pred'].index)
        CurrDF['Climate'] = clm
        DF = DF.append(CurrDF)

DF = DF.reindex(X_new_test2.index.tolist())

In [None]:
charts.ClassicGraphicCM(DF['y_pred'],y_test2,xgb_model2.classes_,normalize=True)

# Conclusion

We enriched the data by using Australia's geography, climate, and rain districts.

We then used it to fill out all the NULLs.

We created an intensive EDA to understand the data.
And then we run the models.

**Results:**

The dataset is consists of 78% of "no rain" ver. 22% of "rain tomorrow."
We looked at simple models such as decision tree (80% accuracy), logistic regression (85% accuracy), random forest (84% accuracy), and XGBOOST (86% accuracy).

We wanted to get to 90%, so we used a different approach:

**When running a different model on each climate and rain district, we achieved our goal of 90% accuracy overall stations, climates, and rain districts.**

Using XGBOOST for running on each rain district in Australia got a 90% accuracy with 73% precision for predicting "It will rain tomorrow."

Notice: **We don't use RISK_MM** as a feature since RISK_MM tells you how much rain did fall tomorrow, and it is a part of the goal feature. The only place we do look at it is when looking for correlations.