In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In this project, the objective is to predict the occurance of rain in Australia. We are given 10 years worth of historical data collected from various part of Australia. We will be using this data to create a Machine Learning model using Supervised Learning for Classification. 

For this project, first we will try to explore the dataset to gain potential insight and information to help us with the modelling, then we will process the given data, and finally use it to create our model.  

To start our project, we will load relavant library, and import the dataset. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
data = '/kaggle/input/weather-dataset-rattle-package/weatherAUS.csv'

df = pd.read_csv(data)
df.head()

# Exploratory Data Analysis

First we need to understand what kind of data that we have. We will start by doing some basic analysis of the dataset. 

## Data Types

In [None]:
df.shape

In [None]:
df.info()

It appears that we have around **145.460** rows of data. We have a total of **23 column**, **16 numerical column** and **7 object column**. 

Note that, the **Date** column is still an object type, which mean that for later on we need to change it into date format. 

For our Target Feature, **RainTomorrow**, the data type is string. 

In [None]:
#Group the Categorical And Numerical column
categorical = ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow']
numerical = ['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm']

Next, to create a more comprehensive analysis for our **Column, or Features,** we create a group for the Numerical Features and Categorical Features. 

## Categorical Data

For our Categorical Features, we have Location, WindGustDir, WinDir9am, WinDir3pm, RainToday, and RainTomorrow. We will begin by describing the features. 

In [None]:
df[categorical].describe()

From the description, we can take 2 important notes. First, from the count of data in each features, we can see that **there is a missing data in all of our Categorical Features, except for Location**. Also, Location have 49 unique value, WindGustDir,WindDir9am, and WindDir3pm have 16 unique value, and RainToday and RainTomorrow have 2 unique value. 

In [None]:
print(df['Location'].value_counts())

We can see all of the **49 unique values** that the **Location Feature** have, and how much row it contains

In [None]:
print(df['WindGustDir'].value_counts())

In [None]:
print(df['WindDir9am'].value_counts())

In [None]:
print(df['WindDir3pm'].value_counts())

Next is the **WinGustDir, WindDir9am, and WindDir3pm**, all which have **16 unique value**, as seen above, and all **3 of them have 16 identical unique values**, since all of them represent the wind direction. 

In [None]:
print(df['RainToday'].value_counts())

In [None]:
print(df['RainTomorrow'].value_counts())

Lastly, it is the **RainToday and RainTomorrow** Features, which both have the same **2 unique values**, which are Yes and No. This would make both of these collumn a **Boolean type** features. 

Boolean data is a data that contains two different value, usually contains value of Yes or No, True or Fales, Correct or Wrong, and some more. 

## Numerical Data

Since we have done the basic analysis of Categorical Features, now we can continute to explore our Numerical Features.

In [None]:
df[numerical].describe()

From the description of our Numerical Features, we can see the **Count**(how many row of data), **Mean** (average value of the data), **Standard Deviation or STD**(variation or dispersion of a set of values), **Min**(smallest value), **25%** (Q1), **50%** (Q2), **75%** (Q3), and **Max**(largest value). 

**Evaporation, Rainfall, WindGustSpeed, and WindSpeed9am** caught my eyes since the it's Q3 and Q4, or 75% and Max of it's data have a big value differences, and can be concluded that both of the features **a lot of outliers**. 

## Data Visualization

From our initial analysis, we can conclude that we gain a number of insight from our dataset. But with that insight, comes a **few question that we want to answer before proceeding with our analysis**. Some of the questions are : 

1. Since the count of rows for Location features are different for each, did it effects the distribution of rain in the city?
2. Since we have identify 4 features with possible outliers value, is it possible that the rest of the features also have outliers?
3. How do we check the distribution of data in each of the features?
4. How about Null Values? How many does each features have?
5. Can we check the correlation of Independent Features with Target Features?

For this questions, we will try to answer them using visualization. 

### Rain Distribution in Each City

In [None]:
rain_chance = df.loc[(df.RainToday == 'Yes')]

fig, ax= plt.subplots(figsize=(10,15))

sns.histplot(y='Location' ,hue='RainTomorrow', data=rain_chance, ax=ax, bins=49)
ax.set_title('Number of Rain for Each City', size=17, pad=17)
plt.tight_layout()
ax.set_xlabel('Count', size=13, labelpad=11)
ax.set_ylabel('Location', size=13)

We create this visualization to answer about the different distribution of rain in each city. From the visualization, we can see more clearly about the different amount of data each city have. **Not every cities have the same amount of data**. 

Also, if we try to aggregate the distribution of rain from each city, **36 Cities have higher count of rain**. This would mean that we are dealing with **Imbalance Dataset**. 

Imbalance Dataset is when the one of the value of Target Feature have higher count than the other. This would need to be addressed later if we want to create an ideal model. 

### Outliers and Distribution

In [None]:
#Visualizing the features to determine the outlier and distribution of data
null_cols = df[numerical]

fig, axis = plt.subplots(16, 2, figsize = (20, 30))

for x, null in enumerate(null_cols):
  sns.boxplot(y = null, data = df, ax = axis[x][0], color = 'gold')
#Box plost will be use to check the outliers
  sns.histplot(data=df, x=null, color="skyblue", ax=axis[x][1])
#Histogram will be used to see the distribution of the data

fig.suptitle('Rain in Australia Numerical Feature Visualization', fontsize = 16, y=1)
plt.tight_layout()

From this visualization, we can see the Outliers of the Data in the left column using a Box Plot, and the Distribution of Data in the right using a Histogram. After carefull study of the outliers and distribution, **the features can be divided into 3 category**. 

First is the features which have a **normal distribution**, and **no outliers**. The features are **MinTemp, MaxTemp, Temp9am, Temp3pm, Cloud9am, Cloud3pm, and Humidity**. 

Second is the features with **normal distribution** but have **some outliers**. The features are **Humidity9am, Pressure9am, and Pressure3pm**. 

Third is the features with **skewed distribution, and have a lot of outliers**. The features are **WindSpeed9am, WindSpeed3pm, WindGustSpeed, Sunshine, Rainfall, Evaporation**.

### Null Values

In [None]:
total = df.isnull().sum().sort_values(ascending=False)
percent_1 = df.isnull().sum()/df.isnull().count()*100
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
missing_data = pd.concat([total, percent_2], axis=1, keys=['Total', '%'])
missing_data

Next, we are going to inspect the **Null Values of each features**. This is important because later we need to input these missing values. 

At the top we have **Sunshine, Evaporation, Cloud3pm, and Cloud9am with more tha 35% missing value**. since these features have a high number of missing value, it will be best to drop the features, to avoid any bias or noise when we finally create our model. 

Next we have **some features with missing value around 0-10%**, now these missing value can be input, and we will determine the method of filling them based on their distribution and outliers. 

Also we need to check about our **Target Feature, RainTomorrow, which have 3267 missing value**, or around 2.2%, which we **need to remove**. This is because as a target feature, it can't have any missing value, and to rather than filling them it would be best to simply remove them. 

### Correlation of Features

Lastly we want to check the **correlation of each features**. Now since we have a lot of features, we will **focused on the correlation of our Independent Features with Target Features**. 

In [None]:
data = df.apply(lambda x: x.factorize()[0]).corr(method='pearson')
plt.figure(figsize=(15,11))
sns.heatmap(data, linecolor='white',linewidths=1, cmap="YlGnBu", annot=True)
plt.title('Rain in Australia Feature Correlation', size=30)
figure = plt.gcf()
figure.set_size_inches(20, 20)
plt.show()

We can see that there are **3 features that is highly correlated** to our Target Features. it is **RainFall, Humidity, and RainToday**. Out of curiosity, we will also try to visualize Temperature featuers.

Out of the categorical features, it appears that **Wind Direction Features** have the **lowest correlation** with our target features. 

In [None]:
fig, ax= plt.subplots(2, 2, figsize=(20,16))

sns.histplot(x='Rainfall', hue='RainTomorrow', data=df, ax=ax[0][0], bins=10)
ax[0][0].set_title('Rainfall', size=17, pad=17)
ax[0][0].set_xlabel('Rainfall (mm)', size=13, labelpad=11)
ax[0][0].set_ylabel('Days', size=13)

sns.scatterplot(x='Humidity9am', y='Humidity3pm', hue='RainTomorrow',data=df, ax=ax[0][1])
ax[0][1].set_title('Humidity', size=17, pad=17)
ax[0][1].set_xlabel('Humidity9am', size=13, labelpad=11)
ax[0][1].set_ylabel('Humidity3on', size=13)

sns.scatterplot(x='Temp9am', y='Temp3pm', hue='RainTomorrow', data=df, ax=ax[1][0])
ax[1][0].set_title('Temperature', size=17, pad=17)
ax[1][0].set_xlabel('Temp9am', size=13, labelpad=11)
ax[1][0].set_ylabel('Temp3pm', size=13)

sns.scatterplot(x='Pressure9am', y='Pressure3pm', hue='RainTomorrow', data=df, ax=ax[1][1])
ax[1][1].set_title('Pressure', size=17, pad=17)
ax[1][1].set_xlabel('Pressure9am', size=13, labelpad=11)
ax[1][1].set_ylabel('Pressure3pm', size=13)

From the visualization, we can see clearly that there is a **characteristic of value from each feature for our Target Feature**. This is important because without any clear line to determine our Target Feature value, we will need to process the data more carefully. 

This visualization also further confirm the **outlier distribution of Rainfall feature**. 

# Data Processing

## Mising Value

Now that we have a more clearer picture of our dataset, we can start to process the data for Machine Learning Model. We will **begin with processing the Null Values**. 

In [None]:
#Remove the Feature that have >30% Null Value
nullfeature=['Sunshine', 'Evaporation', 'Cloud3pm', 'Cloud9am']
df.drop(nullfeature, axis=1, inplace=True)
df.head()

For starter, we **eleminate the feature that have more than 35% of missing value**, since synthetizing the values could create bias and noises. 

In [None]:
nullfeature=['WindGustDir', 'WindDir9am', 'WindDir3pm']
df.drop(nullfeature, axis=1, inplace=True)
df.head()

Next, we also **remove WindGustDir, WindDir9am, and WindDir3pm**, since they are similar feature with identical value, and have **low correlation** with our target feature. 

Although we have eleminate features with high Null Value, we still have to deal with the missing value in our remaining features. To input the missing value, we will use 3 different methods, Mean, Median, & Mode. Mean is an average value of the column, Median is the middle value of the feature, and Mode is the most occuring value in the column. 

To determine which method to use, we will check using the distribution and the outlier of the column

For M**inTemp, MaxTemp, Temp9am, Temp3pm, & Humidity, we will input using Mean**, since the distribution is fairly normal, and there are no outliers. 

For **Humidity9am, Pressure9am, & Pressure3pm, we will use Median**, since although the distribution is fairly normal, there are several outliers. 

Lastly, for **WindSpeed9am, WindSpeed3pm, WindGustSpeed, & Rainfall, we will use Mode**, since the distribution is skewed, and there are many outliers. 

In [None]:
df['MaxTemp'].fillna(df['MaxTemp'].mean(), inplace=True)
df['MinTemp'].fillna(df['MinTemp'].mean(), inplace=True)
df['Temp9am'].fillna(df['Temp9am'].mean(), inplace=True)
df['Temp3pm'].fillna(df['Temp3pm'].mean(), inplace=True)
df['Humidity3pm'].fillna(df['Humidity3pm'].mean(), inplace=True)

df['Humidity9am'].fillna(df['Humidity9am'].median(), inplace=True)
df['Pressure9am'].fillna(df['Pressure9am'].median(), inplace=True)
df['Pressure3pm'].fillna(df['Pressure3pm'].median(), inplace=True)

df['WindSpeed9am'].fillna(int(df['WindSpeed9am'].mode()), inplace=True)
df['Rainfall'].fillna(int(df['Rainfall'].mode()), inplace=True)
df['WindSpeed3pm'].fillna(int(df['WindSpeed3pm'].mode()), inplace=True)
df['WindGustSpeed'].fillna(int(df['WindGustSpeed'].mode()), inplace=True)

Now we that we have filled the missing value with the method that we determine before, we will continue to **fill the categorical value**. 

Since Location have no missing value, and we have delete 3 categorical features, WindGustDir, WinDir9am, & WinDir3pm, **we only have to deal with RainToday and RainTomorrow** missing value.

For **RainTomorrow, we will delete the missing value**, since as the target feature we won't synthesize the value, which could possible create bias in our analysis. 

To input **RainToday missing value, we will use Before Fill**, which is using the value before the missing value. This is because our data is sorted by date, therefore we will make assumption that today and tomorrow will have similar data, and will be more accurate than using Mean, Median or Mode. 

In [None]:
df['RainToday'].fillna(method='bfill', inplace=True)
df.dropna(subset = ["RainTomorrow"], inplace=True)

In [None]:
df.isnull().sum()

Now that we have deal with the missing value of our dataset, we can continue to process the features for Machine Learning Model. 

## Data Scaling and Binning

### Categorical Features Processing

Now we will begin to process our features. **Location is an important feature**, but Machine Learning can't input categorical data. That is why we need to convert Location feature into something else. 

In this case, we will **create an identifier of our Location features**. Identifier is creating a numerical values to act as an identifier for our Location. We will be creating this Identifier using data that we have about the Location. 

For instance, **we will be aggregating the Temperature, Humidity, Pressure, and WindSpeed of each city**. This will create 4 new feature, and for each city we will have different values. This will allow our Model to understand the Location feature. 

Once we create the identifier, we will enter the new features into a new dataset called **df_new**.

In [None]:
data = (df.groupby(['Location', 'WindSpeed9am'], as_index=False).mean()
            .groupby('Location')['WindSpeed3pm'].mean())
total_wind = pd.DataFrame(data).reset_index()

data2 = (df.groupby(['Location', 'Humidity9am'], as_index=False).mean()
            .groupby('Location')['Humidity3pm'].mean())
total_humidity = pd.DataFrame(data2).reset_index()

data3 = (df.groupby(['Location', 'Pressure9am'], as_index=False).mean()
            .groupby('Location')['Pressure3pm'].mean())
total_pressure = pd.DataFrame(data3).reset_index()

data4 = (df.groupby(['Location', 'Temp9am'], as_index=False).mean()
            .groupby('Location')['Temp3pm'].mean())
total_pressure = pd.DataFrame(data4).reset_index()

total1 = pd.merge(data, data2, on='Location')
total2 = pd.merge(data3, data4, on='Location')
city_total = pd.merge(total1, total2, on='Location')
city_total.columns = ['WindSpeed', 'Humidity', 'Pressure', 'Temp']

In [None]:
df_new = pd.merge(df, city_total, on='Location')
df_new.head()

Now we have 4 new feature that will act as an identifier for our Location. Since we no longer need the Location feature, **we will delete the feature**, and continue to process the RainToday and RainTomorrow features. 

In [None]:
df_new.drop(['Location'], axis=1, inplace=True)

In [None]:
df_new['RainToday'] = df_new.RainToday.astype('category')
df_new['RainTomorrow'] = df_new.RainTomorrow.astype('category')

df_new.RainToday = pd.Categorical(df_new.RainToday)
df_new.RainTomorrow = pd.Categorical(df_new.RainTomorrow)

df_new['RainToday'] = df_new.RainToday.cat.codes
df_new['RainTomorrow'] = df_new.RainTomorrow.cat.codes

In [None]:
df_new.head()

Since **RainToday and RainTomorrow is a Boolean Data**, it's quite simple to process them since they can be easily be **converted into 1 and 0**, 1 for Yes and 0 for No. Now that all our Categorical Features have been process, we can continue to process our Numerical Features. 

### Numerical Features Processing

Now we will begin to **scale our Numerical Features**. Scaling is important to create the same magnitudes, units and range between features. We will process them using StandarScaler, MinMaxScaler and Bining.

**StandardScaler** is converting the value of our data so that the feature will have Mean equal to 0, and Standard Deviation equal to 1. 

**MinMaxScaler** is converting the value into the range of 0 and 1. 

**Binning** is changing the value of our feature into an ordinal range, which we will have to define manually. 

First, lets describe our Dataset to give a broader context. 

In [None]:
df_new.describe()

First we will process the **feature with high number of outliers**. That would be WindGustSpeed, WindSpeed9am, WindSpeed3pm, and Rainfall, which we will scaled using **MinMaxScaler**. 

For our identifier for the Location feature, which are Temp, WindSpeed, Humidity, and Pressure, we will use **StandardScaler**.

In [None]:
df_new['Rainfall'] = MinMaxScaler().fit_transform(df_new['Rainfall'].values.reshape(len(df_new), 1))
df_new['WindGustSpeed'] = MinMaxScaler().fit_transform(df_new['WindGustSpeed'].values.reshape(len(df_new), 1))
df_new['WindSpeed9am'] = MinMaxScaler().fit_transform(df_new['WindSpeed9am'].values.reshape(len(df_new), 1))
df_new['WindSpeed3pm'] = MinMaxScaler().fit_transform(df_new['WindSpeed3pm'].values.reshape(len(df_new), 1))
df_new['Temp'] = StandardScaler().fit_transform(df_new['Temp'].values.reshape(len(df_new), 1))
df_new['WindSpeed'] = StandardScaler().fit_transform(df_new['WindSpeed'].values.reshape(len(df_new), 1))
df_new['Humidity'] = StandardScaler().fit_transform(df_new['Humidity'].values.reshape(len(df_new), 1))
df_new['Pressure'] = StandardScaler().fit_transform(df_new['Pressure'].values.reshape(len(df_new), 1))

In [None]:
df_new.describe()

Now that we have scale some of our Numerical Features, it's time to scale the rest. **For the rest of the numerical features, we will be using Binning Method**. 

Binning Method is changing the values of each features into an ordinal range, by defining the range of values for features. This method is great to simplified our data, and create an easier data for our model to learn, but it's dangerous because we risk the data to be over-generalized. 

To avoid that, we will define the range for the binning using the description of each feature that we have run before.

In [None]:
data = [df_new]

for dataset in data:
  dataset['MinTemp'] = dataset['MinTemp'].astype(int)
  dataset.loc[ dataset['MinTemp'] <= 0, 'MinTemp'] = 0
  dataset.loc[(dataset['MinTemp'] > 0) & (dataset['MinTemp'] <= 5), 'MinTemp'] = 1
  dataset.loc[(dataset['MinTemp'] > 5) & (dataset['MinTemp'] <= 10), 'MinTemp'] = 2
  dataset.loc[(dataset['MinTemp'] > 10) & (dataset['MinTemp'] <= 15), 'MinTemp'] = 3
  dataset.loc[(dataset['MinTemp'] > 15) & (dataset['MinTemp'] <= 20), 'MinTemp'] = 4
  dataset.loc[(dataset['MinTemp'] > 20) & (dataset['MinTemp'] <= 25), 'MinTemp'] = 5
  dataset.loc[(dataset['MinTemp'] > 25) & (dataset['MinTemp'] <= 30), 'MinTemp'] = 6
  dataset.loc[ dataset['MinTemp'] > 30, 'MinTemp'] = 7

for dataset in data:
  dataset['MaxTemp'] = dataset['MaxTemp'].astype(int)
  dataset.loc[ dataset['MaxTemp'] <= 0, 'MaxTemp'] = 0
  dataset.loc[(dataset['MaxTemp'] > 0) & (dataset['MaxTemp'] <= 5), 'MaxTemp'] = 1
  dataset.loc[(dataset['MaxTemp'] > 5) & (dataset['MaxTemp'] <= 10), 'MaxTemp'] = 2
  dataset.loc[(dataset['MaxTemp'] > 10) & (dataset['MaxTemp'] <= 15), 'MaxTemp'] = 3
  dataset.loc[(dataset['MaxTemp'] > 15) & (dataset['MaxTemp'] <= 20), 'MaxTemp'] = 4
  dataset.loc[(dataset['MaxTemp'] > 20) & (dataset['MaxTemp'] <= 25), 'MaxTemp'] = 5
  dataset.loc[(dataset['MaxTemp'] > 25) & (dataset['MaxTemp'] <= 30), 'MaxTemp'] = 6
  dataset.loc[(dataset['MaxTemp'] > 30) & (dataset['MaxTemp'] <= 35), 'MaxTemp'] = 7
  dataset.loc[(dataset['MaxTemp'] > 35) & (dataset['MaxTemp'] <= 40), 'MaxTemp'] = 8
  dataset.loc[(dataset['MaxTemp'] > 40) & (dataset['MaxTemp'] <= 45), 'MaxTemp'] = 9
  dataset.loc[ dataset['MaxTemp'] > 45, 'MaxTemp'] = 10

for dataset in data:
  dataset['Humidity9am'] = dataset['Humidity9am'].astype(int)
  dataset.loc[(dataset['Humidity9am'] > 0) & (dataset['Humidity9am'] <= 10), 'Humidity9am'] = 0
  dataset.loc[(dataset['Humidity9am'] > 10) & (dataset['Humidity9am'] <= 20), 'Humidity9am'] = 1
  dataset.loc[(dataset['Humidity9am'] > 20) & (dataset['Humidity9am'] <= 30), 'Humidity9am'] = 2
  dataset.loc[(dataset['Humidity9am'] > 30) & (dataset['Humidity9am'] <= 40), 'Humidity9am'] = 3
  dataset.loc[(dataset['Humidity9am'] > 40) & (dataset['Humidity9am'] <= 50), 'Humidity9am'] = 4
  dataset.loc[(dataset['Humidity9am'] > 50) & (dataset['Humidity9am'] <= 60), 'Humidity9am'] = 5
  dataset.loc[(dataset['Humidity9am'] > 60) & (dataset['Humidity9am'] <= 70), 'Humidity9am'] = 6
  dataset.loc[(dataset['Humidity9am'] > 70) & (dataset['Humidity9am'] <= 80), 'Humidity9am'] = 7
  dataset.loc[(dataset['Humidity9am'] > 80) & (dataset['Humidity9am'] <= 90), 'Humidity9am'] = 8
  dataset.loc[ dataset['Humidity9am'] > 90, 'Humidity9am'] = 9

for dataset in data:
  dataset['Humidity3pm'] = dataset['Humidity3pm'].astype(int)
  dataset.loc[(dataset['Humidity3pm'] > 0) & (dataset['Humidity3pm'] <= 10), 'Humidity3pm'] = 0
  dataset.loc[(dataset['Humidity3pm'] > 10) & (dataset['Humidity3pm'] <= 20), 'Humidity3pm'] = 1
  dataset.loc[(dataset['Humidity3pm'] > 20) & (dataset['Humidity3pm'] <= 30), 'Humidity3pm'] = 2
  dataset.loc[(dataset['Humidity3pm'] > 30) & (dataset['Humidity3pm'] <= 40), 'Humidity3pm'] = 3
  dataset.loc[(dataset['Humidity3pm'] > 40) & (dataset['Humidity3pm'] <= 50), 'Humidity3pm'] = 4
  dataset.loc[(dataset['Humidity3pm'] > 50) & (dataset['Humidity3pm'] <= 60), 'Humidity3pm'] = 5
  dataset.loc[(dataset['Humidity3pm'] > 60) & (dataset['Humidity3pm'] <= 70), 'Humidity3pm'] = 6
  dataset.loc[(dataset['Humidity3pm'] > 70) & (dataset['Humidity3pm'] <= 80), 'Humidity3pm'] = 7
  dataset.loc[(dataset['Humidity3pm'] > 80) & (dataset['Humidity3pm'] <= 90), 'Humidity3pm'] = 8
  dataset.loc[ dataset['Humidity3pm'] > 90, 'Humidity3pm'] = 9

for dataset in data:
  dataset['Pressure9am'] = dataset['Pressure9am'].astype(int)
  dataset.loc[(dataset['Pressure9am'] > 0) & (dataset['Pressure9am'] <= 1000), 'Pressure9am'] = 0
  dataset.loc[(dataset['Pressure9am'] > 1000) & (dataset['Pressure9am'] <= 1005), 'Pressure9am'] = 1
  dataset.loc[(dataset['Pressure9am'] > 1005) & (dataset['Pressure9am'] <= 1010), 'Pressure9am'] = 2
  dataset.loc[(dataset['Pressure9am'] > 1010) & (dataset['Pressure9am'] <= 1013), 'Pressure9am'] = 3
  dataset.loc[(dataset['Pressure9am'] > 1013) & (dataset['Pressure9am'] <= 1015), 'Pressure9am'] = 4
  dataset.loc[(dataset['Pressure9am'] > 1015) & (dataset['Pressure9am'] <= 1017), 'Pressure9am'] = 5
  dataset.loc[(dataset['Pressure9am'] > 1017) & (dataset['Pressure9am'] <= 1019), 'Pressure9am'] = 6
  dataset.loc[(dataset['Pressure9am'] > 1019) & (dataset['Pressure9am'] <= 1021), 'Pressure9am'] = 7
  dataset.loc[(dataset['Pressure9am'] > 1021) & (dataset['Pressure9am'] <= 1031), 'Pressure9am'] = 8
  dataset.loc[ dataset['Pressure9am'] > 1031, 'Pressure9am'] = 9

for dataset in data:
  dataset['Pressure3pm'] = dataset['Pressure3pm'].astype(int)
  dataset.loc[(dataset['Pressure3pm'] > 0) & (dataset['Pressure3pm'] <= 980), 'Pressure3pm'] = 0
  dataset.loc[(dataset['Pressure3pm'] > 980) & (dataset['Pressure3pm'] <= 990), 'Pressure3pm'] = 1
  dataset.loc[(dataset['Pressure3pm'] > 990) & (dataset['Pressure3pm'] <= 1000), 'Pressure3pm'] = 2
  dataset.loc[(dataset['Pressure3pm'] > 1000) & (dataset['Pressure3pm'] <= 1005), 'Pressure3pm'] = 2
  dataset.loc[(dataset['Pressure3pm'] > 1005) & (dataset['Pressure3pm'] <= 1011), 'Pressure3pm'] = 3
  dataset.loc[(dataset['Pressure3pm'] > 1011) & (dataset['Pressure3pm'] <= 1013), 'Pressure3pm'] = 4
  dataset.loc[(dataset['Pressure3pm'] > 1013) & (dataset['Pressure3pm'] <= 1015), 'Pressure3pm'] = 5
  dataset.loc[(dataset['Pressure3pm'] > 1015) & (dataset['Pressure3pm'] <= 1017), 'Pressure3pm'] = 6
  dataset.loc[(dataset['Pressure3pm'] > 1017) & (dataset['Pressure3pm'] <= 1019), 'Pressure3pm'] = 7
  dataset.loc[(dataset['Pressure3pm'] > 1019) & (dataset['Pressure3pm'] <= 1024), 'Pressure3pm'] = 8
  dataset.loc[(dataset['Pressure3pm'] > 1024) & (dataset['Pressure3pm'] <= 1029), 'Pressure3pm'] = 9
  dataset.loc[(dataset['Pressure3pm'] > 1029) & (dataset['Pressure3pm'] <= 1035), 'Pressure3pm'] = 10
  dataset.loc[ dataset['Pressure3pm'] > 1035, 'Pressure3pm'] = 11 

for dataset in data:
  dataset['Temp9am'] = dataset['Temp9am'].astype(int)
  dataset.loc[ dataset['Temp9am'] <= 0, 'Temp9am'] = 0
  dataset.loc[(dataset['Temp9am'] > 0) & (dataset['Temp9am'] <= 5), 'Temp9am'] = 1
  dataset.loc[(dataset['Temp9am'] > 5) & (dataset['Temp9am'] <= 10), 'Temp9am'] = 2
  dataset.loc[(dataset['Temp9am'] > 10) & (dataset['Temp9am'] <= 15), 'Temp9am'] = 3
  dataset.loc[(dataset['Temp9am'] > 15) & (dataset['Temp9am'] <= 20), 'Temp9am'] = 4
  dataset.loc[(dataset['Temp9am'] > 20) & (dataset['Temp9am'] <= 25), 'Temp9am'] = 5
  dataset.loc[(dataset['Temp9am'] > 25) & (dataset['Temp9am'] <= 30), 'Temp9am'] = 6
  dataset.loc[(dataset['Temp9am'] > 30) & (dataset['Temp9am'] <= 35), 'Temp9am'] = 7
  dataset.loc[ dataset['Temp9am'] > 35, 'Temp9am'] = 8

for dataset in data:
  dataset['Temp3pm'] = dataset['Temp3pm'].astype(int)
  dataset.loc[ dataset['Temp3pm'] <= 0, 'Temp3pm'] = 0 
  dataset.loc[(dataset['Temp3pm'] > 0) & (dataset['Temp3pm'] <= 5), 'Temp3pm'] = 1
  dataset.loc[(dataset['Temp3pm'] > 5) & (dataset['Temp3pm'] <= 10), 'Temp3pm'] = 2
  dataset.loc[(dataset['Temp3pm'] > 10) & (dataset['Temp3pm'] <= 15), 'Temp3pm'] = 3
  dataset.loc[(dataset['Temp3pm'] > 15) & (dataset['Temp3pm'] <= 20), 'Temp3pm'] = 4
  dataset.loc[(dataset['Temp3pm'] > 20) & (dataset['Temp3pm'] <= 25), 'Temp3pm'] = 5
  dataset.loc[(dataset['Temp3pm'] > 25) & (dataset['Temp3pm'] <= 30), 'Temp3pm'] = 6
  dataset.loc[(dataset['Temp3pm'] > 30) & (dataset['Temp3pm'] <= 35), 'Temp3pm'] = 7
  dataset.loc[(dataset['Temp3pm'] > 35) & (dataset['Temp3pm'] <= 40), 'Temp3pm'] = 8
  dataset.loc[ dataset['Temp3pm'] > 40, 'Temp3pm'] = 9

In [None]:
df_new.describe()

Now that we have finished Binning our data, our last step for the processing is to **breakdown the Date Feature**. Although Date is a numerical data, Machine Learning Model can't understand Date Type Data.

Therefore, we will be **breaking our Date feature, into 3 different feature, which are Day, Month, and Year**. 

In [None]:
df_new['Date'] = pd.to_datetime(df_new['Date'])

df_new['Year'] = df_new['Date'].dt.year
df_new['Month'] = df_new['Date'].dt.month
df_new['Day'] = df_new['Date'].dt.day

df_new.drop(['Date'], axis=1, inplace=True)
df_new.head()

# Feature Selection

Before we create a Machine Learning Model with our data, we need to **reduce the dimensionality of our data**, by removing some of it's Features. We need to do this because not all of the Features that we have is useful for making the predictions. 

But we can't just delete Features without any analyzation. This is where Feature Selection Method become useful. We will be removing Features using the **Embedded Method**. 

Embedded Method is a feature selection technic using Machine Learning, by analyzing the correlation of each feature with each others. This method is choosed because it's have a fast computation speed, calculate for the interaction of each features, great accuracy, and reduce the chance of over-fitting. 

For this method, we first need to **break our data into two**, one for Independent Features and the other for Target Features. 

In [None]:
X = df_new.drop('RainTomorrow', axis=1)
Y = df_new['RainTomorrow']

In [None]:
clf = RandomForestClassifier()
clf = clf.fit(X,Y)

feat_importance = pd.Series(clf.feature_importances_, index=X.columns)
feat_importance.plot(kind='barh')
plt.title('Feature Importances')
plt.show()
print(feat_importance)

From this method, we can see the **value of each features to our Target features**. Now that we know the value, we can set the lower limit of the value. **We will set the lower limit to 0.04** so that we can remove some of the less useful features. 

In [None]:
embed = SelectFromModel(clf,threshold = 0.04, prefit=True)
X_new = embed.transform(X)

print('Before Embed', X.shape)
print('After Embed', X_new.shape)

Now that we have remove some of the less important features, we can continue to create our Machine Learning Model. 



# Machine Learning Model

For our Machine Learning Model, there are two points to highlight. **First since we are dealing with Imbalance Dataset**, we will need to do either an Over-Sampling, or Under-Sampling. 

Over-Sampling have a risk of over-fitting our model. but Under-Sampling have a risk of losing valuable informations. With this consideration, we will use Over-Sampling for our model, using SMOTE.

Second, since we don't know the best model for our analysis, based on our Exploratory Data Analysis and Data Processing, we c**onclude that two possible best model** for our analysis would be **K Nearest Neighbor (KNN), or Random Forest Classifier**. We will try both of these model. 

We Choose **KNN Model** since it is a model that calculate the distance of each features, using it to classify new data. **Random Forest** is choosen since it create a Tree-like calculation to determine the classification, and create a several iteration of the features to determine how to best classify a new data. 

For the Scoring of our Model, there are **4 metrics to consider**. First is **Accuracy**, which is the ratio of correct prediction of the whole dataset. Second is **Precision**, which is the ratio of correctly predicted positive observations to the total predicted positive observations. **Recalls** correctly predicted positive observations to the all observations in actual class - yes, and **F1 Score** is weighted average of Precision and Recall.

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X_new, Y, test_size=0.3, random_state=42)

oversample = SMOTE()
X_over, Y_over = oversample.fit_resample(X_new, Y)

## K Nearest Neighbor

In [None]:
knn = KNeighborsClassifier(n_neighbors = 3) 
knn.fit(X_over, Y_over)
knn_pred = knn.predict(X_test)
knn_test_score = accuracy_score(Y_test, knn_pred)
acc_knn = cross_val_score(knn, X_over, Y_over, cv=5)

## Random Forest Classifier

In [None]:
random_forest = RandomForestClassifier(max_depth=5)
random_forest.fit(X_over, Y_over)
random_pred = random_forest.predict(X_test)
random_test_score = accuracy_score(Y_test, random_pred)
acc_random = cross_val_score(random_forest, X_over, Y_over, cv=5)

## Model Results

In [None]:
results = pd.DataFrame({
    'Model': ['Random Forest', 'KNN Model'],
    'Train Score': [acc_random.mean(), acc_knn.mean()],
    'Test Score': [random_test_score, knn_test_score]          
              })
result_df = results.sort_values(by='Train Score', ascending=False)
result_df = result_df.set_index('Model')
result_df

Now from the result, we can see that **KNN perform better than Random Forest**. Normally models are scored by 4 different methods, Accuracy, Precision, Recall, and F1. But for our **initial analysis**, we will simply refeer them to Train & Test Score. 

# HyperParameter Tuning

Both of our models already have a good performance, but we can still improve the performance using **HyperParameter Tuning**. HyperParameter is parameter of a model whose value is defined by the user. If we define them correctly, we can increase the score of our model. 

But to find the optimal value for our HyperParameter, we need to search it first. To find the optimal value, we will be using **RandomizedSearchCV**, which is a HyperParameter searching method by iterating each possible combination for the HyperParameter that we define. 

For HyperParamter Tuning, we will **try and find the best value to increase our F1 Score**. This is because eventhough we use accuracy for our initial test, for our **final result we will determine the best model using F1 score, since we are dealing with an Imbalance Dataset**, which is more accurate metrics for our model, since it's calculate the average of Recall and Precision. Still, we still will show the score of all of the 4 metrics for the final result, to show a broader context in scoring our models. . 

## RandomizedSearchCV

In [None]:
param = {'n_neighbors' : [3,5,7],
         'p' : [1, 2],
         'algorithm' : ['ball_tree', 'kd_tree']}

search_knn = RandomizedSearchCV(knn, param_distributions = param, n_iter=25, scoring='f1', n_jobs=-1, cv=3, random_state=1)
result_knn = search_knn.fit(X_over, Y_over)

print('Best Score: %s' % result_knn.best_score_)
print('Best Hyperparameters: %s' % result_knn.best_params_)

In [None]:
param = {'n_estimators': [100, 300, 500],
         'max_depth': [4, 5, 6],
         'min_samples_split':[2, 4, 6],
         'min_samples_leaf': [1, 3, 5]}

random_forest = RandomForestClassifier()
search = RandomizedSearchCV(random_forest, param_distributions = param, n_iter=25, scoring='f1', n_jobs=-1, cv=3, random_state=1)
result = search.fit(X_over, Y_over)

print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)

Now that we have the **values for our HyperParameter Tuning**, the next step would be to enter the values into our Machine Learning Model. 

We will input the values into both of our Machine Learning Model, K Nearest Neighbor and Random Forest Classifier. 

## K Nearest Neighbor

In [None]:
knn = KNeighborsClassifier(n_neighbors = 3, algorithm = 'kd_tree', p = 1)
knn.fit(X_over,Y_over)
knn_pred = knn.predict(X_test)
acc_knn = cross_val_score(knn, X_over, Y_over, cv=5)
knn_acc_score = accuracy_score(Y_test, knn_pred)
knn_prec_score = precision_score(Y_test, knn_pred)
knn_rec_score = recall_score(Y_test, knn_pred)
knn_f1_score = f1_score(Y_test, knn_pred)

## Random Forest

In [None]:
random_forest = RandomForestClassifier(n_estimators = 500, min_samples_split = 2, min_samples_leaf = 1,  max_depth = 6)
random_forest.fit(X_over, Y_over)
random_pred = random_forest.predict(X_test)
acc_random = cross_val_score(random_forest, X_over, Y_over, cv=5)
random_acc_score = accuracy_score(Y_test, random_pred)
random_prec_score = precision_score(Y_test, random_pred)
random_rec_score = recall_score(Y_test, random_pred)
random_f1_score = f1_score(Y_test, random_pred)

## Model Results

In [None]:
results = pd.DataFrame({
    'Score Index': ['Random Forest Classifier', 'K Nearest Neighbor'],
    'Train Score': [acc_random.mean(), acc_knn.mean()],
    'Accuracy Test' : [random_acc_score, knn_acc_score],
    'Precision Test' : [random_prec_score, knn_prec_score],
    'Recall Test' : [random_rec_score, knn_rec_score],
    'F1 Test' : [random_f1_score, knn_f1_score]
    })
result_forest = results
result_forest

After doing HyperParameter tuning, we were able to increase the result from our model. The **KNN Model Training Score have increased** from 85,7% to 88,3%. Also, our KNN model have a high accuracy and recall score. **The Random Forest model also increased it's training score** after HyperParameter Tuning, from 80,1% to 81,1%. From this, we conclude that the **HyperParameter Tuning successfully increase our model score**. 

Since we are dealing with Imbalance Dataset, the best metric to determine the **best score for our analysis is F1 Score**, which is the weighted average of Precision and Recall.

Therefore, we can conclude that the **best model for this analysis is K Nearest Neighbor Model**, with F1 Score of 86,2%. 