In [None]:
#Libraries required
# Import Dependencies
%matplotlib inline

# Start Python Imports
import math, time, random, datetime

# Data Manipulation
import numpy as np
import pandas as pd

# Visualization 
import matplotlib.pyplot as plt
import missingno
import seaborn as sns
plt.style.use('seaborn-whitegrid')

# Machine learning
from sklearn.model_selection import train_test_split
from sklearn import model_selection, tree, preprocessing, metrics, linear_model
from sklearn.linear_model import LinearRegression, LogisticRegression, SGDClassifier
from sklearn.tree import DecisionTreeClassifier

#ignore warnings for now
import warnings
warnings.filterwarnings('ignore')

import os
print(os.listdir("../input"))

**Download the data**  
The data has been downloaded from kaggle datasets : https://www.kaggle.com/jsphyg/weather-dataset-rattle-package

**Load the data**  
Load the data into the notebook (file : weatherAUS.csv)

In [None]:
data = pd.read_csv('../input/weatherAUS.csv')

In [None]:
# Let's see how our data looks like
print('Weather dataframe dimension: ',data.shape)
data.describe()

In [None]:
# We can see that there are lot of NaN values in the dataframe.  
# Let's check which column has maximum Nan values
print(data.count().sort_values())

#Graph to find missing values in the dataframe
import missingno
missingno.matrix(data, figsize = (30,10))

**Feature Selection**
1. From the above result we can see that the columns **Sunshine,Evaporation,Cloud3pm,Cloud9am** have more Nan or null values, they have less than 60% data, hence we are not including these columns.  
2. Also, we donot need **Location** column because we are trying to predict whether it will rain or not tommorrow and this analysis is not based on location.  
3. **Date** column can also be removed since the feature is not required for our prediction model.
4. We must remove **RISK_MM** feature since here we are trying to predict 'RainTommorrow'. RISK_MM is amount of rainfall in millimeters for the next day. It includes all forms of precipitation that reach the ground, such as rain, drizzle, hail and snow. Since it contains information about the future, and information directly about the target variable, including it would leak the future information to the model. Instead the variable itself can be actually used to determine whether or not it rained to create the binary target. For example, if RISKMM was greater than 0, then the RainTomorrow target variable is equal to Yes. Hence, using it as a predictor to build a model and then testing on this dataset would give the false appearance of a high accuracy.

In [None]:
data = data.drop(columns = ['Sunshine','Evaporation','Cloud3pm','Cloud9am','Location','Date','RISK_MM'],axis=1)

In [None]:
print(data.shape)
data.head()

We can also write a function to track the missing values in each of the columns as below:

In [None]:
def find_missing_values(df,columns):
    missing_vals = {}
    df_length = len(df)
    for column in columns:
        total_column_values = df[column].value_counts().sum()
        missing_vals[column] = df_length - total_column_values
    return missing_vals

missing_values = find_missing_values(data,data.columns)
missing_values

Now let's see how to deal with missing values or Nan values

In [None]:
data = data.dropna(axis = 'index',how='any')
print(data.shape)

missing_values = find_missing_values(data,data.columns)
missing_values

In [None]:
final = pd.DataFrame()

DATA TRANSFORMATION

In [None]:
#Data transformation
#For the categorical columns, we will change the value 'Yes' and 'No' to '1' and '0' respectively
data['RainTomorrow'].replace({'No': 0, 'Yes': 1},inplace = True)
data['RainToday'].replace({'No': 0, 'Yes': 1},inplace = True)

#See unique values and convert them to int using pd.getDummies()
categorical_columns = ['WindGustDir', 'WindDir3pm', 'WindDir9am']
for col in categorical_columns:
    print(np.unique(data[col]))
# transform the categorical columns
final = pd.get_dummies(data, columns=categorical_columns)


In [None]:
final.head()

Now lets standardise the data

In [None]:
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
scaler.fit(final)
final = pd.DataFrame(scaler.transform(final), index=final.index, columns=final.columns)
final.head()

**FEATURE EXPLORATION**

**Feature 1 : 'RainTommorrow' (Target variable)**  

Description : Did it rain the next day?  
Values : 'Yes' , 'No'  
This is the dependent variable we want our machine learning model to predict, based on other independent variables  

In [None]:
#Now we will just see how many times it rained the next day?
fig = plt.figure(figsize = (20,3))
sns.countplot(y='RainTomorrow', data=final);
print(final.RainTomorrow.value_counts())

From the graph we can know the ratio of cases where the next day is predicted as rainy day or not. From here, it gives a clue that - in future if our machine learning model is predicting more RainTomorrow = 'No' than number of RainTomorrow = 'Yes' with our test data, it means that we should check on the model. It basically gives us information about the bias or balance of targets in our data.
The table above the graph presents frequency counts for the binary variable.

**Feature 2: MinTemp**

Description : Minimum temperature in degree celsius

In [None]:
missing_values['MinTemp']

In [None]:
data.MinTemp.value_counts()

since it is a continuous value, let's do binning and put this to a sepearate dataframe final_bin for visualization purpose

In [None]:
final_bin = pd.DataFrame()
final_bin['RainTomorrow'] = final['RainTomorrow']

In [None]:
final_bin['MinTemp'] = pd.cut(data['MinTemp'],bins = 5) #discretising the float numbers into categorical

In [None]:
final_bin.MinTemp.value_counts()

In [None]:
final.head()

Now lets see a function to create count and distribution for any variable we want

In [None]:
def plot_count_dist(df,label_column,target_column,figsize=(20,5)):
        fig = plt.figure(figsize=figsize)
        plt.subplot(1,2,1)
        sns.countplot(y=target_column, data = df);
        plt.subplot(1,2,2)
        sns.distplot(data.loc[data[label_column] == 1][target_column],
                    kde_kws={"label" : "Yes"});
        sns.distplot(data.loc[data[label_column] == 0][target_column],
                    kde_kws={"label" : "No"});
    

In [None]:
#Calling the function above we will visualise the MinTemp bin counts as well as the MinTemp distribution versus RainTomorrow
plot_count_dist(df= final_bin, label_column = 'RainTomorrow', target_column = 'MinTemp', figsize = (20,10))

The left plot shows how many different values does MinTemp has. The right plot means that when MinTemp is between 0.0 - 0.3, the target RainTomorrow being 'No' is more and when MinTemp is greater than 0.6 the target RainTomorrow being 'Yes' is more. 

**Feature 3 : MaxTemp**

Description : The maximum temperature in degrees celsius

In [None]:
# Let's cross check the missing values
missing_values['MaxTemp']

In [None]:
data['MaxTemp'].value_counts()

since it is a continuous value, let's do binning and put this to our final_bin dataframe.

In [None]:
final_bin['MaxTemp'] = pd.cut(data['MaxTemp'],bins = 5) #discretising the float numbers into categorical

In [None]:
final_bin['MaxTemp'].value_counts()

In [None]:
final.head()

In [None]:
#Calling the function above we will visualise the MaxTemp bin counts as well as the MaxTemp distribution versus RainTomorrow
plot_count_dist(df= final_bin, label_column = 'RainTomorrow', target_column = 'MaxTemp', figsize = (20,10))

The left plot shows how many different values does MinTemp has.The right plot means that when MaxTemp is between 0.0 - 0.4, the target RainTomorrow being 'Yes' is more and when MinTemp is greater than 0.4 the target RainTomorrow being 'No' is more.

**Feature 4 : Rainfall**

Description : The amount of Rainfall in mm recorded for the day

In [None]:
# Let's cross check the missing values
missing_values['Rainfall']

In [None]:
data['Rainfall'].value_counts()

In [None]:
print("There are {} unique minimum temperature values.".format(len(data.Rainfall.unique())))

since it is a continuous value, let's do binning and put this to our final_bin datafarme

In [None]:
final_bin['Rainfall'] = pd.cut(data['Rainfall'],bins = 5) #discretising the float numbers into categorical

In [None]:
final_bin['Rainfall'].value_counts()

In [None]:
final.head()

In [None]:
#Calling the function above we will visualise the MaxTemp bin counts as well as the MaxTemp distribution versus RainTomorrow
plot_count_dist(df= final_bin, label_column = 'RainTomorrow', target_column = 'Rainfall', figsize = (20,10))

The left plot shows that most of the values for rainfall in mm lies in between  -0.001 to 0.2 and the right plot shows it is obvious that current day's rainfall of nearly '0'mm indicates that mostly there will be no rainfall next day.

**Feature 5 : WindGustDir**

Description : The direction of the strongest wind gust in the 24 hours to midnight

In [None]:
missing_values['WindGustDir']

We have already transformed this categorical variable using dummies, for the system to understand. So let's go ahead and visualise the data

In [None]:
WindGustDir_table = pd.crosstab(index=data["WindGustDir"], columns=data["RainTomorrow"])
WindGustDir_table

In [None]:
WindGustDir_table.plot(kind="bar", figsize=(15,8),stacked=False)

The insights convey the chances of being RainTomorrow = 'NO' is when wind blows in direction of East,Southeast and South Southeast mostly.and the chances of being RainTomorrow = 'Yes' is when it is North, West and Northwest.

**Feature 5 : WindGustSpeed**

Description : The speed (km/h) of the strongest wind gust in the 24 hours to midnight

In [None]:
missing_values['WindGustSpeed']

In [None]:
plot_count_dist(df= final, label_column = 'RainTomorrow', target_column = 'WindGustSpeed', figsize = (20,10))

The left plot provides unique values and its numbers of WindGustSpeed whereas right plot explains if there is a WindGustSpeed of 0-50 the currentday, then the next day mostly it will not rain. Otherwisw when WindGustSpeed is greater than 50 it says that the next day can see rain.

**Feature 6 : WindDir9am**

Description : Direction of the wind at 9am

In [None]:
missing_values['WindDir9am']

We have already transformed this categorical variable using dummies, for the system to understand. So let's go ahead and visualise the data

In [None]:
WindDir9am_table = pd.crosstab(index=data["WindDir9am"], columns=data["RainTomorrow"])
WindDir9am_table

In [None]:
WindDir9am_table.plot(kind="bar", figsize=(15,8),stacked=False)

For RainTomorrow being ‘No’, the wind at morning mostly blows in direction of East,Southeast and South South-east, for RainTomorrow being ‘Yes’, the wind mostly blows in North North-West,North and North North-East.

**Feature 7 : WindDir3pm**

Description : Direction of the wind at 3pm

In [None]:
missing_values['WindDir3pm']

We have already transformed this categorical variable using dummies, for the system to understand. So let's go ahead and visualise the data

In [None]:
WindDir3pm_table = pd.crosstab(index=data["WindDir3pm"], columns=data["RainTomorrow"])
WindDir3pm_table

In [None]:
WindDir3pm_table.plot(kind="bar", figsize=(15,8),stacked=False)

For RainTomorrow being ‘Yes’ the wind at evening mostly blows in North, West and West-Northwest and for RainTomorrow being ‘No’, the wind mostly blows in direction of South, SouthEast, West-Southwest.

**Feature 8 : WindSpeed9am**

Description : Wind speed (km/hr) averaged over 10 minutes prior to 9am

In [None]:
missing_values['WindSpeed9am']

In [None]:
plot_count_dist(df= final, label_column = 'RainTomorrow', target_column = 'WindSpeed9am', figsize = (20,10))

The left plot shows the unique values of WindSpeed9am and their. The frequencyright plot means that for WindSpeed9am of 0-20 km/hr the target RainTomorrow being 'No' is more and having greater than 20 km/hr, the target RainTomorrow being 'Yes' is more.

**Feature 9 : WindSpeed3pm**

Description : Wind speed (km/hr) averaged over 10 minutes prior to 3pm

In [None]:
missing_values['WindSpeed3pm']

In [None]:
plot_count_dist(df= final, label_column = 'RainTomorrow', target_column = 'WindSpeed3pm', figsize = (20,10))

The left graph shows the frequencies of unique values in WindSpeed3pm and right graph tells that if the WindSpeed3pm reaches 20 km/hr then the target RainTomorrow being 'Yes' is more, greater than 25 km/hr means the target RainTomorrow being ‘No’ has more chance.

**Feature 10 : Humidity9am**

Description : Humidity at 9am in %

In [None]:
missing_values['Humidity9am']

In [None]:
plot_count_dist(df= final, label_column = 'RainTomorrow', target_column = 'Humidity9am', figsize = (20,10))

The left plot shows the frequencies of unique humidity values and right graph means that if Humidity9am is between 0-70 %, then target RainTomorrow will be 'No' and if Humidity9am is more than 70 % the current day then RainTomorrow being 'Yes' is more. And also, we can see that at 100% humidity the case of RainTomorrow = 'Yes' is twice as that of case where RainTomorrow = 'No'.

**Feature 11 : Humidity3pm**

Description : Humidity at 3pm in %

In [None]:
missing_values['Humidity3pm']

In [None]:
plot_count_dist(df= final, label_column = 'RainTomorrow', target_column = 'Humidity3pm', figsize = (20,10))

The left graph gives the frequencies of unique values for Humidity3pm and right graph clearly draws a line which seperates the Humidity3pm as two ranges where 1. 0-60% - RainTomorrow being 'No' is evident and 2. greater than 60% - RainTomorrow being 'Yes' is evident.

**Feature 12 : Pressure9am**

Description : Atmospheric pressure reduced to mean sea level at 9am, measured in hpa

In [None]:
missing_values['Pressure9am']

In [None]:
final['Pressure9am'].value_counts()

since it is a floating value, let's do binning and put this to our final_bin dataframe

In [None]:
final_bin['Pressure9am'] = pd.cut(data['Pressure9am'],bins = 5) #discretising the float numbers into categorical

In [None]:
final_bin['Pressure9am'].value_counts()

In [None]:
final.head()

In [None]:
plot_count_dist(df= final_bin, label_column = 'RainTomorrow', target_column = 'Pressure9am', figsize = (20,10))

The graph on the left provides histogram with bins for Pressure9am and the plot on left illustrates that pressure of 0-1015 hpa has more chances of RainTomorrow being 'Yes' and pressure greater the above range means target RainTomorrow being 'No'.

**Feature 13 : Pressure3pm**

Description : Atmospheric pressure reduced to mean sea level at 3pm, measured in hpm

In [None]:
missing_values['Pressure3pm']

In [None]:
final['Pressure3pm'].value_counts()

since it is a floating value, let's do binning and put this to our final_bin dataframe

In [None]:
final_bin['Pressure3pm'] = pd.cut(data['Pressure3pm'],bins = 5) #discretising the float numbers into categorical


In [None]:
final_bin['Pressure3pm'].value_counts()

In [None]:
final.head()

In [None]:
plot_count_dist(df= final_bin, label_column = 'RainTomorrow', target_column = 'Pressure3pm', figsize = (20,10))

The graph on the left provides binned histogram for Humidity9am and the plot on left illustrates that pressure of 0-1012 hpa has more chances of RainTomorrow being 'Yes' and pressure greater the above range means target RainTomorrow being 'No'.

**Feature 14 : Temp9am**

Description : Temperature at 9am, measured in degrees Celsius

In [None]:
missing_values['Temp9am']

In [None]:
final['Temp9am'].value_counts()

since it is a continuous value, let's do binning and put this to our final_bin dataframe

In [None]:
final_bin['Temp9am'] = pd.cut(data['Temp9am'],bins = 5) #discretising the float numbers into categorical

In [None]:
final_bin['Temp9am'].value_counts()

In [None]:
final.head()

In [None]:
plot_count_dist(df= final_bin, label_column = 'RainTomorrow', target_column = 'Temp9am', figsize = (20,10))

The graph on the left shows the binned histogram for Temp9am and we can see the most frequent temperature being in [14.22,22.88]. The right graph conveys that temperature between 7 and 15 degree Celsius mostly results in RainTomorrow being 'Yes', otherwise results in RainTomorrow being ‘No’.

**Feature 15 : Temp3pm**

Description : Temperature at 3pm, measured in degree Celsius

In [None]:
missing_values['Temp3pm']

In [None]:
final['Temp3pm'].value_counts()

since it is a continuous value, let's do binning and put this to our final_bin dataframe

In [None]:
final_bin['Temp3pm'] = pd.cut(data['Temp3pm'],bins = 5) #discretising the float numbers into categorical


In [None]:
final_bin['Temp3pm'].value_counts()

In [None]:
final.head()

In [None]:
plot_count_dist(df= final_bin, label_column = 'RainTomorrow', target_column = 'Temp3pm', figsize = (20,10))

The graph on the left shows the binned histogram for Temp9am and we can see the most frequent temperature being in [19.7,28.7]. The right graph conveys that temperature between 0 and 19 degree Celsius mostly results in RainTomorrow being 'Yes'. Temperature grater than 20 degree celsius results in mostly target RainTomorrow being 'No'.

**Feature 16 : RainToday**

Description : Precipitation of current day. Boolean: 1 if precipitation (in mm) in the 24 hours to 9am exceeds 1mm, otherwise 0

In [None]:
#Now we will just see how many times it rained the current day?
fig = plt.figure(figsize = (20,3))
sns.countplot(y='RainToday', data=final);
print(final.RainToday.value_counts())

From the graph we can know the ratio of cases where the next day is predicted as rainy day or not. The table above the graph presents frequency counts for the binary values 'Yes' and 'No' for RainToday.

In [None]:
final.head()

**CORRELATION MATRIX**

In [None]:
f, ax = plt.subplots(figsize=(18, 18))
sns.heatmap(data.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)
plt.xticks(rotation=90)

Below are the insights drawn from correlation matrix:
*	MinTemp and MaxTemp are highly correlated with r=0.7  
*	Humidity3pm and Humidity 9am are highly correlated with r=0.6  
*	Temp9am and MinTemp are highly correlated with r=0.9  
*	Temp9am and MaxTemp are highly correlated with r=0.9  
*	Temp3pm and MinTemp are highly correlated with r=0.6  
*	Temp3pm and MaxTemp are highly correlated with r=0.9  
*	Also, Pressure & temperature and Humidity and temperature were negatively correlated.  

Now we are done with pre-processing and have a basic ides of what each features are and how they are related to the target variable RainTomorrow.
Let's just see which are the important features to predict RainTomorrow

In [None]:
final_bin.head()

In [None]:
final.shape

In [None]:
#Let's get hold of the independent variables and assign them as X

X = final.loc[:, final.columns != 'RainTomorrow']
y = final['RainTomorrow']
X.shape

In [None]:
# PCA to find the best number of features based on explained variance for each attribute
#Fitting the PCA algorithm with our Data
from sklearn.decomposition import PCA
pca = PCA().fit(X)
#Plotting the Cumulative Summation of the Explained Variance
plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)') #for each component
plt.title('WeatherAUS Dataset Explained Variance')
plt.show()

In [None]:
#Using SelectKBest to get the top features!
from sklearn.feature_selection import SelectKBest, chi2
selector = SelectKBest(chi2, k=40)
selector.fit(X, y)
X_new = selector.transform(X)
print(X.columns[selector.get_support(indices=True)]) #top 40 columns

In [None]:
X = final[['MaxTemp', 'Rainfall', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm',
       'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Temp3pm',
       'RainToday', 'WindGustDir_E', 'WindGustDir_ENE', 'WindGustDir_ESE',
       'WindGustDir_N', 'WindGustDir_NNW', 'WindGustDir_NW', 'WindGustDir_W',
       'WindGustDir_WNW', 'WindDir3pm_E', 'WindDir3pm_ENE', 'WindDir3pm_ESE',
       'WindDir3pm_N', 'WindDir3pm_NNW', 'WindDir3pm_NW', 'WindDir3pm_SE',
       'WindDir3pm_SW', 'WindDir3pm_W', 'WindDir3pm_WNW', 'WindDir9am_E',
       'WindDir9am_ENE', 'WindDir9am_ESE', 'WindDir9am_N', 'WindDir9am_NNE',
       'WindDir9am_NNW', 'WindDir9am_NW', 'WindDir9am_SE', 'WindDir9am_SSE',
       'WindDir9am_W', 'WindDir9am_WNW']] # let's use all 40 features
y = final[['RainTomorrow']]

In [None]:
#Split the data into train and test data
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25)

Let us start building our predictive models

**Model 1 : Logistic Regression**

In [None]:
from sklearn.metrics import accuracy_score
import time
t0=time.time()
logreg = LogisticRegression(random_state=0, class_weight={0:0.3,1:0.7})
logreg = logreg.fit(X_train,y_train)
y_predLR = logreg.predict(X_test)
score = accuracy_score(y_test,y_predLR)
print('Accuracy :',score)
print('Time taken :' , time.time()-t0)





**Model 2 : Decision Tree**

In [None]:
t0=time.time()
#X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25)
dt = DecisionTreeClassifier(random_state=0,class_weight={0:0.3,1:0.7})
dt.fit(X_train,y_train)
y_predDT = dt.predict(X_test)
score = accuracy_score(y_test,y_predDT)
print('Accuracy :',score)
print('Time taken :' , time.time()-t0)

**Model 3 : RandomForest**

In [None]:
from sklearn.ensemble import RandomForestClassifier
t0=time.time()
rf = RandomForestClassifier(n_estimators=100, max_depth=4,random_state=0,class_weight={0:0.3,1:0.7})
rf.fit(X_train,y_train)
y_predRF = rf.predict(X_test)
score = accuracy_score(y_test,y_predRF)
print('Accuracy :',score)
print('Time taken :' , time.time()-t0)

**Model 4 : BalancedBagging Classifier**

In [None]:
from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.tree import DecisionTreeClassifier

#Creating an object of the classifier.
bbc = BalancedBaggingClassifier(base_estimator=DecisionTreeClassifier(),
                                sampling_strategy='auto',
                                replacement=False,
                                random_state=0)

#Training the classifier.
bbc.fit(X_train, y_train)
y_predBBC = bbc.predict(X_test)
score = accuracy_score(y_test,y_predBBC)
print('Accuracy :',score)
print('Time taken :' , time.time()-t0)

> **CONFUSION MATRICES**

In [None]:
from sklearn.metrics import confusion_matrix
pred_models = []
pred_models.append(('LogisticRegression', y_predLR))
pred_models.append(('DecisionTree', y_predDT))
pred_models.append(('RandomForest', y_predRF))
pred_models.append(('BalancedBaggingClassifier', y_predBBC))


for name, pred_model in pred_models:
    cm = confusion_matrix(y_test, pred_model)
    #print(cm)
    plt.figure(figsize = (3,3))
    sns.heatmap(cm,fmt="d",annot=True,xticklabels=["No","Yes"],yticklabels=["No","Yes"],cbar=False)
    plt.title(name+" "+"Confusion Matrix")
    plt.xlabel("Predicted")
    plt.ylabel("Actuals")
    plt.show()

RESULTS ACHIEVED:

•	As discussed about the gaps in previous work, the kernels aimed to measure the performance of machine learning models based on accuracy and most of the model’s recall was about 50%. This meant that when it actually did rain the next day, the model was only right 50% of the time.
In this work, recall of score as high as 68.26 % has been achieved which means that the predictions we make when it actually did rain next day are correct 68% of the time. We can see an increase of recall score by **18.26 %**.

•	The average accuracy obtained in the previous kernels are 85%. In this work, even though the recall score was increased, it is managed to gain an accuracy of score 82.66 %.

•	A better understanding of data has been achieved by exploratory analysis of each and every variable (all 17 features). Also, it is very clear about the relationship and variance of each of the independent variables which helps to predict the dependent variable.

•	For Dimensionality reduction, to find out the number of best features to be included in the machine learning model was found using Principal Component Analysis (PCA), in order to get best performance. From the graph obtained, it is evident that out of 62 features (after feature engineering) 40 features had to be used to explain 90% of the variance in the target variable.

•	Logistic regression outstood as the best predictive model to predict the class labels for this dataset having the best accuracy and recall scores. It suppressed the power of ensemble models by predicting the class labels in as less as 1.34 sec when compared to the Bagging classifier which took 16 sec to predict the target class. 
