# Introduction 

The objective of this notebook is to predict whether it will rain the next day with the weather conditions of the current day. The dataset we will use is weatherAUS.csv, which contains 160689 entries of weather data between 2007 and 2017 at 49 locations in Australia. As geographical factors play vital roles in weather conditions, we will import geo-coordinates of the locations from worldcitiespop dataset and geocoding API to better capture the feature, and therefore make more accurate predictions. <br>
There are four sections in this notebook: <br>
* Import and Join Data
* Deal with Missing Values
* Exploratory Data Analysis
* Make Prediction with XGBoost

In [None]:
# import libraries and environment setting 
import numpy as np 
import pandas as pd 
import missingno as msno
from math import floor
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import re
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import  plot_confusion_matrix, classification_report
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
sns.set_theme()
sns.set_palette('colorblind')


# Import and Join Data

Import weatherAUS and worldcitiespop datasets, we will extract the GPS coordinates from the latter one.

In [None]:
df = pd.read_csv('/kaggle/input/weather-dataset-rattle-package/weatherAUS.csv')
coordinates = pd.read_csv('../input/world-cities-database/worldcitiespop.csv',usecols = ['Country','AccentCity','Latitude','Longitude'])
df.head()

In [None]:
coordinates = coordinates[coordinates.Country=='au'].drop(['Country'],axis = 1)
coordinates.head()

Extract the rows of locations in Australia, and join two data frames on Location and AccentCity.


In [None]:
df = df.merge(coordinates,how = 'left', left_on='Location', right_on='AccentCity')
df.info()

There are still some locations with missing coordinates, for which we will get with Positionstack Geocoding API below.

In [None]:
locations = df.Location[df.Latitude.isna()].unique()
locations

In [None]:
params = {'access_key':'b8a7f574448bf69e3fa2fe41cc5e4fa8','country':'AU'} 
base_url = 'http://api.positionstack.com/v1/forward'
for address in locations:
    if address == 'PearceRAAF':
        query_address ='RAAF Base Pearce'
    else:
        query_address = ' '.join(re.findall('[A-Z][a-z]*', address))
    params['query'] =  query_address
    response = requests.get(base_url, params = params).json()
    try:
        df.loc[df.Location==address,'Latitude'] = response['data'][0]['latitude']
        df.loc[df.Location==address,'Longitude'] = response['data'][0]['longitude']
    except:
        print(params)
        print(response)
        break


# Deal with Missing Values

Plot missing values with msno.

In [None]:
msno.matrix(df)
df.isna().sum()

Columns with a high number of missing values will be dropped, which are "Evaporation", "Sunshine", "Cloud9am", and "Cloud3pm". <br />  And we will extract year and day of year from Date.


In [None]:
df.drop(["Evaporation","Sunshine","Cloud9am","Cloud3pm","AccentCity"], axis = 1, inplace = True)
df.dropna(axis=0,subset=['RainTomorrow'], inplace= True)
df['Year'] = pd.to_datetime(df['Date']).dt.year
df['Day_of_year'] = pd.to_datetime(df['Date']).dt.dayofyear


Here we sort the data by location and time, then fill the missing values with forward fill. 

In [None]:
df.sort_values(by = ['Location','Year','Day_of_year'],inplace = True)
df.fillna(method = 'ffill', limit = 10, inplace = True)

Now that we have aquired geo-corrdinates and transformed Date into desired formats, we can drop "Location" and "Date". <br>
The rows with missing values left will be deleted as well.

In [None]:
df.drop(["Location", "Date"], axis =1,inplace= True)
df.dropna(axis=0,how="any",inplace= True)
df.reset_index(drop = True, inplace = True)
df.isna().sum()

# Exploratory Data Analysis

First of all, let's check out the target variable 'RainTomorrow', then inspect the distribution of numeric and categorial preditor variables.

In [None]:
print(df.shape)
print(df['RainTomorrow'].value_counts(normalize = True))
sns.countplot(x='RainTomorrow', data=df)


* There are 99581 observations and 21 columns left.
* RainTomorrow is positive in 22% of entries.

In [None]:
cat_cols = df.columns[(df.dtypes=='O') & (df.columns!='RainTomorrow')]
num_cols = df.columns[df.dtypes!='O']

fig, axes = plt.subplots(4,4,figsize=(25, 25))
for i,col in enumerate(num_cols):
    plt_col = i%4
    plt_row = floor(i/4)
    sns.boxplot(ax = axes[plt_row,plt_col], data = df, y = col, linewidth=2.5)
df.describe()

* Rainfall and WindGustSpeed have some extreme values, which could be caused by extreme weather conditions.
* The majority of observations are located in the Southeastern area of Australia, the most densely populated region of the country.


In [None]:
for col in cat_cols:
    print('Column','"'+ col +'"', 'has', df[col].nunique(),'unique values:\n' )
    print(df[col].value_counts(), '\n\n------------------------------\n')

* Wind directions are encoded as 16 values.
* The proportion of "RainToday" is similar to "RainTomorrow", which makes sense.

we will explore the effects of predictor variables on the target variable below.

In [None]:
fig, axes = plt.subplots(4,4,figsize=(25, 25))
for i,col in enumerate(num_cols):
    plt_col = i%4
    plt_row = floor(i/4)
    sns.kdeplot(ax = axes[plt_row,plt_col], data = df, x = col, hue = "RainTomorrow")

* Humidity in the afternoon is an effective predictor of RainTomorrow.
* The chance of rain is slightly higher in the middle of the year, which is winter in the souther hemisphere.


In [None]:
plt.figure(figsize=(20,15))
sns.kdeplot(data = df, x = 'Longitude',y = 'Latitude',hue = 'RainTomorrow')

* It rains more often in the southeastern area than in other regions.

In [None]:
fig, axes = plt.subplots(4,1,figsize=(15, 40))
for i,col in enumerate(cat_cols):
    chart = sns.countplot(ax =axes[i],x= col ,hue = 'RainTomorrow' ,data = df)

* It is more likely to rain tomorrow if it rains today.

# Make Prediction with XGBoost

Let's convert wind directions into angles first.

In [None]:
dirangle_map = {'N':0, 'NNE':22.5, 'NE':45, 'ENE':67.5, 'E':90, 'ESE': 112.5, 'SE':135, 'SSE':157.5, 'S':180, 'SSW':202.5, 'SW':225, 'WSW':247.5, 'W':270, 'WNW':292.5, 'NW':315, 'NNW':337.5 }
bool_map = {'No':0, 'Yes':1}
df.replace({"WindGustDir": dirangle_map, 'WindDir9am':dirangle_map, 'WindDir3pm':dirangle_map, 'RainToday':bool_map, 'RainTomorrow':bool_map }, inplace = True)
df.info()

In [None]:
rng = 42
X = df.drop(['RainTomorrow'],axis = 1)
y = df['RainTomorrow']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=rng,stratify =y)

Here, we fix the learning rate of the XGBClassifier at 0.1 and search for the optimal max_depth and n_estimators with RandomizedSearchCV.

In [None]:
param_grid = {
    'max_depth': range(2, 13),
    'n_estimators':range(300, 1500)
}
clf = xgb.XGBClassifier(eta = 0.1)
randomized_clf = RandomizedSearchCV(estimator=clf,param_distributions=param_grid,scoring = 'accuracy',n_iter = 7, cv = 3, random_state = rng)
randomized_clf.fit(X_train,y_train)

In [None]:
print("Best parameters: ", randomized_clf.best_params_)
print("Best Score: ", randomized_clf.best_score_)
features = pd.DataFrame(randomized_clf.best_estimator_.feature_importances_,index = X.columns)
features.sort_values(by = 0, ascending = True, inplace = True)
plt.figure(figsize=(20,15))
features.plot(kind = 'barh')

* The best hyperparameters found by RandomizedSearchCV are 1464 estimators and max_depth of 11.
* Humidity3pm is the most important featues followed by WindGustSpeed, Pressure3pm, and Lontitude.

In [None]:
y_pred = randomized_clf.best_estimator_.predict(X_test)
plot_confusion_matrix(randomized_clf.best_estimator_, X_test, y_test)
print(classification_report(y_test,y_pred))

* The model has an 89% accuracy rate on the predictions of the test set.
