**Will it rain or won't it rain? I gotta know so I know how to dress!!**

In [None]:
import numpy as np
import pandas as pd
pd.set_option("display.max_columns",100)
pd.set_option("display.max_rows",120)

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
params={"figure.facecolor":(0.0,0.0,0.0,0),
        "axes.facecolor":(1.0,1.0,1.0,1),
        "savefig.facecolor":(0.0,0.0,0.0,0)}
plt.rcParams.update(params)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report,roc_auc_score,roc_curve,confusion_matrix

import warnings
warnings.filterwarnings("ignore")

In [None]:
df=pd.read_csv("../input/weather-dataset-rattle-package/weatherAUS.csv")
df.head()

Let's check our data:

In [None]:
df.info()

There are 142193 readings and 23 columns, of which *RainTomorrow* is our target variable. We also have a mixture of numerical and categorical variables, and some missing values (which we shall tackle when we split the data into the training and testing sets).

First let's see view our target variable ***RainTomorrow***:

In [None]:
df["RainTomorrow"].value_counts()

In [None]:
sns.countplot(df["RainTomorrow"],palette=["lightcoral","skyblue"])
plt.ylabel("Count")

In [None]:
df["RainTomorrow"]=df["RainTomorrow"].apply(lambda x:0 if x=="No" else 1)

Now let's go through and check the values for each feature. We will start with the numerical features:

In [None]:
df.describe().drop(["RainTomorrow"],axis=1).T

In [None]:
for column in df.select_dtypes(exclude="object").drop(["RainTomorrow"],axis=1).columns:
    print(column,":",df[column].isnull().sum(),"missing values.")

Quite a number of missing values, which we will impute after we split the data.

For numerical features, it is important to remove any outliers to improve model's performance.

In [None]:
fig,axes=plt.subplots(1,2,figsize=(12,5))

df[df.select_dtypes(exclude="object").columns.drop(["Pressure9am","Pressure3pm","RainTomorrow"])].plot(kind="box",color="#AE9CCD",ax=axes[0])
axes[0].set_xticklabels(axes[0].get_xticklabels(),rotation=90)
axes[0].set_ylabel("Measurement")

df[["Pressure9am","Pressure3pm"]].plot(kind="box",color="#AE9CCD",ax=axes[1])

From the above boxplots, we have quite a number of outliers outside 1.5 times the interquartile range. But because there are no real bounds for weather data, i.e. due to extreme weather events, we will not be removing all of these outliers. If we do we will be creating a perfect dataset that won't properly reflect real world weather. Instead let's just further examine the outliers of the outliers in *Rainfall*, *Evaporation* and *WindSpeed9am*:

In [None]:
fig,axes=plt.subplots(1,3,figsize=(15,4))

sns.distplot(df["Rainfall"],bins=12,color="lightskyblue",ax=axes[0])
sns.distplot(df["Evaporation"],bins=12,color="lightcoral",ax=axes[1])
sns.distplot(df["WindSpeed9am"],bins=12,color="lightgreen",ax=axes[2])

Although it is possible to achieve these amounts of rainfall, evaporation and wind speed - for example in a storm or heatwave event - we remove them from the dataset so the model doesn't think these extreme weather events are common.

In [None]:
droppers=df.loc[(df["Rainfall"]>300)|(df["Evaporation"]>100)|(df["WindSpeed9am"]>100)]
df.drop(droppers.index,inplace=True)

In [None]:
print("We have dropped {num1} rows, so now instead of the initial 142193 readings, we have {num2}.".format(num1=142193-df.shape[0],num2=df.shape[0]))

Let's continue with the categorical features:

In [None]:
df.select_dtypes(include="object").describe()

For categorical features, it is important to check the actual categories and change the format into numbers. Remember we will only impute the missing data after we split the data.

***Date***

- There are 3436 unique values in the format of YYYY-MM-DD. Instead of using the *categoricals* function, we will just split up the date format into year, month and day but we only use the month data as rain is seasonal and not yearly/daily.

In [None]:
print("{num} missing values.".format(num=df["Date"].isnull().sum()))

In [None]:
df["Date"]=pd.to_datetime(df["Date"])

In [None]:
df["Month"]=df["Date"].dt.month

- Now we can drop the *Date* column:

In [None]:
df.drop(["Date"],axis=1,inplace=True)
df.head(2)

***Location***

- We will not be dropping *Location* because rain is regional.

In [None]:
print("{num} missing values.".format(num=df["Location"].isnull().sum()))

In [None]:
df["Location"].value_counts()

- We will convert these categories into numbers when we impute the missing values after we split the data.

***WindGustDir***

In [None]:
print("{num} missing values.".format(num=df["WindGustDir"].isnull().sum()))

In [None]:
df["WindGustDir"].value_counts()

- We will convert these categories into numbers when we impute the missing values after we split the data.

***WindDir9am***

In [None]:
print("{num} missing values.".format(num=df["WindDir9am"].isnull().sum()))

In [None]:
df["WindDir9am"].value_counts()

- We will convert these categories into numbers when we impute the missing values after we split the data.

***WindDir3pm***

In [None]:
print("{num} missing values.".format(num=df["WindDir3pm"].isnull().sum()))

In [None]:
df["WindDir3pm"].value_counts()

- We will convert these categories into numbers when we impute the missing values after we split the data.

***RainToday***

In [None]:
print("{num} missing values.".format(num=df["RainToday"].isnull().sum()))

In [None]:
df["RainToday"].value_counts()

- We will also convert these text data into numbers but just using a simple if statement:

In [None]:
df["RainToday"]=df["RainToday"].apply(lambda x:0 if x=="No" else 1)
df.head(2)

So this is what our data looks like now:

In [None]:
df.head()

Before we tackle the missing values or scale the data, we must first split the data into the training and testing sets to ensure we do not cause any data leakage.

In [None]:
x=df.drop(["RainTomorrow"],axis=1)
y=df["RainTomorrow"]

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.33,random_state=7)

In [None]:
print("Training set shape:",x_train.shape)
print("Testing set shape:",x_test.shape)

To replace the missing values, we will compute a fill value for the numerical and categorical features based on the training set and then apply them to the testing set.

In [None]:
x_train.isnull().sum()

In [None]:
x_test.isnull().sum()

Missing values in numerical features will be filled with the median. We could in fact use the mean or a set constant instead, but because of the range and the number of outliers in the data we will use the median.

In [None]:
for df in [x_train,x_test]:
    for col in df.select_dtypes(exclude="object").columns:
        col_median=x_train[col].median()
        df[col].fillna(col_median,inplace=True)

In [None]:
x_train.isnull().sum()

In [None]:
x_test.isnull().sum()

Missing values in categorical features will be filled with the mode.

In [None]:
for df in [x_train,x_test]:
    for col in df.select_dtypes("object").columns:
        col_mode=x_train[col].mode()[0]
        df[col].fillna(col_mode,inplace=True)

In [None]:
x_train.isnull().sum()

In [None]:
x_test.isnull().sum()

Up until now the categorical features are still in text format. We will have to convert them into a format the model will be able to use as input (i.e. numbers). We shall do so by converting the text into numbers using pd.get_dummies, concatenating the dummies to the dataframe, and then dropping the original text column:

In [None]:
for col in x_train.select_dtypes("object").columns:
    x_train=pd.concat([x_train,pd.get_dummies(x_train[col],drop_first=True)],axis=1)
    x_train.drop([col],axis=1,inplace=True)

In [None]:
x_train.head(2)

In [None]:
for col in x_test.select_dtypes("object").columns:
    x_test=pd.concat([x_test,pd.get_dummies(x_test[col],drop_first=True)],axis=1)
    x_test.drop([col],axis=1,inplace=True)

In [None]:
x_test.head(2)

Since each feature has it's own range of values, we will scale the data (again, only based on the training set and then applied to the testing set):

In [None]:
scaler=StandardScaler()

x_train=pd.DataFrame(scaler.fit_transform(x_train),columns=x_train.columns)
x_test=pd.DataFrame(scaler.transform(x_test),columns=x_test.columns)

In [None]:
x_train.head(2)

In [None]:
x_test.head(2)

But before we fit our model, perhaps we should reduce the number of features selected:

In [None]:
model=LogisticRegression(random_state=7)

min_features_to_select=1
rfecv=RFECV(estimator=model,step=1,cv=5,scoring="accuracy",min_features_to_select=min_features_to_select)
rfecv.fit(x_train,y_train)

print("Optimal number of features : %d" % rfecv.n_features_)

plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(min_features_to_select,
               len(rfecv.grid_scores_)+min_features_to_select),
         rfecv.grid_scores_)
plt.show()

In [None]:
rfetable=pd.DataFrame({"Feature":x_train.columns,"Support":rfecv.support_,"Ranking":rfecv.ranking_,}).sort_values(by="Ranking",ascending=False)
rfetable

Recurssive feature elimination suggests we can remove some columns to the optimal amount of 101.

In [None]:
x_train=x_train.drop(["Rainfall","SSE","SSW","NNE","ESE","W","Williamtown","PearceRAAF","ENE","SE"],axis=1)
x_test=x_test.drop(["Rainfall","SSE","SSW","NNE","ESE","W","Williamtown","PearceRAAF","ENE","SE"],axis=1)

Now that we have removed some features, we can finally fit our model:

In [None]:
model.fit(x_train,y_train)

In [None]:
parameters=[{"penalty":["l1","l2","elasticnet"]},
            {"C":[0.1,1,10,100]},
            {"class_weight":["balanced",None]},
            {"solver":["newton-cg","lbfgs","liblinear","sag","saga"]},
            {"multi_class":["auto","ovr","multinomial"]}]

grid=GridSearchCV(estimator=model,param_grid=parameters,refit=True,cv=5,verbose=1)

grid.fit(x_train,y_train)

y_predict=grid.predict(x_test)

Next we can evaluate our model using a classifcation report, ROC AUC score, ROC curve  and a confusion matrix:

In [None]:
def cm(predictions):
    cm_matrix=pd.DataFrame(data=confusion_matrix(y_test,predictions),columns=["No Rain","Rain"],index=["No Rain","Rain"])
    sns.heatmap(cm_matrix,annot=True,square=True,fmt="d",cmap="Purples",linecolor="w",linewidth=2)
    plt.xlabel("Predicted")
    plt.ylabel("Actual")
    plt.yticks(va="center")

In [None]:
print(classification_report(y_test,y_predict))

In [None]:
y_score=model.predict_proba(x_test)[:,1]

print("roc_auc_score: ",roc_auc_score(y_test,y_score))

false_positive_rate,true_positive_rate,threshold=roc_curve(y_test,y_score)
plt.plot(false_positive_rate,true_positive_rate)
plt.plot([0,1],ls="--")
plt.plot([0,0],[1,0],c=".7")
plt.plot([1,1],c=".7")
plt.title("Receiver Operating Characteristic")
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.show()

In [None]:
cm(y_predict)

In [None]:
print("Training set score: {num:.4f}.".format(num=model.score(x_train,y_train)))
print("Testing set score: {num:.4f}.".format(num=model.score(x_test,y_test)))

Our model didn't do so bad with an accuracy score of 0.84 and a ROC AUC score of 0.86! This means our model was able to correctly predict 83% of the instances. It was however better at predicting class 0 (i.e. no rain) than class 1 (i.e. rain) with the higher precision and recall, and the model also predicted a lot more false negatives (i.e. predicted that it would not rain when it actually will) than false positives (i.e. predicted that it would rain when it actually will not). Thankfully after all that work the training and testing scores are very similar so there is no obvious indication of any over/underfitting hence our model will fair well with new data.

**Now should I bring my umbrella or sunglasses..**