# Rainfall prediction project (Classification)


Predict next-day rain by training classification models on the target variable RainTomorrow
This dataset contains about 10 years of daily weather observations from many locations across Australia.

RainTomorrow is the target variable to predict. It means -- did it rain the next day, Yes or No? This column is Yes if the rain for that day was 1mm or more.


## Framework For the project


* Step 1 - Download the data from kaggle- https://www.kaggle.com/jsphyg/weather-dataset-rattle-package

* Step 2 - Import the data

* Step 3 - EDA (Exploratory Data Analysis)

* Step 4 - Data preparation (Converting the data into numeric form and filling the missing values)

* Step 5 - Fit a Machine Learning model and Evaluate the model on the data 

* Step 6 - Improving the model (hyperparameter tuning)

* Step 7 - Evaluating the final model (Confusion matrix, ROC curve, Precision, Recall, F1score)

* Step 8 - Feature importance 



## Data dictionary for the project

1. Date - The date of observation
2. Location - The common name of the location of the weather station
3. MinTemp - The minimum temperature in degrees celsius
4. MaxTemp - The maximum temperature in degrees celsius
5. Rainfall - The amount of rainfall recorded for the day in mm
6. Evaporation - The so-called Class A pan evaporation (mm) in the 24 hours to 9am
7. Sunshine - The number of hours of bright sunshine in the day.
8. WindGustDir - The direction of the strongest wind gust in the 24 hours to midnight
9. WindGustSpeed - The speed (km/h) of the strongest wind gust in the 24 hours to midnight
10. WindDir9am - Direction of the wind at 9am
11. WindDir3pm - Direction of the wind at 3pm
12. WindSpeed9am - Wind speed (km/hr) averaged over 10 minutes prior to 9am
13. WindSpeed3pm - Wind speed (km/hr) averaged over 10 minutes prior to 3pm
14. Humidity9am - Humidity (percent) at 9am
15. Humidity3pm - Humidity (percent) at 3pm
16. Pressure9am - Atmospheric pressure (hpa) reduced to mean sea level at 9am
17. Pressure3pm - Atmospheric pressure (hpa) reduced to mean sea level at 3pm
18. Cloud9am - Fraction of sky obscured by cloud at 9am. This is measured in "oktas", which are a         unit of eigths.It records how many
19. Cloud3pm - Fraction of sky obscured by cloud (in "oktas": eighths) at 3pm. See Cload9am for a         description of the values
20. Temp9am - Temperature (degrees C) at 9am
21. Temp3pm - Temperature (degrees C) at 3pm
22. RainToday - Boolean: 1 if precipitation (mm) in the 24 hours to 9am exceeds 1mm, otherwise 0
23. RainTommorow - The amount of next day rain in mm. Used to create response variable RainTomorrow.     A kind of measure of the "risk".

In [None]:
# Importing all necessary tools

# Importing the data analysis libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Matplotlib inline makes our plots appear inside the notebook
%matplotlib inline

# Importing the Evaluation tools
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV 
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve

# Importing our machine learning models
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression


## 1. Importing the data

In [None]:
# 1. Importing the data
df = pd.read_csv("../input/weather-dataset-rattle-package/weatherAUS.csv")
df.head()

## 2. EDA (Exploratory Data Analysis)

The goal here is to find out more about the data and become a 
subject matter export on the dataset you're working with 

1. What question are you trying to solve?
2. What kind of data do we have and how do we treat different types?
3. What's missing from the data and how do you deal with it?
4. What are the outliers and why should care about them?
5. How can you add , change or remove features to get more from your data?


In [None]:
df.head().T

In [None]:
df.info()

In [None]:
len(df)

In [None]:
df["RainTomorrow"].value_counts().plot(kind="bar", color=["lightblue", "salmon"]);

In [None]:
# We have a class imbalance in our problem

In [None]:
pd.crosstab(df.Rainfall, df.Location)

In [None]:
fig, ax = plt.subplots(figsize=(10,7))

ax.scatter(df.MaxTemp,
            df.Rainfall,
            color=["salmon"])
plt.title("MaxTemp vs Rainfall")
plt.ylabel("Rainfall")
plt.xlabel("MaxTemp");

In [None]:
fig, ax = plt.subplots(figsize=(10,7))

ax.scatter(df.MinTemp,
            df.Rainfall,
            color=["lightblue"])
plt.title("MinTemp vs Rainfall")
plt.ylabel("Rainfall")
plt.xlabel("MainTemp");

In [None]:
df.head()

In [None]:
fig, ax = plt.subplots(figsize=(10,10))

ax.scatter(df.Date[:1000],
           df.Rainfall[:1000],
           color=["blue"])
plt.title("Date vs Rainfall")
plt.ylabel("Rainfall")
plt.xlabel("Date");

### Parsing dates 
when we are working with time series data we want to enrich time and date as much as possible 

we can do that by telling pandas which column has dates in it using the `parse_dates` parameter

In [None]:
# Import the data again but this time parse the dates 
df = pd.read_csv("../input/weather-dataset-rattle-package/weatherAUS.csv",
                 parse_dates=["Date"])

In [None]:
df.Date.dtype

In [None]:
df.Date[:1000]

In [None]:
fig, ax = plt.subplots(figsize=(8,7))

ax.scatter(df.Date[:1000],
           df.Rainfall[:1000],
           color=["darkred"])
plt.title("Rainfall by Date")
plt.ylabel("Rainfall")
plt.xlabel("Date");

In [None]:
plt.style.use("default")
fig, ax = plt.subplots(figsize=(8,7))

ax.scatter(df.Date[:1000],
           df.WindGustSpeed[:1000],
           color=["blue"])
plt.title("WindGustSpeed by Date")
plt.ylabel("WindGustSpeed")
plt.xlabel("Date");

In [None]:
fig,(ax0,ax1) = plt.subplots(nrows=2,
                             ncols=1,
                             figsize=(10,8),
                             sharex=True)

# Scatter plot with WindSpeed9am
ax0.scatter(df.Date[:1000],
            df.WindSpeed9am[:1000],
            color="teal");

ax0.set(title="WindSpeed9am vs Date",
        xlabel="date",
        ylabel="WindSpeed9am")

# Scatter plot with WindSpeed3pm
ax1.scatter(df.Date[:1000],
            df.WindSpeed3pm[:1000],
            color="pink")

ax1.set(title="WindSpeed3pm vs Date",
        xlabel="date",
        ylabel="WindSpeed3pm");

In [None]:
fig,(ax0,ax1) = plt.subplots(nrows=2,
                             ncols=1,
                             figsize=(10,8),
                             sharex=True)

# Scatter plot with Humidity9am
ax0.scatter(df.Date[:1000],
            df.Humidity9am[:1000],
            color="teal");

ax0.set(title="Humidity9am vs Date",
        xlabel="date",
        ylabel="Humidity9am")

# Scatter plot with Humidity3pm
ax1.scatter(df.Date[:1000],
            df.Humidity3pm[:1000],
            color="blue")

ax1.set(title="Humidity3pm vs Date",
        xlabel="date",
        ylabel="Humidity3pm");

From this we can infer that Humdity is lowest at January and starts increasing from February till August it is at it's peak on July 
and then from September to December it starts falling

In [None]:
fig,(ax0,ax1) = plt.subplots(nrows=2,
                             ncols=1,
                             figsize=(10,8),
                             sharex=True)

# Scatter plot with Pressure9am
ax0.scatter(df.Date[:1000],
            df.Pressure9am[:1000],
            color="red");

ax0.set(title="Pressure9am vs Date",
        xlabel="date",
        ylabel="Pressure9am")

# Scatter plot with Pressure3pm
ax1.scatter(df.Date[:1000],
            df.Pressure3pm[:1000],
            color="blue")

ax1.set(title="Pressure3pm vs Date",
        xlabel="date",
        ylabel="Pressure3pm");

In [None]:
fig,(ax0,ax1) = plt.subplots(nrows=2,
                             ncols=1,
                             figsize=(10,8),
                             sharex=True)

# Scatter plot with Temp9am
ax0.scatter(df.Date[:1000],
            df.Temp9am[:1000],
            color="navy");

ax0.set(title="Temp9am vs Date",
        xlabel="date",
        ylabel="Temp9am")

# Scatter plot with Temp3pm
ax1.scatter(df.Date[:1000],
            df.Temp3pm[:1000],
            color="firebrick")

ax1.set(title="Temp3pm vs Date",
        xlabel="date",
        ylabel="Temp3pm");

As Australia lies in the Southern Hemisphere the Summer Season is from December to February and Winter is from June to August so we can see high temps during summer and low during winter 

In [None]:
fig,(ax0,ax1) = plt.subplots(nrows=2,
                             ncols=1,
                             figsize=(10,8),
                             sharex=True)

# Scatter plot with Maxtemp
ax0.scatter(df.Date[:1000],
            df.MaxTemp[:1000],
            color="teal");

ax0.set(title="Maxtemp vs Date",
        xlabel="date",
        ylabel="Maxtemp")

# Scatter plot with MinTemp
ax1.scatter(df.Date[:1000],
            df.MinTemp[:1000],
            color="navy")

ax1.set(title="MinTemp vs Date",
        xlabel="date",
        ylabel="MinTemp");

from this we can infer that 
During Summer 

* maximum MaxTemp is over 45 degrees
* maximum MinTemp is over 25 degrees
* minimum MaxTemp is between 20-30 degrees
* minimum MinTemp is between 5-15 degrees

During Winter 
* maximum MaxTemp is between 20-25 degrees
* maximum MinTemp is between 5-10 degrees
* minimum MaxTemp is just under 10 degrees
* minimum MinTemp is under 0 degrees



In [None]:
df_tmp = df

In [None]:
df_tmp.head().T

In [None]:
df_tmp.Date.head(20)

### Sort the DataFrame by the Date 

when working with time series data, it's a good idea to sort it by the date

In [None]:
# Sort the DataFrame in Date order
df_tmp.sort_values(by=["Date"], inplace=True, ascending=True)
df_tmp.Date.head(20)

In [None]:
df_tmp.head()

In [None]:
# make a copy of the original dataset
df_temp = df.copy()

### Adding datetime parameters for the `Date` column

In [None]:
df_temp["Year"] = df_temp.Date.dt.year
df_temp["Month"] = df_temp.Date.dt.month
df_temp["Day"] = df_temp.Date.dt.day
df_temp["DayOfWeek"] = df_temp.Date.dt.dayofweek
df_temp["DayOfYear"] = df_temp.Date.dt.dayofyear

In [None]:
df_temp.head().T

In [None]:
# Since we've enriched our DataFrame with datetime features we can now remove the Date column
df_temp.drop("Date", axis=1,inplace=True)

In [None]:
df_temp.Location.value_counts()

In [None]:
df_temp.tail().T

In [None]:
df_temp.info()

## Data Preprocessing

### Converting string to categories
one way we can turn all of our data into numbers is by converting them into pandas categories

In [None]:
pd.api.types.is_string_dtype(df_temp.Location)

In [None]:
# Find the columns which contains strings
for label, content in df_temp.items():
    if pd.api.types.is_string_dtype(content):
        print(label)

In [None]:
# This will turn all the string values into categorical values
for label, content in df_temp.items():
    if pd.api.types.is_string_dtype(content):
        df_temp[label] = content.astype("category").cat.as_ordered()

In [None]:
df_temp.info()

In [None]:
df_temp.RainTomorrow.cat.categories

In [None]:
df_temp.Location.cat.codes

Thanks to Pandas Categories we now have a way to access all our data in form of numbers
But we still have to fill the missing data...

In [None]:
df_temp.isna().sum()

In [None]:
### Saving preprocessed data
# df_temp.to_csv("datasets/temp.csv",
#                index=False)

In [None]:
df_temp.RainTomorrow.cat.codes

## Filling missing values 
### Filling numeric missing values first


In [None]:
# Check for numerical values 
for label, content in df_temp.items():
    if pd.api.types.is_numeric_dtype(content):
        print(label)

In [None]:
df_temp.Pressure9am

In [None]:
# Check for which numeric columns have null values
for label, content in df_temp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(label)

In [None]:
# Filling numeric rows with the mean
for label, content in df_temp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            # Fill missing numeric values with the mean
            df_temp[label] = content.fillna(content.mean())

In [None]:
# Check for missing values now if any
for label, content in df_temp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(label)

There are no missing values since we have filled the missing values with the median of the data

In [None]:
df_temp.isna().sum()

In [None]:
df_temp.head().T

In [None]:
# Check to see how many examples were missing in MinTemp
#df_temp.MinTemp_is_missing.value_counts()

In [None]:
df_temp.isna().sum()

### Filling and turning categorical features into numbers

In [None]:
# Check for columns which arent numeric
for label,content in df_temp.items():
    if not pd.api.types.is_numeric_dtype(content):
        print(label)

In [None]:
pd.Categorical(df_temp.RainTomorrow).codes

In [None]:
df_temp.RainTomorrow.value_counts()

In [None]:
# Filling categorical rows with the mode
for label, content in df_temp.items():
    if not pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            # Turn Categories into numbers
            df_temp[label] = pd.Categorical(content).codes
            # Fill missing categorical values with the mode
            df_temp[label] = content.fillna(content.mode()[0])

In [None]:
df_temp["Location"] = df_temp["Location"].cat.codes
df_temp["WindGustDir"] = df_temp["WindGustDir"].cat.codes
df_temp["WindDir9am"] = df_temp["WindDir9am"].cat.codes
df_temp["WindDir3pm"] = df_temp["WindDir3pm"].cat.codes
df_temp["RainToday"] = df_temp["RainToday"].cat.codes
df_temp["RainTomorrow"] = df_temp["RainTomorrow"].cat.codes

In [None]:
df_temp.head().T

In [None]:
df_temp.isna().sum()

In [None]:
df_temp.RainTomorrow.value_counts()

In [None]:
df_temp.RainTomorrow.value_counts().plot(kind="bar", 
                                         color=["lightblue", "salmon"]);
                       

In [None]:
# Splitting the data
X = df_temp.drop(["RainTomorrow"], axis=1)
y = df_temp["RainTomorrow"]

# Splitting the data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
df_temp.info()

## Machine learning modelling

Here we're done with the data preprocessing and now we will proceed to building and fitting the machine learning models

we are going to experiment with 3 different models on our dataset and see which one performs the best 
we will use the baseline models in the begining 


In [None]:
# We actually don't need these columns
df_temp = df_temp.drop(["DayOfWeek","DayOfYear"], axis=1)

In [None]:
df_temp.Location.dtype

In [None]:
# %%time
# # Instantiate the model
# clf1 = LogisticRegression(n_jobs=-1,
#                           random_state=12)
# # fit the model
# clf1.fit(X_train, y_train)

In [None]:
#clf1.score(X_test, y_test)

### First we will try the RandomForestClassifier

This model is without dropping the highly correlated columns from our dataset

In [None]:
%%time
# Instantiate the 2nd model
clf2 = RandomForestClassifier(n_jobs=-1,
                              random_state=12)

# fit the model
clf2.fit(X_train, y_train)

In [None]:
clf2.score(X_test, y_test)

In [None]:
# make a confusion matrix for randomForest Classifier
y_preds = clf2.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_preds))

In [None]:
# Print the classification report 
print(classification_report(y_test, y_preds))

In [None]:
# Visualize the confusion matrix
sns.set(font_scale=1.5)

def plot_conf_mat(y_test, y_preds):
    fig,ax = plt.subplots(figsize=(4,4))
    ax = sns.heatmap(confusion_matrix(y_test, y_preds),
                     annot=True,
                     cbar=False)
    plt.xlabel("true label")
    plt.ylabel("predicted label")
    
plot_conf_mat(y_test,y_preds)

As we can see that our model is having troubles predicting the true negative values (yes rainfall) because of a class imbalance 
as we have more number of no rainfall samples in our dataset



In [None]:
# plotting a correlation matrix 
plt.figure(figsize=(28,15))
sns.heatmap(df_temp.corr(),
            annot=True)
plt.xticks(rotation=90)
plt.show

From the correlation matrix we can infer that 

* Temp9am and MaxTemp are highly correlated
* Temp9am and MinTemp are highly correlated
* Temp3pm and MaxTemp are highly correlated 
* Temp3pm and MinTemp are highly correlated
* Pressure9am and Pressure3pm are highly correlated
* MinTemp and MaxTemp are highly correlated

In [None]:
# Let's drop the highly correlated columns
df_temp = df_temp.drop(["Temp9am", "Temp3pm", "Pressure3pm","MaxTemp"], axis=1)
df_temp.columns

In [None]:
# df_temp = df_temp.drop(["Humidity9am"], axis=1)
# df_temp.columns

In [None]:
df_temp.head().T

In [None]:
# Split the data into X and y
X = df_temp.drop(["RainTomorrow"], axis=1)
y = df_temp["RainTomorrow"]


# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=12)

In [None]:
%%time
# Let's fit our model again
# Instantiate the 2nd model
clf2 = RandomForestClassifier(n_jobs=-1,
                              random_state=12)

# fit the model
clf2.fit(X_train, y_train)

In [None]:
clf2.score(X_test, y_test)

So we got 85% accuracy on the baseline randomforestclassifier

In [None]:
y_preds = clf2.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_preds))

In [None]:
print(classification_report(y_test, y_preds))

### Let's try the LogisticRegression model


In [None]:
%%time
# Instantiate the model
clf1 = LogisticRegression(n_jobs=-1,
                          random_state=12)

# Fit the model
clf1.fit(X_train, y_train)

In [None]:
# score the model
clf1.score(X_test, y_test)

In [None]:
# make a confusion matrix and classification report
y_preds = clf1.predict(X_test)
print(confusion_matrix(y_test, y_preds))

In [None]:
print(classification_report(y_test, y_preds))

So After training and Evaluating 2 models we can see that RandomForestClassifier is giving us better results than LogisticRegression
hence next we are going to tune our RandomForestClassifier to improve it 
we can also try other models to see if they do a better job than these two but for the time being we will try tuning the randomforestclassifier