The content is daily weather observations from numerous Australian weather stations.

The target RainTomorrow means: Did it rain the next day? Yes or No.

# Importing libraries and dataset

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings('ignore')

data = pd.read_csv('../input/weatherAUS.csv')

# Exploring the data

In [None]:
data.shape

In [None]:
data.columns

In [None]:
data.info()

In [None]:
data.head()

The dataset has below columns :
*  **DateThe** — date of observation
*  **Location** — The common name of the location of the weather station
*  **MinTemp** — The minimum temperature in degrees celsius
*  **MaxTemp** — The maximum temperature in degrees celsius
*  **Rainfall** — The amount of rainfall recorded for the day in mm
*  **Evaporation** — The so-called Class A pan evaporation (mm) in the 24 hours to 9am
*  **Sunshine** — The number of hours of bright sunshine in the day.
*  **WindGustDir** — The direction of the strongest wind gust in the 24 hours to midnight
*  **WindGustSpeed** — The speed (km/h) of the strongest wind gust in the 24 hours to midnight
*  **WindDir9am** — Direction of the wind at 9am
*  **WindDir3pm** — Direction of the wind at 3pm
*  **WindSpeed9am** — Wind speed (km/hr) averaged over 10 minutes prior to 9am
*  **WindSpeed3pm** — Wind speed (km/hr) averaged over 10 minutes prior to 3pm
*  **Humidity9am** — Humidity (percent) at 9am
*  **Humidity3pm** — Humidity (percent) at 3pm
*  **Pressure9am** — Atmospheric pressure (hpa) reduced to mean sea level at 9am
*  **Pressure3pm** — Atmospheric pressure (hpa) reduced to mean sea level at 3pm
*  **Cloud9am** — Fraction of sky obscured by cloud at 9am. This is measured in "oktas", which are a unit of eigths. It records how many eigths of the sky are obscured by cloud. A 0 measure indicates completely clear *  sky whilst an 8 indicates that it is completely overcast.
*  **Cloud3pm** — Fraction of sky obscured by cloud (in "oktas": eighths) at 3pm. See Cload9am for a description of the values
*  **Temp9am** — Temperature (degrees C) at 9am
*  **Temp3pm** — Temperature (degrees C) at 3pm
*  **RainToday** — Boolean: 1 if precipitation (mm) in the 24 hours to 9am exceeds 1mm, otherwise 0
*  **RISK_MM** — The amount of rain. A kind of measure of the "risk".
*  **RainTomorrow** — The target variable. Did it rain tomorrow?

The type of machine learning we will be doing is called **classification**, because when we make predictions we are classifying each day as rainy or not. More specifically, we are performing **binary classification**, which means that there are only two different states we are classifying.

# Null values
Let's get rid of columns with significant amount of null values. And in the rest columns we will drop rows with null values. 

In [None]:
data_null_percent = pd.Series(index=data.columns)

for column_name in data:
    data_null_percent[column_name] = data[column_name].count()/data.shape[0]
    
data_null_percent_sorted = data_null_percent.sort_values()

In [None]:
data_null_percent_sorted.plot.barh()

**Cloud9pm, Cloud3pm, Evaporation, and Sunshine** must be droped since significant amount of records in these columns is missed. Also we should exclude **RISK_MM** because it can leak the answers to the model and reduce its predictability.

In [None]:
data = data.drop(columns=['Cloud9am','Cloud3pm', 'Evaporation', 'Sunshine','RISK_MM'])

Let's drop rows with null values in them.

In [None]:
data = data.dropna()
data.isnull().any()

In [None]:
data.shape

In [None]:
data.head()

# Split into train and test
We must be aware of one important thing: any change we make to the train data, we also need to make to the test data, otherwise we will be unable to use our model. 

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.2)

In [None]:
print("train: " + str(train.shape) + ", test: " + str(test.shape))

# Deal with categorical variables
To apply such algorithms as Logistic Regression we need to convert the non-numeric data into numeric data. Categorical variables with only 2 possible values can be converted into variables with 0s and 1s as values. For categorical variables with 3 and more possible value we will create dummy variables.

Convert values in columns "RainToday" and "RainTomorrow" from **"No" and "Yes"** to **0 and 1**.

In [None]:
train["RainToday"] = train["RainToday"].map({"No":0, "Yes":1})
train["RainTomorrow"] = train["RainTomorrow"].map({"No":0, "Yes":1})

test["RainToday"] = test["RainToday"].map({"No":0, "Yes":1})
test["RainTomorrow"] = test["RainTomorrow"].map({"No":0, "Yes":1})

Visualization of how categorical variables impact on forming tomorrow's rain

In [None]:
def category_impact_plot(variable, subplot_position):
    plt.subplot(subplot_position)
    pd.pivot_table(train, index=variable, values='RainTomorrow').plot.bar(figsize=(25,5), ax=plt.gca()) 
   
plt.figure(1)
category_impact_plot("WindGustDir", 131)
category_impact_plot("WindDir9am", 132)
category_impact_plot("WindDir3pm", 133)


Create dummy variables for **WindGustDir, WindDir9am, WindDir3pm**

In [None]:
categorical_variables = ["WindGustDir", "WindDir9am", "WindDir3pm"]

train = pd.get_dummies(train, columns=categorical_variables)
test = pd.get_dummies(test, columns=categorical_variables)

In [None]:
train.head()

# Does Location affect the formation of rain?

In [None]:
location_pivot = train.pivot_table(index="Location", values="RainTomorrow")
location_pivot_sorted = location_pivot.sort_values(by=["RainTomorrow"])

location_pivot_sorted.plot.barh(figsize=(10,12))
plt.ylabel('')

Yes, **Location** obviously affect the formation of tomorrow's rain! So, we're going to use this variable, and in order to use this categorical variable we have to create dummies.

In [None]:
train = pd.get_dummies(train, columns=["Location"])
test = pd.get_dummies(test, columns=["Location"])

# Does Date affect the formation of rain?

In [None]:
train["Month"] = pd.to_datetime(train["Date"]).dt.month
test["Month"] = pd.to_datetime(test["Date"]).dt.month

In [None]:
date_pivot = train.pivot_table(index="Month", values="RainTomorrow")#.sort_index(ascending=False)

date_pivot.plot.barh()
plt.ylabel('')

There's a certain tendency, season 6-8 is a rainy season.

In [None]:
train = pd.get_dummies(train, columns=["Month"])
test = pd.get_dummies(test, columns=["Month"])

# Rescaling
Looking at our numeric columns, we can see a big difference between the range of each.  In order to make sure these values are equally weighted within our model, we'll need to rescale the data.

Rescaling simply stretches or shrinks the data as needed to be on the same scale, in our case between 0 and 1.

In [None]:
# the preprocessing.minmax_scale() function allows us to quickly and easily rescale our data
from sklearn.preprocessing import minmax_scale

# Added 2 backets to make it a dataframe. Otherwise you will get a type error stating cannot iterate over 0-d array.
def apply_minmax_scale(dataset, features):
    for feature in features:
        dataset[feature] = minmax_scale(dataset[[feature]])
        
numerical_features = ["MinTemp","MaxTemp", "Rainfall", "WindGustSpeed", "WindSpeed9am",
                     "WindSpeed3pm", "Humidity9am", "Humidity3pm", "Pressure9am", 
                     "Pressure3pm", "Temp9am", "Temp3pm"]

apply_minmax_scale(train, numerical_features)
apply_minmax_scale(test, numerical_features)

train[numerical_features].head()

# Visualization of how numerical variables impact on forming tomorrow's rain

In [None]:
rainTomorrow_yes = train[train["RainTomorrow"] == 1]
rainTomorrow_no = train[train["RainTomorrow"] == 0]

In [None]:
def variable_impact_plot(variable, subplot_position):
    plt.subplot(subplot_position)
    rainTomorrow_yes[variable].plot.hist(figsize=(25,10), alpha=0.5, color="blue", bins=50, ax=plt.gca())
    rainTomorrow_no[variable].plot.hist(figsize=(25,10), alpha=0.5, color="yellow", bins=50, ax=plt.gca())
    plt.ylabel('')
    plt.xticks([], [])
    plt.yticks([], [])
    plt.title(variable)

plt.figure(1)
variable_impact_plot("MinTemp", 341)
variable_impact_plot("MaxTemp", 342)
variable_impact_plot("Rainfall", 343)
variable_impact_plot("WindGustSpeed", 344)
variable_impact_plot("WindSpeed9am", 345)
variable_impact_plot("WindSpeed3pm", 346)
variable_impact_plot("Humidity9am", 347)
variable_impact_plot("Humidity3pm", 348)
plt.figure(2)
variable_impact_plot("Pressure9am", 341)
variable_impact_plot("Pressure3pm", 342)
variable_impact_plot("Temp9am", 343)
variable_impact_plot("Temp3pm", 344)

We are intrested in variables with plots where blue and yellow areas have different shapes. Such variables have impact(positive or negative) on forming tomorrow's rain. The most obvious one is **Humidity3pm**! The rest is not that clear, we will use another feature selection method.

# Collinearity
We now have 73 possible feature columns we can use to train our model. One thing to be aware of as you start to add more features is a concept called collinearity. Collinearity occurs where more than one feature contains data that are similar.

The effect of collinearity is that your model will overfit - you may get great results on your test data set, but then the model performs worse on unseen data (like the test set).

 A common way to spot collinearity is to plot correlations between each pair of variables in a heatmap.

In [None]:
# columns we will be using all the way down
columns = list(train.columns[1:])
columns.remove("RainTomorrow")

In [None]:
import seaborn as sns

# custom function to set the style for heatmap
def plot_correlation_heatmap(df):
    corr = df.corr()
    sns.set(style="white")
    mask = np.zeros_like(corr, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True

    f, ax = plt.subplots(figsize=(30, 25))
    cmap = sns.diverging_palette(220, 10, as_cmap=True)

    sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
    plt.show()

plot_correlation_heatmap(train[columns])

We can see that there is correlation about 30-50% between some variables. That's not enough to remove one of them and rely on the other.

Apart from that, we should remove one of each of our dummy variables to reduce the collinearity in each. We'll remove:
* WindGustDir_E
* WindDir9am_E
* WindDir3pm_E

# Feature selection
In order to select the best-performing features, we need a way to measure which of our features are relevant to our outcome - in this case, the impact on forming tomorrow's rain. One effective way is by training a logistic regression model using all of our features, and then looking at the coefficients of each feature.

The scikit-learn LogisticRegression class has an attribute in which coefficients are stored after the model is fit, LogisticRegression.coef_. We first need to train our model, after which we can access this attribute.

In [None]:
# Applying Logistic Regression
from sklearn.linear_model import LogisticRegression
logisticRegression = LogisticRegression()
logisticRegression.fit(train[columns], train["RainTomorrow"])
coefficients = logisticRegression.coef_
print(coefficients)

The coef() method returns a NumPy array of coefficients, in the same order as the features that were used to fit the model. To make these easier to interpret, we can convert the coefficients to a pandas series, adding the column names as the index:

In [None]:
feature_importance = pd.Series(coefficients[0], index=columns)
print(feature_importance)

In [None]:
# Plotting as a horizontal Bar chart
feature_importance.plot.barh(figsize=(10,25))
plt.show()

The plot we generated shows a range of both positive and negative values. Whether the value is positive or negative isn't as important in this case, relative to the magnitude of the value. If you think about it, this makes sense. A feature that indicates strongly whether a it's not going to rain tomorrow is just as useful as a feature that indicates strongly that a it's going to rain tomorrow, given they are mutually exclusive outcomes.

To make things easier to interpret, we'll alter the plot to show all positive values, and have sorted the bars in order of size:

In [None]:
ordered_feature_importance = feature_importance.abs().sort_values()
ordered_feature_importance.plot.barh(figsize=(10,25))
plt.show()

We'll train a model with the top 4 scores.

In [None]:
predictors = ["Pressure3pm", "WindGustSpeed", "Pressure9am", "Humidity3pm"]

lr = LogisticRegression()
lr.fit(train[predictors], train["RainTomorrow"])
predictions = lr.predict(test[predictors])
print(predictions)

In [None]:
# Calculating the accuracy using the k-fold cross validation method with k=10
from sklearn.model_selection import cross_val_score
scores = cross_val_score(lr, train[predictors], train["RainTomorrow"], cv=10)
print(scores)

In [None]:
# Taking the mean of all the scores
accuracy = scores.mean()
print(accuracy)