This Kernel mainly focuses on solving missing data.

In [None]:
import numpy as np
import pandas as pd

import datetime
import math

First we double check where are our input data is and what's the exact directory name. Then load the csv file as a dataframe.

In [None]:
!ls ../input

In [None]:
rain = pd.read_csv('../input/weather-dataset-rattle-package/weatherAUS.csv')
rain.head()

Now if we're focusing on solving this problem as a classification problem, we must drop the 'RISK_MM' as it leaks out the information in predicting our target variable 'RainTomorrow'.

In [None]:
#drop risk_mm
rain.drop(['RISK_MM'], axis=1, inplace=True)
rain.head()

Let's look at the missing value stats. Computed by number of missing values in the column divided by total number of rows.

In [None]:
for col in rain.columns:
    print(col + ' has ' + str(round((rain[col].isnull().sum() / rain.shape[0]) * 100, 2)) + '% missing values')

Then check which of our features are numerical and which are categorical.

In [None]:
rain.info()

Since we notice that 'Date' is an object instead of a datetime object, we convert it accordingly.

In [None]:
#set date to datetime object
rain['Date'] = pd.to_datetime(rain['Date'])
rain.head()

Our data is all over Australia, which means there are lots of locations. Let's take all of these unique locations and store it on the variable 'locations'

In [None]:
#get unique locations
locations = rain['Location'].unique()
locations

Notice that 'Evaporation' and 'Sunshine' has ~40% missing values? To be honest we don't even know how this was measured so let's just drop these columns.

In [None]:
rain.drop(['Evaporation', 'Sunshine'], axis=1, inplace=True)

We know that Australia has 4 seasons. And we can't just fill the missing values with the mean value for each column. For example, 'MinTemp' or the Minimum Temperature of an area is changing depending on the season--what might be a 24 minimum temperature in Summer might be 14 in Winter. In order to be able to compute accurately, we must note what season it is for the missing value. The easiest way to denote the season is to see the month.

So let's create a new feature called 'Month'.

In [None]:
month = [d.month for d in rain['Date']]
rain['Month'] = month

And let's double check to see all our changes so far..

RISK_MM, Evaporation, and Sunshine columns should not be there anymore. And Month column should be there.

In [None]:
rain.head()

Now let's define a function to compute the seasonal numerical values. This takes a dataframe (which should already be minimised to its location--more of this later) and the column name of what we will be working with.

Separate the dataframe into seasons (spring, summer, fall, winter dataframes) and compute the mean value of whatever column we're working on.

Note that since our dataframe has missing WindGustSpeed and Pressure values even on location level, we must decide for default values on these features. If the computed mean value is NaN, then we use these default values.

In [None]:
def compute_missing_seasonal_num_values(dataframe, column_name):
    #separate to seasons
    defaults = {'WindGustSpeed': 5.4, 'Pressure9am': 1013.00, 'Pressure3pm': 1013.00}
    
    spring = dataframe[(dataframe['Month']) >= 9 & (dataframe['Month'] < 12)]
    sp_mean = np.mean(spring.loc[:,column_name])
    if (math.isnan(sp_mean) == True) | (np.isnan(sp_mean) == True):
        sp_mean = defaults[column_name]
    
    summer = dataframe[((dataframe['Month'] >= 1) | (dataframe['Month'] < 3)) & (dataframe['Month'] == 12)]
    sm_mean = np.mean(summer.loc[:,column_name])
    if (math.isnan(sm_mean) == True) | (np.isnan(sm_mean) == True):
        sm_mean = defaults[column_name]
    
    fall = dataframe[(dataframe['Month'] >= 3) & (dataframe['Month'] < 6)]
    fa_mean = np.mean(fall.loc[:,column_name])
    if (math.isnan(fa_mean) == True) | (np.isnan(fa_mean) == True):
        fa_mean = defaults[column_name]

    winter = dataframe[(dataframe['Month'] >= 6) & (dataframe['Month'] < 9)]
    wt_mean = np.mean(winter.loc[:,column_name])
    if (math.isnan(wt_mean) == True) | (np.isnan(wt_mean) == True):
        wt_mean = defaults[column_name]

    return sp_mean, sm_mean, fa_mean, wt_mean

I defined another function for filling the missing values on the dataframe itself. The function above only computes the values for each season.

In [None]:
def fill_missing_seasonal_num_values(dataframe, location, column_name):
    dfs = []
        
    sp, sm, fa, wt = compute_missing_seasonal_num_values(dataframe[dataframe['Location'] == location], column_name)
    df = dataframe[dataframe['Location'] == location]

    sp_df = df[(df['Month'] >= 9) & (df['Month'] < 12)]
    sp_df[column_name].fillna(sp, inplace=True)
    
    sm_df = df[((df['Month'] >= 1) & (df['Month'] < 3)) | (df['Month'] == 12)]
    sm_df[column_name].fillna(sm, inplace=True)
    
    fa_df = df[(df['Month'] >= 3) & (df['Month'] < 6)]
    fa_df[column_name].fillna(fa, inplace=True)
    
    wt_df = df[(df['Month'] >= 6) & (df['Month'] < 9)]
    wt_df[column_name].fillna(wt, inplace=True)

    dfs.append(sm_df)
    dfs.append(fa_df)
    dfs.append(wt_df)

    df = pd.concat(dfs)
        
    return df

Now let's fill up those missing numerical values.

In [None]:
cols = ['MinTemp', 'MaxTemp', 'Temp9am', 'Temp3pm', 'Humidity9am', 'Humidity3pm',
        'Pressure9am', 'Pressure3pm', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm']
df = rain.copy()
for col in cols:
    dfs=[]
    for location in locations:
        dfs.append(fill_missing_seasonal_num_values(df.copy(), location, col))
    df = pd.concat(dfs)

df.head()

And let's check if it worked and set the main dataframe as the computed one.

In [None]:
df.isnull().sum()

In [None]:
rain = df
rain.isnull().sum()

'Rainfall' is the measurement of the amount of rainfall that occured on the corresponding 'Date'. We can assume that 0 means it didn't rain at all and has 0 mm precipitation. Let's fill it then with zeroes.

In [None]:
rain['Rainfall'].fillna(0, inplace=True)
rain.isnull().sum()

'RainToday' is filled up with the mode of the column.

In [None]:
rt_mode = rain['RainToday'].describe().top
rain['RainToday'].fillna(rt_mode, inplace=True)
rain.isnull().sum()

'Cloud9am' and 'Cloud3pm' is measured using oktas. From 0 (clear sky) to 8 (completely overcast). We can fill the missing values with 0 okta which means it's a clear day.

In [None]:
rain['Cloud9am'].fillna(0, inplace=True)
rain['Cloud3pm'].fillna(0, inplace=True)

Now let's take care of the missing categorical values. Same as above, we define 2 functions -- (1) to compute the missing categorical values; (2) fill the missing values. In computing, I'm using the mode. But again, we compute this based on location and season.

In [None]:
def compute_missing_seasonal_cat_values(dataframe, column_name):
    #separate to seasons
    
    spring = dataframe[(dataframe['Month']) >= 9 & (dataframe['Month'] < 12)]
    sp_mode = spring[column_name].describe().top
    if (type(sp_mode) == float):
        sp_mode = dataframe[column_name].describe().top
    
    summer = dataframe[((dataframe['Month'] >= 1) | (dataframe['Month'] < 3)) & (dataframe['Month'] == 12)]
    sm_mode = summer[column_name].describe().top
    if (type(sm_mode) == float):
        sm_mode = dataframe[column_name].describe().top

    fall = dataframe[(dataframe['Month'] >= 3) & (dataframe['Month'] < 6)]
    fa_mode = fall[column_name].describe().top
    if (type(fa_mode) == float):
        fa_mode = dataframe[column_name].describe().top

    winter = dataframe[(dataframe['Month'] >= 6) & (dataframe['Month'] < 9)]
    wt_mode = winter[column_name].describe().top
    if (type(wt_mode) == float):
        wt_mode = dataframe[column_name].describe().top
       
    return sp_mode, sm_mode, fa_mode, wt_mode

In [None]:
def fill_missing_seasonal_cat_values(dataframe, location, column_name):
    dfs = []
        
    sp, sm, fa, wt = compute_missing_seasonal_cat_values(dataframe[dataframe['Location'] == location], column_name)
    df = dataframe[dataframe['Location'] == location]

    sp_df = df[(df['Month'] >= 9) & (df['Month'] < 12)]
    sp_df[column_name].fillna(sp, inplace=True)
    
    sm_df = df[((df['Month'] >= 1) & (df['Month'] < 3)) | (df['Month'] == 12)]
    sm_df[column_name].fillna(sm, inplace=True)
    
    fa_df = df[(df['Month'] >= 3) & (df['Month'] < 6)]
    fa_df[column_name].fillna(fa, inplace=True)
    
    wt_df = df[(df['Month'] >= 6) & (df['Month'] < 9)]
    wt_df[column_name].fillna(wt, inplace=True)

    dfs.append(sp_df)
    dfs.append(sm_df)
    dfs.append(fa_df)
    dfs.append(wt_df)

    df = pd.concat(dfs)
        
    return df

In [None]:
cols = ['WindGustDir', 'WindDir9am', 'WindDir3pm']
df = rain.copy()
for col in cols:
    dfs=[]
    for location in locations:
        dfs.append(fill_missing_seasonal_cat_values(df.copy(), location, col))
    df = pd.concat(dfs)

df.head()

Again same with the problem awhile ago, WindGustDir has missing values on location level. I just filled it with the mode value of the overall dataframe. Considering that we've already dealt with the missing values of other locations.

In [None]:
top = df['WindGustDir'].describe().top
df['WindGustDir'].fillna(top, inplace=True)
df.isnull().sum()

In [None]:
rain = df

And final check...

In [None]:
rain.isnull().sum()

So now we can save this cleaned dataframe as a new csv file.

In [None]:
rain.to_csv('cleaned_weatherAUS.csv')