 [Click here to download dataset](https://kaggle.com/jsphyg/weather-dataset-rattle-package)

In [1]:
# import libraries
import numpy as np # linear algebra
import pandas as pd # data preprocessing , CSV file I/O

# import libraries for plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
# import dataset
df = pd.read_csv("weatherAUS.csv")

In [3]:
# drop row in which target column value is missing
df.dropna(subset=['RainTomorrow'], inplace=True)

##### Feature Engineering of Date Variable

In [4]:
# parse the dates, currently coded as strings, into datetime format
df['Date'] = pd.to_datetime(df['Date'])

In [5]:
# extract year from date
df['Year'] = df['Date'].dt.year
df['Year'].head()

0    2008
1    2008
2    2008
3    2008
4    2008
Name: Year, dtype: int64

In [6]:
# extract month from date
df['Month'] = df['Date'].dt.year
df['Month'].head()

0    2008
1    2008
2    2008
3    2008
4    2008
Name: Month, dtype: int64

In [7]:
# extract day from date
df['Day'] = df['Date'].dt.year
df['Day'].head()

0    2008
1    2008
2    2008
3    2008
4    2008
Name: Day, dtype: int64

In [8]:
# again view the summary of dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 142193 entries, 0 to 145458
Data columns (total 26 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   Date           142193 non-null  datetime64[ns]
 1   Location       142193 non-null  object        
 2   MinTemp        141556 non-null  float64       
 3   MaxTemp        141871 non-null  float64       
 4   Rainfall       140787 non-null  float64       
 5   Evaporation    81350 non-null   float64       
 6   Sunshine       74377 non-null   float64       
 7   WindGustDir    132863 non-null  object        
 8   WindGustSpeed  132923 non-null  float64       
 9   WindDir9am     132180 non-null  object        
 10  WindDir3pm     138415 non-null  object        
 11  WindSpeed9am   140845 non-null  float64       
 12  WindSpeed3pm   139563 non-null  float64       
 13  Humidity9am    140419 non-null  float64       
 14  Humidity3pm    138583 non-null  float64       
 15  

In [9]:
# drop the original Date variable
df.drop('Date', axis=1, inplace = True)

In [10]:
df.head()

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow,Year,Month,Day
0,Albury,13.4,22.9,0.6,,,W,44.0,W,WNW,...,1007.1,8.0,,16.9,21.8,No,No,2008,2008,2008
1,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,WSW,...,1007.8,,,17.2,24.3,No,No,2008,2008,2008
2,Albury,12.9,25.7,0.0,,,WSW,46.0,W,WSW,...,1008.7,,2.0,21.0,23.2,No,No,2008,2008,2008
3,Albury,9.2,28.0,0.0,,,NE,24.0,SE,E,...,1012.8,,,18.1,26.5,No,No,2008,2008,2008
4,Albury,17.5,32.3,1.0,,,W,41.0,ENE,NW,...,1006.0,7.0,8.0,17.8,29.7,No,No,2008,2008,2008


#### Declare feature vector and target variable

In [11]:
X = df.drop(['RainTomorrow'], axis=1)

y = df['RainTomorrow']

In [12]:
y.isnull().sum()

0

In [13]:
# Split data into separate training and test set
# split X and y into training and testing sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [14]:
# check the shape of X_train and X_test
X_train.shape, X_test.shape

((113754, 24), (28439, 24))

#### Feature Engineering 
    Feature Engineering is the process of transforming raw data into useful features that help us to understand our model better and increase its predictive power. I will carry out feature engineering on different types of variables.

    First, I will display the categorical and numerical variables again separately.

In [15]:
# check data types in X_train
X_train.dtypes

Location          object
MinTemp          float64
MaxTemp          float64
Rainfall         float64
Evaporation      float64
Sunshine         float64
WindGustDir       object
WindGustSpeed    float64
WindDir9am        object
WindDir3pm        object
WindSpeed9am     float64
WindSpeed3pm     float64
Humidity9am      float64
Humidity3pm      float64
Pressure9am      float64
Pressure3pm      float64
Cloud9am         float64
Cloud3pm         float64
Temp9am          float64
Temp3pm          float64
RainToday         object
Year               int64
Month              int64
Day                int64
dtype: object

In [16]:
# display categorical variables

categorical = [col for col in X_train.columns if X_train[col].dtypes == 'O']

categorical

['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday']

In [17]:
# display numerical variables

numerical = [col for col in X_train.columns if X_train[col].dtypes != 'O']

numerical

['MinTemp',
 'MaxTemp',
 'Rainfall',
 'Evaporation',
 'Sunshine',
 'WindGustSpeed',
 'WindSpeed9am',
 'WindSpeed3pm',
 'Humidity9am',
 'Humidity3pm',
 'Pressure9am',
 'Pressure3pm',
 'Cloud9am',
 'Cloud3pm',
 'Temp9am',
 'Temp3pm',
 'Year',
 'Month',
 'Day']

In [18]:
# check missing values in numerical variables in X_train

X_train[numerical].isnull().sum()

MinTemp            525
MaxTemp            268
Rainfall          1182
Evaporation      48791
Sunshine         54345
WindGustSpeed     7406
WindSpeed9am      1083
WindSpeed3pm      2109
Humidity9am       1420
Humidity3pm       2913
Pressure9am      11257
Pressure3pm      11225
Cloud9am         43041
Cloud3pm         45767
Temp9am            736
Temp3pm           2206
Year                 0
Month                0
Day                  0
dtype: int64

In [19]:
# check missing values in numerical variables in X_test

X_test[numerical].isnull().sum()

MinTemp            112
MaxTemp             54
Rainfall           224
Evaporation      12052
Sunshine         13471
WindGustSpeed     1864
WindSpeed9am       265
WindSpeed3pm       521
Humidity9am        354
Humidity3pm        697
Pressure9am       2757
Pressure3pm       2756
Cloud9am         10616
Cloud3pm         11327
Temp9am            168
Temp3pm            520
Year                 0
Month                0
Day                  0
dtype: int64

In [20]:
# print percentage of missing values in the numerical variables in training set

for col in numerical:
    if X_train[col].isnull().mean()>0:
        print(col, round(X_train[col].isnull().mean(),4))

MinTemp 0.0046
MaxTemp 0.0024
Rainfall 0.0104
Evaporation 0.4289
Sunshine 0.4777
WindGustSpeed 0.0651
WindSpeed9am 0.0095
WindSpeed3pm 0.0185
Humidity9am 0.0125
Humidity3pm 0.0256
Pressure9am 0.099
Pressure3pm 0.0987
Cloud9am 0.3784
Cloud3pm 0.4023
Temp9am 0.0065
Temp3pm 0.0194


In [21]:
# impute missing values in X_train and X_test with respective column median in X_train
for df1 in [X_train, X_test]:
    for col in numerical:
        col_median = X_train[col].median()
        df1[col].fillna(col_median,inplace=True);

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  downcast=downcast,
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  downcast=downcast,
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  downcast=downcast,
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  downcast=downcast,
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the do

In [22]:
# check again missing values in numerical variables in X_train
X_train[numerical].isnull().sum()

MinTemp          0
MaxTemp          0
Rainfall         0
Evaporation      0
Sunshine         0
WindGustSpeed    0
WindSpeed9am     0
WindSpeed3pm     0
Humidity9am      0
Humidity3pm      0
Pressure9am      0
Pressure3pm      0
Cloud9am         0
Cloud3pm         0
Temp9am          0
Temp3pm          0
Year             0
Month            0
Day              0
dtype: int64

In [23]:
# check missing values in numerical variables in X_test

X_test[numerical].isnull().sum()

MinTemp          0
MaxTemp          0
Rainfall         0
Evaporation      0
Sunshine         0
WindGustSpeed    0
WindSpeed9am     0
WindSpeed3pm     0
Humidity9am      0
Humidity3pm      0
Pressure9am      0
Pressure3pm      0
Cloud9am         0
Cloud3pm         0
Temp9am          0
Temp3pm          0
Year             0
Month            0
Day              0
dtype: int64

Now, we can see that there are no missing values in the numerical columns of training and test set.

In [24]:
# print percentage of missing values in the categorical variables in training set

X_train[categorical].isnull().mean()

Location       0.000000
WindGustDir    0.065563
WindDir9am     0.070644
WindDir3pm     0.026575
RainToday      0.010391
dtype: float64

In [25]:
# print categorical variables with missing data

for col in categorical:
    if X_train[col].isnull().mean()>0:
        print(col, (X_train[col].isnull().mean()))

WindGustDir 0.06556252966928636
WindDir9am 0.07064366967315434
WindDir3pm 0.02657488967420926
RainToday 0.010390843398913444


In [26]:
# impute missing categorical variables with most frequent value

for df2 in [X_train, X_test]:
    df2['WindGustDir'].fillna(X_train['WindGustDir'].mode()[0], inplace=True)
    df2['WindDir9am'].fillna(X_train['WindDir9am'].mode()[0], inplace=True)
    df2['WindDir3pm'].fillna(X_train['WindDir3pm'].mode()[0], inplace=True)
    df2['RainToday'].fillna(X_train['RainToday'].mode()[0], inplace=True)

In [27]:
# check missing values in categorical variables in X_train

X_train[categorical].isnull().sum()

Location       0
WindGustDir    0
WindDir9am     0
WindDir3pm     0
RainToday      0
dtype: int64

In [28]:
# check missing values in categorical variables in X_test

X_test[categorical].isnull().sum()

Location       0
WindGustDir    0
WindDir9am     0
WindDir3pm     0
RainToday      0
dtype: int64

In [29]:
# check missing values in X_train

X_train.isnull().sum()

Location         0
MinTemp          0
MaxTemp          0
Rainfall         0
Evaporation      0
Sunshine         0
WindGustDir      0
WindGustSpeed    0
WindDir9am       0
WindDir3pm       0
WindSpeed9am     0
WindSpeed3pm     0
Humidity9am      0
Humidity3pm      0
Pressure9am      0
Pressure3pm      0
Cloud9am         0
Cloud3pm         0
Temp9am          0
Temp3pm          0
RainToday        0
Year             0
Month            0
Day              0
dtype: int64

In [30]:
# check missing values in X_test

X_test.isnull().sum()

Location         0
MinTemp          0
MaxTemp          0
Rainfall         0
Evaporation      0
Sunshine         0
WindGustDir      0
WindGustSpeed    0
WindDir9am       0
WindDir3pm       0
WindSpeed9am     0
WindSpeed3pm     0
Humidity9am      0
Humidity3pm      0
Pressure9am      0
Pressure3pm      0
Cloud9am         0
Cloud3pm         0
Temp9am          0
Temp3pm          0
RainToday        0
Year             0
Month            0
Day              0
dtype: int64

#### Engineering outliers in numerical variables 
    We have seen that the Rainfall, Evaporation, WindSpeed9am and WindSpeed3pm columns contain outliers. I will use top-coding approach to cap maximum values and remove outliers from the above variables.

In [31]:
def max_value(df3, variable, top):
    return np.where(df3[variable]>top, top, df3[variable])

for df3 in [X_train, X_test]:
    df3['Rainfall'] = max_value(df3, 'Rainfall', 3.2)
    df3['Evaporation'] = max_value(df3, 'Evaporation', 21.8)
    df3['WindSpeed9am'] = max_value(df3, 'WindSpeed9am', 55)
    df3['WindSpeed3pm'] = max_value(df3, 'WindSpeed3pm', 57)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user

In [32]:
X_train.Rainfall.max(), X_test.Rainfall.max()

(3.2, 3.2)

In [33]:
X_train.Evaporation.max(), X_test.Evaporation.max()

(21.8, 21.8)

In [34]:
X_train.WindSpeed9am.max(), X_test.WindSpeed9am.max()

(55.0, 55.0)

In [35]:
X_train.WindSpeed3pm.max(), X_test.WindSpeed3pm.max()

(57.0, 57.0)

In [36]:
X_train[numerical].describe()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,Year,Month,Day
count,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0
mean,12.175225,23.221349,0.67701,5.143557,7.993989,39.895063,13.980124,18.629719,68.839434,51.480317,1017.646309,1015.252305,4.647652,4.701452,16.98134,21.6707,2012.757802,2012.757802,2012.757802
std,6.384019,7.109859,1.185327,2.814837,2.758049,13.127684,8.815668,8.691237,18.944168,20.531492,6.750345,6.683925,2.29266,2.118964,6.470597,6.87257,2.541504,2.541504,2.541504
min,-8.5,-4.8,0.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,980.5,978.2,0.0,0.0,-7.2,-5.4,2007.0,2007.0,2007.0
25%,7.6,17.9,0.0,4.0,8.2,31.0,7.0,13.0,57.0,37.0,1013.5,1011.0,3.0,4.0,12.3,16.7,2011.0,2011.0,2011.0
50%,12.0,22.6,0.0,4.8,8.4,39.0,13.0,19.0,70.0,52.0,1017.6,1015.2,5.0,5.0,16.7,21.1,2013.0,2013.0,2013.0
75%,16.8,28.2,0.6,5.4,8.7,46.0,19.0,24.0,83.0,65.0,1021.8,1019.4,6.0,6.0,21.5,26.3,2015.0,2015.0,2015.0
max,33.9,48.1,3.2,21.8,14.5,135.0,55.0,57.0,100.0,100.0,1041.0,1038.9,8.0,9.0,40.2,46.7,2017.0,2017.0,2017.0


We can now see that the outliers in Rainfall, Evaporation, WindSpeed9am and WindSpeed3pm columns are capped.

#### Encode categorical variables

In [37]:
# print categorical variables
categorical

['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday']

In [38]:
X_train.head()

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,Year,Month,Day
18401,NorahHead,15.1,23.9,0.0,4.8,8.4,SSW,67.0,NW,W,...,1001.9,1002.4,5.0,5.0,19.8,14.3,No,2009,2009,2009
127797,Walpole,9.7,14.2,3.2,4.8,8.4,WSW,50.0,WNW,W,...,1008.2,1007.7,5.0,5.0,11.1,13.4,Yes,2011,2011,2011
40012,Williamtown,13.2,25.4,0.0,3.2,8.8,ENE,30.0,W,E,...,1025.2,1021.5,6.0,5.0,21.2,24.0,No,2010,2010,2010
130914,Hobart,7.6,14.8,0.0,4.0,7.0,WNW,94.0,WNW,WNW,...,1004.6,1001.4,5.0,5.0,11.1,12.9,No,2011,2011,2011
41742,Williamtown,12.9,22.2,0.0,4.0,7.9,S,37.0,SW,SSE,...,1023.0,1021.2,6.0,2.0,18.8,20.6,No,2015,2015,2015


In [39]:
# encode RainToday variable

import category_encoders as ce

encoder = ce.BinaryEncoder(cols=['RainToday'])

X_train = encoder.fit_transform(X_train)

X_test = encoder.transform(X_test)

In [40]:
X_train.head()

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday_0,RainToday_1,Year,Month,Day
18401,NorahHead,15.1,23.9,0.0,4.8,8.4,SSW,67.0,NW,W,...,1002.4,5.0,5.0,19.8,14.3,0,1,2009,2009,2009
127797,Walpole,9.7,14.2,3.2,4.8,8.4,WSW,50.0,WNW,W,...,1007.7,5.0,5.0,11.1,13.4,1,0,2011,2011,2011
40012,Williamtown,13.2,25.4,0.0,3.2,8.8,ENE,30.0,W,E,...,1021.5,6.0,5.0,21.2,24.0,0,1,2010,2010,2010
130914,Hobart,7.6,14.8,0.0,4.0,7.0,WNW,94.0,WNW,WNW,...,1001.4,5.0,5.0,11.1,12.9,0,1,2011,2011,2011
41742,Williamtown,12.9,22.2,0.0,4.0,7.9,S,37.0,SW,SSE,...,1021.2,6.0,2.0,18.8,20.6,0,1,2015,2015,2015


In [41]:
X_train = pd.concat([X_train[numerical], X_train[['RainToday_0', 'RainToday_1']],
                     pd.get_dummies(X_train.Location), 
                     pd.get_dummies(X_train.WindGustDir),
                     pd.get_dummies(X_train.WindDir9am),
                     pd.get_dummies(X_train.WindDir3pm)], axis=1)

In [42]:
X_train.head()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,...,NNW,NW,S,SE,SSE,SSW,SW,W,WNW,WSW
18401,15.1,23.9,0.0,4.8,8.4,67.0,19.0,22.0,38.0,68.0,...,0,0,0,0,0,0,0,1,0,0
127797,9.7,14.2,3.2,4.8,8.4,50.0,15.0,28.0,91.0,56.0,...,0,0,0,0,0,0,0,1,0,0
40012,13.2,25.4,0.0,3.2,8.8,30.0,6.0,17.0,79.0,63.0,...,0,0,0,0,0,0,0,0,0,0
130914,7.6,14.8,0.0,4.0,7.0,94.0,30.0,35.0,52.0,45.0,...,0,0,0,0,0,0,0,0,1,0
41742,12.9,22.2,0.0,4.0,7.9,37.0,15.0,20.0,69.0,52.0,...,0,0,0,0,1,0,0,0,0,0


In [43]:
# Similarly, I will create the X_test testing set.
X_test = pd.concat([X_test[numerical], X_test[['RainToday_0', 'RainToday_1']],
                     pd.get_dummies(X_test.Location), 
                     pd.get_dummies(X_test.WindGustDir),
                     pd.get_dummies(X_test.WindDir9am),
                     pd.get_dummies(X_test.WindDir3pm)], axis=1)

In [44]:
X_test.head()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,...,NNW,NW,S,SE,SSE,SSW,SW,W,WNW,WSW
57760,7.1,13.0,3.2,4.8,8.4,41.0,24.0,22.0,100.0,98.0,...,0,0,0,0,0,0,0,0,1,0
127128,13.2,18.3,0.0,4.8,8.4,48.0,24.0,20.0,73.0,73.0,...,0,0,0,0,0,0,0,0,0,0
119994,9.2,22.7,0.0,5.0,11.1,52.0,26.0,20.0,45.0,25.0,...,0,0,0,0,0,0,0,0,0,0
7088,15.3,26.1,0.0,10.4,8.4,44.0,24.0,19.0,48.0,40.0,...,0,0,0,0,0,0,0,0,0,0
62992,11.9,31.8,0.0,5.0,4.1,72.0,6.0,19.0,89.0,25.0,...,0,0,0,0,0,0,0,0,0,0


In [45]:
X_test.head()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,...,NNW,NW,S,SE,SSE,SSW,SW,W,WNW,WSW
57760,7.1,13.0,3.2,4.8,8.4,41.0,24.0,22.0,100.0,98.0,...,0,0,0,0,0,0,0,0,1,0
127128,13.2,18.3,0.0,4.8,8.4,48.0,24.0,20.0,73.0,73.0,...,0,0,0,0,0,0,0,0,0,0
119994,9.2,22.7,0.0,5.0,11.1,52.0,26.0,20.0,45.0,25.0,...,0,0,0,0,0,0,0,0,0,0
7088,15.3,26.1,0.0,10.4,8.4,44.0,24.0,19.0,48.0,40.0,...,0,0,0,0,0,0,0,0,0,0
62992,11.9,31.8,0.0,5.0,4.1,72.0,6.0,19.0,89.0,25.0,...,0,0,0,0,0,0,0,0,0,0


We now have training and testing set ready for model building. Before that, we should map all the feature variables onto the same scale. It is called feature scaling. I will do it as follows.

#### Feature Scaling

In [46]:
X_train.describe()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,...,NNW,NW,S,SE,SSE,SSW,SW,W,WNW,WSW
count,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,...,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0
mean,12.175225,23.221349,0.67701,5.143557,7.993989,39.895063,13.980124,18.629719,68.839434,51.480317,...,0.05431,0.059699,0.067567,0.101183,0.065114,0.056534,0.064481,0.069562,0.060763,0.065694
std,6.384019,7.109859,1.185327,2.814837,2.758049,13.127684,8.815668,8.691237,18.944168,20.531492,...,0.22663,0.236929,0.251002,0.301573,0.246728,0.230952,0.245609,0.254409,0.238896,0.247748
min,-8.5,-4.8,0.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7.6,17.9,0.0,4.0,8.2,31.0,7.0,13.0,57.0,37.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,12.0,22.6,0.0,4.8,8.4,39.0,13.0,19.0,70.0,52.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,16.8,28.2,0.6,5.4,8.7,46.0,19.0,24.0,83.0,65.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,33.9,48.1,3.2,21.8,14.5,135.0,55.0,57.0,100.0,100.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [47]:
cols = X_train.columns

In [48]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

In [49]:
X_train = pd.DataFrame(X_train,columns=[cols])
X_test = pd.DataFrame(X_test,columns=[cols])


In [50]:
X_train.head()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,...,NNW,NW,S,SE,SSE,SSW,SW,W,WNW,WSW
0,0.556604,0.542533,0.0,0.220183,0.57931,0.472868,0.345455,0.385965,0.38,0.68,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.429245,0.359168,1.0,0.220183,0.57931,0.341085,0.272727,0.491228,0.91,0.56,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.511792,0.570888,0.0,0.146789,0.606897,0.186047,0.109091,0.298246,0.79,0.63,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.379717,0.37051,0.0,0.183486,0.482759,0.682171,0.545455,0.614035,0.52,0.45,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.504717,0.510397,0.0,0.183486,0.544828,0.24031,0.272727,0.350877,0.69,0.52,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [51]:
X_test.head()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,...,NNW,NW,S,SE,SSE,SSW,SW,W,WNW,WSW
0,0.367925,0.336484,1.0,0.220183,0.57931,0.271318,0.436364,0.385965,1.0,0.98,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.511792,0.436673,0.0,0.220183,0.57931,0.325581,0.436364,0.350877,0.73,0.73,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.417453,0.519849,0.0,0.229358,0.765517,0.356589,0.472727,0.350877,0.45,0.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.561321,0.584121,0.0,0.477064,0.57931,0.294574,0.436364,0.333333,0.48,0.4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.481132,0.691871,0.0,0.229358,0.282759,0.511628,0.109091,0.333333,0.89,0.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [52]:
X_train.describe()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,...,NNW,NW,S,SE,SSE,SSW,SW,W,WNW,WSW
count,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,...,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0
mean,0.487623,0.529704,0.211566,0.235943,0.55131,0.262752,0.254184,0.326837,0.688394,0.514803,...,0.05431,0.059699,0.067567,0.101183,0.065114,0.056534,0.064481,0.069562,0.060763,0.065694
std,0.150566,0.134402,0.370415,0.129121,0.19021,0.101765,0.160285,0.152478,0.189442,0.205315,...,0.22663,0.236929,0.251002,0.301573,0.246728,0.230952,0.245609,0.254409,0.238896,0.247748
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.379717,0.429112,0.0,0.183486,0.565517,0.193798,0.127273,0.22807,0.57,0.37,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.483491,0.517958,0.0,0.220183,0.57931,0.255814,0.236364,0.333333,0.7,0.52,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.596698,0.623819,0.1875,0.247706,0.6,0.310078,0.345455,0.421053,0.83,0.65,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


We now have X_train dataset ready to be fed into the Logistic Regression classifier.

In [53]:
y_train.isnull().sum()

0

In [54]:
X_train.isnull().sum()

MinTemp        0
MaxTemp        0
Rainfall       0
Evaporation    0
Sunshine       0
              ..
SSW            0
SW             0
W              0
WNW            0
WSW            0
Length: 118, dtype: int64

## Saving Processed Data to Disk

It can be useful to save processed data to disk, especially for really large datasets, to avoid repeating the preprocessing steps every time you start the Jupyter notebook. The parquet format is a fast and efficient format for saving and loading Pandas dataframes.

In [55]:
print('X_traim:', X_train.shape)
print('y_train:', y_train.shape)
print('X_test:', X_train.shape)
print('y_test:', y_test.shape)

X_traim: (113754, 118)
y_train: (113754,)
X_test: (113754, 118)
y_test: (28439,)


In [57]:
X_train.columns

MultiIndex([(      'MinTemp',),
            (      'MaxTemp',),
            (     'Rainfall',),
            (  'Evaporation',),
            (     'Sunshine',),
            ('WindGustSpeed',),
            ( 'WindSpeed9am',),
            ( 'WindSpeed3pm',),
            (  'Humidity9am',),
            (  'Humidity3pm',),
            ...
            (          'NNW',),
            (           'NW',),
            (            'S',),
            (           'SE',),
            (          'SSE',),
            (          'SSW',),
            (           'SW',),
            (            'W',),
            (          'WNW',),
            (          'WSW',)],
           length=118)

In [58]:
X_train.to_csv('train_inputs.csv')
X_test.to_csv('test_inputs.csv')

In [71]:
y_train.to_csv('train_targets.csv')
y_test.to_csv('test_targets.csv')

#### Model training 

In [59]:
# train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression


# instantiate the model
logreg = LogisticRegression(solver='liblinear', random_state=42)


# fit the model
logreg.fit(X_train, y_train)

LogisticRegression(random_state=42, solver='liblinear')

#### Predict results

In [60]:
y_pred_test = logreg.predict(X_test)
y_pred_test

array(['Yes', 'No', 'No', ..., 'No', 'No', 'No'], dtype=object)

In [61]:
# probability of getting output as 0 - no rain
logreg.predict_proba(X_test)

array([[0.15250108, 0.84749892],
       [0.711425  , 0.288575  ],
       [0.98292875, 0.01707125],
       ...,
       [0.9862669 , 0.0137331 ],
       [0.95182691, 0.04817309],
       [0.95407329, 0.04592671]])

#### Check accuracy score

In [62]:
from sklearn.metrics import accuracy_score
print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred_test)))

Model accuracy score: 0.8455


Here, y_test are the true class labels and y_pred_test are the predicted class labels in the test-set.

#### Compare the train-set and test-set accuracy
 Now, I will compare the train-set and test-set accuracy to check for overfitting.

In [63]:
y_pred_train = logreg.predict(X_train)
y_pred_train
print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train)))

Model accuracy score: 0.8485


#### Check for overfitting and underfitting 
     

In [64]:
# print the scores on training and test set
print('Training set score: {:.4f}'.format(logreg.score(X_train, y_train)))

print('Test set score: {:.4f}'.format(logreg.score(X_test, y_test)))

Training set score: 0.8485
Test set score: 0.8455


The training-set accuracy score is 0.8485 while the test-set accuracy to be 0.8455. These two values are quite comparable. So, there is no question of overfitting.

In Logistic Regression, we use default value of C = 1. It provides good performance with approximately 85% accuracy on both the training and the test set. But the model performance on both the training and test set are very comparable. It is likely the case of underfitting.

I will increase C and fit a more flexible model.

In [65]:
# fit the Logsitic Regression model with C=100

# instantiate the model

model100 = LogisticRegression(C=100, solver='liblinear', random_state=42)

# fit the model
model100.fit(X_train,y_train)

LogisticRegression(C=100, random_state=42, solver='liblinear')

In [66]:
# print the scores on training and test set

print('Training set score: {:.4f}'.format(model100.score(X_train, y_train)))

print('Test set score: {:.4f}'.format(model100.score(X_test, y_test)))

Training set score: 0.8485
Test set score: 0.8454


We can see that, C=100 results in higher test set accuracy and also a slightly increased training set accuracy. So, we can conclude that a more complex model should perform better.

Now, I will investigate, what happens if we use more regularized model than the default value of C=1, by setting C=0.01.

In [67]:
# fit the Logsitic Regression model with C=001

# instantiate the model
model001 = LogisticRegression(C=0.01, solver='liblinear', random_state=42)


# fit the model
model001.fit(X_train, y_train)

LogisticRegression(C=0.01, random_state=42, solver='liblinear')

In [68]:
# print the scores on training and test set

print('Training set score: {:.4f}'.format(model001.score(X_train, y_train)))

print('Test set score: {:.4f}'.format(model001.score(X_test, y_test)))

Training set score: 0.8426
Test set score: 0.8390


So, if we use more regularized model by setting C=0.01, then both the training and test set accuracy decrease relative to the default parameters.

#### Compare model accuracy with null accuracy 

So, the model accuracy is 0.8501. But, we cannot say that our model is very good based on the above accuracy. We must compare it with the null accuracy. Null accuracy is the accuracy that could be achieved by always predicting the most frequent class.

In [69]:
# So, we should first check the class distribution in the test set.
# check class distribution in test set
y_test.value_counts()

No     22098
Yes     6341
Name: RainTomorrow, dtype: int64

We can see that the occurences of most frequent class is 22067. So, we can calculate null accuracy by dividing 22067 by total number of occurences.

In [70]:
# check null accuracy score

null_accuracy = (22067/(22067+6372))

print('Null accuracy score: {0:0.4f}'. format(null_accuracy))

Null accuracy score: 0.7759


We can see that our model accuracy score is 0.8455 but null accuracy score is 0.7759. So, we can conclude that our Logistic Regression model is doing a very good job in predicting the class labels.

Now, based on the above analysis we can conclude that our classification model accuracy is very good. Our model is doing a very good job in terms of predicting the class labels.

But, it does not give the underlying distribution of values. Also, it does not tell anything about the type of errors our classifer is making.

We have another tool called Confusion matrix that comes to our rescue.