# Lab 3 : Clustering

### Group 3 - Members:

_Tai Chowdhury_<br>
_Apurv Mittal_<br>
_Ravi Sivaraman_<br>
_Seemant Srivastava_<br>


## Business Understanding 1

The weather prediction has been of interest for ages as it effects all of us in our day to day life in many ways. Ability to predict with high accuracy, if its going to rain today or tomorrow and how much can help us plan our day better and we can take precautions if needed.

We have acquired the Australian Weather dataset from Kaggle portal. It contains 10 years of weather data collected from many locations across Australia. These are daily weather observations. There are 145,459 observations with 23 attributes. These attributes describes temperatures, wind, cloud, pressure, and humidity conditions both. There numeric data are broken down into morning (am) and afternoon (pm). 

This dataset can be useful for scientific weather reporting and analysis projects for the respective country's regions. These projects can provide solutions to weather prediction problems. For our project, we have chosen RainTomorrow (categorical) and Rainfall (continuous) as predictor variables. `RainTomorrow` is a categorical attribute which indicates whether it is going to rain tomorrow - yes or no. `Rainfall` is a continuous attribute that measures amount of rainfall each of the particular locations have received (in mm). Using our models, we will be able to design an algorithm where the bureau can help to predict rainfall for different regions in Australia.

We will measure the accuracy and effectiveness of our model for categorical variable `RainTomorrow` by using 10-fold cross validation against the confusion matrix measurements like: sensitivity, specificity and accuracy. We can use Logistic Regression, Random Forest and other parametric and non-parametric models to measure the effectiveness and determine the most appropriate model for prediction.

Similarly, We will predict the `Rainfall` (in mm) which is a continous variable using a regression model. We will its effectiveness by using 10-fold cross validation against RMSE (Root Mean Square Error).

Once the machine learning model is built we can test and measure its validity in other geographies and may not just confine to Australia.



Source: https://www.kaggle.com/jsphyg/weather-dataset-rattle-package

## Data Understanding 1

In [None]:
# Import libraries

import pandas as pd
import numpy as np
#import seaborn as sns
#import matplotlib.pyplot as plt
import plotly.graph_objs as go
from scipy import stats
import warnings
from shapely.geometry import Point
import geopandas as gpd
from geopandas import GeoDataFrame
import plotly.express as px

In [None]:
# Read the Australia weather data
df = pd.read_csv("weatherAUS.csv")

In [None]:
#  View the top rows of the data imported
df.head()

Data imported successfully. We can view all the variables and the top rows above. Its visible that there are several null values and we may need to do decide what should we do to accomodate the missing information. As we go along, we will talk abpout the approach we have adopted to handle the sceanrios with missing information.

In [None]:
# A quick look at the variables and the data type
df.info()

Below are the descriptions for all 23 attributes for our dataset:

    Name 	              Type 	                            Description

    `Date               Date  	           The date of observation.

    Location	       Nominal             The name of the location of the weather station.

    MinTemp	           float64	           Minimum temperature in the 24 hours to 9am (in celsius).

    MaxTemp	           float64	           Maximum temperature in the 24 hours to 9am (in celsius).

    Rainfall	       float64	           Precipitation (rainfall) in the 24 hours to 9am (in mm).

    Evaporation	       float64	           "Class A" pan evaporation in the 24 hours to 9am (in mm)

    Sunshine	       float64	           Bright sunshine in the 24 hours to midnight (in hours).

    WindGustDir	       Nominal        	   Direction of strongest gust in the 24 hours to midnight.

    WindGustSpeed	   float64	           Speed of strongest wind gust in the 24 hours to midnight (kmph).

    WindDir9am	       Nominal      	   Wind direction averaged over 10 minutes prior to 9 am.

    WindDir3pm	       Nominal      	   Wind direction averaged over 10 minutes prior to 3 pm.

    WindSpeed9am	   float64	           Wind speed averaged over 10 minutes prior to 9 am (kmph). 

    WindSpeed3pm	   float64	           Wind speed averaged over 10 minutes prior to 3 pm (kmph). 

    Humidity9am	       float64	           Relative humidity at 9 am (in percent).

    Humidity3pm	       float64	           Relative humidity at 3 pm (in percent). 

    Pressure9am	       float64	           Atmospheric pressure mean sea level at 9 am (hectopascals).

    Pressure3pm	       float64	           Atmospheric pressure mean sea level at 3 pm (hectopascals). 

    Cloud9am	       float64	           Fraction of sky obscured by cloud at 9 am (eighths).

    Cloud3pm	       float64	           Fraction of sky obscured by cloud at 3 pm (eighths). 

    Temp9am	           float64	           Temperature at 9 am (in celsius).

    Temp3pm	           float64	           Temperature at 3 pm (in celsius). 

    RainToday	       Nominal      	   Whether it is going to rain current day - Yes or No.

    RainTomorrow	   Nominal      	   Whether there will be rainfall tomorrow - Yes or No.`


Source: http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml

### Data Quality

In [None]:
# summarize the dataset with statistical summary of numeric "float" variables

df.describe().transpose()

Ran summary statistics on the imported dataset. We can see the various satistical summary on the "float" (numeric) variables. We see some large variations in the dataset like Evaportaion ranges from 0 to 145, wind gust varies from 6 kmph to 135 kmph. Which are huge variation but are they invalid data or genuine outliers? We will investigate that in the later sections.

In [None]:
# Count of data types

df.dtypes.value_counts()

In [None]:
# Check for duplicates
df.duplicated().sum()

We ran a duplicate check and we identify there are no duplicates in our dataset which means we don't need to take any action to reduce the impact of duplicate data.

In [None]:
# Check for null values

df.isnull().sum()

We can see there are bunch of missing values in our dataset across the variables. Some variables stand out in terms of number of missing information like `Evaporation` and `Sunshine`. We will continue to investigage further.

In [None]:
# Number of total records
len(df)

We have total of 145,460 records. This includes the missing data as well. This calculation is useful in undetstanding the magnitude of missing data. What is the percentage of data is actually missing? We find out below:

In [None]:
# List the percentage of missing information

(df.isnull().sum()/len(df)*100).sort_values(ascending=True)

We listed the missing data in ascending order to understand what percentage of data is missing. This will help us in determining the most appropriate action we can take to handle the missing information. 

As seen above there are 6 variables which has more than 10% of missing data. `Sunshine`, `Evaporation`, `Cloud at 3 pm`, `Cloud at 9 am` has the most missing data in that order. With more that 38% of missing information, we have to decide how to impute the missing information. If we delete the missing rows, we will lose a lot of important and pertinent information which is not desirable. We need to decide a way to impute the information.

However, before we impute any information, we also notice that `RainToday` and `Rain Tomorrow` also has about equal amount of missing data but the percentage is not very high. Its under 2.5%. And since `Rain Tomorrow` is one of our response variables, we don't want to impute information there based on certain assumption as it may impact the overall predictability of the data and our models may not turn out to be very successful.

With that in mind, we first start with deleting the rows with missing `Rain Today` and `Rain Tomorrow` variable as shown below.

In [None]:
# Removing records which are blank for Rain today and Rain tomorrow

df.dropna(subset = ["RainToday"], inplace=True)
df.dropna(subset = ["RainTomorrow"], inplace=True)

# REFERENCE: https://www.kite.com/python/answers/how-to-drop-empty-rows-from-a-pandas-dataframe-in-python

As explained above, we decided to drop the records with missing (null) data for RainToday and RainTomorrow variables which is under 2.5% of the total dataset.

In [None]:
# Check the null values again
(df.isnull().sum()/len(df)*100).sort_values(ascending=True)

A quick look at the percentage of missing data after deletion of the missing rows for RainToday and RainTomorrow confirms the data got deleted successfully.

In [None]:
# Seperate the data into categorical and numeric

df_num = df.columns[df.dtypes == 'float64']
df_cat=df.columns[df.dtypes == 'object']
print("Numeric Variables:", df_num)
print("Categorical Variables:", df_cat)

In [None]:
df[df_num].groupby([df['RainToday'],df['RainTomorrow']]).mean()

Since the Rainfall is the interest of this study. We decided to check the mean for all numeric variables based upon the value for `RainToday` and `RainTomorrow` variables. We belive that Rain is a very significant weather event and lots of other events and variations in the weather happen on the account of the Rain, its only appropriate to check how the mean varies for the variables depending upon it rains or not.

As expected, we notice the variation is significant among the variables depending upon the rain event.Like `Humidity` varies significatly (particluarly in the evening) as it rains today or tomorrow versus no rain at all. Similarly cloud cover also sees a significant variation.

We will closely analyze `Evaporation`, `Sunshine`, `Cloud9am`, `Cloud3pm` as these variables has highest number of missing information. We need to determine if its safe to impute the missing information with the mean values for these variables or should be take a different approach.

In [None]:
# Number of null for Evaporation by the RainToday And Rain Tomorrow
df_E = df.Evaporation.isnull().groupby([df['RainToday'],df['RainTomorrow']]).sum()
df_E_mean = df.Evaporation.groupby([df['RainToday'],df['RainTomorrow']]).mean()
print('Number of Nulls in Evaporation grouped by Rain Today and Rain Tomorrow:\n',df_E)
print('\nMean of Evaporation grouped by Rain Today and Rain Tomorrow:\n',df_E_mean)

print('\nOverall Mean of Evaporation:\n',df.Evaporation.mean())

The `Evaporation` has most of its missing values for the days it doesn't rain, which is both `RainToday` and `RainTomorrow` are No. For all other days the number of missing records are comparable.

The average `Evaporation` on the days it doesn't rain i.e. both `RainToday` and `RainTomorrow` are "No" is 6.03 while the average `Evaporation` on the days it rains both Today and Tomorrow is 3.87, which is a variation of more than `55%`.

Based on the above data, its not appropriate to impute a mean value for every missing record of `Evaporation`. We will continue to investigate further.

In [None]:
# Number of null for Cloud 9 AM by the RainToday And Rain Tomorrow
df_C9 = df.Cloud9am.isnull().groupby([df['RainToday'],df['RainTomorrow']]).sum()
df_C9_mean = df.Cloud9am.groupby([df['RainToday'],df['RainTomorrow']]).mean()
print('Number of Nulls in Cloud at 9 AM grouped by Rain Today and Rain Tomorrow:\n',df_C9, '\n')
print('\nMean of Cloud at 9 AM  grouped by Rain Today and Rain Tomorrow:\n',df_C9_mean)
print('\nOverall Mean of Cloud at 9 AM:\n',df.Cloud9am.mean())

In [None]:
We check the another variable with large number of missing information `Cloud9am` (clouds at 9 am) which has more than 36,000 missing records. In this case also like how we noticed for `Evaporation`, `Sunshine` and `Cloud3pm` the mean value of cloud significantly depends upon if it `RainToday` or `RainTomorrow`.

The clouds at 9 am is significatly higher for the days it rains. Also, the overall mean is much lower.

Considering the above examples, it appropriate to say that we shouldn't impute overall variable mean for the missing records as it'll be significantly wrong based on the fact if it Rains Today and/or Rains Tomorrow or not.

So, we decided to impute data based on the mean of numeric variables for the days of `RainToday` and `RainTomorrow`.

The categorical variables will be imputed based on the mode.

#### Data Imputation

In [None]:
# Impute data (numeric) based on the mean for RainToday and RainTomorrow

df_impute = df
mat_yesno = df[df_num].groupby([df['RainToday'],df['RainTomorrow']]).mean()
RAINTODAY=0
RAINTOMORROW=1
COUNTER = 0
for i in range(2):
    for j in range(2):
        for indexattr in mat_yesno.iloc[COUNTER].index:
            df_impute.loc[(df_impute["RainToday"] == mat_yesno.iloc[COUNTER].name[RAINTODAY] ) 
                          & (df_impute["RainTomorrow"] == mat_yesno.iloc[COUNTER].name[RAINTOMORROW]) 
                          & (df_impute[indexattr].isnull()), indexattr] = mat_yesno.iloc[COUNTER][indexattr]
        COUNTER = COUNTER + 1

        
        
# Impute data (categorical) with mode of each variable

df_impute['WindDir9am'] = df_impute['WindDir9am'].fillna(df_impute['WindDir9am'].mode()[0])
df_impute['WindGustDir'] = df_impute['WindGustDir'].fillna(df_impute['WindGustDir'].mode()[0])
df_impute['WindDir3pm'] = df_impute['WindDir3pm'].fillna(df_impute['WindDir3pm'].mode()[0])

As mentioned above, we imputed data for all numeric variables with the means for the combination of `RainToday` and `RainTomorrow`. We calcualted the value for `RainToday` and `RainTomorrow` both as "No" and imputed the data for the missing variables for such combination, similary calculated `RainToday` as "Yes" and `RainTomorrow` as "No" and imputed the mean value for the variable so and so forth.

For categorical variables `WindDir9am`, `WindDir3pm` are covering the direction of the wind at different 9 am and 3 pm respectively, while `WindGustDir`is the direction of the wind gust. All these variables are about the direction and and the largest missing variable is `6.8%` for Wind Direction at 9 am. We decided to impute this data with the Mode for each of the categorical variable.

#### Outlier Detection and Removal

In [None]:
The Z-score is the signed number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured.
The intuition behind Z-score is to describe any data point by finding their relationship with the Standard Deviation and Mean of the group of data points. Z-score is finding the distribution of data where mean is 0 and standard deviation is 1 i.e. normal distribution.
While calculating the Z-score we re-scale and center the data and look for data points which are too far from zero. These data points which are way too far from zero will be treated as the outliers.
In most of the cases a threshold of 3 or -3 is used i.e if the Z-score value is greater than or less than 3 or -3 respectively, that data point will be identified as outliers.
The first array contains the list of row numbers and second array respective column numbers, for example if z[8][5] is listed to have a Z-score higher than 3, then it means 8th record in 5th column is an outlier.


We found 8,309 outliers for our Rainfall attributes and we have removed the rows using z-score technique.

###### Reference: https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba
###### Reference: https://towardsdatascience.com/detecting-and-treating-outliers-in-python-part-1-4ece5098b755

In [None]:
# Outlier - Uni-variate(one variable outlier analysis) using Box plot

df_rainfall = df_impute[['Rainfall']]

fig, ax = plt.subplots(figsize=(8,8))
ax.set_xlabel("va=baseline")
sns.boxplot(x="variable", y="value", data=pd.melt(df_rainfall))
plt.xticks(rotation=45)
plt.show()

In [None]:
# Checking the maximum value of the Rainfall variable
df_impute[['Rainfall']].max()

In the above boxplot analysis we see the `Rainfall` data is highly skewed and we can see there are apparent outliers. We notice most of the values (including mean, median) falls around `0`. Which is understandable considering it does't rain most of the days in Australia.

If we look at the extreme value for Rainfall alone, its `371 mm`. Based on the recorded weather history this is not nearly equal to be highest or an outlier. The Highest daily rainfall in 24 hours period is recorded to be 907mm in Australia.

So, we decided to treat this as a valid observation and not change it in any way.

##### Reference:  https://www.ga.gov.au/scientific-topics/national-location-information/dimensions/climatic-extremes

In [None]:
# Boxplot of subset of variables
df_num
df_boxplot = df_impute[['MinTemp', 'MaxTemp', 'Evaporation', 'Sunshine',
       'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am',
       'Humidity3pm', 'Cloud9am', 'Cloud3pm',
       'Temp9am', 'Temp3pm']]

fig, ax = plt.subplots(figsize=(12,12))
ax.set_xlabel("va=baseline")
sns.boxplot(x="variable", y="value", data=pd.melt(df_boxplot))
plt.xticks(rotation=45)
plt.show()

In [None]:
# Boxplot for Pressure at 9 am and 3 pm

df_pressure = df_impute[['Pressure9am', 'Pressure3pm']]


fig, ax = plt.subplots(figsize=(8,8))
ax.set_xlabel("va=baseline")
sns.boxplot(x="variable", y="value", data=pd.melt(df_pressure))
plt.xticks(rotation=45)
plt.show()

In [None]:
# Maximum of each variable
df_impute.max()

Australia is a land of extremes with temperatures ranging from highs of 40°C in the central desert regions to below freezing in the higher regions of the country's southeast. Sometimes these extremes can be experienced on a single day.

##### Reference: https://www.ga.gov.au/scientific-topics/national-location-information/dimensions/climatic-extremes

Similarly, if we look at the barometeric pressure, the highest barometric pressure ever recorded was 1083.8mb. While the lowest non-tornadic atmospheric pressure ever measured was 870 hPa (0.858 atm; 25.69 inHg).

###### https://en.wikipedia.org/wiki/Atmospheric_pressure

###### https://www.guinnessworldrecords.com/world-records/highest-barometric-pressure-

Based on these evidences, we conclude that even though we have some extreme values in our dataset they are not entirely wrong or improbable. We decided that we will not delete or impute any of our outliers and continue our our analysis with the data as observed. 

## Data Understanding 2

In [None]:
# Count of Rainfall days today and tomorrow

fig, ax =plt.subplots(1,2)
print(df_impute.RainToday.value_counts())
print(df_impute.RainTomorrow.value_counts())
plt.figure(figsize=(12,12))
sns.countplot(data=df_impute,x='RainToday',ax=ax[0])
sns.countplot(data=df_impute,x='RainTomorrow',ax=ax[1])

plt.figure(figsize=(12,12))
plt.subplot(121)
df_impute['RainToday'].value_counts().plot.pie(autopct='%0.2f%%')
plt.subplot(122)
df_impute['RainTomorrow'].value_counts().plot.pie(autopct='%0.2f%%')
plt.show() 

# Reference: https://www.kaggle.com/fahadmehfoooz/rain-prediction-with-90-65-accuracy

As expected the number of days of Rainfall are far lower than the days of no Rainfall. Its true for both our variables RainToday and RainTomorrow. The number of actual rainfall days are quite similar for both RainToday and RainTomorrow.

In [None]:
#Histograms for continuous attributes.

fig = plt.figure(figsize = (20,15))
ax = fig.gca()
df_impute.hist(ax=ax)
plt.show()

The above histograms show the distribution for all the continuous variables from our dataset. It can help us to understand the normality (skewness and data range) for each of the continuous variables. Most of the histograms show us that the variables are normally distributed. Few of the variables like `RainFall`, `Evaporation`, and `WindSpeed9am` are right skewed.

`Rainfall` is expected to be skewed as it doesn't rain on most days in Australia.Similarly the `Windspeed` is expectedly skewed too as high winds are not common and most days its low wind speed.

`Evaporation` data requires further analysis in context of `Sunshine` and other variables which will be covered in the later sections. `Evaporation` is a factor of Humidity, Temperature, Windspeed and has to be checked in that context.

In [None]:
# Wind Direction Count: 


# Wind Direction Count: 
#plt.xticks(rotation=45)
#sns.barplot(x="WindGustDir", hue ="RainTomorrow", data=df_impute)

df_plot = df_impute.groupby(['RainTomorrow', 'WindGustDir']).size().reset_index().pivot(columns='RainTomorrow', index='WindGustDir', values=0)

df_plot.plot(kind='bar', stacked=True)

#Source for stacked boxplot: https://stackoverflow.com/questions/50319614/count-plot-with-stacked-bars-per-hue

In terms of wind direction attribute, most of the data is recorded at the west wind direction. This is true for both current day and the day after.  That's the reason we see most RainTomorrow with wind direction to West and same for the days with No RainTomorrow.

In [None]:
# State count in dataframe
location_count = df_impute.State.value_counts().sort_values(ascending=False)
location_count.plot(kind='pie')

There are more observations recorded from New South Wales, Victoria, and Western Australia in our dataframe. These states may influence our modeling and analysis. 

In [None]:
# Boxplot for Pressure at 9 am and 3 pm
df_pressure = df_impute[['Pressure9am', 'Pressure3pm']]
fig, ax = plt.subplots(figsize=(8,8))
ax.set_xlabel("va=baseline")
sns.boxplot(x="variable", y="value", data=pd.melt(df_pressure))
plt.xticks(rotation=45)
plt.show()

The above boxplot indicates `Pressure9am` and `Pressure3pm` are consistant and do not notice much variation throughout the day. `Pressure9am` has slight higher mean value than `Pressure3pm` but the distribution appears to be similar.

In [None]:
# Boxplot of subset of variables
df_num
df_boxplot = df_impute[['MinTemp', 'MaxTemp', 'Evaporation', 'Sunshine',
       'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am',
       'Humidity3pm', 'Cloud9am', 'Cloud3pm',
       'Temp9am', 'Temp3pm']]
fig, ax = plt.subplots(figsize=(12,12))
ax.set_xlabel("va=baseline")
sns.boxplot(x="variable", y="value", data=pd.melt(df_boxplot))
plt.xticks(rotation=45)
plt.show()

Above boxplot shows how our continuous attributes are distributed in our dataframe. `WindGustSpeed` has the most variations. 

`Cloud9am` and `Cloud3pm` have the lowest variation and similar distribution. 

`Evaporation` is highly skewed and has longer whisker. It also has the highest outlier.

In [None]:
# RainToday By State (first one) and RainTomorrow by State (second one)

Location_Windir_RainToday = pd.crosstab(df_impute['State'], df_impute['RainToday'])
Location_Windir_RainToday.div(Location_Windir_RainToday.sum(1),axis=0).plot.barh(stacked = True)

Location_Windir_Raintomorrow = pd.crosstab(df_impute['State'], df_impute['RainTomorrow'])
Location_Windir_Raintomorrow.div(Location_Windir_Raintomorrow.sum(1),axis=0).plot.barh(stacked = True)

The crosstab charts show that `Queensland` and `Tasmania` states have highest chances of rainfall. Both states shows the most Rainfall days for both `RainToday` and `RainTomorrow`. Several other states are significantly close. However, `Northern Territory` tends to have least rainfall (both RainToday and RainTomorrow).

We will further analyze this topic.

In [None]:
# Rainfall (mm) By State 
fig = plt.figure(figsize =(7, 4)) 

# Horizontal Bar Plot 
dtg = df_impute.groupby(by=df_impute.State)['Rainfall'].mean()


dtg.plot(kind = 'bar') 


groupby_single = df_impute.groupby(['State']).agg({'Rainfall': ['mean', 'min', 'max']})
groupby_single

Queensland has received the most amount of rainfall with mean amount of 4.02 mm. South Australia has received the lowest amount with mean value of 1.38 mm. Although we have noticed previously that Tasmania has second most rainfall (close to Queensland) but it has not received a lot compare to some of the other states. New South Wales has received the most amount of daily rainfall and South Australia has received the least amount.

Its interesting to see that `South Australia` not only has low mean rainfall, its maximum rainfall amount is significntly lower than other states. Its more than `300%` lower than the maximum rainfall in `New South Wales`.  

In [None]:
#HeatMap for plot on the correlation matrix using seaborn
plt.figure(figsize=(12,12))
cmap = sns.diverging_palette(220, 10, as_cmap=True) # one of the many color mappings
ax = sns.heatmap(df_impute.corr(), cmap=cmap, square=True, annot=True, fmt='.2f')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.show()

In this correlation matrix, we notice that most of the correlations are positive. Pressure and Humidity seems to have negative correlations agains other attributes but those are not significant. Here are some of the significant correlations we notice:

    MinTemp and MaxTemp

    Temp9am and Temp3pm

    Humidity9am and Humidity3pm

    Cloud9am and Cloud3pm

    Pressure9am and Pressure3pm

    Humidity9am and Humidity 3pm

    WindGustSpeed and WindSpeed9am

    WindGustSpeed and WindSpeed3pm

    WindSpeed9am and WindSpeed3pm

`Humidity` is negatively correlated to `Evaporation` and `Temperature`. Which is significant and appears to be accurate as well. Evaporation is higher during dry conditions. So, the skewness in Evaporation distribution is also impacted by the Humidity in the region.

`Sunshine` also is negatively correlated to `Clouds` which is expected as with cloud cover we will not have sunshine. This gives validity to our data and appears to be following the corret trends.

One common observation is there are stong correlations between morning and late afternoon values for each weather condition category. Only other signifinant correllation we notive is between Cloud(am/pm) and Humidity(am/pm). That is expected as we can usually notice buildup of humidity as the cloud gathers up before rainfall. 

## Data Preparation Part 1

As discussed in *Lab 1*, we have acquired the Australian Weather dataset from Kaggle portal. It contains 10 years of weather data collected from many locations across Australia. These are daily weather observations. There are 145,459 observations with 23 attributes in the original dataset. 

We have chosen `RainTomorrow` (categorical) and `Rainfall` (continuous) as predictor variables. RainTomorrow is a categorical attribute which indicates whether it is going to rain tomorrow - yes or no. Rainfall is a continuous attribute that measures amount of rainfall each of the particular locations have received (in mm). Using our models, we will be able to design an algorithm where the bureau can help to predict rainfall for different regions in Australia.

In this Lab 2 assignment, we have measured the accuracy and effectiveness of our model for categorical variable RainTomorrow by using 10-fold cross validation against the confusion matrix measurements like: Precision, Recall and Accuracy. We have explored the methods of logistic regression and support vector machine (SVM) models on our dataset. 

We have used `scikit-learn` packages for our exploration. We ran logistic regression models with all the available solvers in the `scikit-learn` package and compare the effictiveness and accuracy of the model to predict `RainfallTomorrow`. We also measured the duration of model run from each models to compare model performance and efficiency as well.
 
To get started, we will start with loading all the necessary packages for our analysis. We will start our analysis with `df_impute` which is the imputed dataframe from our last explanatory data analysis Lab 1 project. Using this dataframe will ensure data consistency for all the labs going forward.

In [None]:
# Import libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objs as go
from scipy import stats
import warnings
from shapely.geometry import Point
import plotly.express as px
from sklearn.linear_model import LogisticRegression
from sklearn import metrics as mt
from sklearn.model_selection import ShuffleSplit
from sklearn.utils import resample

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, auc
from sklearn import metrics

In [None]:
#Ignore Warnings on final

warnings.filterwarnings('ignore')

In [None]:
#Original Data
df = pd.read_csv("weatherAUS.csv")

In [None]:
df.head()

#### Dropping columns

We decided to drop `Date` and `Location` as they are not pertinent to our analysis in this Lab 2 project.

In [None]:
df = df.drop(['Date', 'Location'], axis = 1)
df.head()


We imputed data in EDA project by substituting the missing and `NaN` values. We are reusing the imputed data from EDA (Lab1) project.
Here is the link to the EDA for reference:

https://nbviewer.jupyter.org/github/ravisiv/AussieWeatherEDA/blob/c0ba412cb75da21eba386ea9ea39f645ad6af1d0/DS7331_Lab1_Group3_Ravi_Taifur_Seemant_Apurv_Submission.ipynb


In [None]:
# Read the Imputed Australia weather data
df_impute = pd.read_csv("weatherAUS_imputed.csv")
df_impute.shape

The imputed data doesn't include any null or missing values. Also, we have dropped the columns like: Date of observation and City Name.

In [None]:
df_impute_num = df_impute.columns[df_impute.dtypes == 'float64']
df_impute_cat=df_impute.columns[df_impute.dtypes == 'object']
print("Numeric Variables:", df_impute_num)
print("Categorical Variables:", df_impute_cat)

Before continuing further, we need to check which variables are numeric and which are not. As the models expect numerical variables. We will filter and identify non-numeric variables.

`WindGustDir`, `WindDir9am`, `WindDir3pm`, `RainToday` and `RainTomorrow`are not numeric. Here `RainTomorrow` is our response variable. we handle the other variables with hot-one-encoding later in the flow.

In [None]:
#Keep the original data
df_model = df_impute.copy()

Creating a new DataFrame `df_model` for modeling to avoid any changes to the original dataset `df_impute`.

In [None]:
# Create a new variable to Identify if it RainToday

df_model["IsRainToday"] = df_impute['RainToday']

# Replacing No with 0 and Yes with 1.

df_model['IsRainToday'].replace({'No': 0, 'Yes': 1},inplace = True)


Assigning `0` to No values and `1` to Yes values in `RainToday` (Changed to `IsRainToday`)

In [None]:
print("df_impute", df_impute.shape)
print("df_model", df_model.shape)


In [None]:
# Printing the values to check if the data looks good

df_model.head()

### One-hot encoding

Before we create our models, we need to format our attributes. We are converting `RainToday` and `RainTomorrow` into numeric variables to `0` and `1`. We also decided to go ahead with one-hot-encoding `WindGustDir`, `WindDir9am`, and `WindDir3pm` attributes based on the direction of the wind. 

In [None]:
# perform one-hot encoding using dummies

gust_df = pd.get_dummies(df_model.WindGustDir,prefix='GustDir', drop_first= True)
wind3pm_df = pd.get_dummies(df_model.WindDir3pm,prefix='Wind3pm', drop_first= True)
wind9am_df = pd.get_dummies(df_model.WindDir9am,prefix='Wind9am' , drop_first= True)
df_model = pd.concat((df_model,gust_df, wind3pm_df, wind9am_df),axis=1) # add back into the dataframe


We decided to do one-hot-encoding using dummies function as machine learning algorithms and models requires numerical values for both input and output attributes.

Since the dummies function creates a variable for each unique value, we are dropping the first variable to avoid multicollinearity among the variables as the value for the last variable can be interpreted from the values for other variables created as part of one-hot encoding.



In [None]:
# Drop categorical columns

df_model = df_model.drop(['WindDir3pm', 'WindDir9am', 'WindGustDir', 'RainToday'], axis = 1)

After conversions, we are removing these categorical attributes to avoid duplicates as we have those data in numerical format. We are added the newly formatted attributes and rest of the continuous attributes into a new dataframe - df_model. We will use the new dataframe for modeling.

Reference: https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/

In [None]:
#Check if Yes is replaced as 1

print("Are there 1's and 0's in the RainToday column?", 
      (df_model['IsRainToday'].sum() > 0) and (df_model['IsRainToday'].sum() < len(df_model['IsRainToday'])))

#Non zero output means there is a mixture of 1's and 0's


Checking if the data imputation happened accurately.

In [None]:
df_model_num = df_model.columns[df_model.dtypes != 'object' ]
df_model_cat=df_model.columns[df_model.dtypes == 'object']
print("Numeric Variables:", df_model_num)
print("Categorical Variables:", df_model_cat)

Check if all the numerical variables are accurately created and if we still have any non-numeric data.

Assigning the `RainTomorrow` as our response variable (y) and all other variables include one-hot-encoded values as X.

In [None]:
X=df_model[df_model_num]
y = df_model.RainTomorrow
print('features shape:', X.shape) 
print('target shape:', y.shape )

#### Response Variables
For our dataset, we are using two response variables:

1. `RainTomorrow` - Categorical variable for classification
2. `Rainfall` - Continuous variable for regression

We are going to introduce additional variable in our dataset:

`RainfallAmount` - Categorical variable for rainfall classification. We have covered this in more detail in another section.

#### Scaling

We will be using scaled data for our models. We have used the scaling feature as part of our customed function for running our classifications and regression models. We will discuss more in details in the modeling sections.

#### New Feature

We are adding a new classification feature called `RainfallAmount` which has four values - `None` (0), `Low`(1), `Moderate`(2) and `High`(3). We are creating this feature from `Rainfall` feature from our dataframe. The data is numerical due to the requirement of the execution of the classification models.

In [None]:
# New Feature - RainfallAmount

def rain_classifier(row):
    if row["Rainfall"] > 30:
        return 3
    elif row["Rainfall"] > 10 and row["Rainfall"] < 30:
        return 2
    elif row["Rainfall"] > 1 and row["Rainfall"] < 10:
        return 1
    else:
        return 0

df_model["RainfallAmount"] = df_impute.apply(rain_classifier, axis=1)


In [None]:
df_model.RainfallAmount.unique()

#### Down-sampling

Due to large amount of data and multiple models being evaluated in this project, our computers are not able to handle the load and have been crashing which is leading to increased processing time and repetitive work. To avoid this situation we have made few changes in our models.

1. Down Sample the data based on the `RainToday`
2. Reduced the various combinations of hyper tuning parameters to preserve the memory and processing power.
3. Reduced the number of additional models we were running as part of exceptional work, like: XGBOOST, Linear SVC.

In [None]:
#Downsampling before we run our models 
df_model_copy = df_model[df_model.IsRainToday  == np.random.choice(df_model['IsRainToday'].unique())].reset_index(drop=True)
df_model = df_model_copy.copy()


### Data Distribution

Check if the data distribution is balanced or not for the response variable `RainTomorrow`.

In [None]:
       

df_impute['RainTomorrow'].value_counts(normalize = True).plot(kind='bar', color= ['skyblue','navy'], alpha = 0.9, rot=0)
plt.title('RainTomorrow Indicator No(0) and Yes(1) in the Imbalanced Dataset')
plt.show()


As expected, we see the data for `RainTomorrow` is imbalanced. Majority of the data is for `No` rain vs. `Yes` for `RainTomorrow`.

We can observe that the presence of `0` and `1` is almost in the `78:22` ratio. We will be cognizant of the fact that our model may be not very effective if we don't solve for imbalance. We will discuss and adjust for this imbalance in our analysis.

In [None]:
df_model_copy = df_model.copy()

In [None]:
df_model.info()

The above dataframe has float64, object, int64, and uint8 data formats. Float64, int64, and uint8 are all numerical data type. Object is a string data type. 

#### Response Variables

We have three response features in our current dataframe. They are `RainTomorrow`, `Rainfall`, and `RainfallAmount`. `RainTomorrow` and `RainfallAmount` are for our classification models. `Rainfall` is used for continuous regression models. Our primary focus is on `RainTomorrow` and `RainfallAmount` as the prediction for `Rainfall` is not very accurate. We will present the accuracy of this feature in later section. 

In [None]:
df_model.describe().transpose()

We ran summary statistics on the final model dataset. We can see the various satistical summary of the features. We see some large variations in the dataset like Evaportaion ranges from 0 to 145, Rainfall varies from 0 mm to 371 mm. Which are huge variation but as determined during `EDA` (Lab 1) those are not outliers and for our analysis we will consider then as valid observations.

# Tai

### Agglomerative clustering

# Seemant

### Optics

# Ravi

### K-Means++

# Apurv

### DBSCAN

In [None]:
#Read the Government data for State and Latitude/Longitude lookup to create a geography dataframe for Weather Australia
worldcities = pd.read_csv("worldcities.csv", header=[0], encoding = "ISO-8859-1", engine='python')
worldcities = worldcities[(worldcities.country == "Australia")]
worldcities.rename(columns={'city': 'Location', 'lat': 'Latitude', 'lng': 'Longitude', 'admin_name': 'State'}, inplace=True)
worldcities = worldcities.drop(['city_ascii','country','iso2','iso3','capital','population','id'],axis=1)
df_impute_temp = df_impute
df_geo = pd.merge(df_impute_temp, worldcities, how="left", on=["Location"])
df_geo.head()


# Reference for World Cities data : https://simplemaps.com/data/world-cities

In [4]:
import matplotlib.pyplot as plt

import mpl_toolkits

from mpl_toolkits.mplot3d import Axes3D
from mpl_toolkits.basemap import Basemap
import matplotlib
from PIL import Image
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
rcParams['figure.figsize'] = (14,10)

ModuleNotFoundError: No module named 'matplotlib.pyplot'

In [1]:
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
import sklearn.utils

In [5]:
!pip install matplotlib



In [2]:
import importlib
mpl_toolkits = importlib.import_module('mpl_toolkits')
from mpl_toolkits.basemap import Basemap

ImportError: cannot import name '__version__' from 'matplotlib' (unknown location)

## Deployment

Our models are primarily designed for meteorologists. At the same time, these can be very useful for event organizers in cities across the country – mostly important for outdoor events. The favored model can be also useful for government organizations like military (Navy). The model not only predicts if it’s going to Rain Tomorrow (next day), It also predicts the amount of `Rainfall` for today in amount of rain fell (mm) as well as classifies as `Low`, `Medium` and `High`. Due to our ability to give the data which is easily interpretable for everyone, this is useful for the everyone. It can be integrated by the weather channels and apps as well. 


The model’s value can be measured in terms of its accuracy; higher the accuracy better the value of the model over existing models in use. Some parties may value our models higher than others depending upon how important the accuracy of prediction of Rainfall for their business or area of operations.


Our models can be integrated to the existing feed of weather related data which is useful for our model to predict accurately. This data is easily available from government websites. 


Data from those sources can be integrated with our model and build an APIs for anyone to consume and monetize by the count of API calls.


As we know weather is an ever-changing event and due to climate change, predictive models needs to evolve continuously. The validity of our models need to be tested against the recorded information and improve our models daily with the new data (models can be built within hours).


Additional data points may be required like seasons, time of the year, impact of natural events like cyclones/hurricanes, wild fires, El Niño and La Niña effects etc.


Overall, our classification models will be more useful since those clearly indicates weather condition for all the interested parties.

## Exceptional Work

#### Conclusion


