# **CS 4641 Final Project**
**Team Information**

* Brian Lee     (903447932)
* Hyuntaek Lim  (903336480)
* Dowon Kim     (903218787)



## 1. Introduction
Australia is well-known for having the most extreme and unpredictable weather in the world. Having a tropical-influenced climate, the northern part of Australia is currently the hottest place on this planet [1]. Melbourne, a city located on the southern coast, is said to have "four seasons in one day".

Meteorologists in Australia collect data from polar-orbiting satellites, aircrafts, and air-balloons for accurate weather forecasts. Their supercomputers utilize those data to build a "spaghetti model," which is used for a week-long weather forecast by calculating the high and low pressure systems [2]. However, due to the constantly shifting weather conditions, the Australian supercomputers often fail to predict the weather correctly - the accuracy of predicting the amount of rain in Melbourne was only 24% on average [3].

Just like the supercomputers, with a dataset that collected 10 years of daily weather observations in Australia, we decided to apply machine learning algorithms to predict whether it will rain the next day, which may improve the accuracy of the weather forecast. In this project, we will compare and contrast various machine learning algorithms and their predictions to possibly provide insights to meteorologists. The different Machine Learning models will be analyzed so that we can identify the factors that play significant roles in predicting the chance of showers. 

## 2. Dataset

The dataset contains about 10 years of daily weather observations from various locations across Australia: many different weather features that are considered to be critical to predicting rain, such as cloud, humidity, and wind are included. Basic information, such as temperature and location, is also provided.

Our dataset provides the following data columns:

* Date: in YYYY-MM-DD format, from 2008 to 2017
* Location: various cities in Australia, such as Albury
* MinTemp: the lowest temperature of the location on the correspoding date
* MaxTemp: the highest temperature of the location on the correspoding date
* Rainfall: the amount of rainfall in mm
* Evaporation: the amount of dailty water evaporation in mm
* Sunshine: the amount of hours of daily bright sunlight
* WindGustDir: the direction of wind gust (a sudden increase in the speed of wind) 
* WindGustSpeed: the speed of wind gust in km/h
* WindDir9am: the direction of wind at 09:00
* WindDir3pm: the direction of wind at 15:00
* WindSpeed9am: the speed of wind at 09:00 in km/h
* WindSpeed3pm: the speed of wind at 15:00 in km/h
* Humidity9am: the humidity at 09:00 in %
* Humidity3pm: the humidity at 15:00 in %
* Pressure9am: the atmospheric pressure at 09:00
* Pressure3pm: the atmospheric pressure at 09:00 
* Cloud9am: the amount of cloud at 09:00 in 8th
* Cloud3pm: the amount of cloud at 15:00 in 8th
* Temp9am: temperature at 09:00 in degrees celcius
* Temp3pm: temperature at 15:00 in degrees celcius
* RainToday: whether it rained or not (Yes or No)

* **RainTomorrow: whether it rained the next day (Yes or No)  --> target variable**


Source: [WeatherAUS](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package)

**The ultimate goal of our project is to classify the binary class RainTomorrow - our target variable - which indicates whether it rained the next day based on the data provided. We will be utilizing supervised learning to prognosticate the next day's rainfall.**

In [48]:
# Importing the modules

import pandas as pd

from sklearn import datasets



from sklearn.naive_bayes import BernoulliNB           
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

## 3. Reading the Dataset

In [49]:
NUM_SAMPLES = 5  # Number of samples to view

DATA = pd.read_csv('../input/weather-dataset-rattle-package/weatherAUS.csv')
DATA.head(NUM_SAMPLES)

In [50]:
# Generating shape of the data

num_rows, num_col = DATA.shape
print("# ROWS\t\t", num_rows)
print("# COLUMNS\t", num_col)

In total, there are 145,460 rows and 23 columns of data.


Before training our models, the following problems had to be resolved through data preprocessing.

1. Dropping the date column, which is not a useful feature for our prediction
2. Filling the NaN (missing data) values, as seen in the sample data table
3. Encoding categorical attributes, such as RainTomorrow column, into numbers
4. Removing outliers that skew our models

## 4. Data Preprocessing

### 4.1 Dropping the Date Column



In [51]:
DATA = DATA.iloc[:, 1:]
DATA.head()

As we all can see, the data column has been deleted - now the sample data table only has 22 columns left.






### 4.2 Handling the Missing Data
Before filling the missing data, we wanted to visualize and organize the missing data from the dataset. The missingno module allowed us to create a bar graph that indicates the number of missing data in each dataset column. After the visualization, we filled the missing data out through different approaches based on the data's type: continuous and categorical.

In [52]:
# Visualizing the missing data

import missingno

missingno.bar(DATA, sort='descending', color="green")

In [53]:
# Only show continuous attributes
cont = DATA.select_dtypes("float64")
cont.head()

In [54]:
# Now, only display the number of missing values in each of the continuous attribute
cont.isnull().sum()

In [55]:
# We need to replace the null values of our continuous data
# The best approach to do so would be using mean to minimize the bias - continuous data are basically numbers

for col in cont:
    DATA[col] = DATA[col].fillna(DATA[col].mean())

# All of our continuous attributes now have 0 missing values
cont = DATA.select_dtypes("float64")
cont.isnull().sum()

In [56]:
# Only show categorical attributes
catg = DATA.select_dtypes("object")
catg.head()

In [57]:
# Display number of missing values in each of the categorical attribute
catg.isnull().sum()

In [58]:
# We need to replace null values of our categorical data
# The best approach to do so would be using mode - categorical data are not numbers

for col in catg:
    DATA[col] = DATA[col].fillna(DATA[col].mode()[0])

# All of our categorical attributes now have 0 missing values
catg = DATA.select_dtypes("object")
catg.isnull().sum()

In [59]:
# Now, our dataset has no missing data
DATA.isnull().sum()

### 4.3 Encoding Categorical Attributes

Since machine learning models only works with numbers, we need to encode the textual data (categorical attributes) into numbers.

In [60]:
from sklearn import preprocessing

labelEncoder = preprocessing.LabelEncoder()

for col in catg:
    DATA[col] = labelEncoder.fit_transform(DATA[col])
    
DATA.head(NUM_SAMPLES)

All of the textual data have been transformed into numerical data, which can be utilized for constructing machine learning models. Comparing to the very first data table, the city "Albury" has changed to an integer 2, and the all of the "No" in RainTomorrow is now 0. Other categorical data - WindGustDir, WindDir9am, Windir3pm, and RainToday - have also been transformed.

### 4.4 Removing Outliers

Outliers can damage the machine learning models that we plan to build; it is very imperative to clean out the extreme values in the dataset to ensure the models represent in the most effective way.

In [61]:
import numpy as np
from scipy import stats

# Remove outliers using z-score

DATA = DATA[(np.abs(stats.zscore(DATA)) < 3).all(axis=1)]
DATA.shape

After removing the outliers, we now have 136,608 rows that contain weather features in various cities of Australia. 8,852 rows have been deleted.

## 5. Data Visualization

Now that we have preprocessed the data, we can visualize the "correct" data to figure out the correlations between the weather variables.

In [62]:
import matplotlib.pyplot as plt
import seaborn as sns

# Using a heatmap to show the correlations between variables
plt.figure(figsize=(17,15))
ax = sns.heatmap(DATA.corr(), square=True, annot=True, fmt='.2f',cmap="YlGnBu")
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)          
plt.show()

From the first look at the heatmap, we notice the following: 

* Strong correlation with RainTomorrow: RainToday, Cloud3pm, Cloud9am, Humidity3pm Humidity9am, Rainfall, and WindGustSpeed
* Weak correlation with RainTomorrow: Sunshine, Pressure9am, Pressure3pm, and Temp3pm

This analysis intuitively makes sense. There is a high chance of precipitation the next day if the weather is rainy today, or if the humidity is high. On the other hand, if it is really sunny today, it is less likely to be rainy the next day.

Now, we are ready to create and train a model.

## 6. Training and Testing Our Models

First, we would need to split the dataset into testing and training sets. Then we can create a method that calculates the prediction accuracy of the different trained models. 

In [63]:
x = DATA.drop('RainTomorrow', axis=1)    # x = the input data
y = DATA['RainTomorrow']                 # y = output variable

# Split the dataset into testing and training sets
from sklearn.model_selection import train_test_split

# Test Size
test_size = 0.2

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = test_size)

# Evaluation
from sklearn.metrics import accuracy_score

def model_accuracy(modelName, predicted):
    accuracy = accuracy_score(y_test, predicted.round())*100
    print(f"Accuracy for {modelName}: ", accuracy, "%")
    return accuracy

### 6.1 Decision Tree
Our first model to predict the next day's precipitation is the decision tree.

In [64]:
from sklearn.tree import DecisionTreeClassifier

# the classifier
clf_tree = DecisionTreeClassifier()
# training the decision tree model
tree_model = clf_tree.fit(x_train, y_train)
# predict!
tree_pred = clf_tree.predict(x_test)

With the trained model, we can genrate various reports that indicate the accuracy of the decision tree's predictions.

In [65]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, tree_pred))

[[a b]  
 [c d]]
 
a: the model predicted negative, and it is true in real life (true negative) -> we predicted no rain, and it did not rain : 18675 cases -> Correct prediction

d: the model predicted positive, and it is true in real life (true positive) -> we predicted rain, and it did rain : 2906 cases -> Correct prediciton

b: the model predicted positive, but it is false in real life (false positive) -> we predicted rain, but it did not rain : 3082 cases -> Wrong prediction 

c: the model predicted negative, but it is false in real life (false negative) -> we predicted no rain, but it did rain : 2659 cases -> Wrong prediction


Total correct predictions: 21,581 cases

Total predictions: 27,322 cases

The accuracy of the decision tree's predictions is approximately 0.7899 according to the generated confusion matrix. 

In [66]:
from sklearn.metrics import classification_report

print(classification_report(y_test, tree_pred))

As the classification report shows, the accuracy of the decision tree is about 0.79, which is very similar to the result of the confusion matrix. F1 score, the weighted mean of precision and recall, is also close to 1: the performance of our decision tree model is pretty good.

### 6.2 Gaussian Naive Bayes

Our second model is Gaussian Naive Bayes.

In [67]:
from sklearn.naive_bayes import GaussianNB

# the model
clf_g_nb = GaussianNB()
# training the Gaussian Naive Bayes model
g_nb_model = clf_g_nb.fit(x_train, y_train)
# predict!
g_nb_pred = clf_g_nb.predict(x_test)

print(confusion_matrix(y_test, g_nb_pred))
print('\n')
print(classification_report(y_test, g_nb_pred))

Based on the confusion matrix, the calculated accuracy of the prediction is approximately (21,767 / 27,322) 0.7967. The classification report also indicates that the accuracy is 0.80.

### 6.3 Logistic Regression

Our third machine learning model is the Logistic Regression model.

In [68]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

logreg = LogisticRegression(solver='liblinear')
logreg.fit(x_train, y_train)

logreg_pred = logreg.predict(x_test)

print(classification_report(y_test, logreg_pred))


logit_roc_auc = roc_auc_score(y_test, logreg_pred)
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(x_test)[:,1])

plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

According to the classification report, the logistic regression predicts the rain with the highest accuracy: 0.84. 

In addition, we added a ROC (receiver operating characteristic) curve that shows the diagnostic ability of binary classifiers. Here, the two parameters for the ROC curve are the rates of true positive and false positive. The blue graph represents the ratio between the true positive and false positive at another decision threshold. Also, the area below the blue graph is known as the AUC (area under the ROC curve). The closer the area is to 1.0, the better the accuracy of the logistic regression's predictions. Our area is approximately 0.69, which is not a bad result.

Now, using the model_accuracy method we created, let's compute the prediction accuracy of each model and visualize them.

In [69]:
results = {}

models = {
    "Decision Tree Classifier" : clf_tree,
    "Gaussian Naive Bayes" : clf_g_nb,
    "Logistic Regression": logreg
}

# Our pipeline
for m in models:
    # Selecting our model
    model = models[m]
    # Testing
    model.fit(x_train, y_train)
    # Prediction
    predicted = model.predict(x_test)
    # Evaluation
    acc = model_accuracy(m, predicted)
    # Record accuracy for the model
    results[m] = acc

fig = plt.figure()
ax = fig.add_axes([0,0,1,1])

labels = results.keys()
values = results.values()
ax.bar(labels,values, width=0.1)
plt.show()

As we have gone through, the accuracies of each model - Decision Tree, Gaussian Naive Bayes, and Logistic Regression - are extremely similar to the results from classification models and confusion matrix of the different models. The Decision Tree and Gaussian Naive Bayes yielded similar accuracies, whereas the Logistic Regression was the only model which predicted more than 80%. 


We analyzed that the Logistic Regression performed better than the Gaussian Naive Bayes because the latter model assumes that the attributes are conditionally independent. In the case where we are dealing with rain, however, all of the attributes are not conditionally independent, as seen in the correlation heatmap at Data Visualization section. Logistic Regression, on the other hand, directly measures the relationship between the input variables (x) and the results (y) using probabilities. Through building a direct function between the independent and output variables, the Logistic Regression model would perform better in weather forecasting. 


Also, we resulted that the Logistic Regression had a higher accuracy than the Decision Tree due to the number of columns in the dataset. The disadvantages of using Logistic Regression are when the dataset includes too many outliers - making it impossible to come up with a decent linear line that separates the outputs into two - whereas the Decision Tree will perform well in that situation. In our project, though, we preprocessed the data to exclude the outliers, removing the disadvantage of the Logistic Regression model. However, the 22 columns of the dataset were a bit difficult for our Decision Tree model to handle; the textual report of the Decision Tree model included a huge amount of splits, decreasing the credibility of its predictions. 

[1] https://www.dailymail.co.uk/news/article-7807737/Fascinating-world-temperature-map-shows-country-hottest-place-EARTH.html

[2] https://www.smh.com.au/national/i-know-i-m-a-weather-geek-but-i-d-love-to-see-that-the-bom-s-new-three-week-outlooks-20190510-p51m0u.html

[3] http://weather-climate.com/ForecastAccuracyMelbourne29June2007.pdf