# **Machine Learning Project: Rain Prediction in Australia**

## Table of Contents
* ML Algorithms and Evalution Metrics to be Used
* About the Data
* Importing Data
* Data Preprocessing
    * One Hot Encoding
    * Train and Test Data Split
* Train Logistic Regression, KNN, Decision Tree, SVM, and Linear Regression Models
* Model Metrics Report

## ML Algorithms and Evalution Metrics to be Used

ML Algorithms that will be used for this project:

1. Linear Regression
2. KNN
3. Decision Trees
4. Logistic Regression
5. SVM

Evaluation Metrics for the Models:

1.  Accuracy Score
2.  Jaccard Index
3.  F1-Score
4.  LogLoss
5.  Mean Absolute Error
6.  Mean Squared Error
7.  R2-Score

## About The Dataset


This dataset contains observations of weather metrics for each day from 2008 to 2017. The **Weather_Data.csv** dataset includes the following fields:

| Field         | Description                                           | Unit            | Type   |
| ------------- | ----------------------------------------------------- | --------------- | ------ |
| Date          | Date of the Observation in YYYY-MM-DD                 | Date            | object |
| MinTemp       | Minimum temperature                                   | Celsius         | float  |
| MaxTemp       | Maximum temperature                                   | Celsius         | float  |
| Rainfall      | Amount of rainfall                                    | Millimeters     | float  |
| Evaporation   | Amount of evaporation                                 | Millimeters     | float  |
| Sunshine      | Amount of bright sunshine                             | hours           | float  |
| WindGustDir   | Direction of the strongest gust                       | Compass Points  | object |
| WindGustSpeed | Speed of the strongest gust                           | Kilometers/Hour | object |
| WindDir9am    | Wind direction averaged of 10 minutes prior to 9am    | Compass Points  | object |
| WindDir3pm    | Wind direction averaged of 10 minutes prior to 3pm    | Compass Points  | object |
| WindSpeed9am  | Wind speed averaged of 10 minutes prior to 9am        | Kilometers/Hour | float  |
| WindSpeed3pm  | Wind speed averaged of 10 minutes prior to 3pm        | Kilometers/Hour | float  |
| Humidity9am   | Humidity at 9am                                       | Percent         | float  |
| Humidity3pm   | Humidity at 3pm                                       | Percent         | float  |
| Pressure9am   | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal     | float  |
| Pressure3pm   | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal     | float  |
| Cloud9am      | Fraction of the sky obscured by cloud at 9am          | Eights          | float  |
| Cloud3pm      | Fraction of the sky obscured by cloud at 3pm          | Eights          | float  |
| Temp9am       | Temperature at 9am                                    | Celsius         | float  |
| Temp3pm       | Temperature at 3pm                                    | Celsius         | float  |
| RainToday     | If there was rain today                               | Yes/No          | object |
| RainTomorrow  | If there is rain tomorrow                             | Yes/No          | float  |



## Import the required libraries


In [1]:
# Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [2]:
import pandas as pd
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import jaccard_score, f1_score, log_loss, accuracy_score, mean_absolute_error, mean_squared_error, r2_score

## Importing the Dataset


In [3]:
df = pd.read_csv("data/Weather_Data.csv")
df.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


## Data Preprocessing


### One Hot Encoding


In [4]:
df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])
df_sydney_processed['RainTomorrow']

0       Yes
1       Yes
2       Yes
3       Yes
4       Yes
       ... 
3266     No
3267     No
3268     No
3269     No
3270     No
Name: RainTomorrow, Length: 3271, dtype: object

In [5]:
df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)
df_sydney_processed['RainTomorrow']

0       1
1       1
2       1
3       1
4       1
       ..
3266    0
3267    0
3268    0
3269    0
3270    0
Name: RainTomorrow, Length: 3271, dtype: int64

### Training Data and Test Data


In [6]:
df_sydney_processed.drop('Date',axis=1,inplace=True)

In [7]:
df_sydney_processed = df_sydney_processed.astype(float)

In [8]:
X = df_sydney_processed.drop(columns='RainTomorrow', axis=1)
y = df_sydney_processed['RainTomorrow']

### Train Test Split

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Linear Regression


### Create and Train a Linear Regression Model


In [10]:
LinearReg = LinearRegression()
LinearReg.fit(X_train, y_train)

### Testing and Evaluation


In [11]:
predictions = LinearReg.predict(X_test)

In [12]:
LinearRegression_MAE = mean_absolute_error(y_test, predictions)
LinearRegression_MSE = mean_squared_error(y_test, predictions)
LinearRegression_R2 = r2_score(y_test, predictions)

In [13]:
Report = {
    'Model' : ['Linear Regresssion'],
    'Mean Absolute Error' : [LinearRegression_MAE],
    'Mean Squared Error' : [LinearRegression_MSE],
    'R-Squared' : [LinearRegression_R2]
}
report_df = pd.DataFrame(Report)
report_df

Unnamed: 0,Model,Mean Absolute Error,Mean Squared Error,R-Squared
0,Linear Regresssion,0.27051,0.131724,0.336736


## KNN


### Create and Train a KNN Model


In [14]:
KNN = KNeighborsClassifier(n_neighbors=4)
KNN.fit(X_train, y_train)

### Testing and Evaluation


In [15]:
predictions = KNN.predict(X_test)

In [16]:
KNN_Accuracy_Score = accuracy_score(y_test, predictions)
KNN_JaccardIndex = jaccard_score(y_test, predictions)
KNN_F1_Score = f1_score(y_test, predictions)

## Decision Tree


### Create and Train a Decision Tree Model


In [17]:
Tree = DecisionTreeClassifier()
Tree.fit(X_train, y_train)

### Testing and Evaluation

In [18]:
predictions = Tree.predict(X_test)

In [19]:
Tree_Accuracy_Score = accuracy_score(y_test, predictions)
Tree_JaccardIndex = jaccard_score(y_test, predictions)
Tree_F1_Score = f1_score(y_test, predictions)

## Logistic Regression


### Create and Train a LogisticRegression Model

In [20]:
LR = LogisticRegression(solver='liblinear')
LR.fit(X_train, y_train)

### Testing and Evaluation

In [21]:
predictions = LR.predict(X_test)

In [22]:
predict_proba = LR.predict_proba(X_test)

In [23]:
LR_Accuracy_Score = accuracy_score(y_test, predictions)
LR_JaccardIndex = jaccard_score(y_test, predictions)
LR_F1_Score = f1_score(y_test, predictions)
LR_Log_Loss = log_loss(y_test, predict_proba)

## SVM


### Create and Train a SVM Model

In [24]:
SVM = svm.SVC()
SVM.fit(X_train, y_train)

### Testing and Evaluation

In [25]:
predictions = SVM.predict(X_test)

In [26]:
SVM_Accuracy_Score = accuracy_score(y_test, predictions)
SVM_JaccardIndex = jaccard_score(y_test, predictions)
SVM_F1_Score = f1_score(y_test, predictions)

## Model Metrics Report


In [27]:
Report_lr = {
    'Model' : ['Linear Regresssion'],
    'Mean Absolute Error' : [LinearRegression_MAE],
    'Mean Squared Error' : [LinearRegression_MSE],
    'R-Squared' : [LinearRegression_R2]
}
lr_report_df = pd.DataFrame(Report_lr)
lr_report_df

Unnamed: 0,Model,Mean Absolute Error,Mean Squared Error,R-Squared
0,Linear Regresssion,0.27051,0.131724,0.336736


In [28]:
Report_cm = {
    'Model': ['KNN', 'Decision Tree', 'Logistic Regression', 'SVM'],
    'Accuracy Score': [KNN_Accuracy_Score, Tree_Accuracy_Score, LR_Accuracy_Score, SVM_Accuracy_Score],
    'Jaccard Index': [KNN_JaccardIndex, Tree_JaccardIndex, LR_JaccardIndex, SVM_JaccardIndex],
    'F1 Score': [KNN_F1_Score, Tree_F1_Score, LR_F1_Score, SVM_F1_Score],
    'Log Loss': [None, None, LR_Log_Loss, None]
}

report_df = pd.DataFrame(Report_cm)
report_df

Unnamed: 0,Model,Accuracy Score,Jaccard Index,F1 Score,Log Loss
0,KNN,0.792366,0.349282,0.51773,
1,Decision Tree,0.757252,0.3861,0.557103,
2,Logistic Regression,0.827481,0.466981,0.636656,0.411985
3,SVM,0.726718,0.0,0.0,
