# Rain Prediction in Australia

## Classification Machine Learning algorithms

---

<h2>Table of Contents</h2>
<div>
    <ul>
    <li>Importing Data</li>
    <li>Data Preprocessing</li>
    <li>One Hot Encoding</li>
    <li>Training and Testing Data Split</li>
    <li>Training Logistic Regression, KNN, Decision Tree, SVM, and Linear Regression models & returning their appropriate accuracy scores</li>
    </li>
</div>
<br>

<hr>

Below, is where we are going to use the classification algorithms to create a model based on our training data and evaluate our testing data using evaluation metrics.

We will use some of the algorithms, specifically:

1. Linear Regression
2. KNN
3. Decision Trees
4. Logistic Regression
5. SVM

We will evaluate our models using:

1.  Accuracy Score
2.  Jaccard Index
3.  F1-Score
4.  LogLoss
5.  Mean Absolute Error
6.  Mean Squared Error
7.  R2-Score

Finally, we will create a pandas dataframe with the models and their evaluation metrics constituting the rows and columns of the dataframe respectively.

## About The Dataset


The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).

The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)




This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:

| Field         | Description                                           | Unit            | Type   |
| ------------- | ----------------------------------------------------- | --------------- | ------ |
| Date          | Date of the Observation in YYYY-MM-DD                 | Date            | object |
| Location      | Location of the Observation                           | Location        | object |
| MinTemp       | Minimum temperature                                   | Celsius         | float  |
| MaxTemp       | Maximum temperature                                   | Celsius         | float  |
| Rainfall      | Amount of rainfall                                    | Millimeters     | float  |
| Evaporation   | Amount of evaporation                                 | Millimeters     | float  |
| Sunshine      | Amount of bright sunshine                             | hours           | float  |
| WindGustDir   | Direction of the strongest gust                       | Compass Points  | object |
| WindGustSpeed | Speed of the strongest gust                           | Kilometers/Hour | object |
| WindDir9am    | Wind direction averaged of 10 minutes prior to 9am    | Compass Points  | object |
| WindDir3pm    | Wind direction averaged of 10 minutes prior to 3pm    | Compass Points  | object |
| WindSpeed9am  | Wind speed averaged of 10 minutes prior to 9am        | Kilometers/Hour | float  |
| WindSpeed3pm  | Wind speed averaged of 10 minutes prior to 3pm        | Kilometers/Hour | float  |
| Humidity9am   | Humidity at 9am                                       | Percent         | float  |
| Humidity3pm   | Humidity at 3pm                                       | Percent         | float  |
| Pressure9am   | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal     | float  |
| Pressure3pm   | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal     | float  |
| Cloud9am      | Fraction of the sky obscured by cloud at 9am          | Eights          | float  |
| Cloud3pm      | Fraction of the sky obscured by cloud at 3pm          | Eights          | float  |
| Temp9am       | Temperature at 9am                                    | Celsius         | float  |
| Temp3pm       | Temperature at 3pm                                    | Celsius         | float  |
| RainToday     | If there was rain today                               | Yes/No          | object |
| RainTomorrow  | If there is rain tomorrow                             | Yes/No          | float  |

Column definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)



## Import the required libraries

In [59]:
import warnings
warnings.filterwarnings('ignore')

In [60]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics

### Importing the Dataset


In [61]:
path='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv'

In [62]:
df = pd.read_csv(path)
df.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


### Data Preprocessing


#### One Hot Encoding


First, we need to perform one hot encoding to convert categorical variables to binary variables.


In [63]:
df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])

Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the `get_dummies` method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.

In [64]:
df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)

### Training Data and Testing Data

Now, we set our 'features' as x values and our Y as target variable.

In [65]:
df_sydney_processed.drop('Date',axis=1,inplace=True)

In [66]:
df_sydney_processed = df_sydney_processed.astype(float)

In [67]:
x = df_sydney_processed.drop(columns='RainTomorrow', axis=1)
y = df_sydney_processed['RainTomorrow']

In [68]:
x.head()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,...,WindDir3pm_NNW,WindDir3pm_NW,WindDir3pm_S,WindDir3pm_SE,WindDir3pm_SSE,WindDir3pm_SSW,WindDir3pm_SW,WindDir3pm_W,WindDir3pm_WNW,WindDir3pm_WSW
0,19.5,22.4,15.6,6.2,0.0,41.0,17.0,20.0,92.0,84.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,19.5,25.6,6.0,3.4,2.7,41.0,9.0,13.0,83.0,73.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,21.6,24.5,6.6,2.4,0.1,41.0,17.0,2.0,88.0,86.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,20.2,22.8,18.8,2.2,0.0,41.0,22.0,20.0,83.0,90.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,19.7,25.7,77.4,4.8,0.0,41.0,11.0,6.0,88.0,74.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [69]:
y

0       1.0
1       1.0
2       1.0
3       1.0
4       1.0
       ... 
3266    0.0
3267    0.0
3268    0.0
3269    0.0
3270    0.0
Name: RainTomorrow, Length: 3271, dtype: float64

### Linear Regression


##### Using the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `10`

In [70]:
from sklearn.model_selection import train_test_split
x_train1, x_test1, y_train1, y_test1 = train_test_split(x, y, test_size=0.2, random_state=10)
print("Train set: ", x_train1.shape, y_train1.shape)
print("Test set: ", x_test1.shape, y_test1.shape)

Train set:  (2616, 66) (2616,)
Test set:  (655, 66) (655,)


##### Creating and training a Linear Regression model called LinearReg using the training data (`x_train`, `y_train`)

In [71]:
#Enter Your Code, Execute and take the Screenshot

from sklearn import linear_model

In [72]:
LinearReg = linear_model.LinearRegression()
LinearReg.fit(x_train1, y_train1)

print ('Coefficients: ', LinearReg.coef_)
print ('Intercept: ',LinearReg.intercept_)
type(LinearReg.coef_)
type(LinearReg.intercept_)

Coefficients:  [-2.36917590e-02  1.30054373e-02  7.29815718e-04  6.49074935e-03
 -3.51643560e-02  4.23764863e-03  1.82923556e-03  7.89868558e-04
  9.56084036e-04  8.56062272e-03  7.69810332e-03 -9.24433319e-03
 -8.87439523e-03  1.00475778e-02  1.44655485e-02 -3.48059951e-03
  3.53652210e+08  3.53652210e+08 -1.48636792e+08 -1.48636792e+08
 -1.48636792e+08 -1.48636792e+08 -1.48636792e+08 -1.48636792e+08
 -1.48636792e+08 -1.48636792e+08 -1.48636792e+08 -1.48636792e+08
 -1.48636792e+08 -1.48636792e+08 -1.48636792e+08 -1.48636792e+08
 -1.48636792e+08 -1.48636792e+08  2.45978566e+08  2.45978566e+08
  2.45978566e+08  2.45978566e+08  2.45978566e+08  2.45978566e+08
  2.45978566e+08  2.45978566e+08  2.45978566e+08  2.45978566e+08
  2.45978566e+08  2.45978566e+08  2.45978566e+08  2.45978566e+08
  2.45978566e+08  2.45978566e+08 -1.68435103e+08 -1.68435103e+08
 -1.68435103e+08 -1.68435103e+08 -1.68435103e+08 -1.68435103e+08
 -1.68435103e+08 -1.68435103e+08 -1.68435103e+08 -1.68435103e+08
 -1.684351

numpy.float64

##### Now we use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`

In [73]:
predictions1 = LinearReg.predict(x_test1)
print(predictions1)

[ 1.31837666e-01  2.76184976e-01  9.78188813e-01  2.87457049e-01
  1.32413983e-01  4.60464776e-01  3.56785357e-01  8.56409013e-01
  6.75011694e-01  3.82469296e-02  4.77451086e-03  2.81215012e-01
  3.39082420e-01  7.80925155e-02  6.25942945e-02  5.64438164e-01
 -6.15522265e-02  5.24162114e-01  1.53691232e-01  3.59700620e-01
  6.05329871e-02  9.03560758e-01  4.67318714e-01  2.03370869e-01
 -7.10244775e-02  3.83878171e-01  5.36085367e-01 -2.28936672e-02
  6.40129626e-01 -9.56752896e-02  3.78086269e-01  1.20264471e-01
 -1.81462169e-02  5.53833842e-02  5.63534856e-01  1.06298536e+00
 -6.75231218e-03  5.14394581e-01 -8.83882046e-02  6.91938996e-02
  2.44745016e-02  8.71741116e-01  2.44666278e-01  3.94727230e-01
  2.67560542e-01  4.46795344e-01 -4.75681424e-02  1.89430654e-01
  7.76609361e-01  1.57759488e-01  3.94415855e-03 -5.19683957e-02
  2.07340419e-01 -2.07888544e-01 -7.61141777e-02  2.49651730e-01
  2.79297054e-01  6.02773964e-01  6.29590929e-01  4.90636110e-01
  5.64552546e-02  1.05475

##### Using the `predictions` and the `y_test` dataframe, calculating the value for each metric using the appropriate function.

In [74]:
from sklearn.metrics import r2_score

In [75]:
LinearRegression_MAE = np.mean(np.absolute(predictions1 - y_test1))
LinearRegression_MSE = np.mean((predictions1 - y_test1) ** 2)
LinearRegression_R2 =  r2_score(predictions1 , y_test1)

##### Showing the MAE, MSE, and R2 in a tabular format using data frame for the linear model.

In [76]:
data = [['MAE',LinearRegression_MAE],['MSE',LinearRegression_MSE],['R2',LinearRegression_R2]]

Report = pd.DataFrame(data, columns=['Name','Value'])

print(Report)

  Name     Value
0  MAE  0.256318
1  MSE  0.115721
2   R2 -0.384756


### KNN


##### Creating and training a KNN model called KNN using the training data (`x_train`, `y_train`) with the `n_neighbors` parameter set to `4`.

In [77]:
from sklearn.model_selection import train_test_split
x_train2, x_test2, y_train2, y_test2 = train_test_split( x, y, test_size=0.2, random_state=4)
print ('Train set:', x_train2.shape,  y_train2.shape)
print ('Test set:', x_test2.shape,  y_test2.shape)

Train set: (2616, 66) (2616,)
Test set: (655, 66) (655,)


In [78]:
from sklearn.neighbors import KNeighborsClassifier

k = 4

neigh = KNeighborsClassifier(n_neighbors = k).fit(x_train2, y_train2)
neigh

##### Now, we use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`

In [79]:
predictions2 = neigh.predict(x_test2)
predictions2[0:5]

array([0., 0., 0., 0., 0.])

##### Using the `predictions` and the `y_test` dataframe, we calculate the value for each metric using the appropriate function

In [80]:
KNN_Accuracy_Score = metrics.accuracy_score(y_test2, predictions2)
KNN_JaccardIndex = jaccard_score(y_test2, predictions2)
KNN_F1_Score = f1_score(y_test2, predictions2)

In [81]:
print(KNN_Accuracy_Score)
print(KNN_JaccardIndex)
print(KNN_F1_Score)

0.8213740458015267
0.4264705882352941
0.5979381443298969


### Decision Tree


##### Creating and training a Decision Tree model called Tree using the training data (`x_train`, `y_train`)

In [82]:
from sklearn.model_selection import train_test_split

x_train3, x_test3, y_train3, y_test3 = train_test_split(x, y, test_size=0.3, random_state=3)

In [83]:
Tree = DecisionTreeClassifier(criterion="entropy", max_depth=4)
Tree

##### Now, we use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`

In [84]:
Tree.fit(x_train3, y_train3)
predictions3 = Tree.predict(x_test3)
predictions3

array([0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 1., 0.,
       0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 1.,
       0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1.,
       0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.,
       0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 1.,
       0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 1., 0., 0.,
       0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 1., 1., 0.

##### Using the `predictions` and the `y_test` dataframe, we calculate the value for each metric using the appropriate function

In [85]:
Tree_Accuracy_Score = metrics.accuracy_score(y_test3, predictions3)
Tree_JaccardIndex = jaccard_score(y_test3, predictions3)
Tree_F1_Score = f1_score(y_test3, predictions3)

In [86]:
print(Tree_Accuracy_Score)
print(Tree_JaccardIndex)
print(Tree_F1_Score)

0.8177189409368636
0.39730639730639733
0.5686746987951807


### Logistic Regression


##### Using the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `1`

In [87]:
from sklearn.model_selection import train_test_split

In [88]:
x_train4, x_test4, y_train4, y_test4 = train_test_split(x, y, test_size=0.2, random_state=1)

##### Creating and training a LogisticRegression model called LR using the training data (`x_train`, `y_train`) with the `solver` parameter set to `liblinear`.

In [89]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01, solver='liblinear').fit(x_train4,y_train4)
LR

##### Now, we use the `predict` and `predict_proba` methods on the testing data (`x_test`) and save it as 2 arrays `predictions` and `predict_proba`

In [90]:
predictions4 = LR.predict(x_test4)
predict_proba = LR.predict_log_proba(x_test4)
type(predictions4)
type(predict_proba)

numpy.ndarray

In [91]:
print(predict_proba.shape)
print(y_test4.shape)

(655, 2)
(655,)


##### Using the `predictions`, `predict_proba` and the `y_test` dataframe, we calculate the value for each metric using the appropriate function

In [92]:
LR_Accuracy_Score = accuracy_score(y_test4, predictions4)
LR_JaccardIndex = jaccard_score(y_test4, predictions4)
LR_Log_Loss = log_loss(y_test4, predict_proba)

In [93]:
print(LR_Accuracy_Score)
print(LR_JaccardIndex)
print(LR_Log_Loss)

0.8274809160305343
0.4840182648401826
0.6931471805599454


### SVM


##### Creating and training a SVM model called SVM using the training data (`x_train`, `y_train`)

In [94]:
x_train5, x_test5, y_train5, y_test5 = train_test_split( x, y, test_size=0.2, random_state=4)
print ('Train set:', x_train5.shape,  y_train5.shape)
print ('Test set:', x_test5.shape,  y_test5.shape)

Train set: (2616, 66) (2616,)
Test set: (655, 66) (655,)


In [95]:
from sklearn import svm

svm = svm.SVC(kernel='rbf')
svm.fit(x_train5, y_train5)

##### Now, we use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`

In [96]:
predictions5 = svm.predict(x_test5)
predictions5[0:5]

array([0., 0., 0., 0., 0.])

##### Using the `predictions` and the `y_test` dataframe, we calculate the value for each metric using the appropriate function

In [97]:
SVM_Accuracy_Score = accuracy_score(y_test5, predictions5)
SVM_JaccardIndex = jaccard_score(y_test5, predictions5)
SVM_F1_Score = f1_score(y_test5, predictions5)

In [98]:
print(SVM_Accuracy_Score)
print(SVM_JaccardIndex)
print(SVM_F1_Score)

0.7206106870229008
0.0
0.0


##### Showing the Accuracy,Jaccard Index,F1-Score and LogLoss in a tabular format using data frame for all of the above models.

\*LogLoss is only for Logistic Regression Model

In [99]:
data = [['KNN',KNN_Accuracy_Score,KNN_JaccardIndex,KNN_F1_Score, '-'],
        ['Decision Tree', Tree_Accuracy_Score, Tree_JaccardIndex, Tree_F1_Score, '-'],
        ['Logistic Regression', LR_Accuracy_Score, LR_JaccardIndex, '-', LR_Log_Loss],
        ['SVM', SVM_Accuracy_Score, SVM_JaccardIndex, SVM_F1_Score, '-']
    ]

Report = pd.DataFrame(data, columns=['Name', 'Accuracy Score', 'Jaccard Index', 'F1 Score', 'Log Loss'])

print(Report)

                  Name  Accuracy Score  Jaccard Index  F1 Score  Log Loss
0                  KNN        0.821374       0.426471  0.597938         -
1        Decision Tree        0.817719       0.397306  0.568675         -
2  Logistic Regression        0.827481       0.484018         -  0.693147
3                  SVM        0.720611       0.000000       0.0         -


Copyright © 2021 IBM Corporation. All rights reserved.