
<h1 align="center"><font size="5">Rain Prediction in Australia using Classification Algorithms</font></h1>

<h2>Table of Contents</h2>
<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ul>
    <li><a href="https://#Section_1">Introduction</a></li>
    <li><a href="https://#Section_2">About the Data</a></li>
    <li><a href="https://#Section_3">Importing Data </a></li>
    <li><a href="https://#Section_4">Data Preprocessing</a> </li>
    <li><a href="https://#Section_5">One Hot Encoding </a></li>
    <li><a href="https://#Section_6">Train and Test Data Split </a></li>
    <li><a href="https://#Section_7">Train Logistic Regression, KNN, Decision Tree, SVM, and Linear Regression models and return their appropriate accuracy scores</a></li>
</a></li>
</div>
</div>

<hr>


# Introduction


We will use some of the algorithms:

1. Linear Regression
2. KNN
3. Decision Trees
4. Logistic Regression
5. SVM

We will evaluate our models using:

1.  Accuracy Score
2.  Jaccard Index
3.  F1-Score
4.  LogLoss
5.  Mean Absolute Error
6.  Mean Squared Error
7.  R2-Score

Finally, we will use models to generate the report at the end. 


# About The Dataset


The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).

The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)




This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:

| Field         | Description                                           | Unit            | Type   |
| ------------- | ----------------------------------------------------- | --------------- | ------ |
| Date          | Date of the Observation in YYYY-MM-DD                 | Date            | object |
| Location      | Location of the Observation                           | Location        | object |
| MinTemp       | Minimum temperature                                   | Celsius         | float  |
| MaxTemp       | Maximum temperature                                   | Celsius         | float  |
| Rainfall      | Amount of rainfall                                    | Millimeters     | float  |
| Evaporation   | Amount of evaporation                                 | Millimeters     | float  |
| Sunshine      | Amount of bright sunshine                             | hours           | float  |
| WindGustDir   | Direction of the strongest gust                       | Compass Points  | object |
| WindGustSpeed | Speed of the strongest gust                           | Kilometers/Hour | object |
| WindDir9am    | Wind direction averaged of 10 minutes prior to 9am    | Compass Points  | object |
| WindDir3pm    | Wind direction averaged of 10 minutes prior to 3pm    | Compass Points  | object |
| WindSpeed9am  | Wind speed averaged of 10 minutes prior to 9am        | Kilometers/Hour | float  |
| WindSpeed3pm  | Wind speed averaged of 10 minutes prior to 3pm        | Kilometers/Hour | float  |
| Humidity9am   | Humidity at 9am                                       | Percent         | float  |
| Humidity3pm   | Humidity at 3pm                                       | Percent         | float  |
| Pressure9am   | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal     | float  |
| Pressure3pm   | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal     | float  |
| Cloud9am      | Fraction of the sky obscured by cloud at 9am          | Eights          | float  |
| Cloud3pm      | Fraction of the sky obscured by cloud at 3pm          | Eights          | float  |
| Temp9am       | Temperature at 9am                                    | Celsius         | float  |
| Temp3pm       | Temperature at 3pm                                    | Celsius         | float  |
| RainToday     | If there was rain today                               | Yes/No          | object |
| RainTomorrow  | If there is rain tomorrow                             | Yes/No          | float  |

Column definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)



## **Import the required libraries**


In [334]:
# All Libraries required for this lab are listed below. The libraries pre-installed on Skills Network Labs are commented.
# !mamba install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 scikit-learn==0.20.1
# Note: If your environment doesn't support "!mamba install", use "!pip install"

In [335]:
# Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [336]:
#you are running the lab in your  browser, so we will install the libraries using ``piplite``
import piplite
await piplite.install(['pandas'])
await piplite.install(['numpy'])

In [337]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics

### Importing the Dataset


In [339]:
from pyodide.http import pyfetch

async def download(url, filename):
    response = await pyfetch(url)
    if response.status == 200:
        with open(filename, "wb") as f:
            f.write(await response.bytes())

'from pyodide.http import pyfetch\n\nasync def download(url, filename):\n    response = await pyfetch(url)\n    if response.status == 200:\n        with open(filename, "wb") as f:\n            f.write(await response.bytes())'

In [340]:
path='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv'

In [341]:
await download(path, "Weather_Data.csv")
filename ="Weather_Data.csv"

In [342]:
df = pd.read_csv("Weather_Data.csv")
df.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


### Data Preprocessing


#### One Hot Encoding


First, we need to perform one hot encoding to convert categorical variables to binary variables.


In [346]:
df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])

Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the `get_dummies` method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.


In [348]:
df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)

### Training Data and Test Data


Now, we set our 'features' or x values and our Y or target variable.


In [351]:
df_sydney_processed.drop('Date',axis=1,inplace=True)

In [352]:
df_sydney_processed = df_sydney_processed.astype(float)

In [353]:
features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)
Y = df_sydney_processed['RainTomorrow']

### Linear Regression


#### Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `10`.


In [356]:
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=10)
x_train.head()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,...,WindDir3pm_NNW,WindDir3pm_NW,WindDir3pm_S,WindDir3pm_SE,WindDir3pm_SSE,WindDir3pm_SSW,WindDir3pm_SW,WindDir3pm_W,WindDir3pm_WNW,WindDir3pm_WSW
3188,14.8,22.0,33.8,4.2,1.5,50.0,19.0,24.0,90.0,49.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2582,8.1,18.4,0.0,4.8,8.5,41.0,20.0,11.0,70.0,50.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
815,15.4,21.1,0.0,3.8,5.9,41.0,7.0,28.0,59.0,41.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1024,20.0,26.5,0.0,8.6,13.1,30.0,7.0,20.0,64.0,63.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1320,14.8,18.3,38.8,7.4,0.1,48.0,19.0,24.0,93.0,86.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


#### Create and train a Linear Regression model called LinearReg using the training data (`x_train`, `y_train`).


In [358]:
LinearReg = LinearRegression().fit(x_train, y_train)

#### Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [360]:
predictions =  LinearReg.predict(x_test)

#### Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [362]:
from sklearn.metrics import r2_score
LinearRegression_MAE = np.mean(np.absolute((predictions-y_test)))
LinearRegression_MSE = np.mean((predictions-y_test)**2)
LinearRegression_R2 = LinearReg.score(x_test,y_test)

#### Q5) Show the MAE, MSE, and R2 in a tabular format using data frame for the linear model.


In [364]:
#Enter Your Code, Execute and take the Screenshot

In [365]:
Report_LiR=pd.DataFrame(np.array([LinearRegression_MAE,LinearRegression_MSE,LinearRegression_R2]).reshape(1,-1),columns=['MAE','MSE','R2 Score']
                    ,index=['Linear Regression'])
Report_LiR

Unnamed: 0,MAE,MSE,R2 Score
Linear Regression,0.256328,0.115724,0.427117


### KNN


#### Create and train a KNN model called KNN using the training data (`x_train`, `y_train`) with the `n_neighbors` parameter set to `4`.


In [368]:
KNN = KNeighborsClassifier(n_neighbors=4).fit(x_train,y_train)

#### Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [370]:
predictions = KNN.predict(x_test.values)

#### Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [372]:
KNN_Accuracy_Score = accuracy_score(predictions,y_test)
KNN_JaccardIndex = jaccard_score(predictions,y_test)
KNN_F1_Score = f1_score(predictions,y_test)

Report_knn=pd.DataFrame(np.array([KNN_Accuracy_Score,KNN_JaccardIndex,KNN_F1_Score]).reshape(1,-1),columns=['Accuracy Score','Jaccard Index','F1 Score']
                    ,index=['KNN'])
Report_knn

Unnamed: 0,Accuracy Score,Jaccard Index,F1 Score
KNN,0.818321,0.425121,0.59661


### Decision Tree


#### Create and train a Decision Tree model called Tree using the training data (`x_train`, `y_train`).


In [375]:
Tree = DecisionTreeClassifier().fit(x_train,y_train)

#### Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [377]:
predictions = Tree.predict(x_test)

#### Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [379]:
Tree_Accuracy_Score = accuracy_score(predictions,y_test)
Tree_JaccardIndex = jaccard_score(predictions,y_test)
Tree_F1_Score =  f1_score(predictions,y_test)

Report_tree=pd.DataFrame(np.array([Tree_Accuracy_Score,Tree_JaccardIndex,Tree_F1_Score]).reshape(1,-1),columns=['Accuracy Score','Jaccard Index','F1 Score']
                    ,index=['Decission Tree'])
Report_tree

Unnamed: 0,Accuracy Score,Jaccard Index,F1 Score
Decission Tree,0.746565,0.3829,0.553763


### Logistic Regression


#### Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `1`.


In [382]:
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=1)
x_train.head()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,...,WindDir3pm_NNW,WindDir3pm_NW,WindDir3pm_S,WindDir3pm_SE,WindDir3pm_SSE,WindDir3pm_SSW,WindDir3pm_SW,WindDir3pm_W,WindDir3pm_WNW,WindDir3pm_WSW
1264,12.3,19.7,30.0,4.0,7.2,44.0,24.0,17.0,52.0,45.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3113,25.1,38.1,0.0,13.8,11.1,59.0,20.0,31.0,26.0,52.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1230,12.2,17.5,0.0,2.8,3.0,41.0,17.0,11.0,86.0,69.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1221,10.3,16.5,1.0,2.2,0.8,24.0,7.0,11.0,89.0,71.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3070,20.4,23.6,3.2,7.0,0.0,33.0,15.0,7.0,86.0,83.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Create and train a LogisticRegression model called LR using the training data (`x_train`, `y_train`) with the `solver` parameter set to `liblinear`.


In [384]:
LR = LogisticRegression(solver='liblinear').fit(x_train,y_train)

#### Now, use the `predict` and `predict_proba` methods on the testing data (`x_test`) and save it as 2 arrays `predictions` and `predict_proba`.


In [386]:
predictions = LR.predict(x_test)
predictions[:10]

array([0., 0., 0., 0., 0., 1., 0., 0., 0., 0.])

In [387]:
predict_proba = LR.predict_proba(x_test)
predict_proba[:5]

array([[0.73904476, 0.26095524],
       [0.97511927, 0.02488073],
       [0.51767575, 0.48232425],
       [0.8455409 , 0.1544591 ],
       [0.968682  , 0.031318  ]])

#### Using the `predictions`, `predict_proba` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [389]:
LR_Accuracy_Score =  accuracy_score(predictions,y_test)
LR_JaccardIndex = jaccard_score(predictions,y_test)
LR_F1_Score = f1_score(predictions,y_test)
LR_Log_Loss = log_loss(predictions,y_test)

Report_LR=pd.DataFrame(np.array([LR_Accuracy_Score,LR_JaccardIndex,LR_F1_Score,LR_Log_Loss ]).reshape(1,-1),
                    columns=['Accuracy Score','Jaccard Index','F1 Score','Log Loss']
                    ,index=['Logistic Regression'])
Report_LR

Unnamed: 0,Accuracy Score,Jaccard Index,F1 Score,Log Loss
Logistic Regression,0.836641,0.509174,0.674772,5.888047


### SVM


#### Create and train a SVM model called SVM using the training data (`x_train`, `y_train`).


In [392]:
SVM = svm.SVC().fit(x_train,y_train)
SVM

#### Q17) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [394]:
predictions = SVM.predict(x_test)

#### Q18) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [396]:
SVM_Accuracy_Score = accuracy_score(predictions,y_test)
SVM_JaccardIndex = jaccard_score(predictions,y_test)
SVM_F1_Score = f1_score(predictions,y_test)
Report_SVM=pd.DataFrame(np.array([SVM_Accuracy_Score,SVM_JaccardIndex,SVM_F1_Score]).reshape(1,-1),
                    columns=['Accuracy Score','Jaccard Index','F1 Score']
                    ,index=['SVM'])
Report_SVM

Unnamed: 0,Accuracy Score,Jaccard Index,F1 Score
SVM,0.722137,0.0,0.0


### Report


#### Show the Accuracy,Jaccard Index,F1-Score and LogLoss in a tabular format using data frame for all of the above models.

\*LogLoss is only for Logistic Regression Model


In [399]:
Report= pd.concat([Report_knn, Report_tree, Report_LR, Report_SVM])


In [400]:
Report.fillna(0.0,inplace=True)
Report

Unnamed: 0,Accuracy Score,Jaccard Index,F1 Score,Log Loss
KNN,0.818321,0.425121,0.59661,0.0
Decission Tree,0.746565,0.3829,0.553763,0.0
Logistic Regression,0.836641,0.509174,0.674772,5.888047
SVM,0.722137,0.0,0.0,0.0
