#Rain Prediction in Australia

For your project, you will use a rainfall dataset from the Australian Government's Bureau of Meteorology, clean the data, and apply different classification algorithms to the data. Alternatively, you can download your data from the following url

The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).

The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)




You are expected to use the following algorithms to build your models:

1.  Linear Regression
2.  KNN
3.  Decision Trees
4.  Logistic Regression
5.  SVM
We will evaluate our models using:

1.  Accuracy Score
2.  Jaccard Index
3.  F1-Score
4.  LogLoss
5.  Mean Absolute Error
6.  Mean Squared Error
7.  R2-Score

**Finally, you will use your models to generate the report at the end. ***

#Objectives
1.     Splitting the dataset into training and testing data for regression
2.     Building and training a model using Linear Regression and calculating evaluation metrics
3.     Creating a final regression report/table of evaluation metrics
4.     Building and training a model using KNN and calculating evaluation metrics
5.     Building and training a model using Decision Trees and calculating evaluation metrics
6.     Building and training a model using Logistic Regression and calculating evaluation metrics
7.     Building and training a model using SVM and calculating evaluation metrics
8.     Creating a final classification report/table of evaluation metrics

#Import the required libraries

In [None]:
# Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn import svm
from sklearn.metrics import accuracy_score, jaccard_score, f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

#Importing the Dataset

In [5]:
import requests

# Define the URL and filename
path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv'
filename = "Weather_Data.csv"

# Download the CSV file
response = requests.get(path)
if response.status_code == 200:
    with open(filename, 'wb') as file:
        file.write(response.content)
    print(f"{filename} has been downloaded successfully.")
else:
    print(f"Failed to download the file. Status code: {response.status_code}")

# Load the CSV file into a DataFrame
df = pd.read_csv(filename)

# Display the first few rows of the DataFrame
print(df.head())


Weather_Data.csv has been downloaded successfully.
       Date  MinTemp  MaxTemp  Rainfall  Evaporation  Sunshine WindGustDir  \
0  2/1/2008     19.5     22.4      15.6          6.2       0.0           W   
1  2/2/2008     19.5     25.6       6.0          3.4       2.7           W   
2  2/3/2008     21.6     24.5       6.6          2.4       0.1           W   
3  2/4/2008     20.2     22.8      18.8          2.2       0.0           W   
4  2/5/2008     19.7     25.7      77.4          4.8       0.0           W   

   WindGustSpeed WindDir9am WindDir3pm  ...  Humidity9am  Humidity3pm  \
0             41          S        SSW  ...           92           84   
1             41          W          E  ...           83           73   
2             41        ESE        ESE  ...           88           86   
3             41        NNE          E  ...           83           90   
4             41        NNE          W  ...           88           74   

   Pressure9am  Pressure3pm  Cloud9am  Cl

In [6]:
df.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


#Data Preprocessing
One Hot Encoding
First, we need to perform one hot encoding to convert categorical variables to binary variables.

In [7]:
df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])

Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the get_dummies method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.

In [8]:
df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)

  df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)


#Training Data and Test Data
Now, we set our 'features' or x values and our Y or target variable.

In [9]:
df_sydney_processed.drop('Date',axis=1,inplace=True)

In [10]:
df_sydney_processed = df_sydney_processed.astype(float)

In [11]:
features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)
Y = df_sydney_processed['RainTomorrow']

#Linear Regression

Q1) Use the train_test_split function to split the features and Y dataframes with a test_size of 0.2 and the random_state set to 10.

In [12]:
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=10)

Q2) Create and train a Linear Regression model called LinearReg using the training data (x_train, y_train).¶

In [13]:
LinearReg = LinearRegression()
LinearReg.fit(x_train, y_train)

Q3) Now use the predict method on the testing data (x_test) and save it to the array predictions.

In [14]:
predictions = LinearReg.predict(x_test)

Q4) Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.

In [17]:
LinearRegression_MAE = mean_absolute_error(y_test, predictions)
LinearRegression_MSE = mean_squared_error(y_test, predictions)
LinearRegression_R2 = r2_score(y_test, predictions)

Q5) Show the MAE, MSE, and R2 in a tabular format using data frame for the linear model.

In [18]:
Report = pd.DataFrame({
    'Metric': ['MAE', 'MSE', 'R2'],
    'Linear Regression': [LinearRegression_MAE, LinearRegression_MSE, LinearRegression_R2]
})
Report

Unnamed: 0,Metric,Linear Regression
0,MAE,0.256318
1,MSE,0.115721
2,R2,0.427132


#KNN

Q6) Create and train a KNN model called KNN using the training data (x_train, y_train) with the n_neighbors parameter set to 4.

In [19]:
from sklearn.neighbors import KNeighborsRegressor
KNN = KNeighborsRegressor(n_neighbors=4)
KNN.fit(x_train,y_train)

Q7) Now use the predict method on the testing data (x_test) and save it to the array predictions.

In [20]:
predictions = KNN.predict(x_test)

Q8) Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function

In [21]:
# Convert regression predictions to binary (if needed)
binary_predictions = [round(pred) for pred in predictions]

KNN_Accuracy_Score = accuracy_score(y_test, binary_predictions)
KNN_JaccardIndex = jaccard_score(y_test, binary_predictions)
KNN_F1_Score = f1_score(y_test, binary_predictions)
Report = pd.DataFrame({
    'Metric': ['Accuracy', 'Jaccard', 'F1'],
    'KNN Regression': [KNN_Accuracy_Score, KNN_JaccardIndex, KNN_F1_Score]
})
Report

Unnamed: 0,Metric,KNN Regression
0,Accuracy,0.818321
1,Jaccard,0.425121
2,F1,0.59661


#Decision Tree

Q9) Create and train a Decision Tree model called Tree using the training data (x_train, y_train)

In [22]:
Tree = DecisionTreeRegressor()
Tree.fit(x_train, y_train)


Q10) Now use the predict method on the testing data (x_test) and save it to the array predictions.

In [23]:
predictions = Tree.predict(x_test)

Q11) Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.

In [24]:
# Convert regression predictions to binary (if needed)
binary_predictions_tree = [round(pred) for pred in predictions]

Tree_Accuracy_Score = accuracy_score(y_test, binary_predictions_tree)
Tree_JaccardIndex = jaccard_score(y_test, binary_predictions_tree)
Tree_F1_Score = f1_score(y_test, binary_predictions_tree)

Report = pd.DataFrame({
    'Metric': ['Accuracy', 'Jaccard', 'F1'],
    'Decision Tree': [Tree_Accuracy_Score, Tree_JaccardIndex, Tree_F1_Score]
})
Report

Unnamed: 0,Metric,Decision Tree
0,Accuracy,0.754198
1,Jaccard,0.399254
2,F1,0.570667


#Logistic Regression

Q12) Use the train_test_split function to split the features and Y dataframes with a test_size of 0.2 and the random_state set to 1.

In [25]:
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=1)

Q13) Create and train a LogisticRegression model called LR using the training data (x_train, y_train) with the solver parameter set to liblinear.

In [26]:
LR = LogisticRegression(solver='liblinear')
LR.fit(x_train, y_train)

Q14) Now, use the predict and predict_proba methods on the testing data (x_test) and save it as 2 arrays predictions and predict_proba.

In [27]:
predictions = LR.predict(x_test)
predict_proba = LR.predict_proba(x_test)

Q15) Using the predictions, predict_proba and the y_test dataframe calculate the value for each metric using the appropriate function.

In [28]:
LR_Accuracy_Score = accuracy_score(y_test, predictions)
LR_JaccardIndex = jaccard_score(y_test, predictions)
LR_F1_Score = f1_score(y_test, predictions)
LR_Log_Loss = log_loss(y_test, predict_proba)

Report = pd.DataFrame({
    'Metric': ['Accuracy', 'Jaccard', 'F1', 'logloss'],
    'LogisticRegression': [LR_Accuracy_Score, LR_JaccardIndex, LR_F1_Score, LR_Log_Loss]
})
Report

Unnamed: 0,Metric,LogisticRegression
0,Accuracy,0.836641
1,Jaccard,0.509174
2,F1,0.674772
3,logloss,0.381259


#SVM

Q16) Create and train a SVM model called SVM using the training data (x_train, y_train).

In [29]:
SVM = SVC()
SVM.fit(x_train, y_train)

Q17) Now use the predict method on the testing data (x_test) and save it to the array predictions.

In [30]:
predictions = SVM.predict(x_test)

Q18) Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.

In [31]:
SVM_Accuracy_Score = accuracy_score(y_test, predictions)
SVM_JaccardIndex = jaccard_score(y_test, predictions)
SVM_F1_Score = f1_score(y_test, predictions)

Report = pd.DataFrame({
    'Metric': ['Accuracy', 'Jaccard', 'F1'],
    'SVM': [SVM_Accuracy_Score, SVM_JaccardIndex, SVM_F1_Score ]
})
Report

Unnamed: 0,Metric,SVM
0,Accuracy,0.722137
1,Jaccard,0.0
2,F1,0.0


#Final Report

Q19) Show the Accuracy,Jaccard Index,F1-Score and LogLoss in a tabular format using data frame for all of the above models.
*LogLoss is only for Logistic Regression Model

In [32]:
Report = pd.DataFrame({
    'Model': ['K-Nearest Neighbors', 'Decision Tree', 'Logistic Regression', 'Support Vector Machine'],
    'Accuracy': [ KNN_Accuracy_Score, Tree_Accuracy_Score, LR_Accuracy_Score, SVM_Accuracy_Score],
    'Jaccard Index': [ KNN_JaccardIndex, Tree_JaccardIndex, LR_JaccardIndex, SVM_JaccardIndex],
    'F1 Score': [ KNN_F1_Score, Tree_F1_Score, LR_F1_Score, SVM_F1_Score],
    'Log Loss': [ None, None, LR_Log_Loss, None]
})
Report

Unnamed: 0,Model,Accuracy,Jaccard Index,F1 Score,Log Loss
0,K-Nearest Neighbors,0.818321,0.425121,0.59661,
1,Decision Tree,0.754198,0.399254,0.570667,
2,Logistic Regression,0.836641,0.509174,0.674772,0.381259
3,Support Vector Machine,0.722137,0.0,0.0,
