# Overview of the process


We are going to use the classification algorithms to create a model based on our training data and evaluate our testing data using various evaluation metrics.

We will use these algorithms:

1.  Linear Regression
2.  Decision Trees
3.  Logistic Regression
4.  SVM

We will evaluate our models using:

1.  Accuracy Score
2.  Jaccard Index
3.  F1-Score
4.  LogLoss
5.  Mean Absolute Error
6.  Mean Squared Error
7.  R2-Score

Finally, we will use models to generate the report displaying the accuracy scores.


## **Importing the required libraries**


In [125]:
import pandas as pd
import numpy as np
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import accuracy_score
import sklearn.metrics as metrics

### Importing the <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv">Weather Dataset</a>


In [126]:
data_frame = pd.read_csv('Weather_Data.csv')

data_frame.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


Describing the data

In [127]:
data_frame.describe()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm
count,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0,3271.0
mean,14.877102,23.005564,3.342158,5.175787,7.16897,41.476307,15.077041,19.294405,68.243962,54.698563,1018.334424,1016.003085,4.318557,4.176093,17.821461,21.543656
std,4.55471,4.483752,9.917746,2.757684,3.815966,10.806951,7.043825,7.453331,15.086127,16.279241,7.02009,7.019915,2.526923,2.411274,4.894316,4.297053
min,4.3,11.7,0.0,0.0,0.0,17.0,0.0,0.0,19.0,10.0,986.7,989.8,0.0,0.0,6.4,10.2
25%,11.0,19.6,0.0,3.2,4.25,35.0,11.0,15.0,58.0,44.0,1013.7,1011.3,2.0,2.0,13.8,18.4
50%,14.9,22.8,0.0,4.8,8.3,41.0,15.0,19.0,69.0,56.0,1018.6,1016.3,5.0,4.0,18.2,21.3
75%,18.8,26.0,1.4,7.0,10.2,44.0,20.0,24.0,80.0,64.0,1023.1,1020.8,7.0,7.0,21.7,24.5
max,27.6,45.8,119.4,18.4,13.6,96.0,54.0,57.0,100.0,99.0,1039.0,1036.7,9.0,8.0,36.5,44.7


### Data Preprocessing


#### Transforming Categorical Variables


First, we need to convert categorical variables to binary variables. We will use pandas `get_dummies()` method for this.


In [128]:
df_proc = pd.get_dummies(data=data_frame, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])
df_proc

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,...,WindDir3pm_NNW,WindDir3pm_NW,WindDir3pm_S,WindDir3pm_SE,WindDir3pm_SSE,WindDir3pm_SSW,WindDir3pm_SW,WindDir3pm_W,WindDir3pm_WNW,WindDir3pm_WSW
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,41,17,20,92,...,False,False,False,False,False,True,False,False,False,False
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,41,9,13,83,...,False,False,False,False,False,False,False,False,False,False
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,41,17,2,88,...,False,False,False,False,False,False,False,False,False,False
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,41,22,20,83,...,False,False,False,False,False,False,False,False,False,False
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,41,11,6,88,...,False,False,False,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3266,6/21/2017,8.6,19.6,0.0,2.0,7.8,37,22,20,73,...,False,False,False,False,True,False,False,False,False,False
3267,6/22/2017,9.3,19.2,0.0,2.0,9.2,30,20,7,78,...,False,False,False,False,False,False,False,False,False,False
3268,6/23/2017,9.4,17.7,0.0,2.4,2.7,24,15,13,85,...,False,False,False,False,False,False,False,False,False,False
3269,6/24/2017,10.1,19.3,0.0,1.4,9.3,43,17,19,56,...,False,False,False,False,False,False,False,True,False,False


Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the `get_dummies` method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.


In [129]:
df_proc.replace(['No', 'Yes'], [0,1], inplace=True)
df_proc[['Date','RainTomorrow']]

Unnamed: 0,Date,RainTomorrow
0,2/1/2008,1
1,2/2/2008,1
2,2/3/2008,1
3,2/4/2008,1
4,2/5/2008,1
...,...,...
3266,6/21/2017,0
3267,6/22/2017,0
3268,6/23/2017,0
3269,6/24/2017,0


## Training Data and Test Data


Now, we set our 'features' or X values and our Y or target variable.


In [130]:
df_proc.drop('Date',axis=1,inplace=True)

In [131]:
df_proc = df_proc.astype(float)

We need to predict if the rain will fall or not. So we choose the RainTomorrow column as the target.

In [132]:
X = df_proc.drop(columns='RainTomorrow', axis=1)
Y = df_proc['RainTomorrow']

## Linear Regression


### Used the `train_test_split` function to split the `X` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `10`.


In [133]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=10)

#### Created and trained a Linear Regression model called LinearReg using the training data (`x_train`, `y_train`).


In [134]:
from sklearn.linear_model import LinearRegression
LinearReg = LinearRegression()
LinearReg.fit(x_train, y_train)

#### Now  we use the `predict` method on the testing data (`x_test`) and save it to the array `y_predict`.


In [135]:
y_predict = LinearReg.predict(x_test)


####  we use the `y_predict` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [136]:
LinearRegression_MAE = metrics.mean_absolute_error(y_predict, y_test)
LinearRegression_MSE = metrics.mean_squared_error(y_predict, y_test)
LinearRegression_R2 = metrics.r2_score(y_predict, y_test)
print('Linear Regression MAE: ', LinearRegression_MAE)
print('Linear Regression MSE: ', LinearRegression_MSE)
print('Linear Regression R2: ', LinearRegression_R2)

Linear Regression MAE:  0.2563149721567867
Linear Regression MSE:  0.11571467424003506
Linear Regression R2:  -0.3851045332592635


#### Representing the MAE, MSE, and R2 in a tabular format using data frame for the linear model.


In [137]:
Report = {'MSE': [LinearRegression_MSE], 'MAE': [LinearRegression_MAE], 'R2': [LinearRegression_R2]}
Report=pd.DataFrame(Report)
Report.index = ['Linear Regression']
Report

Unnamed: 0,MSE,MAE,R2
Linear Regression,0.115715,0.256315,-0.385105


### Decision Tree


In [138]:
from sklearn.tree import DecisionTreeClassifier
Tree = DecisionTreeClassifier(criterion="entropy", max_depth = 8)
Tree.fit(x_train,y_train)

#### we use the `predict` method on the testing data (`x_test`) and save it to the array `y_predictions`.


In [139]:
y_predict = Tree.predict(x_test)

#### Using the `Y_predict` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [140]:
Tree_Accuracy_Score = accuracy_score(y_test, y_predict)
Tree_JaccardIndex = jaccard_score(y_test, y_predict, pos_label=0)
Tree_F1_Score = f1_score(y_test, y_predict)
print('Tree Accuracy Score: ', Tree_Accuracy_Score)
print('Tree Jaccard Index: ', Tree_JaccardIndex)
print('Tree F1 Score: ', Tree_F1_Score)

Tree Accuracy Score:  0.7938931297709924
Tree Jaccard Index:  0.7589285714285714
Tree F1 Score:  0.5846153846153846


### Logistic Regression


#### Use the `train_test_split` function to split the `X` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `1`.


In [141]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

In [142]:
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(solver='liblinear')
LR.fit(x_train,y_train)

#### We used the `predict` method on the testing data (`x_test`) and save it to the array `y_predict`.


In [143]:
y_predict = LR.predict(x_test)


#### Using the `y_predict` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [144]:
LR_Accuracy_Score = accuracy_score(y_test, y_predict)
LR_JaccardIndex = jaccard_score(y_test, y_predict, pos_label=0)
LR_F1_Score = f1_score(y_test, y_predict)
LR_Log_Loss = log_loss(y_test, LR.predict_proba(x_test))
print('LR Accuracy Score: ', LR_Accuracy_Score)
print('LR Jaccard Index: ', LR_JaccardIndex)
print('LR F1 Score: ', LR_F1_Score)
print('LR Log Loss: ', LR_Log_Loss)

LR Accuracy Score:  0.8366412213740458
LR Jaccard Index:  0.8033088235294118
LR F1 Score:  0.6747720364741641
LR Log Loss:  0.3804510672347215


### SVM


In [145]:
from sklearn import svm
SVM = svm.SVC(kernel='linear')
SVM.fit(x_train, y_train)

#### Using the `predict` method on the testing data (`x_test`) and save it to the array `Y_predict`.


In [146]:
y_predict = SVM.predict(x_test)


 #### Using the `y_predict` and the `y_test` dataframe calculating the value for each metric using the appropriate function.


In [147]:
SVM_Accuracy_Score = accuracy_score(y_test, y_predict)
SVM_JaccardIndex = jaccard_score(y_test, y_predict, pos_label=0)
SVM_F1_Score = f1_score(y_test, y_predict)
print('SVM Accuracy Score: ', SVM_Accuracy_Score)
print('SVM Jaccard Index: ', SVM_JaccardIndex)
print('SVM F1 Score: ', SVM_F1_Score)

SVM Accuracy Score:  0.8458015267175573
SVM Jaccard Index:  0.8126159554730983
SVM F1 Score:  0.6966966966966968


## Report


#### Showing the Accuracy,Jaccard Index, F1-Score and LogLoss in a tabular format using data frame for all of the above models.

\*LogLoss is only for Logistic Regression Model


In [148]:
Report = {'Classification Algorithm': ['Decision Tree',  'LogisticRegression','SVM'],
          'Accuracy Score': [Tree_Accuracy_Score, LR_Accuracy_Score, SVM_Accuracy_Score],
          'Jaccard Score': [Tree_JaccardIndex, LR_JaccardIndex, SVM_JaccardIndex,],
          'F1-score': [Tree_F1_Score, LR_F1_Score, SVM_F1_Score], 
          'LogLoss': ['N/A', LR_Log_Loss, 'N/A']}
Report = pd.DataFrame(Report)
Report

Unnamed: 0,Classification Algorithm,Accuracy Score,Jaccard Score,F1-score,LogLoss
0,Decision Tree,0.793893,0.758929,0.584615,
1,LogisticRegression,0.836641,0.803309,0.674772,0.380451
2,SVM,0.845802,0.812616,0.696697,


# Observation
* We can see that SVM is the best model for predicting rain with almost 80% accuracy