# Classification with Python

## About the dataset

The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).

The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)

This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:

| Field         | Description                                           | Unit            | Type   |
| ------------- | ----------------------------------------------------- | --------------- | ------ |
| Date          | Date of the Observation in YYYY-MM-DD                 | Date            | object |
| Location      | Location of the Observation                           | Location        | object |
| MinTemp       | Minimum temperature                                   | Celsius         | float  |
| MaxTemp       | Maximum temperature                                   | Celsius         | float  |
| Rainfall      | Amount of rainfall                                    | Millimeters     | float  |
| Evaporation   | Amount of evaporation                                 | Millimeters     | float  |
| Sunshine      | Amount of bright sunshine                             | hours           | float  |
| WindGustDir   | Direction of the strongest gust                       | Compass Points  | object |
| WindGustSpeed | Speed of the strongest gust                           | Kilometers/Hour | object |
| WindDir9am    | Wind direction averaged of 10 minutes prior to 9am    | Compass Points  | object |
| WindDir3pm    | Wind direction averaged of 10 minutes prior to 3pm    | Compass Points  | object |
| WindSpeed9am  | Wind speed averaged of 10 minutes prior to 9am        | Kilometers/Hour | float  |
| WindSpeed3pm  | Wind speed averaged of 10 minutes prior to 3pm        | Kilometers/Hour | float  |
| Humidity9am   | Humidity at 9am                                       | Percent         | float  |
| Humidity3pm   | Humidity at 3pm                                       | Percent         | float  |
| Pressure9am   | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal     | float  |
| Pressure3pm   | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal     | float  |
| Cloud9am      | Fraction of the sky obscured by cloud at 9am          | Eights          | float  |
| Cloud3pm      | Fraction of the sky obscured by cloud at 3pm          | Eights          | float  |
| Temp9am       | Temperature at 9am                                    | Celsius         | float  |
| Temp3pm       | Temperature at 3pm                                    | Celsius         | float  |
| RainToday     | If there was rain today                               | Yes/No          | object |
| RainTomorrow  | If there is rain tomorrow                             | Yes/No          | float  |

Column definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)

In [1]:
#suppress warnings
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [2]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import jaccard_score, f1_score, log_loss
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics

### Importing Dataset

In [3]:
path = 'Weather_Data.csv'
df = pd.read_csv(path)
df.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


### Data PreProcessing

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3271 entries, 0 to 3270
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           3271 non-null   object 
 1   MinTemp        3271 non-null   float64
 2   MaxTemp        3271 non-null   float64
 3   Rainfall       3271 non-null   float64
 4   Evaporation    3271 non-null   float64
 5   Sunshine       3271 non-null   float64
 6   WindGustDir    3271 non-null   object 
 7   WindGustSpeed  3271 non-null   int64  
 8   WindDir9am     3271 non-null   object 
 9   WindDir3pm     3271 non-null   object 
 10  WindSpeed9am   3271 non-null   int64  
 11  WindSpeed3pm   3271 non-null   int64  
 12  Humidity9am    3271 non-null   int64  
 13  Humidity3pm    3271 non-null   int64  
 14  Pressure9am    3271 non-null   float64
 15  Pressure3pm    3271 non-null   float64
 16  Cloud9am       3271 non-null   int64  
 17  Cloud3pm       3271 non-null   int64  
 18  Temp9am 

### One Hot Encoding

First, we need to perform one hot encoding to convert categorical to binary variables

In [5]:
df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am','WindDir3pm'])

Next, we replace the values of 'RainTomorrow' column changing them from categorical column to binary column. We do not use the `get_dummies` method because we would end up with two columns for `RainTomorrow` and we do not want that for our target variable.

In [6]:
df_sydney_processed.replace(['No', 'Yes'], [0, 1], inplace=True)

### Train/Test Split Data

In [7]:
#setting our features
df_sydney_processed.drop('Date', axis=1, inplace=True)

In [8]:
df_sydney_processed = df_sydney_processed.astype('float')

In [9]:
features = df_sydney_processed.drop(columns=['RainTomorrow'], axis=1)
Y = df_sydney_processed['RainTomorrow']

### Linear Regression

#### Q1) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `10`.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=10)

#### Q2) Create and train a Linear Regression model called LinearReg using the training data (`x_train`, `y_train`).

In [11]:
LinearReg = LinearRegression()
LinearReg.fit(X_train, y_train)

LinearRegression()

#### Q3) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.

In [12]:
predictions = LinearReg.predict(X_test)

#### Q4) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.

In [13]:
LinearRegression_MAE = metrics.mean_absolute_error(y_test, predictions)
LinearRegression_MSE = metrics.mean_squared_error(y_test, predictions)
LinearRegression_R2 = metrics.r2_score(y_test, predictions)

#### Q5) Show the MAE, MSE, and R2 in a tabular format using data frame for the linear model.

In [14]:
metrics = ['MAE', 'MSE', 'R2']
metrics_val = [LinearRegression_MAE, LinearRegression_MSE, LinearRegression_R2]

Report = {"Metrics": metrics,
          "Result": metrics_val
         }

pd.DataFrame(Report)

Unnamed: 0,Metrics,Result
0,MAE,0.256316
1,MSE,0.115723
2,R2,0.427121


### K-Nearest-Neighbors (KNN)

#### Q6) Create and train a KNN model called KNN using the training data (`x_train`, `y_train`) with the `n_neighbors` parameter set to `4`.


In [15]:
KNN = KNeighborsClassifier(n_neighbors=4)
KNN.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=4)

#### Q7) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [16]:
predictions = KNN.predict(X_test)

#### Q8) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.

In [17]:
KNN_Accuracy_Score = accuracy_score(y_test, predictions)
KNN_JaccardIndex = jaccard_score(y_test, predictions, pos_label=0)
KNN_F1_Score = f1_score(y_test, predictions, average='weighted')

In [18]:
metrics = ['AccuracyScore', 'Jaccard', 'F1']
metrics_val = [KNN_Accuracy_Score, KNN_JaccardIndex, KNN_F1_Score]

Report = {"Metrics": metrics,
          "Result": metrics_val
         }

pd.DataFrame(Report)

Unnamed: 0,Metrics,Result
0,AccuracyScore,0.818321
1,Jaccard,0.790123
2,F1,0.802375


### Decision Tree

#### Q9) Create and train a Decision Tree model called Tree using the training data (`x_train`, `y_train`).

In [19]:
Tree = DecisionTreeClassifier(criterion='entropy', max_depth=4)
Tree.fit(X_train, y_train)

DecisionTreeClassifier(criterion='entropy', max_depth=4)

#### Q10) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.

In [20]:
predictions = Tree.predict(X_test)

#### Q11) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.

In [21]:
Tree_Accuracy_Score = accuracy_score(y_test, predictions)
Tree_JaccardIndex = jaccard_score(y_test, predictions, pos_label=0)
Tree_F1_Score = f1_score(y_test, predictions, average='weighted')

In [22]:
metrics = ['AccuracyScore', 'Jaccard', 'F1']
metrics_val = [Tree_Accuracy_Score, Tree_JaccardIndex, Tree_F1_Score]

Report = {"Metrics": metrics,
          "Result": metrics_val
         }

pd.DataFrame(Report)

Unnamed: 0,Metrics,Result
0,AccuracyScore,0.818321
1,Jaccard,0.781651
2,F1,0.813263


### Logistic Regression

#### Q12) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `1`.

In [23]:
X_train, X_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=1)

#### Q13) Create and train a LogisticRegression model called LR using the training data (`x_train`, `y_train`) with the `solver` parameter set to `liblinear`.

In [24]:
LR = LogisticRegression(C=0.01, solver='liblinear')
LR.fit(X_train, y_train)

LogisticRegression(C=0.01, solver='liblinear')

#### Q14) Now, use the `predict` and `predict_proba` methods on the testing data (`x_test`) and save it as 2 arrays `predictions` and `predict_proba`.

In [25]:
predictions = LR.predict(X_test)

In [26]:
predict_proba = LR.predict_proba(X_test)

#### Q15) Using the `predictions`, `predict_proba` and the `y_test` dataframe calculate the value for each metric using the appropriate function.

In [27]:
#Predictions set
LR_Accuracy_Score = accuracy_score(y_test, predictions)
LR_JaccardIndex = jaccard_score(y_test, predictions, pos_label=0)
LR_F1_Score = f1_score(y_test, predictions, average='weighted')

In [28]:
#Predictions_Proba set
LR_Log_Loss_proba = log_loss(y_test, predict_proba)

In [29]:
metrics = ['AccuracyScore', 'Jaccard', 'F1', 'LogLoss']
metrics_val = [LR_Accuracy_Score, LR_JaccardIndex, LR_F1_Score, LR_Log_Loss_proba]

Report = {"Metrics": metrics,
          "Result": metrics_val
         }

pd.DataFrame(Report)

Unnamed: 0,Metrics,Result
0,AccuracyScore,0.827481
1,Jaccard,0.794171
2,F1,0.820545
3,LogLoss,0.380085


### SVM

#### Q16) Create and train a SVM model called SVM using the training data (`x_train`, `y_train`).

In [30]:
SVM = svm.SVC(kernel='rbf')
SVM.fit(X_test, y_test)

SVC()

#### Q17) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.

In [31]:
predictions = SVM.predict(X_test)

#### Q18) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.

In [32]:
SVM_Accuracy_Score = accuracy_score(y_test, predictions)
SVM_JaccardIndex = jaccard_score(y_test, predictions, pos_label=0)
SVM_F1_Score = f1_score(y_test, predictions, average='weighted')

In [33]:
metrics = ['AccuracyScore', 'Jaccard', 'F1']
metrics_val = [SVM_Accuracy_Score, SVM_JaccardIndex, SVM_F1_Score]

Report = {"Metrics": metrics,
          "Result": metrics_val
         }

pd.DataFrame(Report)

Unnamed: 0,Metrics,Result
0,AccuracyScore,0.722137
1,Jaccard,0.722137
2,F1,0.605622


### Report

#### Q19) Show the Accuracy,Jaccard Index,F1-Score and LogLoss in a tabular format using data frame for all of the above models.

In [35]:
metrics = ['Accuracy Score', 'Jaccard', 'F1', 'LogLoss']
Report = {
    "Metrics": metrics,
    "KNN": [KNN_Accuracy_Score, KNN_JaccardIndex, KNN_F1_Score, np.nan],
    "Decision Tree": [Tree_Accuracy_Score, Tree_JaccardIndex, Tree_F1_Score, np.nan],
    "Logistic": [LR_Accuracy_Score, LR_JaccardIndex, LR_F1_Score, LR_Log_Loss_proba],
    "SVM": [SVM_Accuracy_Score, SVM_JaccardIndex, SVM_F1_Score, np.nan]
}

print("Report of different Classification methods on accuracy of RainingTomorrow")
pd.DataFrame(Report)

Report of different Classification methods on accuracy of RainingTomorrow


Unnamed: 0,Metrics,KNN,Decision Tree,Logistic,SVM
0,Accuracy Score,0.818321,0.818321,0.827481,0.722137
1,Jaccard,0.790123,0.781651,0.794171,0.722137
2,F1,0.802375,0.813263,0.820545,0.605622
3,LogLoss,,,0.380085,
