<img src="./images/dsi_13_sg_shaun_project_4_banner.jpg" width=1000>

# Project 4: West Nile Virus Prediction (Modelling)
**<font color = blue> Shaun Chua 
<br> (DSI-13) </font>**

---

# Table of Contents: <a id="top"></a>
[**1. Importing Libraries**](#1)
<br> [**2. Importing weather.csv**](#2)
<br> [**3. Cleaning weather_df**](#3)
<br> &emsp; [3.01 Cleaning: `Station`](#3.01)
<br> &emsp; [3.02 Cleaning: `Date`](#3.02)
<br> &emsp; [3.03 Cleaning: `Tmax`](#3.03)
<br> &emsp; [3.04 Cleaning: `Tmin`](#3.04)
<br> &emsp; [3.05 Cleaning: `Tavg`](#3.05)
<br> &emsp; [3.06 Cleaning: `Depart`](#3.06)
<br> &emsp; [3.07 Cleaning: `DewPoint`](#3.07)
<br> &emsp; [3.08 Cleaning: `WetBulb`](#3.08)
<br> &emsp; [3.09 Cleaning: `Heat`](#3.09)
<br> &emsp; [3.10 Cleaning: `Cool`](#3.10)
<br> &emsp; [3.11 Cleaning: `Sunrise`](#3.11)
<br> &emsp; [3.12 Cleaning: `Sunset`](#3.12)
<br> &emsp; [3.13 Cleaning: `CodeSum`](#3.13)
<br> &emsp; [3.14 Cleaning: `Depth`](#3.14)
<br> &emsp; [3.15 Cleaning: `Water1`](#3.15)
<br> &emsp; [3.16 Cleaning: `SnowFall`](#3.16)
<br> &emsp; [3.17 Cleaning: `PrecipTotal`](#3.17)
<br> &emsp; [3.18 Cleaning: `StnPressure`](#3.18)
<br> &emsp; [3.19 Cleaning: `SeaLevel`](#3.19)
<br> &emsp; [3.20 Cleaning: `ResultSpeed`](#3.20)
<br> &emsp; [3.21 Cleaning: `ResultDir`](#3.21)
<br> &emsp; [3.22 Cleaning: `AvgSpeed`](#3.22)
<br> &emsp; [3.23 Post-Cleaning: weather_df](#3.23)
<br> [**4. Feature Engineering**](#4)
<br> &emsp; [4.1 Relative Humidity](#4.1)
<br> &emsp; [4.2 Station Location (Longitude and Latitude)](#4.2)
<br> [**5. Exporting Cleaned weather_df**](#5)

# 1. Importing Libraries <a id="1"></a>

In [1]:
import time
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV 

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier,GradientBoostingClassifier,RandomForestClassifier
from sklearn.svm import SVC

from sklearn.decomposition import PCA

from sklearn.metrics import confusion_matrix, roc_auc_score

from matplotlib.pylab import rcParams

from imblearn.combine import SMOTETomek

class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'

Using TensorFlow backend.


In [2]:
# Starting timer for notebook 

t0 = time.time()

# 2. Importing Datasets <a id="2"></a>

In [3]:
train_df = pd.read_csv("./datasets/train_cleaned_final.csv")
test_df = pd.read_csv("./datasets/test_cleaned_final.csv")

In [4]:
display(train_df.shape, test_df.shape)

(8523, 30)

(116293, 28)

# 3. Preprocessing: `train_df` <a id="3"></a>

##### Defining features and target 

In [5]:
X = train_df.drop(columns=["station","wnvpresent","trap","date","Unnamed: 0","nummosquitos","tot_mos_species"])
y = train_df["wnvpresent"]

In [6]:
y.value_counts()

0    8066
1     457
Name: wnvpresent, dtype: int64

## 3.1 Using SMOTE to handle Imbalanced Classes

In [7]:
# https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.combine.SMOTETomek.html 

smt = SMOTETomek(sampling_strategy="auto",
                 random_state=42)

X_smt, y_smt = smt.fit_sample(X, y)

In [8]:
# Shaun: Cool, so SMOTE helps me balance it to 50-50

display(X_smt.shape, y_smt.shape)

(15722, 23)

(15722,)

# 4. Baseline Accuracy <a id="4"></a>

In [9]:
baseline_score = y_smt.value_counts(normalize=True)
baseline_score

1    0.5
0    0.5
Name: wnvpresent, dtype: float64

##### Storing Baseline Score in Series, to merge with metrics Dataframe later

In [10]:
baseline_score_series = pd.Series(data=[baseline_score[1], 
                                        "NaN", 
                                        "NaN", 
                                        "NaN"],
                                  
                                  index=["Accuracy", 
                                         "NaN", 
                                         "NaN", 
                                         "NaN"],
                                  
                                  name="Baseline")

baseline_score_series

Accuracy    0.5
NaN         NaN
NaN         NaN
NaN         NaN
Name: Baseline, dtype: object

In [11]:
#Shaun: Using the correct lingo? "Train" and "Validation", cost test is different thing

X_train, X_val, y_train, y_val = train_test_split(X_smt, y_smt, random_state=42, stratify=y_smt) 

# 5. Modelling <a id="5"></a>

## 5.1 Decision Tree Classifier <a id="5.1"></a>

In [12]:
# Instantiating the classifier
dt = DecisionTreeClassifier() 

##### Hyperparameter Tuning

In [13]:
# Finding best parameter for classifier

dt_optimised = GridSearchCV(estimator=DecisionTreeClassifier(),
                            param_grid = {'max_depth':[3, 6, 8],
                                          'min_samples_split' :[3, 6, 9, 12],
                                          'min_samples_leaf':[2, 4, 6, 8]},
                            cv=3)

In [14]:
dt_optimised.fit(X_train, y_train)

GridSearchCV(cv=3, error_score=nan,
             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=None,
                                              splitter='best'),
             iid='deprecated', n_jobs=None,
             param_grid={'max_depth': [3, 6, 8],
                         'min

##### Summary of Hyperparameter Tuning

In [15]:
# Best score possible, with above param_grid
dt_optimised_accuracy = dt_optimised.best_score_

# Best parameters that returned that score
dt_optimised_best_params = dt_optimised.best_params_

# Predicting Y using X_text (for confusion matrix later)
dt_optimised_ypred = dt_optimised.predict(X_val)

# ROC AUC Score
dt_optimised_roc_score = roc_auc_score(y_val, dt_optimised_ypred)

display(dt_optimised_best_params)
display(dt_optimised_ypred)

{'max_depth': 8, 'min_samples_leaf': 4, 'min_samples_split': 9}

array([0, 0, 0, ..., 1, 0, 1], dtype=int64)

##### Classification Metrics for Decision Tree Classifier

In [16]:
cm_dt_optimised = confusion_matrix(y_val, dt_optimised_ypred)
tn_dt_optimised, fp_dt_optimised, fn_dt_optimised, tp_dt_optimised = cm_dt_optimised.ravel()

dt_optimised_sensitivity = tp_dt_optimised/(tp_dt_optimised+fn_dt_optimised)
dt_optimised_specificity = tn_dt_optimised/(tn_dt_optimised+fp_dt_optimised)

print("The Accuracy Score for the", color.BOLD + "Optimised Decision Tree" + color.END, f"is {dt_optimised_accuracy}")
print("The Sensitivity Score for the", color.BOLD + "Optimised Decision Tree" + color.END, f"is {dt_optimised_sensitivity}")
print("The Specificity Score for the", color.BOLD + "Optimised Decision Tree" + color.END, f"is {dt_optimised_specificity}")
print("The ROC AUC Score for the", color.BOLD + "Optimised Decision Tree" + color.END, f"is {dt_optimised_roc_score}")

The Accuracy Score for the [1mOptimised Decision Tree[0m is 0.8513274683800218
The Sensitivity Score for the [1mOptimised Decision Tree[0m is 0.9190839694656489
The Specificity Score for the [1mOptimised Decision Tree[0m is 0.78382502543235
The ROC AUC Score for the [1mOptimised Decision Tree[0m is 0.8514544974489995


##### Saving Decision Tree Metrics into a Dataframe

In [17]:
metrics_df = pd.DataFrame(data=[dt_optimised_accuracy, 
                                dt_optimised_sensitivity, 
                                dt_optimised_specificity, 
                                dt_optimised_roc_score],
                          
                          index=["Accuracy", 
                                 "Sensitivity (TPR)", 
                                 "Specificity (TNR)", 
                                 "ROC AUC Score"], 
                          
                           columns=["Optimised Decision Tree"])

metrics_df

Unnamed: 0,Optimised Decision Tree
Accuracy,0.851327
Sensitivity (TPR),0.919084
Specificity (TNR),0.783825
ROC AUC Score,0.851454


## 5.2 Random Forest Classifier <a id="5.2"></a>

In [18]:
# Instantiating the classifier
rfc = RandomForestClassifier()

##### Hyperparameter Tuning

In [19]:
# Finding best parameters for rfc

rfc_optimised = GridSearchCV(estimator = rfc,
                             param_grid ={'n_estimators': [100, 150, 200],
                                          'max_depth': [None, 1, 2, 3, 4, 5],}, 
                             cv=3)

In [20]:
rfc_optimised.fit(X_train, y_train)

GridSearchCV(cv=3, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
                                              rando

##### Summary of Hyperparameter Tuning

In [21]:
# Best score possible, with above param_grid
rfc_optimised_accuracy = rfc_optimised.best_score_

# Best parameters that returned that score
rfc_optimised_best_params = rfc_optimised.best_params_

# Predicting Y using X_text (for confusion matrix later)
rfc_optimised_ypred = rfc_optimised.predict(X_val)

# ROC AUC Score
rfc_optimised_roc_score = roc_auc_score(y_val, rfc_optimised_ypred)

display(rfc_optimised_best_params)
display(rfc_optimised_ypred)

{'max_depth': None, 'n_estimators': 150}

array([0, 0, 0, ..., 0, 0, 1], dtype=int64)

##### Classification Metrics for Random Forest Classifier

In [22]:
cm_rfc_optimised = confusion_matrix(y_val, rfc_optimised_ypred)
tn_rfc_optimised, fp_rfc_optimised, fn_rfc_optimised, tp_rfc_optimised = cm_rfc_optimised.ravel()

rfc_optimised_sensitivity = tp_rfc_optimised/(tp_rfc_optimised+fn_rfc_optimised)
rfc_optimised_specificity = tn_rfc_optimised/(tn_rfc_optimised+fp_rfc_optimised)

print("The Accuracy Score for the", color.BOLD + "Optimised Random Forest Classifer" + color.END, f"is {rfc_optimised_accuracy}")
print("The Sensitivity Score for the", color.BOLD + "Optimised Random Forest Classifer" + color.END, f"is {rfc_optimised_sensitivity}")
print("The Specificity Score for the", color.BOLD + "Optimised Random Forest Classifer" + color.END, f"is {rfc_optimised_specificity}")
print("The ROC AUC Score for the", color.BOLD + "Optimised Random Forest Classifer" + color.END, f"is {rfc_optimised_roc_score}")

The Accuracy Score for the [1mOptimised Random Forest Classifer[0m is 0.9296071611895528
The Sensitivity Score for the [1mOptimised Random Forest Classifer[0m is 0.9669211195928753
The Specificity Score for the [1mOptimised Random Forest Classifer[0m is 0.9211597151576806
The ROC AUC Score for the [1mOptimised Random Forest Classifer[0m is 0.944040417375278


##### Saving Random Forest Classification Metrics into a Dataframe

In [23]:
optimised_rfc_metrics_df = pd.DataFrame(data=[rfc_optimised_accuracy, 
                                       rfc_optimised_sensitivity, 
                                       rfc_optimised_specificity, 
                                       rfc_optimised_roc_score],
                          
                                 index=["Accuracy", 
                                        "Sensitivity (TPR)", 
                                        "Specificity (TNR)", 
                                        "ROC AUC Score"], 
                          
                                 columns=["Optimised Random Forest Classifier"])

metrics_df = metrics_df.join(optimised_rfc_metrics_df)
metrics_df

Unnamed: 0,Optimised Decision Tree,Optimised Random Forest Classifier
Accuracy,0.851327,0.929607
Sensitivity (TPR),0.919084,0.966921
Specificity (TNR),0.783825,0.92116
ROC AUC Score,0.851454,0.94404


## 5.3 AdaBoost Classifier <a id="5.3"></a>

In [24]:
# Instantiating the classifier (Taken from 6.04)

ada = AdaBoostClassifier(base_estimator=DecisionTreeClassifier())   

##### Hyperparameter Tuning

In [25]:
# Finding best parameter for model

ada_optimised = GridSearchCV(estimator=ada, 
                             param_grid={'n_estimators':[50,100],
                                         'learning_rate' : [0.9,1],
                                         'base_estimator__max_depth': [1,2,3]},
                             cv=3)

In [26]:
ada_optimised.fit(X_train, y_train)

GridSearchCV(cv=3, error_score=nan,
             estimator=AdaBoostClassifier(algorithm='SAMME.R',
                                          base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                                                class_weight=None,
                                                                                criterion='gini',
                                                                                max_depth=None,
                                                                                max_features=None,
                                                                                max_leaf_nodes=None,
                                                                                min_impurity_decrease=0.0,
                                                                                min_impurity_split=None,
                                                                                min_samples_leaf=1,
 

##### Summary of Hyperparameter Tuning

In [27]:
# Best score possible, with above param_grid
ada_optimised_accuracy = ada_optimised.best_score_

# Best parameters that returned that score
ada_optimised_best_params = ada_optimised.best_params_

# Predicting Y using X_text (for confusion matrix later)
ada_optimised_ypred = ada_optimised.predict(X_val)

# ROC AUC Score
ada_optimised_roc_score = roc_auc_score(y_val, ada_optimised_ypred)

display(ada_optimised_best_params)
display(ada_optimised_ypred)

{'base_estimator__max_depth': 3, 'learning_rate': 1, 'n_estimators': 100}

array([0, 0, 0, ..., 0, 0, 1], dtype=int64)

##### Classification Metrics for AdaBoost Classifier

In [28]:
cm_ada_optimised = confusion_matrix(y_val, ada_optimised_ypred)
tn_ada_optimised, fp_ada_optimised, fn_ada_optimised, tp_ada_optimised = cm_ada_optimised.ravel()

ada_optimised_sensitivity = tp_ada_optimised/(tp_ada_optimised+fn_ada_optimised)
ada_optimised_specificity = tn_ada_optimised/(tn_ada_optimised+fp_ada_optimised)

print("The Accuracy Score for the", color.BOLD + "Optimised AdaBoost Classifer" + color.END, f"is {ada_optimised_accuracy}")
print("The Sensitivity Score for the", color.BOLD + "Optimised AdaBoost Classifer" + color.END, f"is {ada_optimised_sensitivity}")
print("The Specificity Score for the", color.BOLD + "Optimised AdaBoost Classifer" + color.END, f"is {ada_optimised_specificity}")
print("The ROC AUC Score for the", color.BOLD + "Optimised AdaBoost Classifer" + color.END, f"is {ada_optimised_roc_score}")

The Accuracy Score for the [1mOptimised AdaBoost Classifer[0m is 0.9408020111123842
The Sensitivity Score for the [1mOptimised AdaBoost Classifer[0m is 0.9450381679389313
The Specificity Score for the [1mOptimised AdaBoost Classifer[0m is 0.9491353001017294
The ROC AUC Score for the [1mOptimised AdaBoost Classifer[0m is 0.9470867340203304


##### Saving AdaBoost Classification Metrics into a Dataframe

In [29]:
optimised_ada_metrics_df = pd.DataFrame(data=[ada_optimised_accuracy, 
                                       ada_optimised_sensitivity, 
                                       ada_optimised_specificity, 
                                       ada_optimised_roc_score],
                          
                                 index=["Accuracy", 
                                        "Sensitivity (TPR)", 
                                        "Specificity (TNR)", 
                                        "ROC AUC Score"], 
                          
                                 columns=["Optimised AdaBoost Classifier"])

metrics_df = metrics_df.join(optimised_ada_metrics_df)
metrics_df

Unnamed: 0,Optimised Decision Tree,Optimised Random Forest Classifier,Optimised AdaBoost Classifier
Accuracy,0.851327,0.929607,0.940802
Sensitivity (TPR),0.919084,0.966921,0.945038
Specificity (TNR),0.783825,0.92116,0.949135
ROC AUC Score,0.851454,0.94404,0.947087


## 5.4 Gradient Boosting Classifier <a id="5.4"></a>

In [30]:
# Instantiating the Classifier

gbc = GradientBoostingClassifier()

##### Hyperparameter Tuning

In [31]:
# Finding best parameter for model

gbc_optimised = GridSearchCV(estimator=gbc, 
                             param_grid={"max_depth": [2,3,4],
                                         "n_estimators": [100, 125, 150],
                                         "learning_rate": [.08, .1, .12]},
                             cv=3)

In [32]:
gbc_optimised.fit(X_train, y_train)

GridSearchCV(cv=3, error_score=nan,
             estimator=GradientBoostingClassifier(ccp_alpha=0.0,
                                                  criterion='friedman_mse',
                                                  init=None, learning_rate=0.1,
                                                  loss='deviance', max_depth=3,
                                                  max_features=None,
                                                  max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
                                                  min_weight_fraction_leaf=0.0,
                                                  n_estimators=100,
                                                  n_iter_no_change=None,
         

##### Summary of Hyperparameter Tuning

In [33]:
# Best score possible, with above param_grid
gbc_optimised_accuracy = gbc_optimised.best_score_

# Best parameters that returned that score
gbc_optimised_best_params = gbc_optimised.best_params_

# Predicting Y using X_text (for confusion matrix later)
gbc_optimised_ypred = gbc_optimised.predict(X_val)

# ROC AUC Score
gbc_optimised_roc_score = roc_auc_score(y_val, gbc_optimised_ypred)

display(gbc_optimised_best_params)
display(gbc_optimised_ypred)

{'learning_rate': 0.12, 'max_depth': 4, 'n_estimators': 150}

array([0, 0, 0, ..., 0, 0, 1], dtype=int64)

##### Classification Metrics for Gradient Boosting Classifier

In [34]:
cm_gbc_optimised = confusion_matrix(y_val, gbc_optimised_ypred)
tn_gbc_optimised, fp_gbc_optimised, fn_gbc_optimised, tp_gbc_optimised = cm_gbc_optimised.ravel()

gbc_optimised_sensitivity = tp_gbc_optimised/(tp_ada_optimised+fn_ada_optimised)
gbc_optimised_specificity = tn_gbc_optimised/(tn_ada_optimised+fp_ada_optimised)

print("The Accuracy Score for the", color.BOLD + "Optimised Gradient Boosting Classifer" + color.END, f"is {gbc_optimised_accuracy}")
print("The Sensitivity Score for the", color.BOLD + "Optimised Gradient Boosting Classifer" + color.END, f"is {gbc_optimised_sensitivity}")
print("The Specificity Score for the", color.BOLD + "Optimised Gradient Boosting Classifer" + color.END, f"is {gbc_optimised_specificity}")
print("The ROC AUC Score for the", color.BOLD + "Optimised Gradient Boosting Classifer" + color.END, f"is {gbc_optimised_roc_score}")

The Accuracy Score for the [1mOptimised Gradient Boosting Classifer[0m is 0.9197693719632274
The Sensitivity Score for the [1mOptimised Gradient Boosting Classifer[0m is 0.9536895674300254
The Specificity Score for the [1mOptimised Gradient Boosting Classifer[0m is 0.8952187182095626
The ROC AUC Score for the [1mOptimised Gradient Boosting Classifer[0m is 0.924454142819794


##### Saving Gradient Boosting Classification Metrics into a Dataframe

In [35]:
optimised_gbc_metrics_df = pd.DataFrame(data=[gbc_optimised_accuracy, 
                                       gbc_optimised_sensitivity, 
                                       gbc_optimised_specificity, 
                                       gbc_optimised_roc_score],
                          
                                 index=["Accuracy", 
                                        "Sensitivity (TPR)", 
                                        "Specificity (TNR)", 
                                        "ROC AUC Score"], 
                          
                                 columns=["Optimised Gradient Boosting Classifier"])

metrics_df = metrics_df.join(optimised_gbc_metrics_df)
metrics_df

Unnamed: 0,Optimised Decision Tree,Optimised Random Forest Classifier,Optimised AdaBoost Classifier,Optimised Gradient Boosting Classifier
Accuracy,0.851327,0.929607,0.940802,0.919769
Sensitivity (TPR),0.919084,0.966921,0.945038,0.95369
Specificity (TNR),0.783825,0.92116,0.949135,0.895219
ROC AUC Score,0.851454,0.94404,0.947087,0.924454


## 5.5 Summary of Classification Metrics <a id="5.5"></a>

In [36]:
# Transposing cos it looks nicer to me
metrics_df = metrics_df.T

# Cool function to bold text in DF
metrics_df.style.apply(lambda x: ["font-weight:bold" if x.name in ["Optimised AdaBoost Classifier"] 
                                  else "" for i in x], axis=1)

# Can't seem to flush the model names to the left

Unnamed: 0,Accuracy,Sensitivity (TPR),Specificity (TNR),ROC AUC Score
Optimised Decision Tree,0.851327,0.919084,0.783825,0.851454
Optimised Random Forest Classifier,0.929607,0.966921,0.92116,0.94404
Optimised AdaBoost Classifier,0.940802,0.945038,0.949135,0.947087
Optimised Gradient Boosting Classifier,0.919769,0.95369,0.895219,0.924454


##### <font color = blue> Shaun: </font>

It would appear that the **Optimised Random Forest Classifier** gives the highest Sensitivity or True Positive Rate (TPR).

However, it seems that the greater TPR of the **Optimised Random Forest Classifier** comes at a hefty cost of Specificity or True Negative Rate (TNR). 

While we may be able to correctly predict West Nile Virus, we face the possibility of predicting **disproportionately fewer** cases where no West Nile Virus is present.

Therefore, I feel it may still be wise to head with the **Optimised AdaBoost Classifier** model. Also, it has a higher ROC AUC Score, indicating that the model does a very good job distinguishing between presence of WNV and absence of WNV. 

In fact, it does the best job in this regard when compared to all the other models created in this notebook.

## 5.6 Chosen Model: AdaBoost Classifier <a id="5.6"></a>

## 5.6.1 Cleaning test_df Columns <a id="5.6.1"></a>

In [37]:
# Checking columns in X_smt and test_df
# Not using train_df because I already dropped some irrelevant columns in it, X_smt reflects this drop

X_smt.shape, test_df.shape

((15722, 23), (116293, 28))

In [38]:
display(X_smt.head(), test_df.head())

Unnamed: 0,tavg,preciptotal,sealevel,resultspeed,resultdir,avgspeed,rel_hum,day,species_PIPIENS,species_PIPIENS/RESTUANS,...,year_2009,year_2011,year_2013,year_2010,year_2012,year_2014,month_7,month_8,month_9,month_10
0,74.0,0.0,30.11,5.8,18,6.5,57.893159,29,0,1,...,0,0,0,0,0,0,0,0,0,0
1,74.0,0.0,30.11,5.8,18,6.5,57.893159,29,0,0,...,0,0,0,0,0,0,0,0,0,0
2,74.0,0.0,30.11,5.8,18,6.5,57.893159,29,0,0,...,0,0,0,0,0,0,0,0,0,0
3,74.0,0.0,30.11,5.8,18,6.5,57.893159,29,0,1,...,0,0,0,0,0,0,0,0,0,0
4,74.0,0.0,30.11,5.8,18,6.5,57.893159,29,0,0,...,0,0,0,0,0,0,0,0,0,0


Unnamed: 0.1,Unnamed: 0,station,date,tavg,preciptotal,sealevel,resultspeed,resultdir,avgspeed,rel_hum,...,year_2010,year_2012,year_2014,year_2009,year_2011,year_2013,month_7,month_8,month_9,month_10
0,0,1,2008-06-11,74.0,0.0,29.99,8.9,18,10.0,53.941117,...,0,0,0,0,0,0,0,0,0,0
1,1,1,2008-06-11,74.0,0.0,29.99,8.9,18,10.0,53.941117,...,0,0,0,0,0,0,0,0,0,0
2,2,1,2008-06-11,74.0,0.0,29.99,8.9,18,10.0,53.941117,...,0,0,0,0,0,0,0,0,0,0
3,3,1,2008-06-11,74.0,0.0,29.99,8.9,18,10.0,53.941117,...,0,0,0,0,0,0,0,0,0,0
4,4,1,2008-06-11,74.0,0.0,29.99,8.9,18,10.0,53.941117,...,0,0,0,0,0,0,0,0,0,0


In [39]:
X_smt.columns, test_df.columns

(Index(['tavg', 'preciptotal', 'sealevel', 'resultspeed', 'resultdir',
        'avgspeed', 'rel_hum', 'day', 'species_PIPIENS',
        'species_PIPIENS/RESTUANS', 'species_RESTUANS', 'latitude', 'longitude',
        'year_2009', 'year_2011', 'year_2013', 'year_2010', 'year_2012',
        'year_2014', 'month_7', 'month_8', 'month_9', 'month_10'],
       dtype='object'),
 Index(['Unnamed: 0', 'station', 'date', 'tavg', 'preciptotal', 'sealevel',
        'resultspeed', 'resultdir', 'avgspeed', 'rel_hum', 'id', 'trap', 'day',
        'species_PIPIENS', 'species_PIPIENS/RESTUANS', 'species_RESTUANS',
        'latitude', 'longitude', 'year_2010', 'year_2012', 'year_2014',
        'year_2009', 'year_2011', 'year_2013', 'month_7', 'month_8', 'month_9',
        'month_10'],
       dtype='object'))

##### Dropping extra columns

In [40]:
test_data = test_df.drop(columns=["Unnamed: 0",
                                  "station",
                                  "date",
                                  "id",
                                  "trap"])

##### Final check that test_data and X_smt have same columns

In [41]:
X_smt.columns, test_data.columns

(Index(['tavg', 'preciptotal', 'sealevel', 'resultspeed', 'resultdir',
        'avgspeed', 'rel_hum', 'day', 'species_PIPIENS',
        'species_PIPIENS/RESTUANS', 'species_RESTUANS', 'latitude', 'longitude',
        'year_2009', 'year_2011', 'year_2013', 'year_2010', 'year_2012',
        'year_2014', 'month_7', 'month_8', 'month_9', 'month_10'],
       dtype='object'),
 Index(['tavg', 'preciptotal', 'sealevel', 'resultspeed', 'resultdir',
        'avgspeed', 'rel_hum', 'day', 'species_PIPIENS',
        'species_PIPIENS/RESTUANS', 'species_RESTUANS', 'latitude', 'longitude',
        'year_2010', 'year_2012', 'year_2014', 'year_2009', 'year_2011',
        'year_2013', 'month_7', 'month_8', 'month_9', 'month_10'],
       dtype='object'))

## 5.6.2 Prediction using Optimised AdaBoost Classifier <a id="5.6.2"></a>

##### <font color = blue> Shaun: </font>

Using `predict_proba` because Kaggle wants it that way.

In [42]:
pred = ada_optimised.predict_proba(test_data)
pred

array([[0.46030949, 0.53969051],
       [0.47412123, 0.52587877],
       [0.46346595, 0.53653405],
       ...,
       [0.57500883, 0.42499117],
       [0.57500883, 0.42499117],
       [0.57500883, 0.42499117]])

# 6. Kaggle Submission <a id="6"></a>

## 6.1 Preparing Dataset <a id="6.1"></a>

In [43]:
pred_df = pd.DataFrame(pred)
pred_df

Unnamed: 0,0,1
0,0.460309,0.539691
1,0.474121,0.525879
2,0.463466,0.536534
3,0.455252,0.544748
4,0.455252,0.544748
...,...,...
116288,0.575009,0.424991
116289,0.575009,0.424991
116290,0.575009,0.424991
116291,0.575009,0.424991


In [44]:
kaggle_id = test_df['id']
kaggle_id

0              1
1              2
2              3
3              4
4              5
           ...  
116288    116281
116289    116282
116290    116283
116291    116284
116292    116285
Name: id, Length: 116293, dtype: int64

In [45]:
pred_df = pd.concat([kaggle_id, pred_df], axis =1)

##### Kaggle only wants `1`, changed to `wnvpresent`

In [46]:
pred_df.drop(columns=0, inplace=True)

In [47]:
pred_df.rename(columns={1:"WnvPresent"}, inplace=True)

##### Final Look at Kaggle Submission

In [48]:
pred_df

Unnamed: 0,id,WnvPresent
0,1,0.539691
1,2,0.525879
2,3,0.536534
3,4,0.544748
4,5,0.544748
...,...,...
116288,116281,0.424991
116289,116282,0.424991
116290,116283,0.424991
116291,116284,0.424991


## 6.2 Exporting Kaggle Submission File <a id="6.2"></a>

In [51]:
pred_df.to_csv("./datasets/kaggle_submission.csv", index=False)

## 6.3 Kaggle Submission Score <a id="6.3"></a>

<img src="./images/dsi_13_sg_shaun_project_4_kaggle_submission_score.png">

In [50]:
print(f"Run complete, total time taken \u2248 {time.time()-t0:.2f}s")

Run complete, total time taken ≈ 172.78s
