<img src="./images/dsi_13_sg_shaun_project_4_banner.jpg" width=1000>

# Project 4: West Nile Virus Prediction (Preparing Modelling Datasets)
**<font color = blue> Shaun Chua 
<br> (DSI-13) </font>**

---

# Table of Contents: <a id="top"></a>
<br> [**1. Importing Libraries**](#1)
<br> [**2. Importing Datasets**](#2)
<br> [**3. Merging weather_df with train_df and test_df**](#3)
<br> &emsp; [3.01 Cleaning: Adding `station` to train_df and test_df](#3.01)
<br> &emsp; [3.02 Cleaning: Merging `weather_df` on `station` and `date`](#3.02)
<br> &emsp; [3.03 Cleaning: Dropping `longitude` and `latitude`](#3.03)
<br> &emsp; [3.04 Cleaning: Dropping `index`](#3.04)
<br> &emsp; [3.05 Cleaning: Getting Dummies for `year`](#3.05)
<br> &emsp; [3.06 Cleaning: Getting Dummies for `month`](#3.06)
<br> [**4. Exporting Cleaned Datasets**](#4)

# 1. Importing Libraries <a id="1"></a>

In [1]:
# Standard Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import time
%matplotlib inline

In [2]:
# Starting timer for notebook 

t0 = time.time()

# 2. Importing Datasets <a id="2"></a>

In [3]:
train_df = pd.read_csv("./datasets/train_cleaned.csv")
test_df = pd.read_csv("./datasets/test_cleaned.csv")
weather_df = pd.read_csv("./datasets/weather_cleaned.csv")

##### <font color = blue> Shaun: </font>

For continuity, I will lowercase features weather_df since I did it for the datsets.

##### Applying `.lower()` to weather_df.columns

In [4]:
weather_df.columns = weather_df.columns.map(lambda x: x.lower())

In [5]:
# Viewing all datasets again

display(train_df.head(), train_df.shape,
        test_df.head(), test_df.shape,
        weather_df.head(), weather_df.shape)

Unnamed: 0,date,trap,latitude,longitude,nummosquitos,wnvpresent,year,month,day,tot_mos_species,species_PIPIENS,species_PIPIENS/RESTUANS,species_RESTUANS
0,2007-05-29,T002,41.95469,-87.800991,1,0,2007,5,29,1,0,1,0
1,2007-05-29,T002,41.95469,-87.800991,1,0,2007,5,29,1,0,0,1
2,2007-05-29,T007,41.994991,-87.769279,1,0,2007,5,29,1,0,0,1
3,2007-05-29,T015,41.974089,-87.824812,1,0,2007,5,29,1,0,1,0
4,2007-05-29,T015,41.974089,-87.824812,4,0,2007,5,29,4,0,0,1


(8585, 13)

Unnamed: 0,id,date,trap,latitude,longitude,year,month,day,species_PIPIENS,species_PIPIENS/RESTUANS,species_RESTUANS
0,1,2008-06-11,T002,41.95469,-87.800991,2008,6,11,0,1,0
1,2,2008-06-11,T002,41.95469,-87.800991,2008,6,11,0,0,1
2,3,2008-06-11,T002,41.95469,-87.800991,2008,6,11,1,0,0
3,4,2008-06-11,T002,41.95469,-87.800991,2008,6,11,0,0,0
4,5,2008-06-11,T002,41.95469,-87.800991,2008,6,11,0,0,0


(116293, 11)

Unnamed: 0,index,station,date,tavg,preciptotal,sealevel,resultspeed,resultdir,avgspeed,rel_hum,latitude,longitude
0,0,1,2007-05-01,67.0,0.0,29.82,1.7,27,9.2,57.039444,41.995,-87.933
1,1,2,2007-05-01,68.0,0.0,29.82,2.7,25,9.6,55.134977,41.786,-87.752
2,2,1,2007-05-02,51.0,0.0,30.09,13.0,4,13.4,71.781719,41.995,-87.933
3,3,2,2007-05-02,52.0,0.0,30.08,13.3,2,13.4,69.235378,41.786,-87.752
4,4,1,2007-05-03,56.0,0.0,30.12,11.7,7,11.9,55.653432,41.995,-87.933


(2919, 12)

##### <font color = blue> Shaun: </font>

Since `weather_df` may contain features that will influence our target variable `wnvpresent`, I will add them to the training and test datasets, to make the model more robust. 

# 3. Merging weather_df with train_df and test_df <a id="3"></a>

## 3.01 Adding `station` to train_df and test_df <a id="3.01"></a>

##### <font color = blue> Shaun: </font>

Okay, so I'm gonna try and add the weather data to `train_df` and `test_df`. 

I'm gonna allocate the station to the trap first, by referring to the `latitude` feature. I could probably also use `longitude`, but I went with `latitude` since it was positive floats, didn't know if I'd run into any errors, so yeah. 

From the <a href="https://www.kaggle.com/c/predict-west-nile-virus/data">data description</a> on Kaggle, we are told that there are **two** weather stations, each with a specific coordinate. 

For train_df and test_df, we also have the coordinates of each trap. 

Therefore, I will try to assign a station to each trap, and take the corresponding weather data, and it had better help the model because this took me forever to think about.

In [6]:
# Finding out min/max latitude

print(f"train_df latitude min/max: {(train_df.latitude.min(), train_df.latitude.max())}")
print(f"test_df latitude min/max: {(test_df.latitude.min(), test_df.latitude.max())}")
print(f"weather_df latitude min/max: {(weather_df.latitude.min(), weather_df.latitude.max())}") 

train_df latitude min/max: (41.644612, 42.01743)
test_df latitude min/max: (41.644612, 42.01743)
weather_df latitude min/max: (41.786, 41.995)


In [7]:
# Finding out min/max longitude

print(f"train_df longitude min/max: {(train_df.longitude.min(), train_df.longitude.max())}")
print(f"test_df longitude min/max: {(test_df.longitude.min(), test_df.longitude.max())}")
print(f"weather_df longitude min/max: {(weather_df.longitude.min(), weather_df.longitude.max())}") 

train_df longitude min/max: (-87.930995, -87.531635)
test_df longitude min/max: (-87.930995, -87.531635)
weather_df longitude min/max: (-87.93299999999999, -87.75200000000001)


##### <font color = blue> Shaun: </font>

From the cell above, to 3 d.p.,


|Dataset|Latitude Range|Longitude Range|
|:--|:---|:---|
|**train_df**|41.644 to 42.017|-87.931 to -87.532 
|**test_df**|41.644 to 42.017|-87.931 to -87.532
|**weather_df**|41.786 to 41.995|-87.932 to -87.752

<br> Therfore, we can deduce that all traps fall somewhat within the range of the weather stations.

I'll assign weather data to the traps based on which station it is closer to, the cutoff will be the midpoint between the 2 stations measured by latitude:

**Cut Off (Latitude)**

$\frac{41.786 + 41.995}{2}$ = 41.8905 $\approx\$ 41.891 (3 d.p.)

##### Assigning `station` to `train_df`

In [8]:
# Creating a Station Column, and assigning it:
# 1 if latitude > 41.8905, because its closer to Station 1 (41.995)
# 2 if latitude < 41.8905, since its closer to Station 2 (41.786)

train_df["station"] = [1 if lat >= 41.8905 else 2 for lat in train_df["latitude"]]


In [9]:
# Checking to see if assigned correctly

train_df.station.value_counts()

2    4875
1    3710
Name: station, dtype: int64

In [10]:
train_df.head()

Unnamed: 0,date,trap,latitude,longitude,nummosquitos,wnvpresent,year,month,day,tot_mos_species,species_PIPIENS,species_PIPIENS/RESTUANS,species_RESTUANS,station
0,2007-05-29,T002,41.95469,-87.800991,1,0,2007,5,29,1,0,1,0,1
1,2007-05-29,T002,41.95469,-87.800991,1,0,2007,5,29,1,0,0,1,1
2,2007-05-29,T007,41.994991,-87.769279,1,0,2007,5,29,1,0,0,1,1
3,2007-05-29,T015,41.974089,-87.824812,1,0,2007,5,29,1,0,1,0,1
4,2007-05-29,T015,41.974089,-87.824812,4,0,2007,5,29,4,0,0,1,1


##### Assigning `station` to `test_df`

In [11]:
test_df["station"] = [1 if lat >= 41.8905 else 2 for lat in test_df["latitude"]]

In [12]:
# Checking to see if assigned correctly

test_df.station.value_counts()

2    64865
1    51428
Name: station, dtype: int64

In [13]:
test_df.head()

Unnamed: 0,id,date,trap,latitude,longitude,year,month,day,species_PIPIENS,species_PIPIENS/RESTUANS,species_RESTUANS,station
0,1,2008-06-11,T002,41.95469,-87.800991,2008,6,11,0,1,0,1
1,2,2008-06-11,T002,41.95469,-87.800991,2008,6,11,0,0,1,1
2,3,2008-06-11,T002,41.95469,-87.800991,2008,6,11,1,0,0,1
3,4,2008-06-11,T002,41.95469,-87.800991,2008,6,11,0,0,0,1
4,5,2008-06-11,T002,41.95469,-87.800991,2008,6,11,0,0,0,1


## 3.02 Merging `weather_df` on `station` and `date` <a id="3.02"></a>

In [14]:
# https://stackoverflow.com/questions/46386402/how-to-properly-understand-pandas-dataframe-merge-how-left-on-right-on

In [15]:
train_df_cleaned_final = pd.merge(weather_df, train_df,  how='inner', left_on=['date','station'], right_on = ['date','station'])

In [16]:
train_df_cleaned_final

Unnamed: 0,index,station,date,tavg,preciptotal,sealevel,resultspeed,resultdir,avgspeed,rel_hum,...,longitude_y,nummosquitos,wnvpresent,year,month,day,tot_mos_species,species_PIPIENS,species_PIPIENS/RESTUANS,species_RESTUANS
0,56,1,2007-05-29,74.0,0.0,30.11,5.8,18,6.5,57.893159,...,-87.800991,1,0,2007,5,29,1,0,1,0
1,56,1,2007-05-29,74.0,0.0,30.11,5.8,18,6.5,57.893159,...,-87.800991,1,0,2007,5,29,1,0,0,1
2,56,1,2007-05-29,74.0,0.0,30.11,5.8,18,6.5,57.893159,...,-87.769279,1,0,2007,5,29,1,0,0,1
3,56,1,2007-05-29,74.0,0.0,30.11,5.8,18,6.5,57.893159,...,-87.824812,1,0,2007,5,29,1,0,1,0
4,56,1,2007-05-29,74.0,0.0,30.11,5.8,18,6.5,57.893159,...,-87.824812,4,0,2007,5,29,4,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8518,2505,2,2013-09-26,65.0,0.0,30.04,4.1,9,4.6,63.317812,...,-87.600963,7,0,2013,9,26,7,0,1,0
8519,2505,2,2013-09-26,65.0,0.0,30.04,4.1,9,4.6,63.317812,...,-87.600963,1,0,2013,9,26,1,1,0,0
8520,2505,2,2013-09-26,65.0,0.0,30.04,4.1,9,4.6,63.317812,...,-87.654234,8,0,2013,9,26,8,0,1,0
8521,2505,2,2013-09-26,65.0,0.0,30.04,4.1,9,4.6,63.317812,...,-87.742302,6,1,2013,9,26,6,0,1,0


In [17]:
test_df_cleaned_final = pd.merge(weather_df, test_df,  how='inner', left_on=['date','station'], right_on = ['date','station'])

In [18]:
test_df_cleaned_final

Unnamed: 0,index,station,date,tavg,preciptotal,sealevel,resultspeed,resultdir,avgspeed,rel_hum,...,id,trap,latitude_y,longitude_y,year,month,day,species_PIPIENS,species_PIPIENS/RESTUANS,species_RESTUANS
0,450,1,2008-06-11,74.0,0.00,29.99,8.9,18,10.0,53.941117,...,1,T002,41.95469,-87.800991,2008,6,11,0,1,0
1,450,1,2008-06-11,74.0,0.00,29.99,8.9,18,10.0,53.941117,...,2,T002,41.95469,-87.800991,2008,6,11,0,0,1
2,450,1,2008-06-11,74.0,0.00,29.99,8.9,18,10.0,53.941117,...,3,T002,41.95469,-87.800991,2008,6,11,1,0,0
3,450,1,2008-06-11,74.0,0.00,29.99,8.9,18,10.0,53.941117,...,4,T002,41.95469,-87.800991,2008,6,11,0,0,0
4,450,1,2008-06-11,74.0,0.00,29.99,8.9,18,10.0,53.941117,...,5,T002,41.95469,-87.800991,2008,6,11,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
116288,2885,2,2014-10-02,71.0,0.72,29.78,7.2,17,7.9,76.170089,...,116281,T094,41.71914,-87.669539,2014,10,2,0,0,0
116289,2885,2,2014-10-02,71.0,0.72,29.78,7.2,17,7.9,76.170089,...,116282,T094,41.71914,-87.669539,2014,10,2,0,0,0
116290,2885,2,2014-10-02,71.0,0.72,29.78,7.2,17,7.9,76.170089,...,116283,T094,41.71914,-87.669539,2014,10,2,0,0,0
116291,2885,2,2014-10-02,71.0,0.72,29.78,7.2,17,7.9,76.170089,...,116284,T094,41.71914,-87.669539,2014,10,2,0,0,0


In [19]:
# Checking for nulls

print(train_df_cleaned_final.isnull().sum().sum(), 
      test_df_cleaned_final.isnull().sum().sum())

0 0


In [20]:
# Checking shape

print(train_df_cleaned_final.shape, test_df_cleaned_final.shape)

(8523, 24) (116293, 22)


## 3.03 Dropping `longitude` and `latitude`  <a id="3.03"></a>

##### <font color = blue> Shaun: </font>

Now that `latitude` and `longitude` have served their purpose, I'm going to remove them since they don't help anymore. 

##### Removing `latitude` and `longitude` for train_df_cleaned_final

In [21]:
train_df_cleaned_final.columns

Index(['index', 'station', 'date', 'tavg', 'preciptotal', 'sealevel',
       'resultspeed', 'resultdir', 'avgspeed', 'rel_hum', 'latitude_x',
       'longitude_x', 'trap', 'latitude_y', 'longitude_y', 'nummosquitos',
       'wnvpresent', 'year', 'month', 'day', 'tot_mos_species',
       'species_PIPIENS', 'species_PIPIENS/RESTUANS', 'species_RESTUANS'],
      dtype='object')

In [22]:
train_df_cleaned_final.drop("latitude_x", axis=1, inplace=True)
train_df_cleaned_final.drop("longitude_x", axis=1, inplace=True)

train_df_cleaned_final["latitude"] = train_df_cleaned_final["latitude_y"]
train_df_cleaned_final["longitude"] = train_df_cleaned_final["longitude_y"]

train_df_cleaned_final.drop("latitude_y", axis=1, inplace=True)
train_df_cleaned_final.drop("longitude_y", axis=1, inplace=True)

##### Removing `latitude` and `longitude` for test_df_cleaned_final

In [23]:
test_df_cleaned_final.drop("latitude_x", axis=1, inplace=True)
test_df_cleaned_final.drop("longitude_x", axis=1, inplace=True)

test_df_cleaned_final["latitude"] = test_df_cleaned_final["latitude_y"]
test_df_cleaned_final["longitude"] = test_df_cleaned_final["longitude_y"]

test_df_cleaned_final.drop("latitude_y", axis=1, inplace=True)
test_df_cleaned_final.drop("longitude_y", axis=1, inplace=True)

In [24]:
display(train_df_cleaned_final.columns, test_df_cleaned_final.columns)

Index(['index', 'station', 'date', 'tavg', 'preciptotal', 'sealevel',
       'resultspeed', 'resultdir', 'avgspeed', 'rel_hum', 'trap',
       'nummosquitos', 'wnvpresent', 'year', 'month', 'day', 'tot_mos_species',
       'species_PIPIENS', 'species_PIPIENS/RESTUANS', 'species_RESTUANS',
       'latitude', 'longitude'],
      dtype='object')

Index(['index', 'station', 'date', 'tavg', 'preciptotal', 'sealevel',
       'resultspeed', 'resultdir', 'avgspeed', 'rel_hum', 'id', 'trap', 'year',
       'month', 'day', 'species_PIPIENS', 'species_PIPIENS/RESTUANS',
       'species_RESTUANS', 'latitude', 'longitude'],
      dtype='object')

In [25]:
print(train_df_cleaned_final.shape, test_df_cleaned_final.shape)

(8523, 22) (116293, 20)


## 3.04 Dropping `index`  <a id="3.04"></a>

In [26]:
train_df_cleaned_final.drop("index", axis=1, inplace=True)

In [27]:
test_df_cleaned_final.drop("index", axis=1, inplace=True)

In [28]:
display(train_df_cleaned_final.columns, test_df_cleaned_final.columns)

Index(['station', 'date', 'tavg', 'preciptotal', 'sealevel', 'resultspeed',
       'resultdir', 'avgspeed', 'rel_hum', 'trap', 'nummosquitos',
       'wnvpresent', 'year', 'month', 'day', 'tot_mos_species',
       'species_PIPIENS', 'species_PIPIENS/RESTUANS', 'species_RESTUANS',
       'latitude', 'longitude'],
      dtype='object')

Index(['station', 'date', 'tavg', 'preciptotal', 'sealevel', 'resultspeed',
       'resultdir', 'avgspeed', 'rel_hum', 'id', 'trap', 'year', 'month',
       'day', 'species_PIPIENS', 'species_PIPIENS/RESTUANS',
       'species_RESTUANS', 'latitude', 'longitude'],
      dtype='object')

In [29]:
print(train_df_cleaned_final.shape, test_df_cleaned_final.shape)

(8523, 21) (116293, 19)


## 3.05 Getting Dummies for `year`  <a id="3.05"></a>

In [30]:
display(train_df_cleaned_final.year.unique(), test_df_cleaned_final.year.unique())

array([2007, 2009, 2011, 2013], dtype=int64)

array([2008, 2010, 2012, 2014], dtype=int64)

In [31]:
train_df_cleaned_final = pd.get_dummies(data=train_df_cleaned_final, prefix=["year"], columns=["year"], drop_first=True)
test_df_cleaned_final = pd.get_dummies(data=test_df_cleaned_final, prefix=["year"], columns=["year"], drop_first=True)

In [32]:
display(train_df_cleaned_final.columns, test_df_cleaned_final.columns) 

Index(['station', 'date', 'tavg', 'preciptotal', 'sealevel', 'resultspeed',
       'resultdir', 'avgspeed', 'rel_hum', 'trap', 'nummosquitos',
       'wnvpresent', 'month', 'day', 'tot_mos_species', 'species_PIPIENS',
       'species_PIPIENS/RESTUANS', 'species_RESTUANS', 'latitude', 'longitude',
       'year_2009', 'year_2011', 'year_2013'],
      dtype='object')

Index(['station', 'date', 'tavg', 'preciptotal', 'sealevel', 'resultspeed',
       'resultdir', 'avgspeed', 'rel_hum', 'id', 'trap', 'month', 'day',
       'species_PIPIENS', 'species_PIPIENS/RESTUANS', 'species_RESTUANS',
       'latitude', 'longitude', 'year_2010', 'year_2012', 'year_2014'],
      dtype='object')

##### Adding test_df years to train_df, vice versa

In [33]:
train_df_cleaned_final["year_2010"]=[0 for i in train_df_cleaned_final["station"]]
train_df_cleaned_final["year_2012"]=[0 for i in train_df_cleaned_final["station"]]
train_df_cleaned_final["year_2014"]=[0 for i in train_df_cleaned_final["station"]]

test_df_cleaned_final["year_2009"]=[0 for i in test_df_cleaned_final["station"]]
test_df_cleaned_final["year_2011"]=[0 for i in test_df_cleaned_final["station"]]
test_df_cleaned_final["year_2013"]=[0 for i in test_df_cleaned_final["station"]]

In [34]:
print(train_df_cleaned_final.shape, test_df_cleaned_final.shape)

(8523, 26) (116293, 24)


In [35]:
print(train_df_cleaned_final.shape, test_df_cleaned_final.shape)

(8523, 26) (116293, 24)


## 3.06 Getting Dummies for `month`  <a id="3.06"></a>

In [36]:
display(train_df_cleaned_final.month.unique(), test_df_cleaned_final.month.unique())

array([ 5,  6,  7,  8,  9, 10], dtype=int64)

array([ 6,  7,  8,  9, 10], dtype=int64)

In [37]:
display(train_df_cleaned_final.columns, test_df_cleaned_final.columns) 

Index(['station', 'date', 'tavg', 'preciptotal', 'sealevel', 'resultspeed',
       'resultdir', 'avgspeed', 'rel_hum', 'trap', 'nummosquitos',
       'wnvpresent', 'month', 'day', 'tot_mos_species', 'species_PIPIENS',
       'species_PIPIENS/RESTUANS', 'species_RESTUANS', 'latitude', 'longitude',
       'year_2009', 'year_2011', 'year_2013', 'year_2010', 'year_2012',
       'year_2014'],
      dtype='object')

Index(['station', 'date', 'tavg', 'preciptotal', 'sealevel', 'resultspeed',
       'resultdir', 'avgspeed', 'rel_hum', 'id', 'trap', 'month', 'day',
       'species_PIPIENS', 'species_PIPIENS/RESTUANS', 'species_RESTUANS',
       'latitude', 'longitude', 'year_2010', 'year_2012', 'year_2014',
       'year_2009', 'year_2011', 'year_2013'],
      dtype='object')

In [38]:
train_df_cleaned_final = pd.get_dummies(data=train_df_cleaned_final, prefix=["month"], columns=["month"], drop_first=True)
test_df_cleaned_final = pd.get_dummies(data=test_df_cleaned_final, prefix=["month"], columns=["month"], drop_first=True)

In [39]:
display(train_df_cleaned_final.columns, test_df_cleaned_final.columns) 

Index(['station', 'date', 'tavg', 'preciptotal', 'sealevel', 'resultspeed',
       'resultdir', 'avgspeed', 'rel_hum', 'trap', 'nummosquitos',
       'wnvpresent', 'day', 'tot_mos_species', 'species_PIPIENS',
       'species_PIPIENS/RESTUANS', 'species_RESTUANS', 'latitude', 'longitude',
       'year_2009', 'year_2011', 'year_2013', 'year_2010', 'year_2012',
       'year_2014', 'month_6', 'month_7', 'month_8', 'month_9', 'month_10'],
      dtype='object')

Index(['station', 'date', 'tavg', 'preciptotal', 'sealevel', 'resultspeed',
       'resultdir', 'avgspeed', 'rel_hum', 'id', 'trap', 'day',
       'species_PIPIENS', 'species_PIPIENS/RESTUANS', 'species_RESTUANS',
       'latitude', 'longitude', 'year_2010', 'year_2012', 'year_2014',
       'year_2009', 'year_2011', 'year_2013', 'month_7', 'month_8', 'month_9',
       'month_10'],
      dtype='object')

##### Dropping `month_6` since test set does not have it

In [40]:
train_df_cleaned_final.shape

(8523, 30)

In [41]:
train_df_cleaned_final.drop("month_6", axis=1, inplace=True)

In [42]:
train_df_cleaned_final.shape

(8523, 29)

In [43]:
print(train_df_cleaned_final.shape, test_df_cleaned_final.shape)

(8523, 29) (116293, 27)


In [44]:
print(train_df_cleaned_final.isnull().sum().sum(), test_df_cleaned_final.isnull().sum().sum())

0 0


# 4. Exporting Cleaned Datasets <a id="4"></a> 

In [45]:
train_df_cleaned_final.to_csv("./datasets/train_cleaned_final.csv")

In [46]:
test_df_cleaned_final.to_csv("./datasets/test_cleaned_final.csv")

In [47]:
print(f"Run complete, total time taken \u2248 {time.time()-t0:.2f}s")

Run complete, total time taken ≈ 2.60s
