# Project 4: West Nile Virus Prediction

## Problem Statement

1. **To predict locations with high potential** of having mosquitoes carrying West Nile Virus.
2. Find out **factors that contribute to the growth and spread** of the virus in mosquitoes.

## Executive Summary

### Context

West-Nile-Virus(WNV) is a mosquito-borne disease that has plagued the continental United States since 1999. The vast majority of infected people will develop mild symptoms that will subside over a few days to several weeks ([source](https://medlineplus.gov/westnilevirus.html)). About 1 out of 150 infected people develop a serious illness. The sometimes neuroinvasive virus may cause encephalitis and meningitis, which can prove to be fatal. There is currently no vaccine to prevent or medication to treat WNV ([source](https://www.cdc.gov/westnile/index.html)).

Outbreaks typically intensify over as little as a couple of weeks; however, human case reports are lagging indicators of risk since case reports occur weeks after the time of infection. Thus, environmental surveillance – monitoring enzootic and epizootic WNV transmission in mosquitoes and birds – forms a timelier index of risk, and is an important cornerstone for implementing effective WNV risk reduction efforts.Research and operational experience shows that increases in WNV infection rates in mosquito populations can provide an indicator of developing outbreak conditions several weeks in advance of increases in human infections. Aggressive and timely efforts to reduce the number of infected adult mosquitoes will optimally impact human WNV case incidence ([source](https://www.cdc.gov/westnile/resources/pdfs/wnvGuidelines.pdf)).

### Scope

The goal of this project is to derive a plan to deploy pesticides throughout the city of Chicago. Hence, the scope of our plan for deployment and cost analysis will also be limited within the city of Chicago. 

However, the model that we have trained can certainly be used in other states/cities within the United States to predict the potential outbreak of West Nile Virus. 

The modelling techniques and strategies will be presented to biostaticians,  epidemiologists, American Public Health Service and decision makers from Centers for Disease Control and Prevention (CDC).

### Contents:
- Jupyter Notebook 1 - ***1_data_cleaning.ipynb***
    - Data Importing and Cleaning
        - Cleaning train.csv
        - Cleaning test.csv
        - Cleaning spray.csv
        - Cleaning weather.csv
        - Exporting cleaned data
- Jupyter Notebook 2 - ***2_merging_data_feature_engineering.ipynb***
   - [Feature Engineering & Data Merging](#Feature-Engineering-&-Data-Merging)
        - [Feature engineering to merge spray data with train & test data](#Feature-engineering-to-merge-spray-data-with-train-&-test-data)
        - [Feature engineering for spatial correlation](#Feature-engineering-for-Spatial-Coorelation)
        - [Merging weather data with train & test data](#Merging-weather-data-with-train-&-test-data)
        - [Exporting merged data](#Exporting-merged-data)
- Jupyter Notebook 3 - ***3_EDA_modelling_evaluation.ipynb.ipynb***
     - Exploratory Data Analysis
     - Model Preparation
        - Modelling Approach
        - Classification Metrics
        - Class Balancing Techniques
     - Classfication Modelling
        - *GridSearchCV* for *LogisticRegression* with *SMOTE* balancing technique
        - *RandomForestClassifier* with *SMOTE* balancing technique
        - *SVC* with *SMOTE* balancing technique
        - *GradientBoostingClassifier* with *SMOTE* balancing technique
        - *RandomForestClassifier* with *ADASYN* balancing technique
        - *RandomForestClassifier* with *ClusterCentroids* balancing technique
        - *RandomForestClassifier* with hyperparameter *class_weight='balanced_subsample'*
        - Feature Importance
        - ROC Curve
        - Visualizing the Predictions
        - Cost-Benefit Analysis
        - Deployment of Model
     - Conclusions and Recommendations

In [1]:
# Imports

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

## Feature Engineering & Data Merging

In [2]:
# Importing cleaned_train.csv, cleaned_test.csv, cleaned_spray.csv and cleaned_weather.csv.

train = pd.read_csv("../input/cleaned_train.csv")
test = pd.read_csv("../input/cleaned_test.csv")
spray = pd.read_csv("../input/cleaned_spray.csv")
weather = pd.read_csv("../input/cleaned_weather.csv")

In [3]:
train.head()

Unnamed: 0,Date,Latitude,Longitude,Species_CULEX PIPIENS,Species_CULEX PIPIENS/RESTUANS,Species_CULEX RESTUANS,WnvPresent
0,2007-05-29,41.95469,-87.800991,0,1,0,0
1,2007-05-29,41.95469,-87.800991,0,0,1,0
2,2007-05-29,41.994991,-87.769279,0,0,1,0
3,2007-05-29,41.974089,-87.824812,0,1,0,0
4,2007-05-29,41.974089,-87.824812,0,0,1,0


In [4]:
train.shape

(8304, 7)

In [5]:
test.head()

Unnamed: 0,Id,Date,Latitude,Longitude,Species_CULEX PIPIENS,Species_CULEX PIPIENS/RESTUANS,Species_CULEX RESTUANS
0,1,2008-06-11,41.95469,-87.800991,0,1,0
1,2,2008-06-11,41.95469,-87.800991,0,0,1
2,3,2008-06-11,41.95469,-87.800991,1,0,0
3,4,2008-06-11,41.95469,-87.800991,0,0,0
4,5,2008-06-11,41.95469,-87.800991,0,0,0


In [6]:
test.shape

(116293, 7)

In [7]:
train.dtypes

Date                               object
Latitude                          float64
Longitude                         float64
Species_CULEX PIPIENS               int64
Species_CULEX PIPIENS/RESTUANS      int64
Species_CULEX RESTUANS              int64
WnvPresent                          int64
dtype: object

In [8]:
test.dtypes

Id                                  int64
Date                               object
Latitude                          float64
Longitude                         float64
Species_CULEX PIPIENS               int64
Species_CULEX PIPIENS/RESTUANS      int64
Species_CULEX RESTUANS              int64
dtype: object

In [9]:
# Converting dates from str to datetime type.

train['Date'] = train['Date'].map(lambda date: datetime.strptime(date, '%Y-%m-%d'))
test['Date'] = test['Date'].map(lambda date: datetime.strptime(date, '%Y-%m-%d'))

In [10]:
train.dtypes

Date                              datetime64[ns]
Latitude                                 float64
Longitude                                float64
Species_CULEX PIPIENS                      int64
Species_CULEX PIPIENS/RESTUANS             int64
Species_CULEX RESTUANS                     int64
WnvPresent                                 int64
dtype: object

In [11]:
test.dtypes

Id                                         int64
Date                              datetime64[ns]
Latitude                                 float64
Longitude                                float64
Species_CULEX PIPIENS                      int64
Species_CULEX PIPIENS/RESTUANS             int64
Species_CULEX RESTUANS                     int64
dtype: object

### Feature engineering to merge *spray* data with *train* & *test* data

In [12]:
spray.head()

Unnamed: 0,Date,Latitude,Longitude
0,2011-08-29,42.391623,-88.089163
1,2011-08-29,42.391348,-88.089163
2,2011-08-29,42.391022,-88.089157
3,2011-08-29,42.390637,-88.089158
4,2011-08-29,42.39041,-88.088858


In [13]:
spray.shape

(14835, 3)

In [14]:
spray.dtypes

Date          object
Latitude     float64
Longitude    float64
dtype: object

In [15]:
# Converting dates from str to datetime type.

spray['Date'] = spray['Date'].map(lambda date: datetime.strptime(date, '%Y-%m-%d'))

In [16]:
spray.dtypes

Date         datetime64[ns]
Latitude            float64
Longitude           float64
dtype: object

In terms of the Latitude & Longitude coordinates given to us in GIS, **a change of 0.001 in the coordinate value corresponds to roughly 0.1112km (111.2m) distance**.

To engineer a new feature to merge the *spray* data with *train* and *test* data, let us first start with making some assumptions about the effectiveness of spraying on the mosquito traps.

In this context, we make the following **assumptions**:
1. **Spraying is effective for up to 7 days.**
2. **Spraying is effective in a ~100m radius from the location of spraying.**

We will now engineer a new feature called *Sprayed* which will use the above assumptions to merge *spray* data with *train* and *test* data. So, **for every mozzie trap on any particular testing date, it will be considered sprayed *(Sprayed = 1)* if:**
1. **Spraying happened within a period of 7 days prior to the testing date.**
2. **Spraying happened within a distance of ~100m from the location of the mozzie trap.**

**Otherwise the mozzie trap is considered not sprayed *(Sprayed = 0)*.**

In [17]:
# Defining a function to apply the above mentioned conditions on every row of train & test dataframes.
# Function will take in each row from train & test dataframes (corresponding to individual mozzie traps on any particular testing date).

def spray_within_days(row, days=7, dist=0.001):
    
    # Accessing the tested mozzie trap's Latitude & Longitude.
    lat = row['Latitude']
    lon = row['Longitude']
    
    # Accessing the test date for that mozzie trap, and calculating the date_before using the given no. of days as input.
    date = row['Date']
    date_before = row['Date'] - timedelta(days=days)
    
    # Filtering spray df to only take values between the dates caluclated above.
    spray_filtered = spray[(spray['Date']<=date) & (spray['Date']>date_before)]
    
    for i in spray_filtered.index:
        
        # Calculating euclidean distance between the row's location and the spray's location (using values in Latitude and Longitude columns)
        spray_dist = ((spray_filtered.loc[i, 'Latitude'] - lat) ** 2 + (spray_filtered.loc[i, 'Longitude'] - lon) ** 2) ** 0.5
        
        # So if the calculated distance is <= distance value given as input, the tested trap is considered sprayed, so return 1.
        if spray_dist <= dist:
            return 1
    
    # Return 0 (mozzie trap not sprayed) if spray_dist > dist.
    return 0

In [18]:
# Creating a new Sprayed column in train & test dataframes using the above defined function.
# THIS CELL TAKES ABOUT 1-2 MINUTES TO RUN.

train['Sprayed'] = train.apply(spray_within_days, axis=1)
test['Sprayed'] = test.apply(spray_within_days, axis=1)

In [19]:
train.head()

Unnamed: 0,Date,Latitude,Longitude,Species_CULEX PIPIENS,Species_CULEX PIPIENS/RESTUANS,Species_CULEX RESTUANS,WnvPresent,Sprayed
0,2007-05-29,41.95469,-87.800991,0,1,0,0,0
1,2007-05-29,41.95469,-87.800991,0,0,1,0,0
2,2007-05-29,41.994991,-87.769279,0,0,1,0,0
3,2007-05-29,41.974089,-87.824812,0,1,0,0,0
4,2007-05-29,41.974089,-87.824812,0,0,1,0,0


In [20]:
train.shape

(8304, 8)

In [21]:
# Finding number of mozzie traps which are considered sprayed (Sprayed = 1) in train dataframe.

train['Sprayed'].sum()

21

In [22]:
test.head()

Unnamed: 0,Id,Date,Latitude,Longitude,Species_CULEX PIPIENS,Species_CULEX PIPIENS/RESTUANS,Species_CULEX RESTUANS,Sprayed
0,1,2008-06-11,41.95469,-87.800991,0,1,0,0
1,2,2008-06-11,41.95469,-87.800991,0,0,1,0
2,3,2008-06-11,41.95469,-87.800991,1,0,0,0
3,4,2008-06-11,41.95469,-87.800991,0,0,0,0
4,5,2008-06-11,41.95469,-87.800991,0,0,0,0


In [23]:
test.shape

(116293, 8)

In [24]:
# Finding number of mozzie traps which are considered sprayed (Sprayed = 1) in test dataframe.

test['Sprayed'].sum()

0

### Feature engineering for Spatial Coorelation

Now, we will engineer a new feature to take into account the spatial correlation between the mosquito traps around Chicago. This feature will be the average distance of previously positively tested mosquito traps from each individual trap being tested.

Consider testing a trap $T$. The presence of WNV in this trap $T$ could be influenced by the presence of WNV in nearby traps around this trap $T$. In order to establish this "influence", we need to first understand the behaviour and characteristics carrier mosquitos themselves.

**How far can mosquitos fly?** - According to this [page](https://www.mosquito.org/page/faq#How%20far%20can%20mosquitoes%20fly?), **"most mosquito species have flight ranges of 1-3 miles."** So, it could be said that mosquitos from any traps which are within a 1-3 miles distance from trap $T$ could have potentially flown over to trap $T$. If these mosquitos flew over from traps which were previously tested positive, then they could have potentially brought over the WNV virus with them to trap $T$ as well. So, in order to consider the spatial influence of nearby traps on trap $T$ in our model, we need to calculate the average distance of previously positively tested traps from trap $T$. **The closer (on average) these previously positively tested traps are to trap $T$, the higher the chances that mosquitos could have flown over and brought the virus to trap $T$ (and vice-versa). In this context, we will assume that on average, mosquitos fly a distance of ~2 miles (3.2km).**

One key point to note before we engineer this feature is that **we will be ignoring the influence of time as a factor in determining this spatial correlation between the mosquito traps around Chicago**. Consider this - in the real world, if traps around trap $T$ test positive for WNV, then mosquitoes from these positively tested traps only have their lifespan worth of time to fly over to trap $T$ and potentially bring over the virus to trap $T$ as well. According to this [page](https://www.mosquito.org/page/faq#How%20long%20do%20mosquitoes%20live?), the average lifespan of mosquitoes is about 2-3 weeks. This means that, while engineering the new feature explained above for each trap $T$, we should not only consider the positively tested traps within a distance of ~2miles from trap $T$, but we should also consider the fact that these traps should only have tested positive within the last 2-3 weeks. If the traps around trap $T$ tested positive more than 2-3 weeks ago, then the WNV carrying mosquitoes from those traps would have already died by now, and so would no longer be able to fly over to trap $T$. While this makes sense in reality, **given the limited amount of temporal data for this project, we will ignore this time factor and consider only the spatial correlation of all traps around trap $T$ (within ~2miles distance) which tested positive for WNV historically.**

In [25]:
# Filtering the dates, latitudes and longitudes of all traps which tested positive for WNV, and storing them in a new df called positives.

positives_mask = train['WnvPresent']==1

positives = train[positives_mask].loc[:, ['Date', 'Latitude', 'Longitude']]

In [26]:
positives.head()

Unnamed: 0,Date,Latitude,Longitude
514,2007-07-18,41.686398,-87.531635
548,2007-07-25,41.673408,-87.599862
550,2007-07-25,41.673408,-87.599862
641,2007-08-01,41.95469,-87.800991
646,2007-08-01,41.974089,-87.824812


In [27]:
# Defining a function to apply the above mentioned conditions on every row of train & test dataframes.
# Function will take in each row from train & test dataframes (corresponding to individual mozzie traps on any particular testing date).

def avg_dist_nearby_positives(row, dist=0.03):
    
    # Accessing the tested mozzie trap's Latitude & Longitude.
    lat = row['Latitude']
    lon = row['Longitude']
    
    # Accessing the test date for that mozzie trap.
    date = row['Date']
    
    # Filtering positives df to only take values before the date caluclated above.
    positives_filtered = positives[positives['Date']<=date]
    
    # Defining a list to store the distances of previously positively tested traps, and defining a variable to keep a count of such traps for average distance calculation later.
    trap_distances = []
    trap_count = 0
    
    for i in positives_filtered.index:
        
        # Calculating euclidean distance between the row's location and the previously positively tested trap's location (using values in Latitude and Longitude columns)
        trap_dist = ((positives_filtered.loc[i, 'Latitude'] - lat) ** 2 + (positives_filtered.loc[i, 'Longitude'] - lon) ** 2) ** 0.5
        
        # So if the calculated distance is <= distance value given as input (0.03 = ~3.2km),
        # this trap's distance should be considered in the average distance calculation for spatial correlation.
        if trap_dist <= dist:
            trap_distances.append(trap_dist)
            trap_count += 1
            
    # Calculating the average distance, which will be used as a measure of "influence" of the spatial correlation.         
    # For traps which don't have any previously positively tested traps within the 3.2km radius circle, the average distance is set to 1 (equivalent to ~100km).
    avg_pos_trap_distance = 1
    if trap_count!=0:
        avg_pos_trap_distance = sum(trap_distances) / trap_count
    
    return avg_pos_trap_distance

In [28]:
# Creating a new column 'Avg Pos Dist' in train & test dataframes using the above defined function.
# THIS CELL TAKES ABOUT 15-20 MINUTES TO RUN.

train['Avg Pos Dist'] = train.apply(avg_dist_nearby_positives, axis=1)
test['Avg Pos Dist'] = test.apply(avg_dist_nearby_positives, axis=1)

In [29]:
train.head()

Unnamed: 0,Date,Latitude,Longitude,Species_CULEX PIPIENS,Species_CULEX PIPIENS/RESTUANS,Species_CULEX RESTUANS,WnvPresent,Sprayed,Avg Pos Dist
0,2007-05-29,41.95469,-87.800991,0,1,0,0,0,1.0
1,2007-05-29,41.95469,-87.800991,0,0,1,0,0,1.0
2,2007-05-29,41.994991,-87.769279,0,0,1,0,0,1.0
3,2007-05-29,41.974089,-87.824812,0,1,0,0,0,1.0
4,2007-05-29,41.974089,-87.824812,0,0,1,0,0,1.0


In [30]:
train.tail()

Unnamed: 0,Date,Latitude,Longitude,Species_CULEX PIPIENS,Species_CULEX PIPIENS/RESTUANS,Species_CULEX RESTUANS,WnvPresent,Sprayed,Avg Pos Dist
8299,2013-09-26,41.763733,-87.742302,0,1,0,1,0,0.017862
8300,2013-09-26,41.98728,-87.666066,0,1,0,0,0,0.016418
8301,2013-09-26,41.912563,-87.668055,0,1,0,0,0,0.013769
8302,2013-09-26,42.009876,-87.807277,0,1,0,0,0,0.016889
8303,2013-09-26,41.776428,-87.627096,0,1,0,0,0,0.012121


In [31]:
test.head()

Unnamed: 0,Id,Date,Latitude,Longitude,Species_CULEX PIPIENS,Species_CULEX PIPIENS/RESTUANS,Species_CULEX RESTUANS,Sprayed,Avg Pos Dist
0,1,2008-06-11,41.95469,-87.800991,0,1,0,0,0.014806
1,2,2008-06-11,41.95469,-87.800991,0,0,1,0,0.014806
2,3,2008-06-11,41.95469,-87.800991,1,0,0,0,0.014806
3,4,2008-06-11,41.95469,-87.800991,0,0,0,0,0.014806
4,5,2008-06-11,41.95469,-87.800991,0,0,0,0,0.014806


### Merging *weather* data with *train* & *test* data

In [32]:
weather.head()

Unnamed: 0,Date,Tmax,Tmin,Tavg,DewPoint,WetBulb,Heat,Cool,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,Sunrise,Sunset,bc,br,dz,fg,fg+,fu,gr,hz,mi,ra,sn,sq,ts,vc
0,2007-05-01,83.5,51.0,67.25,51.0,56.5,0.0,2.5,0.0,29.14,29.82,2.2,26.0,9.4,4.8,18.816667,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2007-05-02,59.5,42.5,51.0,42.0,47.0,13.5,0.0,0.0,29.41,30.085,13.15,3.0,13.4,4.783333,18.833333,0,1,0,0,0,0,0,1,0,0,0,0,0,0
2,2007-05-03,66.5,47.0,56.75,40.0,49.0,8.0,0.0,0.0,29.425,30.12,12.3,6.5,12.55,4.766667,18.85,0,0,0,0,0,0,0,1,0,0,0,0,0,0
3,2007-05-04,72.0,50.0,61.0,41.5,50.0,3.5,0.0,0.0,29.335,30.045,10.25,7.5,10.6,4.733333,18.866667,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,2007-05-05,66.0,53.5,59.75,38.5,49.5,5.0,0.0,0.0,29.43,30.095,11.45,7.0,11.75,4.716667,18.883333,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [33]:
weather.shape

(1472, 30)

In [34]:
weather.dtypes

Date            object
Tmax           float64
Tmin           float64
Tavg           float64
DewPoint       float64
WetBulb        float64
Heat           float64
Cool           float64
PrecipTotal    float64
StnPressure    float64
SeaLevel       float64
ResultSpeed    float64
ResultDir      float64
AvgSpeed       float64
Sunrise        float64
Sunset         float64
bc               int64
br               int64
dz               int64
fg               int64
fg+              int64
fu               int64
gr               int64
hz               int64
mi               int64
ra               int64
sn               int64
sq               int64
ts               int64
vc               int64
dtype: object

In [35]:
# Converting dates from str to datetime type.

weather['Date'] = weather['Date'].map(lambda date: datetime.strptime(date, '%Y-%m-%d'))

In [36]:
weather.dtypes

Date           datetime64[ns]
Tmax                  float64
Tmin                  float64
Tavg                  float64
DewPoint              float64
WetBulb               float64
Heat                  float64
Cool                  float64
PrecipTotal           float64
StnPressure           float64
SeaLevel              float64
ResultSpeed           float64
ResultDir             float64
AvgSpeed              float64
Sunrise               float64
Sunset                float64
bc                      int64
br                      int64
dz                      int64
fg                      int64
fg+                     int64
fu                      int64
gr                      int64
hz                      int64
mi                      int64
ra                      int64
sn                      int64
sq                      int64
ts                      int64
vc                      int64
dtype: object

In [37]:
weather.set_index('Date', inplace=True)

In [38]:
weather.head(10)

Unnamed: 0_level_0,Tmax,Tmin,Tavg,DewPoint,WetBulb,Heat,Cool,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,Sunrise,Sunset,bc,br,dz,fg,fg+,fu,gr,hz,mi,ra,sn,sq,ts,vc
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1
2007-05-01,83.5,51.0,67.25,51.0,56.5,0.0,2.5,0.0,29.14,29.82,2.2,26.0,9.4,4.8,18.816667,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2007-05-02,59.5,42.5,51.0,42.0,47.0,13.5,0.0,0.0,29.41,30.085,13.15,3.0,13.4,4.783333,18.833333,0,1,0,0,0,0,0,1,0,0,0,0,0,0
2007-05-03,66.5,47.0,56.75,40.0,49.0,8.0,0.0,0.0,29.425,30.12,12.3,6.5,12.55,4.766667,18.85,0,0,0,0,0,0,0,1,0,0,0,0,0,0
2007-05-04,72.0,50.0,61.0,41.5,50.0,3.5,0.0,0.0,29.335,30.045,10.25,7.5,10.6,4.733333,18.866667,0,0,0,0,0,0,0,0,0,1,0,0,0,0
2007-05-05,66.0,53.5,59.75,38.5,49.5,5.0,0.0,0.0,29.43,30.095,11.45,7.0,11.75,4.716667,18.883333,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2007-05-06,68.0,50.5,59.25,30.0,46.0,5.5,0.0,0.0,29.595,30.285,14.1,10.5,14.75,4.7,18.916667,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2007-05-07,83.5,48.5,66.0,40.0,53.5,0.0,1.0,0.0,29.41,30.12,8.55,17.5,10.2,4.683333,18.933333,0,0,0,0,0,0,0,0,0,1,0,0,0,0
2007-05-08,81.0,57.0,69.0,57.5,62.5,0.0,4.0,0.0,29.325,30.025,2.6,9.5,5.6,4.65,18.95,0,1,0,0,0,0,0,1,0,0,0,0,0,0
2007-05-09,76.5,62.0,69.25,59.5,63.0,0.0,4.5,0.075,29.245,29.935,3.9,8.0,6.05,4.633333,18.966667,0,1,0,0,0,0,0,1,0,0,0,0,0,0
2007-05-10,83.5,57.5,70.5,52.0,60.5,0.0,5.5,0.0,29.23,29.915,1.35,13.0,4.0,4.616667,18.983333,0,1,0,0,0,0,0,1,0,0,0,0,0,0


In order to merge the *weather* data with *train* and *test* data, **we decided to shift the *weather* data with a lag of 10 days**. The reasoning behind doing this is that the weather conditions on any given day will not have an immediate impact on the presence of WNV on that day itself. Generally, there would be a delay of a few days before the effects of weather conditions are felt on the presence of WNV in the mosquito traps. For instance, **if the weather on a particular day is ideal for breeding of mosquitoes, the mosquitoes would breed more on that day and lay more eggs than usual. These eggs will then hatch and the hatchlings will grow into adult mosquitoes over a period of time, typically about 10 days (according to this [page](https://www.mosquito.org/page/lifecycle)). It is only after these mosquitoes grow into adults will we see an increase in the presence of WNV in the mosquito traps.**

We chose a value of 10 days after trying a few different values of lag and modelling the classification using a *RandomForestClassifier* model. **The plot below shows the trend in the various classification metrics for the same model with different days of lag in weather data.** As seen in the plot, a lag of 5 days would give the highest *train* & *test accuracies* and *specificity (true negative rate)*, but the lowest *sensitivity (true positive rate)*. On the other hand, lags of 2 days and 10 days give identical values for all classification metrics, and for a small decrease in *accuracies* and *specificity* (as compared to lag of 5 days), give a significant increase in the *sensitivity*. So, a lag of 10 days was chosen as it also coincided with our explanation of the lifecycle of mosquitoes given above.

![](../plot_images/eval_metrics_vs_weather_lag.png)

**The code below shows how *weather* data was merged with *train* and *test* data with lags of 0, 2, 5, 7, 10, 14, 21 and 28 days. Only the merged data with lag of 10 days was used for further evaluation using different classifiers.**

In [39]:
# Creating new columns with dates lagged by 2, 5, 7, 10, 14, 21 & 28 days.

weather['Date_lag_2'] = weather.index.shift(2, freq='D')
weather['Date_lag_5'] = weather.index.shift(5, freq='D')
weather['Date_lag_7'] = weather.index.shift(7, freq='D')
weather['Date_lag_10'] = weather.index.shift(10, freq='D')
weather['Date_lag_14'] = weather.index.shift(14, freq='D')
weather['Date_lag_21'] = weather.index.shift(21, freq='D')
weather['Date_lag_28'] = weather.index.shift(28, freq='D')

In [40]:
weather.head(10)

Unnamed: 0_level_0,Tmax,Tmin,Tavg,DewPoint,WetBulb,Heat,Cool,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,Sunrise,Sunset,bc,br,dz,fg,fg+,fu,gr,hz,mi,ra,sn,sq,ts,vc,Date_lag_2,Date_lag_5,Date_lag_7,Date_lag_10,Date_lag_14,Date_lag_21,Date_lag_28
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1
2007-05-01,83.5,51.0,67.25,51.0,56.5,0.0,2.5,0.0,29.14,29.82,2.2,26.0,9.4,4.8,18.816667,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2007-05-03,2007-05-06,2007-05-08,2007-05-11,2007-05-15,2007-05-22,2007-05-29
2007-05-02,59.5,42.5,51.0,42.0,47.0,13.5,0.0,0.0,29.41,30.085,13.15,3.0,13.4,4.783333,18.833333,0,1,0,0,0,0,0,1,0,0,0,0,0,0,2007-05-04,2007-05-07,2007-05-09,2007-05-12,2007-05-16,2007-05-23,2007-05-30
2007-05-03,66.5,47.0,56.75,40.0,49.0,8.0,0.0,0.0,29.425,30.12,12.3,6.5,12.55,4.766667,18.85,0,0,0,0,0,0,0,1,0,0,0,0,0,0,2007-05-05,2007-05-08,2007-05-10,2007-05-13,2007-05-17,2007-05-24,2007-05-31
2007-05-04,72.0,50.0,61.0,41.5,50.0,3.5,0.0,0.0,29.335,30.045,10.25,7.5,10.6,4.733333,18.866667,0,0,0,0,0,0,0,0,0,1,0,0,0,0,2007-05-06,2007-05-09,2007-05-11,2007-05-14,2007-05-18,2007-05-25,2007-06-01
2007-05-05,66.0,53.5,59.75,38.5,49.5,5.0,0.0,0.0,29.43,30.095,11.45,7.0,11.75,4.716667,18.883333,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2007-05-07,2007-05-10,2007-05-12,2007-05-15,2007-05-19,2007-05-26,2007-06-02
2007-05-06,68.0,50.5,59.25,30.0,46.0,5.5,0.0,0.0,29.595,30.285,14.1,10.5,14.75,4.7,18.916667,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2007-05-08,2007-05-11,2007-05-13,2007-05-16,2007-05-20,2007-05-27,2007-06-03
2007-05-07,83.5,48.5,66.0,40.0,53.5,0.0,1.0,0.0,29.41,30.12,8.55,17.5,10.2,4.683333,18.933333,0,0,0,0,0,0,0,0,0,1,0,0,0,0,2007-05-09,2007-05-12,2007-05-14,2007-05-17,2007-05-21,2007-05-28,2007-06-04
2007-05-08,81.0,57.0,69.0,57.5,62.5,0.0,4.0,0.0,29.325,30.025,2.6,9.5,5.6,4.65,18.95,0,1,0,0,0,0,0,1,0,0,0,0,0,0,2007-05-10,2007-05-13,2007-05-15,2007-05-18,2007-05-22,2007-05-29,2007-06-05
2007-05-09,76.5,62.0,69.25,59.5,63.0,0.0,4.5,0.075,29.245,29.935,3.9,8.0,6.05,4.633333,18.966667,0,1,0,0,0,0,0,1,0,0,0,0,0,0,2007-05-11,2007-05-14,2007-05-16,2007-05-19,2007-05-23,2007-05-30,2007-06-06
2007-05-10,83.5,57.5,70.5,52.0,60.5,0.0,5.5,0.0,29.23,29.915,1.35,13.0,4.0,4.616667,18.983333,0,1,0,0,0,0,0,1,0,0,0,0,0,0,2007-05-12,2007-05-15,2007-05-17,2007-05-20,2007-05-24,2007-05-31,2007-06-07


In [41]:
# Resetting the index and renaming the Date column to 'Date_lag_0'.

weather.reset_index(inplace=True)
weather.rename(columns={'Date':'Date_lag_0'}, inplace=True)

In [42]:
# Converting all date columns to type string so that they can be used in pd.merge functions later.

train['Date'] = train['Date'].dt.strftime('%Y-%m-%d')
test['Date'] = test['Date'].dt.strftime('%Y-%m-%d')
weather['Date_lag_0'] = weather['Date_lag_0'].dt.strftime('%Y-%m-%d')
weather['Date_lag_2'] = weather['Date_lag_2'].dt.strftime('%Y-%m-%d')
weather['Date_lag_5'] = weather['Date_lag_5'].dt.strftime('%Y-%m-%d')
weather['Date_lag_7'] = weather['Date_lag_7'].dt.strftime('%Y-%m-%d')
weather['Date_lag_10'] = weather['Date_lag_10'].dt.strftime('%Y-%m-%d')
weather['Date_lag_14'] = weather['Date_lag_14'].dt.strftime('%Y-%m-%d')
weather['Date_lag_21'] = weather['Date_lag_21'].dt.strftime('%Y-%m-%d')
weather['Date_lag_28'] = weather['Date_lag_28'].dt.strftime('%Y-%m-%d')

In [43]:
weather.columns

Index(['Date_lag_0', 'Tmax', 'Tmin', 'Tavg', 'DewPoint', 'WetBulb', 'Heat',
       'Cool', 'PrecipTotal', 'StnPressure', 'SeaLevel', 'ResultSpeed',
       'ResultDir', 'AvgSpeed', 'Sunrise', 'Sunset', 'bc', 'br', 'dz', 'fg',
       'fg+', 'fu', 'gr', 'hz', 'mi', 'ra', 'sn', 'sq', 'ts', 'vc',
       'Date_lag_2', 'Date_lag_5', 'Date_lag_7', 'Date_lag_10', 'Date_lag_14',
       'Date_lag_21', 'Date_lag_28'],
      dtype='object')

Merging *weather* data with *train* & *test* data with **0 days lag**.

In [44]:
# Combining weather dataframe with train & test dataframes along Date_lag_0 column.
# Using left join to only keep rows from train & test dataframes and discard all other rows.

weather_cols = weather.columns[:30]

merged_train_lag_0 = pd.merge(left=train, right=weather[weather_cols], left_on='Date', right_on='Date_lag_0', how='left')
merged_test_lag_0 = pd.merge(left=test, right=weather[weather_cols], left_on='Date', right_on='Date_lag_0', how='left')

# Dropping the extra column 'Date_lag_0'.
merged_train_lag_0.drop(columns='Date_lag_0', inplace=True)
merged_test_lag_0.drop(columns='Date_lag_0', inplace=True)

In [45]:
merged_train_lag_0.head()

Unnamed: 0,Date,Latitude,Longitude,Species_CULEX PIPIENS,Species_CULEX PIPIENS/RESTUANS,Species_CULEX RESTUANS,WnvPresent,Sprayed,Avg Pos Dist,Tmax,Tmin,Tavg,DewPoint,WetBulb,Heat,Cool,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,Sunrise,Sunset,bc,br,dz,fg,fg+,fu,gr,hz,mi,ra,sn,sq,ts,vc
0,2007-05-29,41.95469,-87.800991,0,1,0,0,0,1.0,88.0,62.5,75.25,58.5,65.5,0.0,10.5,0.0,29.415,30.1,5.8,17.0,6.95,4.35,19.283333,0,1,0,0,0,0,0,1,0,0,0,0,0,0
1,2007-05-29,41.95469,-87.800991,0,0,1,0,0,1.0,88.0,62.5,75.25,58.5,65.5,0.0,10.5,0.0,29.415,30.1,5.8,17.0,6.95,4.35,19.283333,0,1,0,0,0,0,0,1,0,0,0,0,0,0
2,2007-05-29,41.994991,-87.769279,0,0,1,0,0,1.0,88.0,62.5,75.25,58.5,65.5,0.0,10.5,0.0,29.415,30.1,5.8,17.0,6.95,4.35,19.283333,0,1,0,0,0,0,0,1,0,0,0,0,0,0
3,2007-05-29,41.974089,-87.824812,0,1,0,0,0,1.0,88.0,62.5,75.25,58.5,65.5,0.0,10.5,0.0,29.415,30.1,5.8,17.0,6.95,4.35,19.283333,0,1,0,0,0,0,0,1,0,0,0,0,0,0
4,2007-05-29,41.974089,-87.824812,0,0,1,0,0,1.0,88.0,62.5,75.25,58.5,65.5,0.0,10.5,0.0,29.415,30.1,5.8,17.0,6.95,4.35,19.283333,0,1,0,0,0,0,0,1,0,0,0,0,0,0


In [46]:
merged_train_lag_0.shape

(8304, 38)

In [47]:
merged_train_lag_0.isnull().sum().sum()

0

In [48]:
merged_test_lag_0.head()

Unnamed: 0,Id,Date,Latitude,Longitude,Species_CULEX PIPIENS,Species_CULEX PIPIENS/RESTUANS,Species_CULEX RESTUANS,Sprayed,Avg Pos Dist,Tmax,Tmin,Tavg,DewPoint,WetBulb,Heat,Cool,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,Sunrise,Sunset,bc,br,dz,fg,fg+,fu,gr,hz,mi,ra,sn,sq,ts,vc
0,1,2008-06-11,41.95469,-87.800991,0,1,0,0,0.014806,86.0,63.5,74.75,55.5,64.0,0.0,10.0,0.0,29.31,29.98,9.15,18.0,10.2,4.266667,19.433333,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2,2008-06-11,41.95469,-87.800991,0,0,1,0,0.014806,86.0,63.5,74.75,55.5,64.0,0.0,10.0,0.0,29.31,29.98,9.15,18.0,10.2,4.266667,19.433333,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,3,2008-06-11,41.95469,-87.800991,1,0,0,0,0.014806,86.0,63.5,74.75,55.5,64.0,0.0,10.0,0.0,29.31,29.98,9.15,18.0,10.2,4.266667,19.433333,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,4,2008-06-11,41.95469,-87.800991,0,0,0,0,0.014806,86.0,63.5,74.75,55.5,64.0,0.0,10.0,0.0,29.31,29.98,9.15,18.0,10.2,4.266667,19.433333,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,5,2008-06-11,41.95469,-87.800991,0,0,0,0,0.014806,86.0,63.5,74.75,55.5,64.0,0.0,10.0,0.0,29.31,29.98,9.15,18.0,10.2,4.266667,19.433333,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [49]:
merged_test_lag_0.shape

(116293, 38)

In [50]:
merged_test_lag_0.isnull().sum().sum()

0

Merging *weather* data with *train* & *test* data with **2 days lag**.

In [51]:
# Combining weather dataframe with train & test dataframes along Date_lag_2 column.
# Using left join to only keep rows from train & test dataframes and discard all other rows.

weather_cols = ['Date_lag_2']
weather_cols.extend(weather.columns[1:30])

merged_train_lag_2 = pd.merge(left=train, right=weather[weather_cols], left_on='Date', right_on='Date_lag_2', how='left')
merged_test_lag_2 = pd.merge(left=test, right=weather[weather_cols], left_on='Date', right_on='Date_lag_2', how='left')

# Dropping the extra column 'Date_lag_2'.
merged_train_lag_2.drop(columns='Date_lag_2', inplace=True)
merged_test_lag_2.drop(columns='Date_lag_2', inplace=True)

In [52]:
print(merged_train_lag_2.shape)
merged_train_lag_2.isnull().sum().sum()

(8304, 38)


0

In [53]:
print(merged_test_lag_2.shape)
merged_test_lag_2.isnull().sum().sum()

(116293, 38)


0

Merging *weather* data with *train* & *test* data with **5 days lag**.

In [54]:
# Combining weather dataframe with train & test dataframes along Date_lag_5 column.
# Using left join to only keep rows from train & test dataframes and discard all other rows.

weather_cols = ['Date_lag_5']
weather_cols.extend(weather.columns[1:30])

merged_train_lag_5 = pd.merge(left=train, right=weather[weather_cols], left_on='Date', right_on='Date_lag_5', how='left')
merged_test_lag_5 = pd.merge(left=test, right=weather[weather_cols], left_on='Date', right_on='Date_lag_5', how='left')

# Dropping the extra column 'Date_lag_5'.
merged_train_lag_5.drop(columns='Date_lag_5', inplace=True)
merged_test_lag_5.drop(columns='Date_lag_5', inplace=True)

In [55]:
print(merged_train_lag_5.shape)
merged_train_lag_5.isnull().sum().sum()

(8304, 38)


0

In [56]:
print(merged_test_lag_5.shape)
merged_test_lag_5.isnull().sum().sum()

(116293, 38)


0

Merging *weather* data with *train* & *test* data with **7 days lag**.

In [57]:
# Combining weather dataframe with train & test dataframes along Date_lag_7 column.
# Using left join to only keep rows from train & test dataframes and discard all other rows.

weather_cols = ['Date_lag_7']
weather_cols.extend(weather.columns[1:30])

merged_train_lag_7 = pd.merge(left=train, right=weather[weather_cols], left_on='Date', right_on='Date_lag_7', how='left')
merged_test_lag_7 = pd.merge(left=test, right=weather[weather_cols], left_on='Date', right_on='Date_lag_7', how='left')

# Dropping the extra column 'Date_lag_7'.
merged_train_lag_7.drop(columns='Date_lag_7', inplace=True)
merged_test_lag_7.drop(columns='Date_lag_7', inplace=True)

In [58]:
print(merged_train_lag_7.shape)
merged_train_lag_7.isnull().sum().sum()

(8304, 38)


0

In [59]:
print(merged_test_lag_7.shape)
merged_test_lag_7.isnull().sum().sum()

(116293, 38)


0

Merging *weather* data with *train* & *test* data with **10 days lag**.

In [60]:
# Combining weather dataframe with train & test dataframes along Date_lag_10 column.
# Using left join to only keep rows from train & test dataframes and discard all other rows.

weather_cols = ['Date_lag_10']
weather_cols.extend(weather.columns[1:30])

merged_train_lag_10 = pd.merge(left=train, right=weather[weather_cols], left_on='Date', right_on='Date_lag_10', how='left')
merged_test_lag_10 = pd.merge(left=test, right=weather[weather_cols], left_on='Date', right_on='Date_lag_10', how='left')

# Dropping the extra column 'Date_lag_10'.
merged_train_lag_10.drop(columns='Date_lag_10', inplace=True)
merged_test_lag_10.drop(columns='Date_lag_10', inplace=True)

In [61]:
print(merged_train_lag_10.shape)
merged_train_lag_10.isnull().sum().sum()

(8304, 38)


0

In [62]:
print(merged_test_lag_10.shape)
merged_test_lag_10.isnull().sum().sum()

(116293, 38)


0

Merging *weather* data with *train* & *test* data with **14 days lag**.

In [63]:
# Combining weather dataframe with train & test dataframes along Date_lag_14 column.
# Using left join to only keep rows from train & test dataframes and discard all other rows.

weather_cols = ['Date_lag_14']
weather_cols.extend(weather.columns[1:30])

merged_train_lag_14 = pd.merge(left=train, right=weather[weather_cols], left_on='Date', right_on='Date_lag_14', how='left')
merged_test_lag_14 = pd.merge(left=test, right=weather[weather_cols], left_on='Date', right_on='Date_lag_14', how='left')

# Dropping the extra column 'Date_lag_14'.
merged_train_lag_14.drop(columns='Date_lag_14', inplace=True)
merged_test_lag_14.drop(columns='Date_lag_14', inplace=True)

In [64]:
print(merged_train_lag_14.shape)
merged_train_lag_14.isnull().sum().sum()

(8304, 38)


0

In [65]:
print(merged_test_lag_14.shape)
merged_test_lag_14.isnull().sum().sum()

(116293, 38)


0

Merging *weather* data with *train* & *test* data with **21 days lag**.

In [66]:
# Combining weather dataframe with train & test dataframes along Date_lag_21 column.
# Using left join to only keep rows from train & test dataframes and discard all other rows.

weather_cols = ['Date_lag_21']
weather_cols.extend(weather.columns[1:30])

merged_train_lag_21 = pd.merge(left=train, right=weather[weather_cols], left_on='Date', right_on='Date_lag_21', how='left')
merged_test_lag_21 = pd.merge(left=test, right=weather[weather_cols], left_on='Date', right_on='Date_lag_21', how='left')

# Dropping the extra column 'Date_lag_21'.
merged_train_lag_21.drop(columns='Date_lag_21', inplace=True)
merged_test_lag_21.drop(columns='Date_lag_21', inplace=True)

In [67]:
print(merged_train_lag_21.shape)
merged_train_lag_21.isnull().sum().sum()

(8304, 38)


0

In [68]:
print(merged_test_lag_21.shape)
merged_test_lag_21.isnull().sum().sum()

(116293, 38)


0

Merging *weather* data with *train* & *test* data with **28 days lag**.

In [69]:
# Combining weather dataframe with train & test dataframes along Date_lag_28 column.
# Using left join to only keep rows from train & test dataframes and discard all other rows.

weather_cols = ['Date_lag_28']
weather_cols.extend(weather.columns[1:30])

merged_train_lag_28 = pd.merge(left=train, right=weather[weather_cols], left_on='Date', right_on='Date_lag_28', how='left')
merged_test_lag_28 = pd.merge(left=test, right=weather[weather_cols], left_on='Date', right_on='Date_lag_28', how='left')

# Dropping the extra column 'Date_lag_28'.
merged_train_lag_28.drop(columns='Date_lag_28', inplace=True)
merged_test_lag_28.drop(columns='Date_lag_28', inplace=True)

In [70]:
print(merged_train_lag_28.shape)
merged_train_lag_28.isnull().sum().sum()

(8304, 38)


1624

In [71]:
print(merged_test_lag_28.shape)
merged_test_lag_28.isnull().sum().sum()

(116293, 38)


0

**A lag of 28 days in *weather* data results in rows of missing values in *train* data.** This is because there are some dates in *train* data which do not correspond to dates in *weather* data after the *weather* data has been shifted by 28 days. So, we cannot use a lag of 28 days in *weather* data to do our classification modelling.

### Exporting merged data

In [72]:
# Exporting all merged dataframes as CSV files.

# merged_train_lag_0.to_csv("../input/merged_train_lag_0.csv", index=False)
# merged_test_lag_0.to_csv("../input/merged_test_lag_0.csv", index=False)

# merged_train_lag_2.to_csv("../input/merged_train_lag_2.csv", index=False)
# merged_test_lag_2.to_csv("../input/merged_test_lag_2.csv", index=False)

# merged_train_lag_5.to_csv("../input/merged_train_lag_5.csv", index=False)
# merged_test_lag_5.to_csv("../input/merged_test_lag_5.csv", index=False)

# merged_train_lag_7.to_csv("../input/merged_train_lag_7.csv", index=False)
# merged_test_lag_7.to_csv("../input/merged_test_lag_7.csv", index=False)

merged_train_lag_10.to_csv("../input/merged_train_lag_10.csv", index=False)
merged_test_lag_10.to_csv("../input/merged_test_lag_10.csv", index=False)

# merged_train_lag_14.to_csv("../input/merged_train_lag_14.csv", index=False)
# merged_test_lag_14.to_csv("../input/merged_test_lag_14.csv", index=False)

# merged_train_lag_21.to_csv("../input/merged_train_lag_21.csv", index=False)
# merged_test_lag_21.to_csv("../input/merged_test_lag_21.csv", index=False)

# merged_train_lag_28.to_csv("../input/merged_train_lag_28.csv", index=False)
# merged_test_lag_28.to_csv("../input/merged_test_lag_28.csv", index=False)