# Project 4: West Nile Virus Prediction 

## Introduction 

The West Nile virus (WNV) is a mosquito-borne illness that can cause severe neurological disease and death in humans. Originating from <a href="https://www.who.int/news-room/fact-sheets/detail/west-nile-virus">Uganda in 1937</a>,  its first recorded occurrence in the United States (US) was in 1999 in New York City and has since spread across the entire nation to become the leading disease transmitted by mosquitoes in the US. 

The disease is spread primarily by several members of the <a href="https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0227160">Culex species</a> of mosquitoes, which became infected when they feed on infected birds. These infected mosquitoes then spread the disease to humans and other animals by biting them. 

Although 8 out of 10 infected people with the WNV do not develop any symptoms, about 1 in 5 of those infected develop a fever or other physical symptoms such as headaches or body aches. Fortunately, most of these people manage to recover completely. 

However, 1 in 150 people develop much more severe symptoms affecting the central nervous system, including areas of the brain and the spinal cord. These result in much more severe illnesses such as high fevers, comas and even death. This is especially more pronounced in people over the age of 60 and those with pre-existing medical conditions (<a href="https://www.cdc.gov/westnile/index.html">source</a>).

## Problem Statement 

With no vaccine or specific antiviral cure for the WNV, combined with an <a href ="https://www.sciencedaily.com/releases/2014/02/140210184713.htm">estimated cost of $800 million</a> in healthcare expenditures and lost productivity over a 15-year period from 1999 to 2014, it is paramount that a solution is developed to curb the spread of the WNV, with focus placed on the city of Chicago. 

Given weather, GIS and spray data from the Chicago Department of Public Health, we selected and trained 6 classification models to determine which has best performance in predicting whether the WNV is present in a given location. Our goal is to use our predictions to devise an effective strategy to deploy pesticides in WNV-hotspots to maximise their effectiveness with minimal cost. 

## Executive Summary

This project explores the <a href="https://www.kaggle.com/c/predict-west-nile-virus/">West Nile Virus Prediction Challenge</a> launched on Kaggle in a bid to predict WNV in mosquitoes across the city of Chicago. The first instances of the virus in Chicago were detected in 2002 and since 2004, the Chicago Department of Public Health has increased surveillance and control efforts in a bid to prevent transmission of this virus. 

We first began with an initial cleaning of our 4 datasets - main, spray, weather and map. This was followed by an extensive exploratory data analysis where we assessed the numerical and categorical variables against the count of WNV-positive mosquitoes. The numerical weather features related to temperature, precipitation, dew point, wet bulb and average wind speed appeared to have some effect on the WNV-positive mosquito count. After conducting time series analysis on these features, these effects became more pronounced and we were able to obtain more meaningful results. 

These features, as well as the weather phenomena, were then fed into our modelling process. The models used were: 
- Logistic Regression
- K Nearest Neighbors 
- Support Vector Machines
- XGBoost

Hyperparameter tuning was done to obtain the best training and cross-validated scores for each pipeline. We then used the best model to find the classification metrics and compared them against an AUC (Area Under The Curve) ROC (Receiver Operating Characteristics) curve to determine how much our model is capable of distinguishing between locations with and without the WNV. 

In [1]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math

from sklearn.feature_extraction.text import CountVectorizer

# Train

In [2]:
train = pd.read_csv('../datasets/train.csv',parse_dates=['Date'])

In [3]:
train.drop(columns=['Address', 'Block', 'Street', 'AddressNumberAndStreet'], inplace=True)

In [4]:
# custom function to rename columns to lower case and replace whitespaces with underscore.
def rename_cols(df):
    old_cols = df.columns
    new_cols = [x.lower().replace(' ','_') for x in old_cols]
    #create dictionary with old column label as key and new label as value
    col_dict = {old:new for old, new in zip(old_cols,new_cols)}
    df.rename(columns=col_dict, inplace=True)
    return df

In [5]:
rename_cols(train)

Unnamed: 0,date,species,trap,latitude,longitude,addressaccuracy,nummosquitos,wnvpresent
0,2007-05-29,CULEX PIPIENS/RESTUANS,T002,41.954690,-87.800991,9,1,0
1,2007-05-29,CULEX RESTUANS,T002,41.954690,-87.800991,9,1,0
2,2007-05-29,CULEX RESTUANS,T007,41.994991,-87.769279,9,1,0
3,2007-05-29,CULEX PIPIENS/RESTUANS,T015,41.974089,-87.824812,8,1,0
4,2007-05-29,CULEX RESTUANS,T015,41.974089,-87.824812,8,4,0
...,...,...,...,...,...,...,...,...
10501,2013-09-26,CULEX PIPIENS/RESTUANS,T035,41.763733,-87.742302,8,6,1
10502,2013-09-26,CULEX PIPIENS/RESTUANS,T231,41.987280,-87.666066,8,5,0
10503,2013-09-26,CULEX PIPIENS/RESTUANS,T232,41.912563,-87.668055,9,1,0
10504,2013-09-26,CULEX PIPIENS/RESTUANS,T233,42.009876,-87.807277,9,5,0


In [6]:
train.rename(columns={'latitude':'lat', 'longitude':'long', 
                      'addressaccuracy':'accuracy',\
                     'nummosquitos':'num_mos', 'wnvpresent':'wnv'}, 
                     inplace=True)

## Date

In [7]:
# 'Date' is now of Date type
train['date']

0       2007-05-29
1       2007-05-29
2       2007-05-29
3       2007-05-29
4       2007-05-29
           ...    
10501   2013-09-26
10502   2013-09-26
10503   2013-09-26
10504   2013-09-26
10505   2013-09-26
Name: date, Length: 10506, dtype: datetime64[ns]

In [8]:
# odd years only
train['date'].dt.year.unique()

array([2007, 2009, 2011, 2013])

In [9]:
# summer and autumn only
train['date'].dt.month.unique()

array([ 5,  6,  7,  8,  9, 10])

## Location: Nearest Weather Station

In [10]:
# custom function to assign nearer weather station to each trap
def assign_station(row):
    
    def distance(origin, destination):
    
        # Origin and destination coordinates
        lat1, lon1 = origin
        lat2, lon2 = destination

        # Radius of the Earth in km
        radius = 6371

        dlat = math.radians(lat2 - lat1)
        dlon = math.radians(lon2 - lon1)
        a = (math.sin(dlat / 2) * math.sin(dlat / 2) +
             math.cos(math.radians(lat1)) * math.cos(math.radians(lat2)) *
             math.sin(dlon / 2) * math.sin(dlon / 2))
        c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
        d = radius * c

        # Returns distance in km
        return d
    
    #calc distance to station 1
    station_1 = (41.995, -87.933)
    dist1 = distance(station_1, (row['lat'],row['long']))
    
    #calc distance to station 2
    station_2 = (41.786, -87.752)
    dist2 = distance(station_2, (row['lat'],row['long']))
    
    if dist1 <= dist2:
        station=1
    else: station=2
    return station

In [11]:
train['station'] = train.apply(assign_station, axis=1)

In [12]:
train

Unnamed: 0,date,species,trap,lat,long,accuracy,num_mos,wnv,station
0,2007-05-29,CULEX PIPIENS/RESTUANS,T002,41.954690,-87.800991,9,1,0,1
1,2007-05-29,CULEX RESTUANS,T002,41.954690,-87.800991,9,1,0,1
2,2007-05-29,CULEX RESTUANS,T007,41.994991,-87.769279,9,1,0,1
3,2007-05-29,CULEX PIPIENS/RESTUANS,T015,41.974089,-87.824812,8,1,0,1
4,2007-05-29,CULEX RESTUANS,T015,41.974089,-87.824812,8,4,0,1
...,...,...,...,...,...,...,...,...,...
10501,2013-09-26,CULEX PIPIENS/RESTUANS,T035,41.763733,-87.742302,8,6,1,2
10502,2013-09-26,CULEX PIPIENS/RESTUANS,T231,41.987280,-87.666066,8,5,0,1
10503,2013-09-26,CULEX PIPIENS/RESTUANS,T232,41.912563,-87.668055,9,1,0,2
10504,2013-09-26,CULEX PIPIENS/RESTUANS,T233,42.009876,-87.807277,9,5,0,1


## Species

In [13]:
# remove 'CULEX'
train['species']=train['species'].str.replace('CULEX ','')

In [14]:
train['species'].value_counts()

PIPIENS/RESTUANS    4752
RESTUANS            2740
PIPIENS             2699
TERRITANS            222
SALINARIUS            86
TARSALIS               6
ERRATICUS              1
Name: species, dtype: int64

In [15]:
train.groupby('species')['wnv'].value_counts()

species           wnv
ERRATICUS         0         1
PIPIENS           0      2459
                  1       240
PIPIENS/RESTUANS  0      4490
                  1       262
RESTUANS          0      2691
                  1        49
SALINARIUS        0        86
TARSALIS          0         6
TERRITANS         0       222
Name: wnv, dtype: int64

In [16]:
# groupby all with 5 identical features and sum up num_mos and wnv
grouped_train = pd.DataFrame(train.groupby(['date','species', 'trap', 'lat', 'long','station','accuracy'])\
                             [['num_mos', 'wnv']].sum())

In [17]:
grouped_train

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,num_mos,wnv
date,species,trap,lat,long,station,accuracy,Unnamed: 7_level_1,Unnamed: 8_level_1
2007-05-29,PIPIENS,T096,41.731922,-87.677512,2,8,1,0
2007-05-29,PIPIENS/RESTUANS,T002,41.954690,-87.800991,1,9,1,0
2007-05-29,PIPIENS/RESTUANS,T015,41.974089,-87.824812,1,8,1,0
2007-05-29,PIPIENS/RESTUANS,T048,41.867108,-87.654224,2,8,1,0
2007-05-29,PIPIENS/RESTUANS,T050,41.919343,-87.694259,2,8,1,0
...,...,...,...,...,...,...,...,...
2013-09-26,RESTUANS,T082,41.803423,-87.642984,2,8,2,0
2013-09-26,RESTUANS,T102,41.750498,-87.605294,2,5,1,0
2013-09-26,RESTUANS,T209,41.740641,-87.546587,2,5,1,0
2013-09-26,RESTUANS,T220,41.963976,-87.691810,1,9,8,0


In [18]:
grouped_train.reset_index(inplace=True)

In [19]:
# check
grouped_train[grouped_train['date']=='2007-07-25'].sort_values(by='trap')

Unnamed: 0,date,species,trap,lat,long,station,accuracy,num_mos,wnv
563,2007-07-25,PIPIENS,T107,41.729669,-87.582699,2,5,1,0
567,2007-07-25,PIPIENS/RESTUANS,T107,41.729669,-87.582699,2,5,2,0
564,2007-07-25,PIPIENS,T115,41.673408,-87.599862,2,5,2356,3
568,2007-07-25,PIPIENS/RESTUANS,T115,41.673408,-87.599862,2,5,644,2
572,2007-07-25,SALINARIUS,T115,41.673408,-87.599862,2,5,1,0
574,2007-07-25,TERRITANS,T115,41.673408,-87.599862,2,5,1,0
565,2007-07-25,PIPIENS,T128,41.704572,-87.565666,2,8,55,0
569,2007-07-25,PIPIENS/RESTUANS,T128,41.704572,-87.565666,2,8,83,0
571,2007-07-25,RESTUANS,T128,41.704572,-87.565666,2,8,11,0
573,2007-07-25,SALINARIUS,T128,41.704572,-87.565666,2,8,2,0


In [20]:
grouped_train = pd.get_dummies(grouped_train, columns=['species'], drop_first=False)

In [21]:
grouped_train = grouped_train[['date','trap','lat','long','station','accuracy','species_ERRATICUS','species_PIPIENS',\
                               'species_PIPIENS/RESTUANS','species_RESTUANS','species_SALINARIUS',\
                               'species_TARSALIS','species_TERRITANS','num_mos','wnv']]

In [22]:
grouped_train

Unnamed: 0,date,trap,lat,long,station,accuracy,species_ERRATICUS,species_PIPIENS,species_PIPIENS/RESTUANS,species_RESTUANS,species_SALINARIUS,species_TARSALIS,species_TERRITANS,num_mos,wnv
0,2007-05-29,T096,41.731922,-87.677512,2,8,0,1,0,0,0,0,0,1,0
1,2007-05-29,T002,41.954690,-87.800991,1,9,0,0,1,0,0,0,0,1,0
2,2007-05-29,T015,41.974089,-87.824812,1,8,0,0,1,0,0,0,0,1,0
3,2007-05-29,T048,41.867108,-87.654224,2,8,0,0,1,0,0,0,0,1,0
4,2007-05-29,T050,41.919343,-87.694259,2,8,0,0,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8470,2013-09-26,T082,41.803423,-87.642984,2,8,0,0,0,1,0,0,0,2,0
8471,2013-09-26,T102,41.750498,-87.605294,2,5,0,0,0,1,0,0,0,1,0
8472,2013-09-26,T209,41.740641,-87.546587,2,5,0,0,0,1,0,0,0,1,0
8473,2013-09-26,T220,41.963976,-87.691810,1,9,0,0,0,1,0,0,0,8,0


In [23]:
grouped_train.sort_values(by=['date','trap'], inplace=True)

In [24]:
# Export clean train data for EDA
grouped_train.to_csv('../datasets/train_cleaned.csv',index=False)

# Test

In [25]:
test = pd.read_csv('../datasets/test.csv',parse_dates=['Date'])

In [26]:
test.drop(columns=['Address', 'Block', 'Street', 'AddressNumberAndStreet'], inplace=True)

In [27]:
rename_cols(test)

Unnamed: 0,id,date,species,trap,latitude,longitude,addressaccuracy
0,1,2008-06-11,CULEX PIPIENS/RESTUANS,T002,41.954690,-87.800991,9
1,2,2008-06-11,CULEX RESTUANS,T002,41.954690,-87.800991,9
2,3,2008-06-11,CULEX PIPIENS,T002,41.954690,-87.800991,9
3,4,2008-06-11,CULEX SALINARIUS,T002,41.954690,-87.800991,9
4,5,2008-06-11,CULEX TERRITANS,T002,41.954690,-87.800991,9
...,...,...,...,...,...,...,...
116288,116289,2014-10-02,CULEX SALINARIUS,T054C,41.925652,-87.633590,8
116289,116290,2014-10-02,CULEX TERRITANS,T054C,41.925652,-87.633590,8
116290,116291,2014-10-02,CULEX TARSALIS,T054C,41.925652,-87.633590,8
116291,116292,2014-10-02,UNSPECIFIED CULEX,T054C,41.925652,-87.633590,8


In [28]:
test.rename(columns={'latitude':'lat', 'longitude':'long', 'addressaccuracy':'accuracy'}, 
             inplace=True)

In [29]:
# id required for Kaggle submission
test.head()

Unnamed: 0,id,date,species,trap,lat,long,accuracy
0,1,2008-06-11,CULEX PIPIENS/RESTUANS,T002,41.95469,-87.800991,9
1,2,2008-06-11,CULEX RESTUANS,T002,41.95469,-87.800991,9
2,3,2008-06-11,CULEX PIPIENS,T002,41.95469,-87.800991,9
3,4,2008-06-11,CULEX SALINARIUS,T002,41.95469,-87.800991,9
4,5,2008-06-11,CULEX TERRITANS,T002,41.95469,-87.800991,9


## Date

In [30]:
# 'Date' is now of datetime type
test['date']

0        2008-06-11
1        2008-06-11
2        2008-06-11
3        2008-06-11
4        2008-06-11
            ...    
116288   2014-10-02
116289   2014-10-02
116290   2014-10-02
116291   2014-10-02
116292   2014-10-02
Name: date, Length: 116293, dtype: datetime64[ns]

In [31]:
# odd years only
test['date'].dt.year.unique()

array([2008, 2010, 2012, 2014])

In [32]:
# summer and autumn only
test['date'].dt.month.unique()

array([ 6,  7,  8,  9, 10])

## Location: Traps and Coordinates

In [33]:
test['station'] = test.apply(assign_station, axis=1)

In [34]:
test['trap'].nunique()

149

In [35]:
set(test['trap'].unique()) - set(train['trap'].unique())

{'T002A',
 'T002B',
 'T065A',
 'T090A',
 'T090B',
 'T090C',
 'T128A',
 'T200A',
 'T200B',
 'T218A',
 'T218B',
 'T218C',
 'T234'}

## Species

In [36]:
# remove 'CULEX'
test['species']=test['species'].str.replace('CULEX ','')

In [37]:
test['species'].value_counts()

PIPIENS/RESTUANS     15359
RESTUANS             14670
PIPIENS              14521
SALINARIUS           14355
TERRITANS            14351
TARSALIS             14347
ERRATICUS            14345
UNSPECIFIED CULEX    14345
Name: species, dtype: int64

In [38]:
test[test['trap']=='T002']

Unnamed: 0,id,date,species,trap,lat,long,accuracy,station
0,1,2008-06-11,PIPIENS/RESTUANS,T002,41.95469,-87.800991,9,1
1,2,2008-06-11,RESTUANS,T002,41.95469,-87.800991,9,1
2,3,2008-06-11,PIPIENS,T002,41.95469,-87.800991,9,1
3,4,2008-06-11,SALINARIUS,T002,41.95469,-87.800991,9,1
4,5,2008-06-11,TERRITANS,T002,41.95469,-87.800991,9,1
...,...,...,...,...,...,...,...,...
115088,115089,2014-10-02,SALINARIUS,T002,41.95469,-87.800991,9,1
115089,115090,2014-10-02,TERRITANS,T002,41.95469,-87.800991,9,1
115090,115091,2014-10-02,TARSALIS,T002,41.95469,-87.800991,9,1
115091,115092,2014-10-02,UNSPECIFIED CULEX,T002,41.95469,-87.800991,9,1


In [39]:
test[test['trap']=='T002A']

Unnamed: 0,id,date,species,trap,lat,long,accuracy,station
1120,1121,2008-06-11,PIPIENS/RESTUANS,T002A,41.965571,-87.781978,8,1
1121,1122,2008-06-11,RESTUANS,T002A,41.965571,-87.781978,8,1
1122,1123,2008-06-11,PIPIENS,T002A,41.965571,-87.781978,8,1
1123,1124,2008-06-11,SALINARIUS,T002A,41.965571,-87.781978,8,1
1124,1125,2008-06-11,TERRITANS,T002A,41.965571,-87.781978,8,1
...,...,...,...,...,...,...,...,...
116208,116209,2014-10-02,SALINARIUS,T002A,41.965571,-87.781978,8,1
116209,116210,2014-10-02,TERRITANS,T002A,41.965571,-87.781978,8,1
116210,116211,2014-10-02,TARSALIS,T002A,41.965571,-87.781978,8,1
116211,116212,2014-10-02,UNSPECIFIED CULEX,T002A,41.965571,-87.781978,8,1


In [40]:
grouped_test = pd.get_dummies(test, columns=['species'], drop_first=False)

In [41]:
grouped_test.drop('species_UNSPECIFIED CULEX', axis=1, inplace=True)

In [42]:
grouped_test = grouped_test[['id','date','trap','lat','long','station','accuracy','species_ERRATICUS','species_PIPIENS',\
                               'species_PIPIENS/RESTUANS','species_RESTUANS','species_SALINARIUS',\
                               'species_TARSALIS','species_TERRITANS']]

In [43]:
grouped_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116293 entries, 0 to 116292
Data columns (total 14 columns):
 #   Column                    Non-Null Count   Dtype         
---  ------                    --------------   -----         
 0   id                        116293 non-null  int64         
 1   date                      116293 non-null  datetime64[ns]
 2   trap                      116293 non-null  object        
 3   lat                       116293 non-null  float64       
 4   long                      116293 non-null  float64       
 5   station                   116293 non-null  int64         
 6   accuracy                  116293 non-null  int64         
 7   species_ERRATICUS         116293 non-null  uint8         
 8   species_PIPIENS           116293 non-null  uint8         
 9   species_PIPIENS/RESTUANS  116293 non-null  uint8         
 10  species_RESTUANS          116293 non-null  uint8         
 11  species_SALINARIUS        116293 non-null  uint8         
 12  sp

In [44]:
grouped_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8475 entries, 1 to 8467
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   date                      8475 non-null   datetime64[ns]
 1   trap                      8475 non-null   object        
 2   lat                       8475 non-null   float64       
 3   long                      8475 non-null   float64       
 4   station                   8475 non-null   int64         
 5   accuracy                  8475 non-null   int64         
 6   species_ERRATICUS         8475 non-null   uint8         
 7   species_PIPIENS           8475 non-null   uint8         
 8   species_PIPIENS/RESTUANS  8475 non-null   uint8         
 9   species_RESTUANS          8475 non-null   uint8         
 10  species_SALINARIUS        8475 non-null   uint8         
 11  species_TARSALIS          8475 non-null   uint8         
 12  species_TERRITANS   

In [45]:
grouped_train.head()

Unnamed: 0,date,trap,lat,long,station,accuracy,species_ERRATICUS,species_PIPIENS,species_PIPIENS/RESTUANS,species_RESTUANS,species_SALINARIUS,species_TARSALIS,species_TERRITANS,num_mos,wnv
1,2007-05-29,T002,41.95469,-87.800991,1,9,0,0,1,0,0,0,0,1,0
10,2007-05-29,T002,41.95469,-87.800991,1,9,0,0,0,1,0,0,0,1,0
11,2007-05-29,T007,41.994991,-87.769279,1,9,0,0,0,1,0,0,0,1,0
2,2007-05-29,T015,41.974089,-87.824812,1,8,0,0,1,0,0,0,0,1,0
12,2007-05-29,T015,41.974089,-87.824812,1,8,0,0,0,1,0,0,0,4,0


In [46]:
grouped_test.head()

Unnamed: 0,id,date,trap,lat,long,station,accuracy,species_ERRATICUS,species_PIPIENS,species_PIPIENS/RESTUANS,species_RESTUANS,species_SALINARIUS,species_TARSALIS,species_TERRITANS
0,1,2008-06-11,T002,41.95469,-87.800991,1,9,0,0,1,0,0,0,0
1,2,2008-06-11,T002,41.95469,-87.800991,1,9,0,0,0,1,0,0,0
2,3,2008-06-11,T002,41.95469,-87.800991,1,9,0,1,0,0,0,0,0
3,4,2008-06-11,T002,41.95469,-87.800991,1,9,0,0,0,0,1,0,0
4,5,2008-06-11,T002,41.95469,-87.800991,1,9,0,0,0,0,0,0,1


In [47]:
# Export clean test data for EDA
grouped_test.to_csv('../datasets/test_cleaned.csv', index=False)

# Weather

In [48]:
weather = pd.read_csv('../datasets/weather.csv')

In [49]:
weather.head()

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,...,CodeSum,Depth,Water1,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2007-05-01,83,50,67,14,51,56,0,2,...,,0,M,0.0,0.0,29.1,29.82,1.7,27,9.2
1,2,2007-05-01,84,52,68,M,51,57,0,3,...,,M,M,M,0.0,29.18,29.82,2.7,25,9.6
2,1,2007-05-02,59,42,51,-3,42,47,14,0,...,BR,0,M,0.0,0.0,29.38,30.09,13.0,4,13.4
3,2,2007-05-02,60,43,52,M,42,47,13,0,...,BR HZ,M,M,M,0.0,29.44,30.08,13.3,2,13.4
4,1,2007-05-03,66,46,56,2,40,48,9,0,...,,0,M,0.0,0.0,29.39,30.12,11.7,7,11.9


In [50]:
weather['Date']=pd.to_datetime(weather['Date'],format='%Y-%m-%d')

In [51]:
weather['Water1'].unique()

array(['M'], dtype=object)

In [52]:
weather.drop(columns=['Depart','Water1','SnowFall','Depth','Sunrise','Sunset'], inplace=True)

In [53]:
# use countvectorizer to split weather codes
cv=CountVectorizer(token_pattern='[a-zA-Z\+]+')

In [54]:
weather_code=pd.DataFrame(cv.fit_transform(weather['CodeSum']).todense(),
             columns=cv.get_feature_names())

In [55]:
weather=pd.concat([weather,weather_code],axis=1).drop(columns='CodeSum')

In [56]:
weather['PrecipTotal'].unique()

array(['0.00', '  T', '0.13', '0.02', '0.38', '0.60', '0.14', '0.07',
       '0.11', '0.09', '1.01', '0.28', '0.04', '0.08', '0.01', '0.53',
       '0.19', '0.21', '0.32', '0.39', '0.31', '0.42', '0.27', '0.16',
       '0.58', '0.93', '0.05', '0.34', '0.15', '0.35', 'M', '0.40',
       '0.66', '0.30', '0.24', '0.43', '1.55', '0.92', '0.89', '0.17',
       '0.03', '1.43', '0.97', '0.26', '1.31', '0.06', '0.46', '0.29',
       '0.23', '0.41', '0.45', '0.83', '1.33', '0.91', '0.48', '0.37',
       '0.88', '2.35', '1.96', '0.20', '0.25', '0.18', '0.67', '0.36',
       '0.33', '1.28', '0.74', '0.76', '0.71', '0.95', '1.46', '0.12',
       '0.52', '0.64', '0.22', '1.24', '0.72', '0.73', '0.65', '1.61',
       '1.22', '0.50', '1.05', '2.43', '0.59', '2.90', '2.68', '1.23',
       '0.62', '6.64', '3.07', '1.44', '1.75', '0.82', '0.80', '0.86',
       '0.63', '0.55', '1.03', '0.70', '1.73', '1.38', '0.44', '1.14',
       '1.07', '3.97', '0.87', '0.78', '1.12', '0.68', '0.10', '0.61',
       '0.

In [57]:
weather['PrecipTotal'].replace(['M','  T'],0 ,inplace=True)

In [58]:
weather['Tavg'].unique()

array(['67', '68', '51', '52', '56', '58', 'M', '60', '59', '65', '70',
       '69', '71', '61', '55', '57', '73', '72', '53', '62', '63', '74',
       '75', '78', '76', '77', '66', '80', '64', '81', '82', '79', '85',
       '84', '83', '50', '49', '46', '48', '45', '54', '47', '44', '40',
       '41', '38', '39', '42', '37', '43', '86', '87', '89', '92', '88',
       '91', '93', '94', '90', '36'], dtype=object)

In [59]:
weather[weather['Tavg']=='M']

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,DewPoint,WetBulb,Heat,Cool,PrecipTotal,...,gr,hz,mifg,ra,sn,sq,ts,tsra,vcfg,vcts
7,2,2007-05-04,78,51,M,42,50,M,M,0.0,...,0,0,0,0,0,0,0,0,0,0
505,2,2008-07-08,86,46,M,68,71,M,M,0.28,...,0,0,0,1,0,0,1,0,0,0
675,2,2008-10-01,62,46,M,41,47,M,M,0.0,...,0,0,0,0,0,0,0,0,0,0
1637,2,2011-07-22,100,71,M,70,74,M,M,0.14,...,0,0,0,0,0,0,1,1,0,0
2067,2,2012-08-22,84,72,M,51,61,M,M,0.0,...,0,0,0,0,0,0,0,0,0,0
2211,2,2013-05-02,71,42,M,39,45,M,M,0.0,...,0,0,0,0,0,0,0,0,0,0
2501,2,2013-09-24,91,52,M,48,54,M,M,0.0,...,0,0,0,0,0,0,0,0,0,0
2511,2,2013-09-29,84,53,M,48,54,M,M,0.22,...,0,0,0,1,0,0,0,0,0,0
2525,2,2013-10-06,76,48,M,44,50,M,M,0.06,...,0,0,0,1,0,0,0,0,0,0
2579,2,2014-05-02,80,47,M,43,47,M,M,0.04,...,0,0,0,1,0,0,0,0,0,0


In [60]:
weather.loc[weather[weather['Tavg']=='M'].index]

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,DewPoint,WetBulb,Heat,Cool,PrecipTotal,...,gr,hz,mifg,ra,sn,sq,ts,tsra,vcfg,vcts
7,2,2007-05-04,78,51,M,42,50,M,M,0.0,...,0,0,0,0,0,0,0,0,0,0
505,2,2008-07-08,86,46,M,68,71,M,M,0.28,...,0,0,0,1,0,0,1,0,0,0
675,2,2008-10-01,62,46,M,41,47,M,M,0.0,...,0,0,0,0,0,0,0,0,0,0
1637,2,2011-07-22,100,71,M,70,74,M,M,0.14,...,0,0,0,0,0,0,1,1,0,0
2067,2,2012-08-22,84,72,M,51,61,M,M,0.0,...,0,0,0,0,0,0,0,0,0,0
2211,2,2013-05-02,71,42,M,39,45,M,M,0.0,...,0,0,0,0,0,0,0,0,0,0
2501,2,2013-09-24,91,52,M,48,54,M,M,0.0,...,0,0,0,0,0,0,0,0,0,0
2511,2,2013-09-29,84,53,M,48,54,M,M,0.22,...,0,0,0,1,0,0,0,0,0,0
2525,2,2013-10-06,76,48,M,44,50,M,M,0.06,...,0,0,0,1,0,0,0,0,0,0
2579,2,2014-05-02,80,47,M,43,47,M,M,0.04,...,0,0,0,1,0,0,0,0,0,0


In [61]:
# impute with average of Tmax and Tmin if Tavg is missing
weather['Tavg']=weather.apply(lambda x:(x['Tmax']+x['Tmin'])//2 if x['Tavg']=='M' else x['Tavg'],axis=1)

In [62]:
weather[weather['Heat']=='M']

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,DewPoint,WetBulb,Heat,Cool,PrecipTotal,...,gr,hz,mifg,ra,sn,sq,ts,tsra,vcfg,vcts
7,2,2007-05-04,78,51,64,42,50,M,M,0.0,...,0,0,0,0,0,0,0,0,0,0
505,2,2008-07-08,86,46,66,68,71,M,M,0.28,...,0,0,0,1,0,0,1,0,0,0
675,2,2008-10-01,62,46,54,41,47,M,M,0.0,...,0,0,0,0,0,0,0,0,0,0
1637,2,2011-07-22,100,71,85,70,74,M,M,0.14,...,0,0,0,0,0,0,1,1,0,0
2067,2,2012-08-22,84,72,78,51,61,M,M,0.0,...,0,0,0,0,0,0,0,0,0,0
2211,2,2013-05-02,71,42,56,39,45,M,M,0.0,...,0,0,0,0,0,0,0,0,0,0
2501,2,2013-09-24,91,52,71,48,54,M,M,0.0,...,0,0,0,0,0,0,0,0,0,0
2511,2,2013-09-29,84,53,68,48,54,M,M,0.22,...,0,0,0,1,0,0,0,0,0,0
2525,2,2013-10-06,76,48,62,44,50,M,M,0.06,...,0,0,0,1,0,0,0,0,0,0
2579,2,2014-05-02,80,47,63,43,47,M,M,0.04,...,0,0,0,1,0,0,0,0,0,0


In [63]:
weather[weather['Cool']=='M']

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,DewPoint,WetBulb,Heat,Cool,PrecipTotal,...,gr,hz,mifg,ra,sn,sq,ts,tsra,vcfg,vcts
7,2,2007-05-04,78,51,64,42,50,M,M,0.0,...,0,0,0,0,0,0,0,0,0,0
505,2,2008-07-08,86,46,66,68,71,M,M,0.28,...,0,0,0,1,0,0,1,0,0,0
675,2,2008-10-01,62,46,54,41,47,M,M,0.0,...,0,0,0,0,0,0,0,0,0,0
1637,2,2011-07-22,100,71,85,70,74,M,M,0.14,...,0,0,0,0,0,0,1,1,0,0
2067,2,2012-08-22,84,72,78,51,61,M,M,0.0,...,0,0,0,0,0,0,0,0,0,0
2211,2,2013-05-02,71,42,56,39,45,M,M,0.0,...,0,0,0,0,0,0,0,0,0,0
2501,2,2013-09-24,91,52,71,48,54,M,M,0.0,...,0,0,0,0,0,0,0,0,0,0
2511,2,2013-09-29,84,53,68,48,54,M,M,0.22,...,0,0,0,1,0,0,0,0,0,0
2525,2,2013-10-06,76,48,62,44,50,M,M,0.06,...,0,0,0,1,0,0,0,0,0,0
2579,2,2014-05-02,80,47,63,43,47,M,M,0.04,...,0,0,0,1,0,0,0,0,0,0


In [64]:
weather['Heat']=weather['Heat'].apply(lambda x:0 if x=='M' else x)
weather['Cool']=weather['Cool'].apply(lambda x:0 if x=='M' else x)

In [65]:
weather[weather['WetBulb']=='M']

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,DewPoint,WetBulb,Heat,Cool,PrecipTotal,...,gr,hz,mifg,ra,sn,sq,ts,tsra,vcfg,vcts
848,1,2009-06-26,86,69,78,60,M,0,13,0.0,...,0,0,0,0,0,0,0,0,0,0
2410,1,2013-08-10,81,64,73,57,M,0,8,0.0,...,0,0,0,0,0,0,0,0,0,0
2412,1,2013-08-11,81,60,71,61,M,0,6,0.01,...,0,0,0,1,0,0,0,0,0,0
2415,2,2013-08-12,85,69,77,63,M,0,12,0.66,...,0,0,0,1,0,0,0,0,0,0


In [66]:
np.floor(weather[(weather['DewPoint']==60)&(weather['WetBulb']!='M')]['WetBulb'].astype(int).mean())

64.0

In [67]:
np.floor(weather[(weather['DewPoint']==57)&(weather['WetBulb']!='M')]['WetBulb'].astype(int).mean())

62.0

In [68]:
np.floor(weather[(weather['DewPoint']==61)&(weather['WetBulb']!='M')]['WetBulb'].astype(int).mean())

65.0

In [69]:
np.floor(weather[(weather['DewPoint']==63)&(weather['WetBulb']!='M')]['WetBulb'].astype(int).mean())

67.0

In [70]:
weather.loc[848,'WetBulb']=64
weather.loc[2410,'WetBulb']=62
weather.loc[2412,'WetBulb']=65
weather.loc[2415,'WetBulb']=67

In [71]:
weather[weather['StnPressure']=='M']

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,DewPoint,WetBulb,Heat,Cool,PrecipTotal,...,gr,hz,mifg,ra,sn,sq,ts,tsra,vcfg,vcts
87,2,2007-06-13,86,68,77,53,62,0,12,0.0,...,0,0,0,0,0,0,0,0,0,0
848,1,2009-06-26,86,69,78,60,64,0,13,0.0,...,0,0,0,0,0,0,0,0,0,0
2410,1,2013-08-10,81,64,73,57,62,0,8,0.0,...,0,0,0,0,0,0,0,0,0,0
2411,2,2013-08-10,81,68,75,55,63,0,10,0.0,...,0,0,0,0,0,0,0,0,0,0


In [72]:
weather.replace('M',np.nan,inplace=True)
weather['StnPressure'].fillna(round(weather['StnPressure'].astype(float).mean(),2),inplace=True)
weather['SeaLevel'].fillna(round(weather['SeaLevel'].astype(float).mean(),2),inplace=True)

In [73]:
weather['AvgSpeed']

0        9.2
1        9.6
2       13.4
3       13.4
4       11.9
        ... 
2939     9.0
2940     5.5
2941     6.5
2942    22.9
2943    22.6
Name: AvgSpeed, Length: 2944, dtype: object

In [74]:
weather['AvgSpeed']=weather.apply(lambda x: float(x['ResultSpeed']) \
                                  if np.isnan(float(x['AvgSpeed'])) else float(x['AvgSpeed']),axis=1)

In [75]:
weather=weather.astype({
    'Tavg':int,
    'WetBulb':int,
    'Heat':int,
    'Cool':int,
    'PrecipTotal':float,
    'StnPressure':float,
    'SeaLevel':float,
})

In [76]:
weather.reset_index(inplace=True, drop=True)
rename_cols(weather)

Unnamed: 0,station,date,tmax,tmin,tavg,dewpoint,wetbulb,heat,cool,preciptotal,...,gr,hz,mifg,ra,sn,sq,ts,tsra,vcfg,vcts
0,1,2007-05-01,83,50,67,51,56,0,2,0.00,...,0,0,0,0,0,0,0,0,0,0
1,2,2007-05-01,84,52,68,51,57,0,3,0.00,...,0,0,0,0,0,0,0,0,0,0
2,1,2007-05-02,59,42,51,42,47,14,0,0.00,...,0,0,0,0,0,0,0,0,0,0
3,2,2007-05-02,60,43,52,42,47,13,0,0.00,...,0,1,0,0,0,0,0,0,0,0
4,1,2007-05-03,66,46,56,40,48,9,0,0.00,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2939,2,2014-10-29,49,40,45,34,42,20,0,0.00,...,0,0,0,0,0,0,0,0,0,0
2940,1,2014-10-30,51,32,42,34,40,23,0,0.00,...,0,0,0,0,0,0,0,0,0,0
2941,2,2014-10-30,53,37,45,35,42,20,0,0.00,...,0,0,0,1,0,0,0,0,0,0
2942,1,2014-10-31,47,33,40,25,33,25,0,0.03,...,0,0,0,1,1,0,0,0,0,0


In [77]:
weather_s1=weather[weather['station']==1]
weather_s2=weather[weather['station']==2]

In [78]:
# Export weather data for EDA
weather_s1.to_csv('../datasets/weather_clean_s1.csv', index=False)
weather_s2.to_csv('../datasets/weather_clean_s2.csv', index=False)

In [79]:
# Rolling sum of weather conditions for past 14 days
weather_conditions = ['bcfg','br','dz','fg','fg+','fu','gr','hz','mifg','ra','sn','sq','ts','tsra','vcfg','vcts']

for weather_cond in weather_conditions:
    weather_s1[f'{weather_cond}_rolling_14'] = weather_s1[weather_cond].rolling(14).sum()
    weather_s2[f'{weather_cond}_rolling_14'] = weather_s2[weather_cond].rolling(14).sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  weather_s1[f'{weather_cond}_rolling_14'] = weather_s1[weather_cond].rolling(14).sum()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  weather_s2[f'{weather_cond}_rolling_14'] = weather_s2[weather_cond].rolling(14).sum()


In [80]:
# Rolling average of weather data for past 14 days
weather_data = ['tmax','tmin','tavg','dewpoint','wetbulb','heat','cool',
                'preciptotal','stnpressure','sealevel','resultspeed','resultdir','avgspeed']

for data in weather_data:
    weather_s1[f'{data}_rolling_14'] = weather_s1[data].rolling(14).mean()
    weather_s2[f'{data}_rolling_14'] = weather_s2[data].rolling(14).mean()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  weather_s1[f'{data}_rolling_14'] = weather_s1[data].rolling(14).mean()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  weather_s2[f'{data}_rolling_14'] = weather_s2[data].rolling(14).mean()


In [81]:
# Create columns for different lag days
lags = [3, 5, 7, 10]

columns = ['bcfg_rolling_14','br_rolling_14','dz_rolling_14','fg_rolling_14','fg+_rolling_14','fu_rolling_14','gr_rolling_14',
           'hz_rolling_14','mifg_rolling_14','ra_rolling_14','sn_rolling_14','sq_rolling_14','ts_rolling_14','tsra_rolling_14',
           'vcfg_rolling_14','vcts_rolling_14','tmax_rolling_14','tmin_rolling_14','tavg_rolling_14','dewpoint_rolling_14',
           'wetbulb_rolling_14','heat_rolling_14','cool_rolling_14','preciptotal_rolling_14','stnpressure_rolling_14',
           'sealevel_rolling_14','resultspeed_rolling_14','resultdir_rolling_14','avgspeed_rolling_14']

for col in columns:
    for lag in lags:
        weather_s1[f'{col}_lag_{lag}'] = weather_s1[col].shift(lag)
        weather_s2[f'{col}_lag_{lag}'] = weather_s2[col].shift(lag)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  weather_s1[f'{col}_lag_{lag}'] = weather_s1[col].shift(lag)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  weather_s2[f'{col}_lag_{lag}'] = weather_s2[col].shift(lag)


In [82]:
weather=pd.concat([weather_s1,weather_s2],axis=0).sort_index()

## Merge weather with train/test

In [83]:
grouped_train.shape

(8475, 15)

In [84]:
weather.shape

(2944, 176)

In [85]:
# Merge weather data to train and test data based on date and weather station
train_weather = grouped_train.merge(weather, how='left', on = ['date','station'])
test_weather = grouped_test.merge(weather, how='left', on = ['date','station'])

In [86]:
weather.loc[weather['date']=='2007-05-29']

Unnamed: 0,station,date,tmax,tmin,tavg,dewpoint,wetbulb,heat,cool,preciptotal,...,resultspeed_rolling_14_lag_7,resultspeed_rolling_14_lag_10,resultdir_rolling_14_lag_3,resultdir_rolling_14_lag_5,resultdir_rolling_14_lag_7,resultdir_rolling_14_lag_10,avgspeed_rolling_14_lag_3,avgspeed_rolling_14_lag_5,avgspeed_rolling_14_lag_7,avgspeed_rolling_14_lag_10
56,1,2007-05-29,88,60,74,58,65,0,9,0.0,...,8.671429,8.6,17.5,16.357143,15.357143,15.857143,11.021429,11.721429,10.378571,10.535714
57,2,2007-05-29,88,65,77,59,66,0,12,0.0,...,8.435714,8.421429,16.142857,17.357143,15.785714,16.214286,10.314286,11.014286,9.828571,9.95


In [87]:
# merged correctly
train_weather.loc[train_weather['date']=='2007-05-29'].head()

Unnamed: 0,date,trap,lat,long,station,accuracy,species_ERRATICUS,species_PIPIENS,species_PIPIENS/RESTUANS,species_RESTUANS,...,resultspeed_rolling_14_lag_7,resultspeed_rolling_14_lag_10,resultdir_rolling_14_lag_3,resultdir_rolling_14_lag_5,resultdir_rolling_14_lag_7,resultdir_rolling_14_lag_10,avgspeed_rolling_14_lag_3,avgspeed_rolling_14_lag_5,avgspeed_rolling_14_lag_7,avgspeed_rolling_14_lag_10
0,2007-05-29,T002,41.95469,-87.800991,1,9,0,0,1,0,...,8.671429,8.6,17.5,16.357143,15.357143,15.857143,11.021429,11.721429,10.378571,10.535714
1,2007-05-29,T002,41.95469,-87.800991,1,9,0,0,0,1,...,8.671429,8.6,17.5,16.357143,15.357143,15.857143,11.021429,11.721429,10.378571,10.535714
2,2007-05-29,T007,41.994991,-87.769279,1,9,0,0,0,1,...,8.671429,8.6,17.5,16.357143,15.357143,15.857143,11.021429,11.721429,10.378571,10.535714
3,2007-05-29,T015,41.974089,-87.824812,1,8,0,0,1,0,...,8.671429,8.6,17.5,16.357143,15.357143,15.857143,11.021429,11.721429,10.378571,10.535714
4,2007-05-29,T015,41.974089,-87.824812,1,8,0,0,0,1,...,8.671429,8.6,17.5,16.357143,15.357143,15.857143,11.021429,11.721429,10.378571,10.535714


In [88]:
test_weather.head()

Unnamed: 0,id,date,trap,lat,long,station,accuracy,species_ERRATICUS,species_PIPIENS,species_PIPIENS/RESTUANS,...,resultspeed_rolling_14_lag_7,resultspeed_rolling_14_lag_10,resultdir_rolling_14_lag_3,resultdir_rolling_14_lag_5,resultdir_rolling_14_lag_7,resultdir_rolling_14_lag_10,avgspeed_rolling_14_lag_3,avgspeed_rolling_14_lag_5,avgspeed_rolling_14_lag_7,avgspeed_rolling_14_lag_10
0,1,2008-06-11,T002,41.95469,-87.800991,1,9,0,0,1,...,7.985714,8.321429,14.071429,13.214286,10.928571,14.785714,10.842857,10.478571,9.421429,9.771429
1,2,2008-06-11,T002,41.95469,-87.800991,1,9,0,0,0,...,7.985714,8.321429,14.071429,13.214286,10.928571,14.785714,10.842857,10.478571,9.421429,9.771429
2,3,2008-06-11,T002,41.95469,-87.800991,1,9,0,1,0,...,7.985714,8.321429,14.071429,13.214286,10.928571,14.785714,10.842857,10.478571,9.421429,9.771429
3,4,2008-06-11,T002,41.95469,-87.800991,1,9,0,0,0,...,7.985714,8.321429,14.071429,13.214286,10.928571,14.785714,10.842857,10.478571,9.421429,9.771429
4,5,2008-06-11,T002,41.95469,-87.800991,1,9,0,0,0,...,7.985714,8.321429,14.071429,13.214286,10.928571,14.785714,10.842857,10.478571,9.421429,9.771429


In [89]:
# Check features with highest correlation to WNV
train_weather.corr()['wnv'].sort_values(ascending=False).head(60)

wnv                             1.000000
num_mos                         0.485083
dewpoint_rolling_14_lag_7       0.123059
dewpoint_rolling_14_lag_5       0.119969
wetbulb_rolling_14_lag_7        0.118706
wetbulb_rolling_14_lag_10       0.118518
dewpoint_rolling_14             0.118297
dewpoint_rolling_14_lag_10      0.117584
dewpoint_rolling_14_lag_3       0.116146
wetbulb_rolling_14_lag_5        0.114342
tmax_rolling_14_lag_10          0.107532
wetbulb_rolling_14_lag_3        0.106995
wetbulb_rolling_14              0.106450
tavg_rolling_14_lag_10          0.106093
cool_rolling_14_lag_10          0.105307
tmin_rolling_14_lag_10          0.101629
tavg_rolling_14_lag_7           0.100321
tmin_rolling_14_lag_7           0.099851
cool_rolling_14_lag_7           0.098788
tmax_rolling_14_lag_7           0.097068
species_PIPIENS                 0.094056
tmin_rolling_14_lag_5           0.093537
tavg_rolling_14_lag_5           0.092326
cool_rolling_14_lag_5           0.090365
tmax_rolling_14_

In [90]:
train_weather.columns

Index(['date', 'trap', 'lat', 'long', 'station', 'accuracy',
       'species_ERRATICUS', 'species_PIPIENS', 'species_PIPIENS/RESTUANS',
       'species_RESTUANS',
       ...
       'resultspeed_rolling_14_lag_7', 'resultspeed_rolling_14_lag_10',
       'resultdir_rolling_14_lag_3', 'resultdir_rolling_14_lag_5',
       'resultdir_rolling_14_lag_7', 'resultdir_rolling_14_lag_10',
       'avgspeed_rolling_14_lag_3', 'avgspeed_rolling_14_lag_5',
       'avgspeed_rolling_14_lag_7', 'avgspeed_rolling_14_lag_10'],
      dtype='object', length=189)

In [91]:
test_weather.columns

Index(['id', 'date', 'trap', 'lat', 'long', 'station', 'accuracy',
       'species_ERRATICUS', 'species_PIPIENS', 'species_PIPIENS/RESTUANS',
       ...
       'resultspeed_rolling_14_lag_7', 'resultspeed_rolling_14_lag_10',
       'resultdir_rolling_14_lag_3', 'resultdir_rolling_14_lag_5',
       'resultdir_rolling_14_lag_7', 'resultdir_rolling_14_lag_10',
       'avgspeed_rolling_14_lag_3', 'avgspeed_rolling_14_lag_5',
       'avgspeed_rolling_14_lag_7', 'avgspeed_rolling_14_lag_10'],
      dtype='object', length=188)

Based on initial modelling results, the following features were found to be more important and will be kept.

The features for the train and test data are identical except train contains 'wnv' and test contains 'id'.

In [92]:
# Temperature related: roll 14, lag 10
# Precipitation related: roll 14 lag 7
train_features = ['wetbulb', 'wetbulb_rolling_14', 'wetbulb_rolling_14_lag_10',
                    'dewpoint', 'dewpoint_rolling_14', 'dewpoint_rolling_14_lag_10',
                    'tmin', 'tmin_rolling_14', 'tmin_rolling_14_lag_10',
                    'tavg', 'tavg_rolling_14', 'tavg_rolling_14_lag_10',
                    'tmax', 'tmax_rolling_14', 'tmax_rolling_14_lag_10',
                    'species_PIPIENS', 'species_PIPIENS/RESTUANS', 'species_RESTUANS',
                    'species_SALINARIUS', 'species_ERRATICUS', 'species_TARSALIS', 'species_TERRITANS',
                    'ts', 'ts_rolling_14', 'ts_rolling_14_lag_7',
                    'tsra', 'tsra_rolling_14', 'tsra_rolling_14_lag_7',
                    'vcts', 'vcts_rolling_14', 'vcts_rolling_14_lag_7',
                    'ra', 'ra_rolling_14', 'ra_rolling_14_lag_7',
                    'br', 'br_rolling_14', 'br_rolling_14_lag_7',
                    'hz', 'hz_rolling_14', 'hz_rolling_14_lag_7',
                    'preciptotal', 'preciptotal_rolling_14', 'preciptotal_rolling_14_lag_7',
                    'avgspeed', 'avgspeed_rolling_14',
                    'lat', 'long',
                    'cool', 
                    'heat',
                    'date',
                    'wnv']

test_features = ['wetbulb', 'wetbulb_rolling_14', 'wetbulb_rolling_14_lag_10',
                    'dewpoint', 'dewpoint_rolling_14', 'dewpoint_rolling_14_lag_10',
                    'tmin', 'tmin_rolling_14', 'tmin_rolling_14_lag_10',
                    'tavg', 'tavg_rolling_14', 'tavg_rolling_14_lag_10',
                    'tmax', 'tmax_rolling_14', 'tmax_rolling_14_lag_10',
                    'species_PIPIENS', 'species_PIPIENS/RESTUANS', 'species_RESTUANS',
                    'species_SALINARIUS', 'species_ERRATICUS', 'species_TARSALIS', 'species_TERRITANS',
                    'ts', 'ts_rolling_14', 'ts_rolling_14_lag_7',
                    'tsra', 'tsra_rolling_14', 'tsra_rolling_14_lag_7',
                    'vcts', 'vcts_rolling_14', 'vcts_rolling_14_lag_7',
                    'ra', 'ra_rolling_14', 'ra_rolling_14_lag_7',
                    'br', 'br_rolling_14', 'br_rolling_14_lag_7',
                    'hz', 'hz_rolling_14', 'hz_rolling_14_lag_7',
                    'preciptotal', 'preciptotal_rolling_14', 'preciptotal_rolling_14_lag_7',
                    'avgspeed', 'avgspeed_rolling_14',
                    'lat', 'long',
                    'cool', 
                    'heat',
                    'date',
                    'id']

In [93]:
train_weather = train_weather[train_features]
test_weather = test_weather[test_features]

In [94]:
train_weather.to_csv('../datasets/train_weather.csv', index=False)
test_weather.to_csv('../datasets/test_weather.csv', index=False)