### In this competition, you will be analyzing weather data and GIS data and predicting whether or not West Nile virus is present, for a given _time_, _location_, and _species_. 

#### Questions

- Do we have spray data from just two years 2011 and 2013, yes! , is 2011 when city started to spray?
- How does spraying impact wnv presence?

- Let's looking at training data by year


In [380]:
import pandas as pd
import numpy as np
import geopy as gp
from geopy.geocoders import Nominatim
from geopy.distance import vincenty


In [381]:
geolocator = Nominatim()
location = geolocator.geocode("175 5th Avenue NYC")
print(location)

Flatiron Building, 175, 5th Avenue, Flatiron Building, Manhattan Community Board 5, New York County, NYC, New York, 10010, United States of America


#### Read in csv data files

In [382]:
train = pd.read_csv('./assets/train.csv')
#test = pd.read_csv('./assets/test.csv')
spray = pd.read_csv('./assets/spray.csv')
weather = pd.read_csv('./assets/weather.csv')

print(train.shape)
print(spray.shape)
print(weather.shape)

(10506, 12)
(14835, 4)
(2944, 22)


### Spray Data

In [24]:
spray.head()

Unnamed: 0,Date,Time,Latitude,Longitude,Day,Month,Year
0,2011-08-29,6:56:58 PM,42.391623,-88.089163,29,8,2011
1,2011-08-29,6:57:08 PM,42.391348,-88.089163,29,8,2011
2,2011-08-29,6:57:18 PM,42.391022,-88.089157,29,8,2011
3,2011-08-29,6:57:28 PM,42.390637,-88.089158,29,8,2011
4,2011-08-29,6:57:38 PM,42.39041,-88.088858,29,8,2011


In [18]:
spray.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14835 entries, 0 to 14834
Data columns (total 4 columns):
Date         14835 non-null datetime64[ns]
Time         14835 non-null object
Latitude     14835 non-null float64
Longitude    14835 non-null float64
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 463.7+ KB


In [383]:
# Fill in empty times with zero for now (not sure the exact spray time will come into play)
spray['Time'].fillna(0, inplace=True)

In [384]:
# Convert Date to datetime and break apart
spray['Date'] = pd.to_datetime(spray['Date'])
spray['Year'] = spray['Date'].apply(lambda x: x.year)
spray['Month'] = spray['Date'].apply(lambda x: x.month)
spray['Day'] = spray['Date'].apply(lambda x: x.day)

In [153]:
spray['Date'].value_counts()

2013-08-15    2668
2013-08-29    2302
2013-07-17    2202
2011-09-07    2114
2013-07-25    1607
2013-08-22    1587
2013-08-08    1195
2013-09-05     924
2013-08-16     141
2011-08-29      95
Name: Date, dtype: int64

#### Are the spray coordinates unique for each observations or are the same coordinates being sprayed more than once?

** !! There is some bad data, one location that shows up 541 time for the same Date and Time and another show up twice. This not correct,  Same date might be okay, but same Date and Time is questionable.**

In [385]:
df_spray = pd.DataFrame({'count' : spray.groupby(['Date','Latitude','Longitude','Time', 'Year', 'Month']).size()}).reset_index()
df_spray.sort_values('count', ascending=False).head(10)
# Note look at results, we have one location that show up 541 time for the same Date and Time?   

Unnamed: 0,Date,Latitude,Longitude,Time,Year,Month,count
1208,2011-09-07,41.98646,-87.794225,7:44:32 PM,2011,9,541
1035,2011-09-07,41.983917,-87.793088,7:43:40 PM,2011,9,2
0,2011-08-29,42.38946,-88.093895,7:11:28 PM,2011,8,1
9523,2013-08-22,41.716443,-87.615177,11:38:46 PM,2013,8,1
9524,2013-08-22,41.716572,-87.594697,8:18:20 PM,2013,8,1
9525,2013-08-22,41.716612,-87.616337,11:37:06 PM,2013,8,1
9526,2013-08-22,41.716637,-87.600332,8:21:20 PM,2013,8,1
9527,2013-08-22,41.71665,-87.599487,8:21:10 PM,2013,8,1
9528,2013-08-22,41.716668,-87.598808,8:21:00 PM,2013,8,1
9529,2013-08-22,41.716687,-87.598012,8:20:50 PM,2013,8,1


### Training Data for Mosquito testing

In [386]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10506 entries, 0 to 10505
Data columns (total 12 columns):
Date                      10506 non-null object
Address                   10506 non-null object
Species                   10506 non-null object
Block                     10506 non-null int64
Street                    10506 non-null object
Trap                      10506 non-null object
AddressNumberAndStreet    10506 non-null object
Latitude                  10506 non-null float64
Longitude                 10506 non-null float64
AddressAccuracy           10506 non-null int64
NumMosquitos              10506 non-null int64
WnvPresent                10506 non-null int64
dtypes: float64(2), int64(4), object(6)
memory usage: 985.0+ KB


In [37]:
train.describe()

Unnamed: 0,Block,Latitude,Longitude,AddressAccuracy,NumMosquitos,WnvPresent
count,10506.0,10506.0,10506.0,10506.0,10506.0,10506.0
mean,35.687797,41.841139,-87.699908,7.819532,12.853512,0.052446
std,24.339468,0.112742,0.096514,1.452921,16.133816,0.222936
min,10.0,41.644612,-87.930995,3.0,1.0,0.0
25%,12.0,41.732984,-87.76007,8.0,2.0,0.0
50%,33.0,41.846283,-87.694991,8.0,5.0,0.0
75%,52.0,41.95469,-87.627796,9.0,17.0,0.0
max,98.0,42.01743,-87.531635,9.0,50.0,1.0


In [387]:
# Lets convert date from object to datetime.
train['Date'] = pd.to_datetime(train['Date'])
train['Year'] = train['Date'].apply(lambda x: x.year)
train['Month'] = train['Date'].apply(lambda x: x.month)
train['Day'] = train['Date'].apply(lambda x: x.day)

In [30]:
train['Species'].value_counts()

CULEX PIPIENS/RESTUANS    4752
CULEX RESTUANS            2740
CULEX PIPIENS             2699
CULEX TERRITANS            222
CULEX SALINARIUS            86
CULEX TARSALIS               6
CULEX ERRATICUS              1
Name: Species, dtype: int64

In [35]:
len(train['Trap'].value_counts())

136

In [228]:
train['NumMosquitos'].value_counts()

1     2307
2     1300
50    1019
3      896
4      593
5      489
6      398
7      326
8      244
9      237
10     206
11     170
13     163
12     132
16     128
14     120
15     112
17     107
18      92
19      86
21      85
20      79
23      69
27      67
37      61
26      57
24      57
22      56
25      50
39      49
29      48
36      47
31      47
30      44
35      43
28      43
46      43
43      39
32      39
47      37
33      36
48      36
49      35
45      35
38      35
34      31
41      31
42      29
40      28
44      25
Name: NumMosquitos, dtype: int64

In [219]:
train[train['NumMosquitos'] >=45]

Unnamed: 0,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy,NumMosquitos,WnvPresent,Month,Year,Day,spray_date,spray_date2,JustDate
293,2007-07-11,"2200 West 113th Street, Chicago, IL 60643, USA",CULEX PIPIENS/RESTUANS,22,W 113TH ST,T086,"2200 W 113TH ST, Chicago, IL",41.688324,-87.676709,8,50,0,7,2007,11,,0,<built-in method date of Timestamp object at 0...
295,2007-07-11,"2200 West 113th Street, Chicago, IL 60643, USA",CULEX PIPIENS/RESTUANS,22,W 113TH ST,T086,"2200 W 113TH ST, Chicago, IL",41.688324,-87.676709,8,50,0,7,2007,11,,0,<built-in method date of Timestamp object at 0...
350,2007-07-11,"3500 West 116th Street, Chicago, IL 60655, USA",CULEX PIPIENS/RESTUANS,35,W 116TH ST,T158,"3500 W 116TH ST, Chicago, IL",41.682587,-87.707973,9,50,0,7,2007,11,,0,<built-in method date of Timestamp object at 0...
351,2007-07-11,"3500 West 116th Street, Chicago, IL 60655, USA",CULEX PIPIENS/RESTUANS,35,W 116TH ST,T158,"3500 W 116TH ST, Chicago, IL",41.682587,-87.707973,9,50,0,7,2007,11,,0,<built-in method date of Timestamp object at 0...
353,2007-07-11,"3500 West 116th Street, Chicago, IL 60655, USA",CULEX PIPIENS/RESTUANS,35,W 116TH ST,T158,"3500 W 116TH ST, Chicago, IL",41.682587,-87.707973,9,50,0,7,2007,11,,0,<built-in method date of Timestamp object at 0...
529,2007-07-18,"South Doty Avenue, Chicago, IL, USA",CULEX PIPIENS/RESTUANS,12,S DOTY AVE,T115,"1200 S DOTY AVE, Chicago, IL",41.673408,-87.599862,5,50,0,7,2007,18,,0,<built-in method date of Timestamp object at 0...
530,2007-07-18,"South Stony Island Avenue, Chicago, IL, USA",CULEX PIPIENS/RESTUANS,10,S STONY ISLAND AVE,T138,"1000 S STONY ISLAND AVE, Chicago, IL",41.726465,-87.585413,5,50,0,7,2007,18,,0,<built-in method date of Timestamp object at 0...
531,2007-07-18,"South Stony Island Avenue, Chicago, IL, USA",CULEX PIPIENS/RESTUANS,10,S STONY ISLAND AVE,T138,"1000 S STONY ISLAND AVE, Chicago, IL",41.726465,-87.585413,5,50,0,7,2007,18,,0,<built-in method date of Timestamp object at 0...
533,2007-07-18,"South Stony Island Avenue, Chicago, IL, USA",CULEX RESTUANS,10,S STONY ISLAND AVE,T138,"1000 S STONY ISLAND AVE, Chicago, IL",41.726465,-87.585413,5,50,0,7,2007,18,,0,<built-in method date of Timestamp object at 0...
547,2007-07-18,"3700 118th Street, Chicago, IL 60617, USA",CULEX PIPIENS/RESTUANS,37,E 118TH ST,T212,"3700 E 118TH ST, Chicago, IL",41.680946,-87.535198,8,50,0,7,2007,18,,0,<built-in method date of Timestamp object at 0...


In [257]:
pd.DataFrame(train[(train['Trap'] == 'T900') & (train['Year'] == 2013) & (train['Month'] == 9)]).reset_index()

Unnamed: 0,index,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy,NumMosquitos,WnvPresent,Month,Year,Day,spray_date,spray_date2,JustDate
0,10118,2013-09-06,"ORD Terminal 5, O'Hare International Airport, ...",CULEX PIPIENS/RESTUANS,10,W OHARE AIRPORT,T900,"1000 W OHARE AIRPORT, Chicago, IL",41.974689,-87.890615,9,50,0,9,2013,6,,2013-07-17 00:00:00,<built-in method date of Timestamp object at 0...
1,10119,2013-09-06,"ORD Terminal 5, O'Hare International Airport, ...",CULEX PIPIENS/RESTUANS,10,W OHARE AIRPORT,T900,"1000 W OHARE AIRPORT, Chicago, IL",41.974689,-87.890615,9,44,1,9,2013,6,,2013-07-17 00:00:00,<built-in method date of Timestamp object at 0...
2,10120,2013-09-06,"ORD Terminal 5, O'Hare International Airport, ...",CULEX PIPIENS/RESTUANS,10,W OHARE AIRPORT,T900,"1000 W OHARE AIRPORT, Chicago, IL",41.974689,-87.890615,9,39,1,9,2013,6,,2013-07-17 00:00:00,<built-in method date of Timestamp object at 0...
3,10121,2013-09-06,"ORD Terminal 5, O'Hare International Airport, ...",CULEX PIPIENS/RESTUANS,10,W OHARE AIRPORT,T900,"1000 W OHARE AIRPORT, Chicago, IL",41.974689,-87.890615,9,38,1,9,2013,6,,2013-07-17 00:00:00,<built-in method date of Timestamp object at 0...
4,10122,2013-09-06,"ORD Terminal 5, O'Hare International Airport, ...",CULEX PIPIENS/RESTUANS,10,W OHARE AIRPORT,T900,"1000 W OHARE AIRPORT, Chicago, IL",41.974689,-87.890615,9,50,1,9,2013,6,,2013-07-17 00:00:00,<built-in method date of Timestamp object at 0...
5,10123,2013-09-06,"ORD Terminal 5, O'Hare International Airport, ...",CULEX PIPIENS/RESTUANS,10,W OHARE AIRPORT,T900,"1000 W OHARE AIRPORT, Chicago, IL",41.974689,-87.890615,9,37,0,9,2013,6,,2013-07-17 00:00:00,<built-in method date of Timestamp object at 0...
6,10124,2013-09-06,"ORD Terminal 5, O'Hare International Airport, ...",CULEX PIPIENS/RESTUANS,10,W OHARE AIRPORT,T900,"1000 W OHARE AIRPORT, Chicago, IL",41.974689,-87.890615,9,50,1,9,2013,6,,2013-07-17 00:00:00,<built-in method date of Timestamp object at 0...
7,10125,2013-09-06,"ORD Terminal 5, O'Hare International Airport, ...",CULEX RESTUANS,10,W OHARE AIRPORT,T900,"1000 W OHARE AIRPORT, Chicago, IL",41.974689,-87.890615,9,26,0,9,2013,6,,2013-07-17 00:00:00,<built-in method date of Timestamp object at 0...
8,10126,2013-09-06,"ORD Terminal 5, O'Hare International Airport, ...",CULEX RESTUANS,10,W OHARE AIRPORT,T900,"1000 W OHARE AIRPORT, Chicago, IL",41.974689,-87.890615,9,14,1,9,2013,6,,2013-07-17 00:00:00,<built-in method date of Timestamp object at 0...
9,10127,2013-09-06,"ORD Terminal 5, O'Hare International Airport, ...",CULEX PIPIENS,10,W OHARE AIRPORT,T900,"1000 W OHARE AIRPORT, Chicago, IL",41.974689,-87.890615,9,23,1,9,2013,6,,2013-07-17 00:00:00,<built-in method date of Timestamp object at 0...


In [341]:
#df = pd.DataFrame(train.groupby(['Date','Block','Trap','Latitude','Longitude'])['WnvPresent'].sum()).reset_index()
#df['wnv_present_overall'] = np.where(df['WnvPresent'] > 0, 1, 0)
#df.shape

(3480, 12)

In [109]:
#df.sort_values(['wnv_present_overall', 'Block', 'Trap', "Date"], ascending=False)

Unnamed: 0,Date,Block,Trap,Latitude,Longitude,WnvPresent,wnv_present_overall
622,2007-08-07,93,T162,41.725517,-87.614258,1,1
4412,2013-09-06,91,T009,41.992478,-87.862995,1,1
4342,2013-08-29,91,T009,41.992478,-87.862995,3,1
4128,2013-08-08,91,T009,41.992478,-87.862995,1,1
2183,2009-08-25,91,T009,41.992478,-87.862995,1,1
956,2007-08-24,91,T009,41.981964,-87.812827,1,1
739,2007-08-15,91,T009,41.981964,-87.812827,2,1
4197,2013-08-15,90,T226,41.793818,-87.654234,1,1
4127,2013-08-08,90,T226,41.793818,-87.654234,2,1
3911,2013-07-19,90,T226,41.793818,-87.654234,1,1


### Joiniing spray data to mosquito test data

The spray location coordinates will not match the trap coordinates exactly so we will find the nearest spray location. Note provided spray data is limited to 2013 and 2011 and the data for 2011 is very limited.  To find the nearest spray location for a trap... 
- Take the mosquito test year and date to determine appropriate spray data  
- Find the distance from each spray location to the trap location
- Finally take the smallest distance and assume that spray location is the nearest to the trap

In [427]:
# mask = (df['date'] > start_date) & (df['date'] <= end_date)

import datetime

def find_spray_datetime(x):
    location = [x.Latitude, x.Longitude]
    distances = []
    spray_info = []
    # get spray observations for given year AND date/time prior to mosquito test
    df = pd.DataFrame(df_spray[(df_spray['Date'] < x.Date) & (df_spray['Year'] == x.Year)]).reset_index()
    # iterate through resulting set and grab distance in miles, date and lat and long as well
    if df.shape[0] > 0:
        for index, row in df.iterrows():
            dict = {}
            spray_location = [row.Latitude, row.Longitude]
            # get distance between trap and spray location
            distance = vincenty(location, spray_location).miles
            distance = str(round(distance, 6))
            distances.append(distance)
            dict['distance'] = distance
            dict['spray_info'] = '|'.join([row['Date'].strftime('%Y-%m-%d'), distance, str(row.Latitude), str(row.Longitude)])
            spray_info.append(dict)
            
        # sort distances
        distances.sort(reverse=False)
        #print(distances)
        shortest_distance = distances.pop(0)
        for d in spray_info:
            if shortest_distance == d['distance']:
                info = d['spray_info']
                break
        return info
    else:
        # return none if no valid spray info for mosquito test date
        return 'none'
            

In [407]:
# only call the find spray time if year is 2011 or 2013 and month is after July.  
# Spray data does not exist for other times
train['spray_info'] = train[(train['Year'] >= 2011) & (train['Month'] >= 7)].apply(lambda x: find_spray_datetime(x), axis=1)

In [408]:
#train['spray_info'].value_counts()

none                                                         1434
2013-07-17|1.016937|41.9728533333333|-87.8710233333333        170
2011-09-07|2.784849|41.9757483333333|-87.83655999999999        39
2013-07-17|10.002319|42.011858333333294|-87.784405             36
2013-07-17|0.03972|42.0084983333333|-87.77719                  33
2011-08-29|30.930854|42.390395|-88.08831500000001              25
2013-07-17|10.000144|41.723381666666704|-87.656675             25
2013-07-17|0.009117|41.7331116666667|-87.6495966666667         25
2013-07-17|0.970296|41.99808|-87.7637166666667                 24
2013-07-17|10.000037|41.722215000000006|-87.6529733333333      22
2013-07-17|0.270572|41.99799|-87.768085                        20
2013-07-17|10.00371|42.0050183333333|-87.7745716666667         20
2013-07-17|0.168923|41.7211583333333|-87.6627733333333         20
2013-07-17|1.385875|41.973141666666706|-87.8702516666667       20
2013-07-25|0.00201|41.9518783333333|-87.72502166666669         19
2013-07-17

In [413]:
train['spray_info'].fillna('none', inplace=True) # 3629

In [425]:
# Spray info = spray date | distance from nearest spray location in miles | spray latitude | spray longitude
# ! NOTE String is temp, will have to be added as separate columns
train[['Date', 'Species', 'Block', 'Trap', 'spray_info']].sort_values(['Date', 'spray_info'], ascending=False).head(200)

Unnamed: 0,Date,Species,Block,Trap,spray_info
10483,2013-09-26,CULEX RESTUANS,91,T009,2013-09-05|1.200799|41.9809333333333|-87.84554...
10504,2013-09-26,CULEX PIPIENS/RESTUANS,71,T233,2013-09-05|0.027713|42.010101666666706|-87.806...
10480,2013-09-26,CULEX PIPIENS/RESTUANS,40,T221,2013-08-29|10.607458|41.7593966666667|-87.6941...
10481,2013-09-26,CULEX RESTUANS,40,T221,2013-08-29|10.607458|41.7593966666667|-87.6941...
10482,2013-09-26,CULEX PIPIENS,40,T221,2013-08-29|10.607458|41.7593966666667|-87.6941...
10468,2013-09-26,CULEX PIPIENS/RESTUANS,13,T209,2013-08-29|10.003016|41.774173333333295|-87.73...
10469,2013-09-26,CULEX RESTUANS,13,T209,2013-08-29|10.003016|41.774173333333295|-87.73...
10470,2013-09-26,CULEX PIPIENS,13,T209,2013-08-29|10.003016|41.774173333333295|-87.73...
10471,2013-09-26,CULEX PIPIENS/RESTUANS,37,T212,2013-08-29|10.001613|41.7594983333333|-87.69775
10472,2013-09-26,CULEX PIPIENS,37,T212,2013-08-29|10.001613|41.7594983333333|-87.69775


In [418]:
# push to csv file
#train.sort_values('spray_info', ascending=True).to_csv('train_w_spray.csv')

** Group the training data by Year, Month, Trap and Species and determing if WNV is present for that group **

In [344]:
df_train = pd.DataFrame(train.groupby(['Year','Month','Species', 'Block','Trap', 'Street','Latitude', 'Longitude']).agg({'NumMosquitos':'sum','WnvPresent': 'sum', 'Date': 'size'})).reset_index()
 #['WnvPresent'].sum()).reset_index()
df_train['WnvPresentForGroup'] = np.where(df_train['WnvPresent'] > 0, 1, 0)
print(df_train.shape)
df_train.columns = ['Year','Month','Species', 'Block','Trap', 'Street','Latitude', 'Longitude', 'SumNumMosquitos', 'SumWnvPresent', 'GroupCount', 'WnvPresentForGroup']
df_train.sort_values(['Year','Month','Trap', 'WnvPresentForGroup'], ascending=False).head(5)

(3480, 12)


Unnamed: 0,Year,Month,Species,Block,Trap,Street,Latitude,Longitude,SumNumMosquitos,SumWnvPresent,GroupCount,WnvPresentForGroup
3310,2013,9,CULEX PIPIENS,10,T903,W OHARE,41.957799,-87.930995,20,0,1,0
3381,2013,9,CULEX PIPIENS/RESTUANS,10,T903,W OHARE,41.957799,-87.930995,10,0,1,0
3309,2013,9,CULEX PIPIENS,10,T900,W OHARE AIRPORT,41.974689,-87.890615,546,5,23,1
3380,2013,9,CULEX PIPIENS/RESTUANS,10,T900,W OHARE AIRPORT,41.974689,-87.890615,640,9,19,1
3451,2013,9,CULEX RESTUANS,10,T900,W OHARE AIRPORT,41.974689,-87.890615,45,1,3,1


In [54]:
#  Keep has reference to timedelta

#(df['date'] > start_date) & (df['date'] <= end_date)
#import datetime
#spray['DatePlus15'] = spray['Date'].apply(lambda x: x + datetime.timedelta(days=15))
#end_date = date_1 + datetime.timedelta(days=10)

### Weather Data

** _It is believed that hot and dry conditions are more favorable for West Nile virus than cold and wet. We provide you with the dataset from NOAA of the weather conditions of 2007 to 2014, during the months of the tests._** 

Based on the above comment from Kaggle, will focus on **Heating** and **Cooling** days and **Total Precipitation**

Link to NOAA doc that explains heating and cooling days
http://www.cpc.ncep.noaa.gov/products/analysis_monitoring/cdus/degree_days/ddayexp.shtml

In [300]:
print(weather.shape)
weather.head()

(2944, 22)
(2931, 22)


In [490]:
weather.info() 
# No nulls, but will have to change column types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2944 entries, 0 to 2943
Data columns (total 22 columns):
Station        2944 non-null int64
Date           2944 non-null object
Tmax           2944 non-null int64
Tmin           2944 non-null int64
Tavg           2944 non-null object
Depart         2944 non-null object
DewPoint       2944 non-null int64
WetBulb        2944 non-null object
Heat           2944 non-null object
Cool           2944 non-null object
Sunrise        2944 non-null object
Sunset         2944 non-null object
CodeSum        2944 non-null object
Depth          2944 non-null object
Water1         2944 non-null object
SnowFall       2944 non-null object
PrecipTotal    2944 non-null object
StnPressure    2944 non-null object
SeaLevel       2944 non-null object
ResultSpeed    2944 non-null float64
ResultDir      2944 non-null int64
AvgSpeed       2944 non-null object
dtypes: float64(1), int64(5), object(16)
memory usage: 506.1+ KB


In [None]:
# Drop rows that have an M (missing data)
df_weather = weather[(weather.Cool != 'M') & (weather.Heat != 'M') & (weather.PrecipTotal != 'M')]
print(df_weather.shape)

In [301]:
# Drop rows where precip == T
df_weather = df_weather[df_weather.PrecipTotal.str.replace(' ', '') != 'T']
print(df_weather.shape)

(2614, 22)


In [492]:
df_weather['Cool'] = df_weather['Cool'].astype('int')
df_weather['Heat'] = df_weather['Heat'].astype('int')
df_weather['PrecipTotal'] = df_weather['PrecipTotal'].astype('float')

In [493]:
df_weather.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2614 entries, 0 to 2943
Data columns (total 22 columns):
Station        2614 non-null int64
Date           2614 non-null object
Tmax           2614 non-null int64
Tmin           2614 non-null int64
Tavg           2614 non-null object
Depart         2614 non-null object
DewPoint       2614 non-null int64
WetBulb        2614 non-null object
Heat           2614 non-null int64
Cool           2614 non-null int64
Sunrise        2614 non-null object
Sunset         2614 non-null object
CodeSum        2614 non-null object
Depth          2614 non-null object
Water1         2614 non-null object
SnowFall       2614 non-null object
PrecipTotal    2614 non-null float64
StnPressure    2614 non-null object
SeaLevel       2614 non-null object
ResultSpeed    2614 non-null float64
ResultDir      2614 non-null int64
AvgSpeed       2614 non-null object
dtypes: float64(2), int64(7), object(13)
memory usage: 469.7+ KB


** Considered just grabbing data for Station 1, but better to get average of Station data since some records are removed due to Missing and Trace values **

In [317]:
#df_weather1 = df_weather[['Date','Heat', 'Cool', 'PrecipTotal']][df_weather['Station'] == 1]
#df_weather1.head(31)

Unnamed: 0,Date,Heat,Cool,PrecipTotal
0,2007-05-01,0,2,0.0
2,2007-05-02,14,0,0.0
4,2007-05-03,9,0,0.0
10,2007-05-06,6,0,0.0
14,2007-05-08,0,3,0.0
16,2007-05-09,0,4,0.13
18,2007-05-10,0,5,0.0
20,2007-05-11,4,0,0.0
22,2007-05-12,10,0,0.0
24,2007-05-13,9,0,0.0


** Better to get average of Station 1 and Station 2 **

In [323]:
df_weather2 = pd.DataFrame(df_weather.groupby(['Date']).agg({'Heat': 'mean', 'Cool': 'mean', 'PrecipTotal': 'mean'})).reset_index()

In [324]:
# We can round later
#df_weather2['Heat'] = round(df_weather2['Heat'])
#df_weather2['Cool'] = round(df_weather2['Cool'])

In [326]:
df_weather2['Date'] = pd.to_datetime(df_weather2['Date'])
df_weather2['Year'] = df_weather2['Date'].apply(lambda x: x.year)
df_weather2['Month'] = df_weather2['Date'].apply(lambda x: x.month)
df_weather2['Day'] = df_weather2['Date'].apply(lambda x: x.day)

In [495]:
df_weather3 = pd.DataFrame(df_weather2.groupby(['Year', 'Month']).agg({'Heat': 'mean', 'Cool': 'mean', 'PrecipTotal': 'mean'})).reset_index()
print(df_weather3.shape)
df_weather3['Heat'] = round(df_weather3['Heat'])
df_weather3['Cool'] = round(df_weather3['Cool'])
df_weather3['Heat'] = df_weather3['Heat'].astype('int')
df_weather3['Cool'] = df_weather3['Cool'].astype('int')
df_weather3.sort_values(['PrecipTotal'], ascending=False)

(48, 5)


Unnamed: 0,Year,Month,Heat,Cool,PrecipTotal
10,2008,9,1,3,0.382333
20,2010,7,0,14,0.331897
3,2007,8,0,10,0.306667
45,2014,8,0,10,0.275645
43,2014,6,0,7,0.250345
26,2011,7,0,15,0.2475
19,2010,6,0,7,0.2475
13,2009,6,2,6,0.229821
24,2011,5,8,2,0.229815
17,2009,10,15,0,0.208548


In [497]:
df_weather3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 5 columns):
Year           48 non-null int64
Month          48 non-null int64
Heat           48 non-null int64
Cool           48 non-null int64
PrecipTotal    48 non-null float64
dtypes: float64(1), int64(4)
memory usage: 2.0 KB


In [496]:
df_weather3.head()

Unnamed: 0,Year,Month,Heat,Cool,PrecipTotal
0,2007,5,3,3,0.063621
1,2007,6,1,7,0.096786
2,2007,7,0,9,0.112419
3,2007,8,0,10,0.306667
4,2007,9,2,6,0.045


In [429]:
df_weather3[(df_weather3['Year'] == 2007) & (df_weather3['Month'] == 7) ]

Unnamed: 0,Year,Month,Heat,Cool,PrecipTotal
2,2007,7,0.0,9.0,0.112419


In [483]:
df_weather3.sort_values('PrecipTotal', ascending=False)

Unnamed: 0,Year,Month,Heat,Cool,PrecipTotal
10,2008,9,1.0,3.0,0.382333
20,2010,7,0.0,14.0,0.331897
3,2007,8,0.0,10.0,0.306667
45,2014,8,0.0,10.0,0.275645
43,2014,6,0.0,7.0,0.250345
26,2011,7,0.0,15.0,0.2475
19,2010,6,0.0,7.0,0.2475
13,2009,6,2.0,6.0,0.229821
24,2011,5,8.0,2.0,0.229815
17,2009,10,15.0,0.0,0.208548


In [498]:
df_weather3.sort_values('Heat', ascending=False).head()

Unnamed: 0,Year,Month,Heat,Cool,PrecipTotal
17,2009,10,15,0,0.208548
47,2014,10,12,0,0.103
11,2008,10,12,1,0.08375
35,2012,10,12,0,0.0995
41,2013,10,11,1,0.148214


### Join monthly weather averages to mosquito test data 

**The following method is a quick way to append month averages to the mosquito test data.
The three values in the delimited string will ultimately need to be in separate columns**

In [499]:
def add_weather_data(row):
    df3 = df_weather3[(df_weather3['Year'] == row['Year']) & (df_weather3['Month'] == row['Month'])]
    precip = str(round(df3['PrecipTotal'].values[0], 4))
    heat = str(df3['Heat'].values[0])
    cool = str(df3['Cool'].values[0])
    return '|'.join([heat, cool, precip])

In [500]:
df_train['Weather_Info'] = df_train.apply(lambda x: add_weather_data(x), axis=1 )

In [504]:
df_train[['Year', 'Month', 'Trap','Weather_Info']].head(20)

Unnamed: 0,Year,Month,Trap,Weather_Info
0,2007,5,T096,3|3|0.0636
1,2007,5,T048,3|3|0.0636
2,2007,5,T050,3|3|0.0636
3,2007,5,T054,3|3|0.0636
4,2007,5,T086,3|3|0.0636
5,2007,5,T002,3|3|0.0636
6,2007,5,T129,3|3|0.0636
7,2007,5,T143,3|3|0.0636
8,2007,5,T148,3|3|0.0636
9,2007,5,T015,3|3|0.0636


In [506]:
def get_heat(x):
    values = x.split('|')
    return values[0]

def get_cool(x):
    values = x.split('|')
    return values[1]

def get_precip(x):
    values = x.split('|')
    return values[2]


In [507]:
df_train['HeatMonthAvg'] = df_train['Weather_Info'].apply(get_heat)

In [509]:
df_train['CoolMonthAvg'] = df_train['Weather_Info'].apply(get_cool)
df_train['PrecipMonthAvg'] = df_train['Weather_Info'].apply(get_precip)

In [513]:
df_train[['Year', 'Month', 'Trap','HeatMonthAvg','CoolMonthAvg','PrecipMonthAvg']].head(20)

Unnamed: 0,Year,Month,Trap,HeatMonthAvg,CoolMonthAvg,PrecipMonthAvg
0,2007,5,T096,3,3,0.0636
1,2007,5,T048,3,3,0.0636
2,2007,5,T050,3,3,0.0636
3,2007,5,T054,3,3,0.0636
4,2007,5,T086,3,3,0.0636
5,2007,5,T002,3,3,0.0636
6,2007,5,T129,3,3,0.0636
7,2007,5,T143,3,3,0.0636
8,2007,5,T148,3,3,0.0636
9,2007,5,T015,3,3,0.0636
