# Test Data Summary
This dataset contains data on mosquito traps in the city of Chicago; because it is the **test** data, it does not contain data on whether or not West Nile Virus is present.

## Data Cleaning
#### Type Conversions
- Date must be converted to timeseries
- Began as as an object/string

#### Dummies
- Traps
- Species

#### Nulls
- No nulls present in testing data


## Final Data Types
#### Strings/Objects
- Address
- (original) Species 
- Street
- (original) Trap
- Address Number and Street
#### Floats
- Latitude
- Longitude
#### Integers
- ID
- Block
- Address Accuracy
- (dummy) Trap
- (dummy) Species
#### Timeseries
- Date

### Import Libraries and Data

In [1]:
import pandas as pd
pd.set_option('display.max_rows', 200)
from datetime import datetime
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

test = pd.read_csv('../project-4/project-4/wnv-data/test.csv')

print(test.shape)
test.head(2)

(116293, 11)


Unnamed: 0,Id,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy
0,1,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
1,2,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9


### Check for Nulls

In [2]:
test.isnull().sum()

Id                        0
Date                      0
Address                   0
Species                   0
Block                     0
Street                    0
Trap                      0
AddressNumberAndStreet    0
Latitude                  0
Longitude                 0
AddressAccuracy           0
dtype: int64

### Check Data Types

In [3]:
test.dtypes

Id                          int64
Date                       object
Address                    object
Species                    object
Block                       int64
Street                     object
Trap                       object
AddressNumberAndStreet     object
Latitude                  float64
Longitude                 float64
AddressAccuracy             int64
dtype: object

In [4]:
# convert date to a timeseries
test['Date'] = pd.to_datetime(test['Date'])

# check data types
test.dtypes

Id                                 int64
Date                      datetime64[ns]
Address                           object
Species                           object
Block                              int64
Street                            object
Trap                              object
AddressNumberAndStreet            object
Latitude                         float64
Longitude                        float64
AddressAccuracy                    int64
dtype: object

### Create Dummy Variables

In [5]:
# create dummy variables for trap and species
dummies = pd.get_dummies(test[['Trap', 'Species']], drop_first=False)

# add it back to original df without removing original trap and species vars
test = pd.concat([test, dummies], axis = 1)

# confirm everything merged correctly
print(test.shape)
test.head(2)

(116293, 168)


Unnamed: 0,Id,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,...,Trap_T900,Trap_T903,Species_CULEX ERRATICUS,Species_CULEX PIPIENS,Species_CULEX PIPIENS/RESTUANS,Species_CULEX RESTUANS,Species_CULEX SALINARIUS,Species_CULEX TARSALIS,Species_CULEX TERRITANS,Species_UNSPECIFIED CULEX
0,1,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,...,0,0,0,0,1,0,0,0,0,0
1,2,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,...,0,0,0,0,0,1,0,0,0,0


In [6]:
test.dtypes

Id                                         int64
Date                              datetime64[ns]
Address                                   object
Species                                   object
Block                                      int64
Street                                    object
Trap                                      object
AddressNumberAndStreet                    object
Latitude                                 float64
Longitude                                float64
AddressAccuracy                            int64
Trap_T001                                  uint8
Trap_T002                                  uint8
Trap_T002A                                 uint8
Trap_T002B                                 uint8
Trap_T003                                  uint8
Trap_T004                                  uint8
Trap_T005                                  uint8
Trap_T006                                  uint8
Trap_T007                                  uint8
Trap_T008           

In [7]:
test['Date'].value_counts()
# earliest recorded date: 2008-06-11
# latest recorded date: 2014-10-02

2012-07-09    1293
2012-08-03    1282
2012-07-27    1282
2012-07-19    1260
2010-07-13    1257
2012-07-13    1256
2014-08-14    1254
2014-08-21    1253
2008-09-09    1248
2008-09-02    1247
2014-08-07    1247
2008-08-19    1247
2014-07-31    1245
2012-07-20    1245
2010-07-19    1241
2014-06-19    1236
2014-08-28    1234
2008-07-24    1234
2014-07-10    1233
2008-08-13    1233
2008-08-26    1232
2012-06-29    1231
2012-08-10    1231
2014-07-03    1229
2010-07-26    1229
2012-08-17    1229
2014-09-05    1229
2010-06-28    1227
2014-06-12    1225
2010-08-20    1225
2010-07-30    1225
2010-08-13    1224
2010-08-06    1224
2012-09-13    1224
2012-09-10    1223
2012-08-24    1223
2010-09-13    1223
2008-07-29    1222
2014-09-11    1222
2010-07-23    1222
2010-07-02    1221
2010-07-12    1221
2008-08-25    1221
2014-09-18    1220
2014-07-17    1220
2010-08-27    1219
2014-06-05    1218
2012-08-31    1218
2008-08-05    1218
2014-09-25    1218
2008-09-15    1217
2014-06-26    1217
2008-06-24  

In [8]:
# set date as index
test.set_index('Date', inplace=True)
print(test.shape)
test.head()

(116293, 167)


Unnamed: 0_level_0,Id,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy,...,Trap_T900,Trap_T903,Species_CULEX ERRATICUS,Species_CULEX PIPIENS,Species_CULEX PIPIENS/RESTUANS,Species_CULEX RESTUANS,Species_CULEX SALINARIUS,Species_CULEX TARSALIS,Species_CULEX TERRITANS,Species_UNSPECIFIED CULEX
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2008-06-11,1,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,...,0,0,0,0,1,0,0,0,0,0
2008-06-11,2,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,...,0,0,0,0,0,1,0,0,0,0
2008-06-11,3,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,...,0,0,0,1,0,0,0,0,0,0
2008-06-11,4,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX SALINARIUS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,...,0,0,0,0,0,0,1,0,0,0
2008-06-11,5,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX TERRITANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,...,0,0,0,0,0,0,0,0,1,0


In [15]:
test['Trap'].value_counts()
# earliest recorded date: 2008-06-11
# latest recorded date: 2014-10-02

T009     1528
T035     1520
T900     1468
T002      857
T008      822
T011      814
T027      803
T151      797
T054      792
T903      784
T028      783
T012      781
T231      781
T003      780
T220      776
T115      776
T073      776
T016      775
T135      774
T223      774
T063      774
T128      774
T212      773
T158      773
T102      772
T114      772
T030      771
T159      770
T221      769
T147      769
T082      769
T013      769
T090A     768
T074      768
T138      768
T218      767
T065A     767
T047      767
T080      767
T225      766
T200A     766
T089      765
T144      765
T233      765
T065      765
T200      765
T152      764
T145      764
T228      764
T066      763
T048      763
T236      763
T090      763
T076      763
T083      763
T209      763
T031      763
T062      762
T069      762
T014      762
T218C     762
T222      762
T033      762
T045      762
T227      762
T061      762
T218B     762
T095      762
T230      762
T046      762
T094      761
T224  

In [16]:
test['AddressAccuracy'].value_counts()

8    61973
9    39795
5    13761
3      764
Name: AddressAccuracy, dtype: int64

### Save as CSV

In [11]:
# save to csv

test.to_csv('./West-Nile_Team-MATH/assets/test.csv')