# Feature Engineering/Preprocessing

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import datetime
import pickle
np.random.RandomState(42)

<mtrand.RandomState at 0x23bd46e33f0>

In [2]:
test = pd.read_csv('../Data/clean_test.csv',index_col=[0])
train = pd.read_csv('../Data/clean_train.csv',index_col=[0])
weather = pd.read_csv('../Data/clean_weather.csv',index_col=[0])
spray = pd.read_csv('../Data/clean_spray.csv',index_col=[0])
for df in [train,spray,test,weather]:
    df['Date'] = pd.to_datetime(df['Date'])
    df.set_index(df['Date'],inplace=True)
    df.drop(columns='Date',inplace=True)

### Weather Feature Engineering/Preprocessing

>After research, it was discovered that the amount of daylight in a day is an important feature in determining the amount of mosquitoes in a day.

In [3]:
weather['Daylight'] = (weather['Sunset'] - weather['Sunrise'])/100
weather.drop(columns=['Sunset','Sunrise'],inplace = True)

>Using the Feature CodeSum, the columns Rain and fog were created to indicate if it was either raining or if it was foggy that day. This may be useful as a feature because mosquitoes can not fly in either rain or fog.

In [4]:
def string_contain(x,string):
    if string in x:
        return 1
    else:
        return 0
weather['Rain'] = weather["CodeSum"].apply(lambda x: string_contain(x,"RA"))
weather['Fog'] = weather['CodeSum'].apply(lambda x: string_contain(x,'FG'))

> This cell creates a new column 'Humidity.' This feature was engineered because based on outside research, it was discovered that mosquitoes thrive in humid environments.

In [5]:
def change_to_celcius(tf):
    tc= (tf -32)/1.8
    return tc
def humidity(t, td):
    a= 17.625
    b= 243.04
    rh= 100 * (np.exp((a * td)/(b + td)))/(np.exp((a * t)/(b + t)))
    return round(rh, 2)
weather.loc[:, 'Celcius_Dew']= change_to_celcius(weather['DewPoint'])
weather['Tavg']= weather['Tavg'].astype(float)
weather.loc[:, 'Celcius_Temp']= change_to_celcius(weather['Tavg'])
weather.loc[:, 'Humidity']= humidity(weather['Celcius_Temp'], weather['Celcius_Dew'])
weather.drop(columns = ['Celcius_Temp','Celcius_Dew','DewPoint'],inplace = True)

>By making two separate data frames for each station, we can better manage creating the averages/sums for the previous 7 days for certain features.

In [6]:
station1_weather, station2_weather = weather.groupby(by="Station")
station1_weather = station1_weather[1]
station2_weather = station2_weather[1]

>This cell creates a new feature that described either the mean of the previous 7 days, or the number of days out of the previous 7 days that an event occurred. For example, prev_7_day_avg_Precip describes the average precipitation over 7 days, and prev_7_day_Rain describes the number of days out of the previous 7 that it rained.

In [7]:
for station in [station1_weather,station2_weather]:
    station['prev_7_day_avg_Precip'] = station['PrecipTotal'].rolling(7).mean()
    station['prev_7_day_avg_Temp'] = station['Tavg'].rolling(7).mean()
    station['prev_7_day_Rain'] = station['Rain'].rolling(7).sum()
    station['prev_7_day_Fog'] = station['Fog'].rolling(7).sum()
    station['prev_7_day_Daylight'] = station['Daylight'].rolling(7).mean()
    station.drop(columns=['PrecipTotal','Tavg','Rain','Fog','Daylight','Heat','Cool'],inplace=True)

>Dropping Columns that are no longer needed/may be co-linearly related to other features.

In [8]:
for station in [station1_weather,station2_weather]:
    station.drop(columns=['CodeSum','Tmin','Tmax','Station'],inplace = True)

# Train/Test Data Preprocessing

> Creating columns that describe the presence of each species in a trap for both train and testing data, this is done so that our computer can "read" categorical data.

In [9]:
train = pd.concat([train,pd.get_dummies(train['Species'])],axis=1)
test = pd.concat([test,pd.get_dummies(test['Species'])],axis=1)
test.drop(columns=['Species'],inplace=True)
train.drop(columns=['Species'],inplace=True)
train = pd.concat([train,pd.get_dummies(train['Block'],prefix = "Block")],axis=1)
test = pd.concat([test,pd.get_dummies(test['Block'],prefix = "Block")],axis=1)
test.drop(columns=['Block','Trap'],inplace=True)
train.drop(columns=['Block','Trap'],inplace=True)

>What this cell tells us is that we need to manually create the UNSPECIFIED CULEX feature in our training dataframe. Id should only be in the testing dataframe, and NumMosquitos and WnvPresent should only be in training by default.

In [10]:
def column_check(df1,df2):
    if (len(set(df1.columns) - set(df2.columns)) == 0) & (len(set(df2.columns) - set(df1.columns)) == 0):
        print('These Dataframes have the same columns')
    else:
        print('The',set(df1.columns) - set(df2.columns),'Features are in our first dataframe,but not in the second dataframe')
        print('The',set(df2.columns) - set(df1.columns),'Features are in our second dataframe,but not in the first dataframe')
column_check(test,train)

The {'UNSPECIFIED CULEX', 'Block_26', 'Id'} Features are in our first dataframe,but not in the second dataframe
The {'WnvPresent', 'NumMosquitos'} Features are in our second dataframe,but not in the first dataframe


In [11]:
train['UNSPECIFIED CULEX'] = 0
train['Block_26'] = 0

> This cell does the same thing as the previous one, but for the month and year of each trap that was collected.

In [12]:
for df in [test,train]:
    dates = df.index
    df['Month'] = df.index.map(lambda dates: dates.month)
    df['Year'] = df.index.map(lambda dates: dates.year)
    df['Year_sq'] = df['Year'] * df['Year']
    df['Month_sq'] = df['Month'] * df['Month']
    df.drop(columns = ['Month','Year'],inplace = True)

In [13]:
column_check(test,train)

The {'Id'} Features are in our first dataframe,but not in the second dataframe
The {'WnvPresent', 'NumMosquitos'} Features are in our second dataframe,but not in the first dataframe


> This cell splits both our training and testing data on a specific latitude line into two data frames for train, and two dataframes for test. This is done to assign a station's weather data to the traps that are closest to it.

In [14]:
station1_train= train[train.Latitude > 41.85]
station2_train= train[train.Latitude <= 41.85]
station1_test= test[test.Latitude > 41.85]
station2_test= test[test.Latitude <= 41.85]

> This cell merges our training and testing data frame with the correct weather station's data frame, then recombines the two training dataframes back together, and the two testing dataframes back together.

In [15]:
stat1= pd.merge(station1_weather, station1_train, how= 'inner', right_index= True, left_index= True)
stat2= pd.merge(station2_weather, station2_train, how= 'inner', right_index= True, left_index= True)
stat_test1= pd.merge(station1_weather, station1_test, how= 'inner', right_index= True, left_index= True)
stat_test2= pd.merge(station2_weather, station2_test, how= 'inner', right_index= True, left_index= True)
train = pd.concat([stat1, stat2],sort=True)
test = pd.concat([stat_test1,stat_test2],sort=True)

In [16]:
column_check(test,train)

The {'Id'} Features are in our first dataframe,but not in the second dataframe
The {'NumMosquitos', 'WnvPresent'} Features are in our second dataframe,but not in the first dataframe


## Creating X and y Variables

>Because the outcome that we want predict is the presence of West Nile virus, it becomes the y variable. Everything else meant to predict West Nile virus becomes our X variable.

In [17]:
y = train[['WnvPresent']]
X = train.drop(columns=['WnvPresent', 'NumMosquitos'])
y_stuff = train[['NumMosquitos']]

> This cell is saving our list of ID's so we can submit to Kaggle later.

In [18]:
Id_for_test = test['Id']
test.drop(columns='Id',inplace = True)
with open('../Assets/ID_list.pkl','wb+') as f:
    pickle.dump(Id_for_test,f)

> The cell below confirms that our X variable and our training data have the same values. Therefore, we are ready to start making models and making predictions based on our testing data.

In [19]:
column_check(test,X)

These Dataframes have the same columns


### Saving to CSV for Modeling

In [20]:
X.to_csv('../Data/X.csv')
y.to_csv('../Data/y.csv')
test.to_csv('../Data/formatted_test.csv')