# What we've learned

1. There are patterns with `Weather`, `Mosquitoes`, and `WnvPresent`.
2. `LabelEncoder` allows us to get around hundreds of dummy columns.
3. Rounding/Estimating location is much better than an exact `Lat & Long`.
4. `Weather` is cyclical.

## Imports

In [1]:
import pickle
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

## Data

In [2]:
df_test = pd.read_csv('../data/test.csv')
df_train = pd.read_csv('../data/train.csv')
df_weather = pd.read_csv('../data/weather_cleaned.csv')

## Datetime

In [3]:
df_test['Date'] = pd.to_datetime(df_test['Date'])
df_test.set_index('Date',inplace=True)

df_train['Date'] = pd.to_datetime(df_train['Date'])
df_train.set_index('Date',inplace=True)

df_weather['Date'] = pd.to_datetime(df_weather['Date'])
df_weather.set_index('Date',inplace=True)

## Rolling Weather

Since we know that weather is cyclical and that mosquitoes have birth patterns, we want to be able to use rolling means of weather data to help us predict `WnvPresent`. At first we focused on using `7`, `14`, `21`, `28`, but that process didn't produce the best results, and was also extremely tedious.

So we wrote a function to randomly select a date for rolling means from 3-30. And we found the hard coded values to be best.

In [4]:
def random_roll(weather_cols):
    for w in weather_cols:
        days = np.random.choice(range(3,30),1)[0]
        df_weather[w+'_roll_'+str(days)] = df_weather[w].rolling(days).mean()

In [5]:
#random_roll(['ResultSpeed','PrecipTotal','DewPoint','WetBulb','StnPressure','SeaLevel','AvgSpeed','Heat','Tmax','Tmin'])

In [6]:
df_weather['ResultSpeed_21'] = df_weather['ResultSpeed'].rolling(21).mean()
df_weather['PrecipTotal_15'] = df_weather['PrecipTotal'].rolling(15).sum()
df_weather['DewPoint_16'] = df_weather['DewPoint'].rolling(16).mean()
df_weather['AvgSpeed_19'] = df_weather['AvgSpeed'].rolling(19).mean()
df_weather['Heat_28'] = df_weather['Heat'].rolling(28).mean()
df_weather['Tmax_4'] = df_weather['Tmax'].rolling(4).mean()
df_weather['Tmin_8'] = df_weather['Tmin'].rolling(8).mean()

## Round Lat & Long

In [7]:
df_train['Lat_int'] = df_train['Latitude'].apply(int)
df_train['Long_int'] = df_train['Longitude'].apply(int)
df_test['Lat_int'] = df_test['Latitude'].apply(int)
df_test['Long_int'] = df_test['Longitude'].apply(int)

## Merge

In [8]:
df_train = pd.merge(left=df_train,right=df_weather,on='Date')
df_test = pd.merge(left=df_test,right=df_weather,on='Date')

## Drops

In [9]:
df_train.drop(['Station', 'Tmax', 'Tmin', 'Tavg', 'DewPoint', 'WetBulb', 'Heat',
       'Cool', 'PrecipTotal', 'StnPressure', 'SeaLevel', 'ResultSpeed',
       'ResultDir', 'AvgSpeed','Address', 'Block','Latitude','Longitude',
         'AddressNumberAndStreet','AddressAccuracy','NumMosquitos'],axis=1,inplace=True)

df_test.drop(['Id','Station', 'Tmax', 'Tmin', 'Tavg', 'DewPoint', 'WetBulb', 'Heat',
       'Cool', 'PrecipTotal', 'StnPressure', 'SeaLevel', 'ResultSpeed',
       'ResultDir', 'AvgSpeed','Address', 'Block','Latitude','Longitude',
         'AddressNumberAndStreet','AddressAccuracy'],axis=1,inplace=True)

## Binary Species

At first we used `LabelEncoder` for `Species` but with further tuning and scoring, we found that using a binary classification for `Species` that's based on having a correlated value to `WnvPresent` is better.

In [10]:
no_wn = ['CULEX ERRATICUS','CULEX SALINARIUS','CULEX TARSALIS','CULEX TERRITANS']
yes_wn = ['CULEX PIPIENS/RESTUANS', 'CULEX RESTUANS', 'CULEX PIPIENS']

df_train['Species'] = df_train['Species'].map(lambda x: 1 if x in yes_wn else 0)
df_test['Species'] = df_test['Species'].map(lambda x: 1 if x in yes_wn else 0)

## Label Encoder

In [11]:
from sklearn import preprocessing

# Convert categorical data to numbers
lbl = preprocessing.LabelEncoder()
# lbl.fit(list(df['Species'].values) + list(df1['Species'].values))
# df['Species'] = lbl.transform(df['Species'].values)
# df1['Species'] = lbl.transform(df1['Species'].values)

lbl.fit(list(df_train['Street'].values) + list(df_test['Street'].values))
df_train['Street'] = lbl.transform(df_train['Street'].values)
df_test['Street'] = lbl.transform(df_test['Street'].values)

lbl.fit(list(df_train['Trap'].values) + list(df_test['Trap'].values))
df_train['Trap'] = lbl.transform(df_train['Trap'].values)
df_test['Trap'] = lbl.transform(df_test['Trap'].values)

## Final Drops

Through numerous iterations, feature comparisons, testing, and tuning, we dropped certain values right before modeling to obtain the best results. However, our best models kept all of our final features.

In [12]:
#df_train.drop([''],axis=1,inplace=True)
#df_test.drop([''],axis=1,inplace=True)

# Model RandomForestClassifier

This is a tree, but each split is made on a random feature, ensuring that the model isn't biased to only picking/fitting based on the strongest feature. We found little to no gain after hypertuning, thus we're sticking to 1000 trees.

In [13]:
X = df_train.drop('WnvPresent',axis=1)
y = df_train['WnvPresent']

In [14]:
ss = StandardScaler()
rf = RandomForestClassifier(n_estimators=1000,n_jobs=3)

In [15]:
X_train_ss = ss.fit_transform(X)
X_test_ss = ss.transform(df_test)

rf.fit(X_train_ss,y);

test_preds = rf.predict_proba(X_test_ss)

In [16]:
submit = pd.read_csv('../data/sampleSubmission.csv')
submit['WnvPresent'] = 1-test_preds
submit.to_csv('../kaggle/random_roll_random_forest.csv',index=False)

At times, we get very high scores between 60-67% or very low scores between 31-35%. In the later case, we do `1-preds` so that we can get the highest score. If we're very 'bad' at predicting `WnvPresent`, then we can simple flip out predictions.

In [17]:
# Saving the model as a pickle
with open('../assets/random_roll_random_forest.pkl', 'wb+') as f:
    pickle.dump(rf, f)

In [18]:
features = pd.DataFrame({'Feature':X.columns,'Weight':rf.feature_importances_})
features

Unnamed: 0,Feature,Weight
0,Species,0.008183
1,Street,0.343346
2,Trap,0.389918
3,Lat_int,0.014845
4,Long_int,0.0
5,ResultSpeed_21,0.044472
6,PrecipTotal_15,0.030347
7,DewPoint_16,0.047255
8,AvgSpeed_19,0.034524
9,Heat_28,0.030297


# Score 0.71717

This is our highest score yet. And on the full Kaggle dataset, we score `0.68223`! But we want to have over 70% on the full kaggle dataset. So let's do that!

# PCA - Principal Component Analysis
This allows us to extract features without dropping features. We believe we have the best selection of features, and to further reduce that, we're going to do feature extraction.

**Feature Extraction**
- In feature extraction, we take our existing features and combine them together in a particular way. We can then drop some of these "new" variables, but the variables we keep are still a combination of the old variables!
- This allows us to still reduce the number of features in our model **but** we can keep all of the most important pieces of the original features!

After multiple iterations, we found 5 components (the number of linear combinations with increased explained variance with each added component) to be the best.

In [19]:
pca = PCA(random_state=3,n_components=5)

In [20]:
X_train_ss = ss.fit_transform(X)
X_test_ss = ss.transform(df_test)

X_train_pca = pca.fit_transform(X_train_ss)
X_test_pca = pca.transform(X_test_ss)

In [21]:
print('Our explained variance:\n',np.cumsum(pca.explained_variance_ratio_))

Our explained variance:
 [0.37717204 0.48816358 0.59423562 0.68681201 0.77662438]


This tells us that these 5 components explains nearly 80% of the variance within our dataset. There was little to no benefit when going to 6 components which is closer to 85%.

In [22]:
rf.fit(X_train_pca,y);
test_preds = rf.predict_proba(X_test_pca)

In [23]:
# Saving the model as a pickle
with open('../assets/random_roll_random_forest_pca_5.pkl', 'wb+') as f:
    pickle.dump(rf, f)

In [24]:
submit['WnvPresent'] = 1-test_preds
submit.to_csv('../kaggle/random_roll_random_forest_pca_5.csv',index=False)

# Score 0.71974

This is our highest score yet. And on the full Kaggle dataset, we score `0.70150`!