# Beating the Benchmark

## This notebook took a script from the [kaggle competition discussion board](https://www.kaggle.com/abhishek/vote-me-up) to better understand  which features produced a signal for predicting `WnvPresent`

First, we import the dependencies for doing some data manipulation and for machine learning

In [1]:
import pickle
import pandas as pd
from sklearn import ensemble, preprocessing

Next, we will load in the different data sets we are using to create a way to predict West Nile Virus.

Here is a description of each file and how it is used:
1. **train.csv**: This is a spreadsheet with various values that we will use to train our machine learning model so that we can predict the incidence of West Nile Virus from unseen data.
2. **test.csv**: This spreadsheet is similar to the train spreadsheet with some columns missing. The values from this spreadsheet are used to create our predictions.
3. **sample.csv**: We are using this to submit our kaggle submissions. We will overwrite the "WnvPresent" column with our predictions.
4. **weather.csv**: This spreadsheet is utilized alongside our train.csv to help train our machine learning model.

In [2]:
train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')
sample = pd.read_csv('../data/sampleSubmission.csv')
weather = pd.read_csv('../data/weather.csv')

This is grabbing our target values for when we train our machine learning model. This will be the true values that we will compare our predictions against as we train our model.

In [3]:
# Get labels
labels = train.WnvPresent.values

From here, we are going to clean up our data by:
1. Dropping columns that do not seem helpful for building our predictions
2. Replace missing values
3. Create `month` and `day` columns
4. 'Round' `Latitude` and `Longitude`
5. `LabelEncoder` discrete values

In [4]:
# Not using codesum for this benchmark
weather = weather.drop('CodeSum', axis=1)

In [5]:
# Split station 1 and 2 and join horizontally
weather_stn1 = weather[weather['Station']==1]
weather_stn2 = weather[weather['Station']==2]
weather_stn1 = weather_stn1.drop('Station', axis=1)
weather_stn2 = weather_stn2.drop('Station', axis=1)
weather = weather_stn1.merge(weather_stn2, on='Date')

In [6]:
# replace some missing values and T with -1
weather = weather.replace('M', -1)
weather = weather.replace('-', -1)
weather = weather.replace('T', -1)
weather = weather.replace(' T', -1)
weather = weather.replace('  T', -1)

In [7]:
# Functions to extract month and day from dataset
# You can also use parse_dates of Pandas.
def create_month(x):
    return x.split('-')[1]

def create_day(x):
    return x.split('-')[2]

train['month'] = train.Date.apply(create_month)
train['day'] = train.Date.apply(create_day)
test['month'] = test.Date.apply(create_month)
test['day'] = test.Date.apply(create_day)

In [8]:
# Add integer latitude/longitude columns
train['Lat_int'] = train['Latitude'].apply(int)
train['Long_int'] = train['Longitude'].apply(int)
test['Lat_int'] = test['Latitude'].apply(int)
test['Long_int'] = test['Longitude'].apply(int)

In [9]:
# drop address columns
train = train.drop(['Address', 'AddressNumberAndStreet','WnvPresent', 'NumMosquitos'], axis = 1)
test = test.drop(['Id', 'Address', 'AddressNumberAndStreet'], axis = 1)

In [10]:
# Merge with weather data
train = train.merge(weather, on='Date')
test = test.merge(weather, on='Date')
train = train.drop(['Date'], axis = 1)
test = test.drop(['Date'], axis = 1)

In [11]:
# Convert categorical data to numbers
lbl = preprocessing.LabelEncoder()
lbl.fit(list(train['Species'].values) + list(test['Species'].values))
train['Species'] = lbl.transform(train['Species'].values)
test['Species'] = lbl.transform(test['Species'].values)

lbl.fit(list(train['Street'].values) + list(test['Street'].values))
train['Street'] = lbl.transform(train['Street'].values)
test['Street'] = lbl.transform(test['Street'].values)

lbl.fit(list(train['Trap'].values) + list(test['Trap'].values))
train['Trap'] = lbl.transform(train['Trap'].values)
test['Trap'] = lbl.transform(test['Trap'].values)

In [12]:
# drop columns with -1s
train = train.loc[:,(train != -1).any(axis=0)];
test = test.loc[:,(test != -1).any(axis=0)];

In [13]:
# Random Forest Classifier 
clf = ensemble.RandomForestClassifier(n_estimators=1000)
clf.fit(train, labels);

In [14]:
print('This is our X_train score: ',clf.score(train, labels))

This is our X_train score:  0.9810584427945935


In [15]:
# Saving the model as a pickle
with open('../assets/random_forest_all_features.pkl', 'wb+') as f:
    pickle.dump(clf, f)

In [16]:
# create predictions and submission file
predictions = clf.predict_proba(test)[:,1]
sample['WnvPresent'] = predictions
sample.to_csv('../kaggle/random_forest_all_features.csv', index=False)

Here is how the features were weighted by the RandomForestClassifier:

In [17]:
pd.DataFrame({'Feature':test.columns,'Weight':sorted(clf.feature_importances_, reverse=True)}).head()

Unnamed: 0,Feature,Weight
0,Species,0.191937
1,Block,0.124575
2,Street,0.122056
3,Trap,0.115064
4,Latitude,0.107482


So we know that these are the strongest features. We can focus on these moving forward.

## Submit to Kaggle

We're using the [kaggle CLI](http://wiki.fast.ai/index.php/Kaggle_CLI) to submit my predictions. Once the submission has been successfully submitted, the browser will open the kaggle leaderboard page so I can check the scores.

In [18]:
import subprocess, webbrowser
result = subprocess.check_output(f'kaggle competitions submit -f {"../kaggle/random_forest_all_features.csv"} -m "uploading a new set" predict-west-nile-virus')
webbrowser.open("https://www.kaggle.com/c/predict-west-nile-virus/leaderboard")

True

## Score: 0.71922

This is much better. But we still want to do better. We're going to take what we learned, and try and use the dates in our `weather` to our advantage. Weather is cyclical, has patterns, and has an effect on mosquitoes. And we've seen before that there are trends in seasons. That's out next step.