# Beating the Benchmark

## This notebook took a script from the kaggle competition discussion board to better understand  which features produced a signal for predicting West Nile Virus

First, we import the dependencies for doing some data manipulation and for machine learning

In [109]:
# Borrowing from https://www.kaggle.com/abhishek/vote-me-up
import pandas as pd;
import numpy as np;
from sklearn.ensemble import RandomForestClassifier;
from sklearn.model_selection import ParameterGrid
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier;
from sklearn.ensemble import BaggingRegressor, BaggingClassifier;
from sklearn.decomposition import PCA
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

Next, we will load in the different data sets we are using to create a way to predict West Nile Virus.

Here is a description of each file and how it is used:
1. **train.csv**: This is a spreadsheet with various values that we will use to train our machine learning model so that we can predict the incidence of West Nile Virus from unseen data.
2. **test.csv**: This spreadsheet is similar to the train spreadsheet with some columns missing. The values from this spreadsheet are used to create our predictions.
3. **sample.csv**: We are using this to submit our kaggle submissions. We will overwrite the "WnvPresent" column with our predictions.
4. **weather.csv**: This spreadsheet is utilized alongside our train.csv to help train our machine learning model.

In [110]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
sample = pd.read_csv('../input/sampleSubmission.csv')
weather = pd.read_csv('../input/weather.csv')

This is grabbing our target values for when we train our machine learning model. This will be the true values that we will compare our predictions against as we train our model.

In [111]:
# Get labels
labels = train.WnvPresent.values

From here, we are going to clean up our data by:
1. Dropping columns that do not seem helpful for building our predictions
2. 

In [112]:
# Not using codesum for this benchmark
weather = weather.drop('CodeSum', axis=1)

In [113]:
# Split station 1 and 2 and join horizontally
weather_stn1 = weather[weather['Station']==1]
weather_stn2 = weather[weather['Station']==2]
weather_stn1 = weather_stn1.drop('Station', axis=1)
weather_stn2 = weather_stn2.drop('Station', axis=1)
weather = weather_stn1.merge(weather_stn2, on='Date')

In [114]:
# replace some missing values and T with -1
weather = weather.replace('M', -1)
weather = weather.replace('-', -1)
weather = weather.replace('T', -1)
weather = weather.replace(' T', -1)
weather = weather.replace('  T', -1)

In [115]:
# Functions to extract month and day from dataset
# You can also use parse_dates of Pandas.
def create_month(x):
    return x.split('-')[1]

def create_day(x):
    return x.split('-')[2]

train['month'] = train.Date.apply(create_month)
train['day'] = train.Date.apply(create_day)
test['month'] = test.Date.apply(create_month)
test['day'] = test.Date.apply(create_day)

In [116]:
# Add integer latitude/longitude columns
train['Lat_int'] = train.Latitude.apply(int)
train['Long_int'] = train.Longitude.apply(int)
test['Lat_int'] = test.Latitude.apply(int)
test['Long_int'] = test.Longitude.apply(int)

In [117]:
# drop address columns

train = train.drop(['Address', 'AddressNumberAndStreet'], axis = 1)
test = test.drop(['Id', 'Address', 'AddressNumberAndStreet'], axis = 1)

In [118]:
# Merge with weather data
train = train.merge(weather, on='Date')
test = test.merge(weather, on='Date')
train = train.drop(['Date'], axis = 1)
test = test.drop(['Date'], axis = 1)

In [119]:
# Convert categorical data to numbers
lbl = LabelEncoder()
lbl.fit(list(train['Species'].values) + list(test['Species'].values))
train['Species'] = lbl.transform(train['Species'].values)
test['Species'] = lbl.transform(test['Species'].values)

lbl.fit(list(train['Street'].values) + list(test['Street'].values))
train['Street'] = lbl.transform(train['Street'].values)
test['Street'] = lbl.transform(test['Street'].values)

lbl.fit(list(train['Trap'].values) + list(test['Trap'].values))
train['Trap'] = lbl.transform(train['Trap'].values)
test['Trap'] = lbl.transform(test['Trap'].values)

In [120]:
# drop columns with -1s
train = train.loc[:,(train != -1).any(axis=0)];
test = test.loc[:,(test != -1).any(axis=0)];

In [122]:
X = train.drop(columns=["WnvPresent"])
y = train["WnvPresent"]

X_train, X_test, y_train, y_test = train_test_split(X, y)

# Scale our data
ss = StandardScaler()
X_train_scaled = ss.fit_transform(X_train)
X_test_scaled = ss.fit_transform(X_test)

In [127]:
grid = ParameterGrid({"max_samples": [0.5, 1.0],
                          "max_features": [0.5, 1.0],
                          "bootstrap": [True, False],
                          "bootstrap_features": [True, False]})

for base_estimator in [None,DecisionTreeClassifier(), RandomForestClassifier()]:
    for params in grid:
        br = BaggingClassifier(base_estimator=base_estimator,
                         **params).fit(X_train_scaled, y_train)
        preds = br.predict(X_train_scaled)
        test_preds = br.predict(X_test_scaled)
        print(f"Scores for {str(base_estimator)} and {params}")        
        print(f"RocAuc for train data: {roc_auc_score(y_train, preds)}")
        print(f"RocAuc for test data: {roc_auc_score(y_test, test_preds)}")
        [print() for _ in range(5)]

Scores for None and {'bootstrap': True, 'bootstrap_features': True, 'max_features': 0.5, 'max_samples': 0.5}
RocAuc for train data: 0.6630602549744463
RocAuc for test data: 0.532835853859193





Scores for None and {'bootstrap': True, 'bootstrap_features': True, 'max_features': 0.5, 'max_samples': 1.0}
RocAuc for train data: 0.7854831729419188
RocAuc for test data: 0.5370652576016077





Scores for None and {'bootstrap': True, 'bootstrap_features': True, 'max_features': 1.0, 'max_samples': 0.5}
RocAuc for train data: 0.677540576880511
RocAuc for test data: 0.5245758173074269





Scores for None and {'bootstrap': True, 'bootstrap_features': True, 'max_features': 1.0, 'max_samples': 1.0}
RocAuc for train data: 0.8697013845528696
RocAuc for test data: 0.5479438251257501





Scores for None and {'bootstrap': True, 'bootstrap_features': False, 'max_features': 0.5, 'max_samples': 0.5}
RocAuc for train data: 0.6979626491177646
RocAuc for test data: 0.5344467266292315





Scores for None 

Scores for DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best') and {'bootstrap': False, 'bootstrap_features': True, 'max_features': 1.0, 'max_samples': 0.5}
RocAuc for train data: 0.7866508422614032
RocAuc for test data: 0.5439145726701009





Scores for DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best') and {'bootstrap': False, 'bootstrap_features': True, 'max_features': 1.0, 'max_samples': 1.0}
Ro

Scores for RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False) and {'bootstrap': False, 'bootstrap_features': True, 'max_features': 0.5, 'max_samples': 1.0}
RocAuc for train data: 0.830596603203864
RocAuc for test data: 0.5332389171401279





Scores for RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=No

In [72]:
# X = train.drop(columns=["NumMosquitos"])
# y = train["NumMosquitos"]

# test_preds = BaggingRegressor(
#     base_estimator=DecisionTreeRegressor(),
#     bootstrap=False,
#     bootstrap_features=False,
#     max_features=1.0,
#     max_samples=1.0).fit(X, y).predict(test)

# test["NumMosquitos"] = test_preds

In [15]:
# pca = PCA(n_components=4)

# pca_train = pca.fit_transform(train)
# pca_test = pca.transform(test)

In [16]:
# np.cumsum(pca.explained_variance_ratio_)

In [60]:
# # Random Forest Classifier 
# clf = ensemble.RandomForestClassifier(n_estimators=1000)
# clf.fit(pca_train, labels)

# # create predictions and submission file
# predictions = clf.predict_proba(pca_test)[:,1]
# sample['WnvPresent'] = predictions
# file_name = 'beat_the_benchmark.csv'
# sample.to_csv(file_name, index=False)

Here is how the features were weighted by the RandomForestClassifier:

In [30]:
# pd.DataFrame({'Feature':test.columns,'Weight':sorted(clf.feature_importances_, reverse=True)})

## Submit to Kaggle

I am using the [kaggle CLI](http://wiki.fast.ai/index.php/Kaggle_CLI) to submit my predictions. Once the submission has been successfully submitted, the browser will open the kaggle leaderboard page so I can check the scores.

In [31]:
import subprocess, webbrowser
result = subprocess.check_output(f'kaggle competitions submit -f {file_name} -m "uploading a new set" predict-west-nile-virus')
webbrowser.open("https://www.kaggle.com/c/predict-west-nile-virus/leaderboard")

NameError: name 'file_name' is not defined