# Project 4: West Nile Virus Prediction

## Problem Statement

Due to the recent epidemic of West Nile Virus in the Windy City, we've had the Department of Public Health set up a surveillance and control system. We're hoping it will let us learn something from the mosquito population as we collect data over time. Pesticides are a necessary evil in the fight for public health and safety, not to mention expensive! We need to derive an effective plan to deploy pesticides throughout the city. We need to predict the area where the West Nile Virus will be present, as well as evaluate the costs and benefits of spraying.

## Contents:
- [Imports and Downloads](#Imports-and-Downloads)
- [Train Data Cleaning and Engineering](#Train-Data-Cleaning-and-Engineering)
- [Reading and Data Cleaning for Weather Data](#Reading-and-Data-Cleaning-for-Weather-Data)
- [Combine Train and Weather Data](#Combine-Train-and-Weather-Data)
- [Test Data Cleaning and Engineering](#Test-Data-Cleaning-and-Engineering)
- [Combine Test and Weather Data](#Combine-Test-and-Weather-Data)
- [Combine Train and Test Data to Perform Encoding](#Combine-Train-and-Test-Data-to-Perform-Encoding)
- [Split back into Train and Test Data](#Split-back-into-Train-and-Test-Data)
- [Run Classfication Models to Predict if WNV is Present](#Run-Classfication-Models-to-Predict-if-WNV-is-Present)
- [Perform Predictions for Submission to Kaggle](#Perform-Predictions-for-Submission-to-Kaggle)

## Imports and Downloads

In [41]:
#Imports:
import pandas as pd
from tqdm import tqdm
import requests
import time
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.stats as stats
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression

In [4]:
pip install geopy

Collecting geopy
[?25l  Downloading https://files.pythonhosted.org/packages/80/93/d384479da0ead712bdaf697a8399c13a9a89bd856ada5a27d462fb45e47b/geopy-1.20.0-py2.py3-none-any.whl (100kB)
[K    100% |████████████████████████████████| 102kB 177kB/s a 0:00:01
[?25hCollecting geographiclib<2,>=1.49 (from geopy)
  Downloading https://files.pythonhosted.org/packages/5b/ac/4f348828091490d77899bc74e92238e2b55c59392f21948f296e94e50e2b/geographiclib-1.49.tar.gz
Building wheels for collected packages: geographiclib
  Building wheel for geographiclib (setup.py) ... [?25ldone
[?25h  Stored in directory: /Users/kokweilee/Library/Caches/pip/wheels/99/45/d1/14954797e2a976083182c2e7da9b4e924509e59b6e5c661061
Successfully built geographiclib
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.49 geopy-1.20.0
Note: you may need to restart the kernel to use updated packages.


## Train Data Cleaning and Engineering

In [42]:
train = pd.read_csv('./train.csv')
train.Date = pd.to_datetime(train.Date)

**Find Distances from the location to the weather stations**

Station 1: CHICAGO O'HARE INTERNATIONAL AIRPORT Lat: 41.995 Lon: -87.933 Elev: 662 ft. above sea level<br>
Station 2: CHICAGO MIDWAY INTL ARPT Lat: 41.786 Lon: -87.752 Elev: 612 ft. above sea level

In [43]:
import geopy.distance as gp

coord1 = (41.995,-87.933)  #coordinates for Chicago O'hare Airport (station 1)
coord2 = (41.786,-87.752)  #coordinates for Chicago Midway Airport (station 2)

dist_fr_1 = []
dist_fr_2 = []

for i in range(len(train.Latitude)):
    coord = (train.Latitude[i],train.Longitude[i])
    dist_fr_1.append(gp.distance(coord1, coord).km)
    dist_fr_2.append(gp.distance(coord2, coord).km)

train['dist_fr_1'] = dist_fr_1
train['dist_fr_2'] = dist_fr_2

Drop address and location specific columns as this information is already captured in the distance from each weather station.

In [44]:
train = train.drop(['Address','Block','Street','AddressNumberAndStreet','Latitude','Longitude'],axis=1)
train.head()

Unnamed: 0,Date,Species,Trap,AddressAccuracy,NumMosquitos,WnvPresent,dist_fr_1,dist_fr_2
0,2007-05-29,CULEX PIPIENS/RESTUANS,T002,9,1,0,11.822004,19.172879
1,2007-05-29,CULEX RESTUANS,T002,9,1,0,11.822004,19.172879
2,2007-05-29,CULEX RESTUANS,T007,9,1,0,13.56547,23.257125
3,2007-05-29,CULEX PIPIENS/RESTUANS,T015,8,1,0,9.261595,21.747902
4,2007-05-29,CULEX RESTUANS,T015,8,4,0,9.261595,21.747902


Drop number of mosquitoes as this information is not available in the test data set.

In [45]:
cols = list(train.drop(columns=["NumMosquitos"]).columns)

## Reading and Data Cleaning for Weather Data

In [46]:
weather = pd.read_csv('./weather.csv')
weather.head()

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,...,CodeSum,Depth,Water1,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2007-05-01,83,50,67,14,51,56,0,2,...,,0,M,0.0,0.0,29.1,29.82,1.7,27,9.2
1,2,2007-05-01,84,52,68,M,51,57,0,3,...,,M,M,M,0.0,29.18,29.82,2.7,25,9.6
2,1,2007-05-02,59,42,51,-3,42,47,14,0,...,BR,0,M,0.0,0.0,29.38,30.09,13.0,4,13.4
3,2,2007-05-02,60,43,52,M,42,47,13,0,...,BR HZ,M,M,M,0.0,29.44,30.08,13.3,2,13.4
4,1,2007-05-03,66,46,56,2,40,48,9,0,...,,0,M,0.0,0.0,29.39,30.12,11.7,7,11.9


Replace missing weather information based on the average values of approximately 5 days before and after.

In [47]:
weather.iloc[[87, 1745, 2067], 21] = (7.48, 7.61, 6.37) # AvgSpeed
weather.iloc[[87, 848, 2410, 2411], 17] = (29.37, 29.09, 29.30, 29.30) # StnPressure
weather.iloc[[848, 2410, 2412, 2415], 7] = (71.89, 64.67, 64.67, 59.44) # WetBulb

Compute length of day in minutes by the sunrise and sunset times.

In [48]:
def timediff(a,b):
    amin = int(a[:-2])*60 + int(a[-2:])
    bmin = int(b[:-2])*60 + int(b[-2:])
    return abs(amin-bmin)

daylen = []
for i in range(0,len(weather.Sunrise),2):
    diff = timediff(weather.Sunset[i],weather.Sunrise[i])
    daylen.append(diff)
    daylen.append(diff)
weather['daylen']=daylen

When PrecipTotal is T, it means there is trace amounts of precipitation. With min recorded value of 0.01 as precipitation, trace amounts of precipitation is set to 0.001. Missing values are assumed to be 0, indicating no precipitation.


In [49]:
weather.PrecipTotal = weather.PrecipTotal.replace('  T',0.001)
weather.PrecipTotal = weather.PrecipTotal.replace('M',0)

For missing values of Tavg, it is computed to the be mean of the min and max temperature.

In [50]:
for i in range(len(weather.Tavg)):
    if weather.Tavg[i]=='M':
        weather.Tavg[i]= round((weather.Tmax[i]+weather.Tmin[i])/2)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


For missing values of Sea level, it is taken to be the sea level of the other station in the same day.

In [51]:
for index, row in weather[weather['SeaLevel']=='M'].iterrows():
    weather.SeaLevel[index] = np.nanmean(pd.to_numeric(weather[weather['Date']==row['Date']]['SeaLevel'], 
                                                          errors='coerce'))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


For codesum, the data is encoded to be 1 when the weather phenomena is present, and 0 otherwise.

In [52]:
cols = ["+FC", "FC", "TS", "GR", "RA","DZ","SN","SG","GS","PL","IC","FG+","FG","BR","UP","HZ","FU",
"VA","DU","DS","PO","SA","SS","PY","SQ","DR","SH","FZ","MI","PR","BC","BL","VC"]

for col in cols:
    weather[col] = 0
    
weather.CodeSum = weather.CodeSum.str.rsplit()
for i in weather[cols]:
    for index, word in weather.CodeSum.iteritems():
        for code in word:
            if code == i:
                weather.loc[index, i] = 1

#### Drop columns and change data types to numeric.

**Dropped columns**

Heating/Cooling:
Temperature compared to base of 65F. Degrees above/below it indicated. Information already captured in Tmin, Tmax, Tavg

Temperature Departure:
Difference from the 30-year average temperature, already represented in the Tavg feature.

Depth/SnowFall/Water1:
0 or missing data, or insignificant (trace amount, 0 values). Relevant information captured in PrecipTotal.

Sunrise/Sunset:
Information captured in length of day


In [53]:
weather = weather.drop(['Depart','Heat','Cool','Depth','Water1','SnowFall','Sunrise','Sunset','CodeSum'],axis=1)
weather.Date = pd.to_datetime(weather.Date)
for col in ['Tavg','WetBulb','PrecipTotal','StnPressure','SeaLevel','AvgSpeed']:
    weather[col] = pd.to_numeric(weather[col])
weather.head()

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,DewPoint,WetBulb,PrecipTotal,StnPressure,SeaLevel,...,PY,SQ,DR,SH,FZ,MI,PR,BC,BL,VC
0,1,2007-05-01,83,50,67.0,51,56.0,0.0,29.1,29.82,...,0,0,0,0,0,0,0,0,0,0
1,2,2007-05-01,84,52,68.0,51,57.0,0.0,29.18,29.82,...,0,0,0,0,0,0,0,0,0,0
2,1,2007-05-02,59,42,51.0,42,47.0,0.0,29.38,30.09,...,0,0,0,0,0,0,0,0,0,0
3,2,2007-05-02,60,43,52.0,42,47.0,0.0,29.44,30.08,...,0,0,0,0,0,0,0,0,0,0
4,1,2007-05-03,66,46,56.0,40,48.0,0.0,29.39,30.12,...,0,0,0,0,0,0,0,0,0,0


## Combine Train and Weather Data

In [54]:
#create new weather dataframe for train data. 
#Weather for each entry is weighted by the distance from the stations where the weather was recorded

wdf = pd.DataFrame()
for i in range(len(train.Date)):
    dateweather = weather[weather.Date == train.Date[i]].drop(['Date','Station'],axis=1)
    weightedw = (dist_fr_2[i] * dateweather.iloc[0].astype(float) +
             dist_fr_1[i] * dateweather.iloc[1].astype(float))/(dist_fr_1[i] + dist_fr_2[i])
    wdf[i] = weightedw
wdf = wdf.T

In [55]:
#join train and weather dataframes.
finaltrain = train.join(wdf)
finaltrain

Unnamed: 0,Date,Species,Trap,AddressAccuracy,NumMosquitos,WnvPresent,dist_fr_1,dist_fr_2,Tmax,Tmin,...,PY,SQ,DR,SH,FZ,MI,PR,BC,BL,VC
0,2007-05-29,CULEX PIPIENS/RESTUANS,T002,9,1,0,11.822004,19.172879,88.000000,61.907090,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2007-05-29,CULEX RESTUANS,T002,9,1,0,11.822004,19.172879,88.000000,61.907090,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2007-05-29,CULEX RESTUANS,T007,9,1,0,13.565470,23.257125,88.000000,61.842004,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2007-05-29,CULEX PIPIENS/RESTUANS,T015,8,1,0,9.261595,21.747902,88.000000,61.493348,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2007-05-29,CULEX RESTUANS,T015,8,4,0,9.261595,21.747902,88.000000,61.493348,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,2007-05-29,CULEX RESTUANS,T045,8,2,0,23.553811,16.652337,88.000000,62.929130,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,2007-05-29,CULEX RESTUANS,T046,8,1,0,25.817125,14.209564,88.000000,63.224989,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,2007-05-29,CULEX PIPIENS/RESTUANS,T048,8,1,0,27.136705,12.129938,88.000000,63.455440,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,2007-05-29,CULEX RESTUANS,T048,8,2,0,27.136705,12.129938,88.000000,63.455440,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,2007-05-29,CULEX RESTUANS,T049,8,1,0,25.509596,14.650453,88.000000,63.175992,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Test Data Cleaning and Engineering

In [165]:
test = pd.read_csv('./test.csv')
test = test.drop(['Address','Block','Street','Trap','AddressNumberAndStreet'],axis=1)

In [166]:
#distance of each data entry from each weather station.
coord1 = (41.995,-87.933)  #coordinates for Chicago O'hare Airport (station 1)
coord2 = (41.786,-87.752)  #coordinates for Chicago Midway Airport (station 2)

dist_fr_1 = []
dist_fr_2 = []

for i in range(len(test.Latitude)):
    coord = (test.Latitude[i],test.Longitude[i])
    dist_fr_1.append(gp.distance(coord1, coord).km)
    dist_fr_2.append(gp.distance(coord2, coord).km)

test['dist_fr_1'] = dist_fr_1
test['dist_fr_2'] = dist_fr_2

In [167]:
test.head()

Unnamed: 0,Id,Date,Species,Latitude,Longitude,AddressAccuracy,dist_fr_1,dist_fr_2
0,1,2008-06-11,CULEX PIPIENS/RESTUANS,41.95469,-87.800991,9,11.822004,19.172879
1,2,2008-06-11,CULEX RESTUANS,41.95469,-87.800991,9,11.822004,19.172879
2,3,2008-06-11,CULEX PIPIENS,41.95469,-87.800991,9,11.822004,19.172879
3,4,2008-06-11,CULEX SALINARIUS,41.95469,-87.800991,9,11.822004,19.172879
4,5,2008-06-11,CULEX TERRITANS,41.95469,-87.800991,9,11.822004,19.172879


## Combine Test and Weather Data

In [59]:
#create new weather dataframe for test data entries

wdftest = pd.DataFrame()
for i in range(len(test.Date)):
    dateweather = weather[weather.Date == test.Date[i]].drop(['Date','Station'],axis=1)
    weightedw = (test.dist_fr_2[i] * dateweather.iloc[0].astype(float) +
             test.dist_fr_1[i] * dateweather.iloc[1].astype(float))/(test.dist_fr_1[i] + test.dist_fr_2[i])
    wdftest[i] = weightedw
wdftest = wdftest.T

In [168]:
#combine test data with the weather data corresponding to its entries.
finaltest = test.join(wdftest)
finaltest.drop(['Id','Latitude','Longitude'],axis=1,inplace=True)

#add in month data
test = pd.read_csv('./test.csv')
finaltest['Month'] = test['Date'].apply(lambda x: x.split("-")[1])

#Test data source set to be 0.
finaltest['Source']=0
finaltest.columns

Index(['Date', 'Species', 'AddressAccuracy', 'dist_fr_1', 'dist_fr_2', 'Tmax',
       'Tmin', 'Tavg', 'DewPoint', 'WetBulb', 'PrecipTotal', 'StnPressure',
       'SeaLevel', 'ResultSpeed', 'ResultDir', 'AvgSpeed', 'daylen', '+FC',
       'FC', 'TS', 'GR', 'RA', 'DZ', 'SN', 'SG', 'GS', 'PL', 'IC', 'FG+', 'FG',
       'BR', 'UP', 'HZ', 'FU', 'VA', 'DU', 'DS', 'PO', 'SA', 'SS', 'PY', 'SQ',
       'DR', 'SH', 'FZ', 'MI', 'PR', 'BC', 'BL', 'VC', 'Month', 'Source'],
      dtype='object')

## Combine Train and Test Data to Perform Encoding

In [None]:
target = finaltrain.WnvPresent

In [None]:
finaltrain = finaltrain.drop(['Trap','NumMosquitos','WnvPresent'],axis=1)

In [149]:
train = pd.read_csv('./train.csv')
finaltrain['Month'] = train['Date'].apply(lambda x: x.split("-")[1])

#Train data source set to be 1.
finaltrain['Source']=1
finaltrain.columns

Index(['Date', 'Species', 'AddressAccuracy', 'dist_fr_1', 'dist_fr_2', 'Tmax',
       'Tmin', 'Tavg', 'DewPoint', 'WetBulb', 'PrecipTotal', 'StnPressure',
       'SeaLevel', 'ResultSpeed', 'ResultDir', 'AvgSpeed', 'daylen', '+FC',
       'FC', 'TS', 'GR', 'RA', 'DZ', 'SN', 'SG', 'GS', 'PL', 'IC', 'FG+', 'FG',
       'BR', 'UP', 'HZ', 'FU', 'VA', 'DU', 'DS', 'PO', 'SA', 'SS', 'PY', 'SQ',
       'DR', 'SH', 'FZ', 'MI', 'PR', 'BC', 'BL', 'VC', 'Source', 'Month'],
      dtype='object')

In [186]:
#combine test and train data and perform encoding for the Species.

final = pd.concat([finaltrain, finaltest], ignore_index=True)
final = pd.concat([final, pd.get_dummies(final.Species)], axis = 1)
final = pd.concat([final, pd.get_dummies(final.Month)], axis = 1)
final.drop(['Species','Date','Month'],axis=1,inplace=True)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  This is separate from the ipykernel package so we can avoid doing imports until


## Split back into Train and Test Data

In [187]:
final_test = final[final.Source==0].reset_index().drop(['Source','index'],axis=1)
final_train = final[final.Source==1].drop(['Source'],axis=1)
final_test.to_csv('./finaltest.csv',index=False)
final_train.to_csv('./finaltrain.csv',index=False)

In [65]:
final_test = pd.read_csv('./finaltest.csv')
final_train = pd.read_csv('./finaltrain.csv')

# Run Classfication Models to Predict if WNV is Present

In [374]:
pip install imblearn

Collecting imblearn
  Downloading https://files.pythonhosted.org/packages/81/a7/4179e6ebfd654bd0eac0b9c06125b8b4c96a9d0a8ff9e9507eb2a26d2d7e/imblearn-0.0-py2.py3-none-any.whl
Collecting imbalanced-learn (from imblearn)
[?25l  Downloading https://files.pythonhosted.org/packages/e6/62/08c14224a7e242df2cef7b312d2ef821c3931ec9b015ff93bb52ec8a10a3/imbalanced_learn-0.5.0-py3-none-any.whl (173kB)
[K    100% |████████████████████████████████| 174kB 273kB/s ta 0:00:01
[?25hCollecting joblib>=0.11 (from imbalanced-learn->imblearn)
  Using cached https://files.pythonhosted.org/packages/cd/c1/50a758e8247561e58cb87305b1e90b171b8c767b15b12a1734001f41d356/joblib-0.13.2-py2.py3-none-any.whl
Collecting scikit-learn>=0.21 (from imbalanced-learn->imblearn)
[?25l  Downloading https://files.pythonhosted.org/packages/e9/57/8a9889d49d0d77905af5a7524fb2b468d2ef5fc723684f51f5ca63efed0d/scikit_learn-0.21.3-cp37-cp37m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (BaggingClassifier, RandomForestClassifier, AdaBoostClassifier)
from sklearn.svm import SVC
from imblearn.ensemble import BalancedBaggingClassifier

from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler

### Scale data and Split into train and test data

In [188]:
scaled = StandardScaler().fit_transform(final_train)

In [189]:
X_train, X_test, y_train, y_test = train_test_split(scaled, target, test_size=0.25, random_state=42)

In [129]:
1-target.mean()

0.9475537787930707

### Run Classification Models and Score on Test Data

In [208]:
bbc = BalancedBaggingClassifier().fit(X_train,y_train)
print('score:', bbc.score(X_test,y_test))
tn, fp, fn, tp = confusion_matrix(y_test, bbc.predict(X_test)).ravel()
print('True Negative:',tn)
print('False Positive:',fp)
print('False Negative:',fn)
print('True Positive:',tp)
roc_auc_score(y_test, bbc.predict(X_test))

score: 0.8016749143509707
True Negative: 2026
False Positive: 467
False Negative: 54
True Positive: 80


0.7048452083744934

In [191]:
lr = LogisticRegression(class_weight='balanced').fit(X_train,y_train)
print('score:', lr.score(X_test,y_test))
tn, fp, fn, tp = confusion_matrix(y_test, lr.predict(X_test)).ravel()
print('True Negative:',tn)
print('False Positive:',fp)
print('False Negative:',fn)
print('True Positive:',tp)
roc_auc_score(y_test, lr.predict(X_test))



score: 0.650171298058622
True Negative: 1604
False Positive: 889
False Negative: 30
True Positive: 104


0.7097604636265125

In [218]:
knn = KNeighborsClassifier().fit(X_train,y_train)
print('score:', knn.score(X_test,y_test))
tn, fp, fn, tp = confusion_matrix(y_test, knn.predict(X_test)).ravel()
print('True Negative:',tn)
print('False Positive:',fp)
print('False Negative:',fn)
print('True Positive:',tp)
roc_auc_score(y_test, knn.predict(X_test))

score: 0.9459459459459459
True Negative: 2471
False Positive: 22
False Negative: 120
True Positive: 14


0.5478264513772892

In [219]:
dtc = DecisionTreeClassifier().fit(X_train,y_train)
print('score:',dtc.score(X_test,y_test))
tn, fp, fn, tp = confusion_matrix(y_test,dtc.predict(X_test)).ravel()
print('True Negative:',tn)
print('False Positive:',fp)
print('False Negative:',fn)
print('True Positive:',tp)
roc_auc_score(y_test, dtc.predict(X_test))

score: 0.9265321659687857
True Negative: 2409
False Positive: 84
False Negative: 109
True Positive: 25


0.576436410007723

In [194]:
bc = BaggingClassifier().fit(X_train,y_train)
print('score:', bc.score(X_test,y_test))
tn, fp, fn, tp = confusion_matrix(y_test, bc.predict(X_test)).ravel()
print('True Negative:',tn)
print('False Positive:',fp)
print('False Negative:',fn)
print('True Positive:',tp)
roc_auc_score(y_test, bc.predict(X_test))

score: 0.9421393224210126
True Negative: 2455
False Positive: 38
False Negative: 114
True Positive: 20


0.5670055259203381

In [227]:
rf = RandomForestClassifier(class_weight='balanced').fit(X_train,y_train)
print('score:',rf.score(X_test,y_test))
tn, fp, fn, tp = confusion_matrix(y_test, rf.predict(X_test)).ravel()
print('True Negative:',tn)
print('False Positive:',fp)
print('False Negative:',fn)
print('True Positive:',tp)
roc_auc_score(y_test, rf.predict(X_test))



score: 0.9097830224590788
True Negative: 2363
False Positive: 130
False Negative: 107
True Positive: 27


0.5746732642443618

In [228]:
ab = AdaBoostClassifier().fit(X_train,y_train)
print('score:', ab.score(X_test,y_test))
tn, fp, fn, tp = confusion_matrix(y_test, ab.predict(X_test)).ravel()
print('True Negative:',tn)
print('False Positive:',fp)
print('False Negative:',fn)
print('True Positive:',tp)
roc_auc_score(y_test, ab.predict(X_test))

score: 0.9489912447658927
True Negative: 2493
False Positive: 0
False Negative: 134
True Positive: 0


0.5

In [214]:
svc = SVC(class_weight='balanced').fit(X_train,y_train)
print('score:', svc.score(X_test,y_test))
tn, fp, fn, tp = confusion_matrix(y_test, svc.predict(X_test)).ravel()
print('True Negative:',tn)
print('False Positive:',fp)
print('False Negative:',fn)
print('True Positive:',tp)
roc_auc_score(y_test, svc.predict(X_test))



score: 0.705367339170156
True Negative: 1742
False Positive: 751
False Negative: 23
True Positive: 111


0.7635573636031635

## Perform Predictions for Submission to Kaggle

In [210]:
final_test_scaled = StandardScaler().fit_transform(final_test)

In [232]:
predictions = ab.predict(final_test_scaled)

In [233]:
result = pd.DataFrame(np.array(range(1,len(predictions)+1)),columns=['Id'])
result['WnvPresent'] = predictions

In [234]:
result.to_csv('./results_ab2.csv',index=False)

**Results from Kaggle**

lr score: 0.65<br>
svc score: 0.61<br>
bbc score: 0.57<br>
knn score: 0.51 <br>
dtc score: 0.51<br>
rf score: 0.5 <br>
ab score: 0.5 <br>

## Explore coeficients for the Logistic regression model

In [198]:
coef = pd.DataFrame(lr.coef_, columns = final_train.columns).T

In [199]:
coef.sort_values(0).head()

Unnamed: 0,0
06,-1.877188
CULEX TERRITANS,-0.855579
WetBulb,-0.838532
CULEX SALINARIUS,-0.521833
05,-0.407991


In [200]:
coef.sort_values(0).tail()

Unnamed: 0,0
Tavg,0.517418
AvgSpeed,0.549565
09,0.646901
DewPoint,0.802585
08,0.94089
