<a id = 'Content'><a/>
### Contents:
- [1.0 Imports and Functions](#1.0)
- [2.0 Models](#2.0)
    - [2.1 Random Forest Regular](#2.1)
    - [2.2 Gradient Boost Regular](#2.2)
    - [2.3 PCA model setup](#2.3)
    - [2.4 Gradient Boost w/poly features+PCA](#2.4)
    - [2.5 Log Reg w/poly features+smote+PCA](#2.5)
    - [2.6 Random Forest w/smotetomek+PCA](#2.6)
    - [2.7 Log Reg w/smote](#2.7)
    - [2.8 Gradient Boost w/smote](#2.8)
    - [2.9 Gradient Boost w/smotetomek](#2.9)
    - [2.10 Random Forest w/smote](#2.10)
    - [2.11 XGBoost w/smotetomek](#2.11)
- [3.0 Make Model Table](#3.0)
- [4.0 Model Evaluations](#4.0)
- [5.0 Feature Importances](#5.0)

<a id = '1.0'><a/>
### 1.0 Imports and Functions
* [Back To Top](#Content) 

In [1]:
#!pip install xgboost

In [2]:
# Necessary imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score, recall_score, f1_score, precision_score, confusion_matrix, plot_confusion_matrix
from sklearn.decomposition import PCA
from imblearn.pipeline import Pipeline as Pipeline_imb

from imblearn.combine import SMOTETomek
from imblearn.over_sampling import SMOTE

AttributeError: module 'keras.utils' has no attribute 'Sequence'

In [None]:
# test = pd.read_csv('../datasets/test.csv', parse_dates=['Date'],index_col=['Id'])
train = pd.read_csv("../datasets/final_train.csv")
test = pd.read_csv("../datasets/final_test.csv", index_col='id')

In [None]:
# this function is the heart of this notebook. it will be used to run all models,
# except the ones that use both PCA and SMOTE

def run_model(clsf, params, kind='reg'):
    """ Input classifier, parameters, and 
    kind ('reg' (default), 'pca', 'smote',or 'smotetomek') """

# 'reg' stand for just a normal model, pipe with only a scaler and a classifier
    if kind == 'reg':
        pipe = Pipeline([
            ('sc', StandardScaler()),
            ('clsf', clsf)
        ])

# models with SMOTE
    if kind == 'smote':
        pipe = Pipeline_imb([
            ('sc', StandardScaler()),
            ('smpl', SMOTE(sampling_strategy='auto', random_state=42)),
            ('clsf', clsf)
        ])

# pipe models with SMOTETOMEK
    if kind == 'smotetomek':
        pipe = Pipeline_imb([
            ('sc', StandardScaler()),
            ('smpl', SMOTETomek(random_state=42)),
            ('clsf', clsf)
        ])

# pipe for models with PCA
    if kind == 'pca':
        pipe = Pipeline([
            ('sc', StandardScaler()),
            ('pca', PCA(n_components=50, random_state=42)),
            ('clsf', clsf)
        ])

# initiate a gridsearch
    grid = GridSearchCV(
        pipe,
        param_grid=params,
        scoring='roc_auc',
        cv=4,
        n_jobs=-1,
        verbose=2)

# only PCA gets fitted with polynomial order 2 data, the rest with regular
    if kind in ['reg', 'smote', 'smotetomek']:
        grid.fit(X_train, y_train)
        pred = grid.predict(X_test)
        pred_prob = grid.predict_proba(X_test)[:, 1]

    elif kind == 'pca':
        grid.fit(Xp_train, y_train)
        pred = grid.predict(Xp_test)
        pred_prob = grid.predict_proba(Xp_test)[:, 1]

    else:
        print("Try again")
        return

# initiate a dictionary called table to store all models scores and parameters
    table = {'Model': clsf}
    table['Type'] = kind

    table['ROC-AUC'] = roc_auc_score(y_test, pred_prob)
    table['Precision'] = precision_score(y_test, pred)
    table['Recall'] = recall_score(y_test, pred)
    table['F1'] = f1_score(y_test, pred)

    for key, value in grid.best_params_.items():
        table[key] = value


# quick printout of parameters and confustion matrix, to aid in additinal parameterization
    print('\n')
    print('='*30)
    print(f"\033[1m {clsf} \033[0m".center(38, "="))
    print('='*30)
    print('')
    print(" Best Parameters:")
    print('-'*30, '\n')
    for key, value in grid.best_params_.items():
        print(key, ':', value)

    print(confusion_matrix(y_test, pred))
    scoring_table.append(table)
    return grid

- Kaggle train set has over 100k rows of data, and is therefore vastly superior to our train/test sets we will obtain after train-test-split. Kaggle however, gives us only ROC-AUC score
- we will therefore use the Kaggle submission to give us the superior ROC-AUC score. Other scores, unavailable from Kaggle, will be from our test set we split off from training data
- as we will be submitting every model to Kaggle to get scores, we will create submit_kaggle function in the cell below

In [None]:
# function to submit a model to kaggle for scoring
def submit_kaggle(name, gs):
    """ the function takes filename and gridsearch (best estimator),
    and creates a csv file ready for kaggle's WestNile competition
    """

    submit = pd.DataFrame()
    submit['Id'] = test.index
    submit.set_index('Id', drop=True, inplace=True)
    submit['WnvPresent'] = gs.predict_proba(test)[:, 1]
    submit.to_csv(f'../datasets/{name}2.csv')

In [None]:
train.info()

In [None]:
train.describe()

In [None]:
test.head()

In [None]:
# dropping columns: columns date, species, latlong and year are not part of features. column nummosquitos is not part of
# the test set, so it can't be used in training. columns wnvpresent is our dependendent variable. finally, column
# stnpressure will be removed due to its colinearity with sealevel, as uncovered during EDA
X = train.drop(['wnvpresent', 'date', 'species',
               'nummosquitos', 'year', 'stnpressure'], axis=1)
y = train['wnvpresent']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, random_state=42)

In [None]:
# create scoring_table. it will hold all our model data.
scoring_table=[]

In [None]:
test.shape

In [None]:
set(test.columns)-set(X_test.columns)

In [None]:
test.drop(columns=['date', 'species', 'year', 'stnpressure'], inplace=True)

after splitting the data, we plan to run about a dozen different models using the run_model function above

section for each model will consist of the following:
- initiate the model and it's parameters, include scaler and possible smote, pca 
- a brief output showing the essential model info regarding parameters and confustion matrix for possible further tuning
- kaggle submission via submit_kaggle function
- screenshot of the kaggle score
- the entire score and other relevant info gets appended to a list. after we run all the models, the list will show us all the stats based on which we will decide what our production model will be.

<a id = '2.0'><a/>
### 2.0 Models
* [Back To Top](#Content) 

- we plan to run about a dozen different models with different classifiers, SMOTE and PCA
- for the sake of simplicity and readability of this notebook, we will hold off on commenting on individual models and their performance until all the models are run

<a id = '2.1'><a/>
### 2.1 Random Forest regular
* [Back To Top](#Content) 

In [None]:
rf = RandomForestClassifier(random_state=42, n_jobs=4)

params_rf = {
    'clsf__n_estimators': [200, 300],
    'clsf__max_depth': [7, 10],
    'clsf__ccp_alpha': [0, 0.01]
}

In [None]:
gs = run_model(rf, params_rf)

In [None]:
plot_confusion_matrix(gs, X_test, y_test)

In [None]:
# make sure the scoring table works
scoring_table

In [None]:
#submit to kaggle
#submit_kaggle('submit_rf', gs)

###  RandomForest Kaggle

<img  src="../images/submit_rf2.png">

<a id = '2.2'><a/>
### 2.2 Gradient Boost regular
* [Back To Top](#Content) 

In [None]:
# Gradient Boost
gb = GradientBoostingClassifier(random_state=42)

params_gb = {
    'clsf__learning_rate': [ 0.01, 0.1],
    'clsf__max_depth': [7, 10],
    'clsf__ccp_alpha':[0, 0.1],
    'clsf__n_estimators':[200, 350]
}

In [None]:
gs_gb=run_model(gb, params_gb)

In [None]:
plot_confusion_matrix(gs_gb, X_test, y_test)

In [None]:
pd.DataFrame(scoring_table).sort_values(by='ROC-AUC', ascending=False)

recall is quite low. smote will be needed.

In [None]:
#submit_kaggle('submit_gb', gs_gb)

###  GradientBoost Kaggle

<img  src="../images/submit_gb.png">

<a id = '2.3'><a/>
### 2.3 PCA model setup
* [Back To Top](#Content) 

Runninng PCA models will be slighlty different:

-  first we will see if any interaction terms look interesting in term of predictive power to be used in our other models
- secondly, we will run 'explained variance' method to see how many PCA compenents are needed to explain 98%+ of the total variance
- finally, we will use n_components number we get in step two to run several PCA models using our run_model function

In [None]:
# Poly by power of 2.  
pf = PolynomialFeatures(degree=2, include_bias=True, interaction_only=True )  
# Fit and transform our X data using Polynomial Features.  
X_poly = pf.fit_transform(X)

#transform the real (Kaggle) test set
test_poly = pf.transform(test)
                         
#Train/test split our data.
Xp_train, Xp_test, y_train, y_test = train_test_split(X_poly,
                                                            y,
                                                            stratify=y,
                                                            random_state=42)      

# Instantiate our StandardScaler.
sc = StandardScaler()
# Scale X_train.
Xp_trainsc = sc.fit_transform(Xp_train)
# Scale X_test.
Xp_testsc = sc.transform(Xp_test)
# Scale the real Kaggle test set
testsc = sc.transform(test_poly)

In [None]:
pf.get_feature_names(X.columns)

In [None]:
# creat a new df with polynomial columns, interaction only
new = pd.DataFrame(Xp_trainsc, columns = pf.get_feature_names(X.columns))

In [None]:
new.head()

In [None]:
# top polynomial columns with negative correlation to west nile
new.corrwith(y_train).dropna().sort_values()[:10]

In [None]:
# top polynomial columns with positive correlation to west nile
new.corrwith(y_train).dropna().sort_values(ascending=False)[:10]

- none of the interaction columns look like they ought to be included in our other models. top interaction columns with negative correlation to y are all multiples of resultspeed, which itself has high negative correlation with y. similarly, top interaction features with positive correlation with y are multiples of dumweek_35, which itself has similarly high positive correlation with y
- next, set up pca to find out how many pca components are needed to reach 98%+ explained variance ratio. use that n_components to input into our run_models

In [None]:
# Instantiate PCA.
pca = PCA(random_state=42)

# Fit and transform PCA for train, transform for test
Z_train = pca.fit_transform(Xp_trainsc)

Z_test = pca.transform(Xp_testsc)
Z_kaggle_test = pca.transform(testsc)


In [None]:
expl_var = np.cumsum(pca.explained_variance_ratio_)[:90]

In [None]:
plt.figure(figsize=(10,7))
plt.plot(range(len(expl_var)), expl_var)
plt.show()

In [None]:
# 50 components give us around 99% of explained variance
expl_var[50]

- next we will run a few pipelines with PCA with our run_model function using n_components of 50.

<a id = '2.4'><a/>
### 2.4 Gradient Boost w/poly features+PCA
* [Back To Top](#Content)

In [None]:
params_gb

In [None]:
gs_gb_pca=run_model(gb, params_gb, 'pca')

In [None]:
submit_gb_pca = pd.DataFrame()
submit_gb_pca['Id'] = test.index
submit_gb_pca.set_index('Id', drop=True,inplace=True)
submit_gb_pca['WnvPresent'] =gs_gb_pca.predict_proba(test_poly)[:,1]
submit_gb_pca.to_csv('../datasets/submit_gb_pca2.csv')

###  Gradient Boost w/poly featurer and PCA  Kaggle

<img src="../images/submit_gbpca.png" alt="not great">

<a id = '2.5'><a/>
### 2.5 Log Reg w/poly features+smote+PCA
* [Back To Top](#Content)

- our run_models function does not handle PCA and SMOTE at the same time, so we'll do it below, 'manually'

In [None]:
lr = LogisticRegression(
    solver='liblinear',
    random_state=42,
)

params_lr = {
    'clsf__penalty': ['l1'],
    'clsf__C': [0.1, 1.5],
    'clsf__max_iter': [200]
}

pipe_lrpca = Pipeline_imb([
            ('pf', PolynomialFeatures(degree=2, include_bias=True, interaction_only=True )),
            ('sc', StandardScaler()),
            ('pca', PCA(n_components=50, random_state=42)),
            ('smpl', SMOTE(sampling_strategy='auto', random_state=42)),
            ('clsf', lr)
        ])



In [None]:
  grid_lrpca = GridSearchCV(
        pipe_lrpca,
        param_grid=params_lr,
        scoring='roc_auc',
        cv=4,
        n_jobs=-1,
        verbose=2)

In [None]:
grid_lrpca.fit(X_train, y_train)

In [None]:
plot_confusion_matrix(grid_lrpca, X_test, y_test)

In [None]:
# attach the results to our table list here, since we didn't use the run_models function
pred = grid_lrpca.predict(X_test)
pred_prob = grid_lrpca.predict_proba(X_test)[:, 1]

table = {'Model': lr}
table['Type'] = 'pcasmote'
table['ROC-AUC'] = roc_auc_score(y_test, pred_prob)
table['Precision'] = precision_score(y_test, pred)
table['Recall'] = recall_score(y_test, pred)
table['F1'] = f1_score(y_test, pred)

for key, value in grid_lrpca.best_params_.items():
        table[key] = value

scoring_table.append(table)

In [None]:
#submit_kaggle('submit_lrpca', grid_lrpca)

###  LogReg w/smote and PCA Kaggle

<img src="../images/submit_lrpcasmote.png" alt="not great">

<a id = '2.6'><a/>
### 2.6 Random Forest w/smotetomek+PCA
* [Back To Top](#Content)

In [None]:
rf = RandomForestClassifier(
    random_state=42,
)

params_rfpca = {
    'clsf__n_estimators': [ 200,300],
    'clsf__max_depth': [ 7,10],
    'clsf__min_samples_leaf':[4,10]
}


pipe_rfpca = Pipeline_imb([
            ('pf', PolynomialFeatures(degree=2, include_bias=True, interaction_only=True )),
            ('sc', StandardScaler()),
            ('pca', PCA(n_components=50, random_state=42)),
            ('smpl', SMOTETomek(random_state=42)),
            ('clsf', rf)
        ])

In [None]:
grid_rfpca = GridSearchCV(
        pipe_rfpca,
        param_grid=params_rfpca,
        scoring='roc_auc',
        cv=4,
        n_jobs=-1,
        verbose=2)

In [None]:
grid_rfpca.fit(X_train, y_train)

In [None]:
plot_confusion_matrix(grid_rfpca, X_test, y_test)

In [None]:
#submit_kaggle('submit_rfpca', grid_rfpca)

###  Random Forest w/smote and pca50 Kaggle

<img src="../images/submit_rfsmotepca.png" alt="not great">

In [None]:
pred = grid_rfpca.predict(X_test)
pred_prob = grid_rfpca.predict_proba(X_test)[:, 1]

table = {'Model': rf}
table['Type'] = 'pcasmote'
table['ROC-AUC'] = roc_auc_score(y_test, pred_prob)
table['Precision'] = precision_score(y_test, pred)
table['Recall'] = recall_score(y_test, pred)
table['F1'] = f1_score(y_test, pred)

for key, value in grid_rfpca.best_params_.items():
        table[key] = value

scoring_table.append(table)

<a id = '2.7'><a/>
### 2.7 Log Reg w/smote
* [Back To Top](#Content)

In [None]:
lr = LogisticRegression(
    solver='liblinear',
    random_state=42,
)

params_lr_smote = {
    'clsf__penalty': ['l1', 'l2'],
    'clsf__C': [0.1, 1.5, 10, 40],
    'clsf__max_iter': [50, 200, 1000],
    'smpl__k_neighbors': [3, 5, 7],
}

In [None]:
gs_lr_smote = run_model(lr, params_lr_smote, 'smote')

In [None]:
plot_confusion_matrix(gs_lr_smote, X_test, y_test)

In [None]:
#submit_kaggle('submit_lr_smote', gs_lr_smote)

###  LogReg w/smote Kaggle

<img src="../images/submit_lrsmote2.png" alt="not great">

<a id = '2.8'><a/>
### 2.8 Gradient Boost w/smote
* [Back To Top](#Content)

In [None]:
gb = GradientBoostingClassifier(random_state=42)

params_gb_smote = {
    'clsf__learning_rate': [ 0.05, 0.1],
    'clsf__max_depth': [7, 10],
    'clsf__ccp_alpha':[0, 0.01],
    'clsf__n_estimators':[150, 300],
    'smpl__k_neighbors': [3, 5, 7],
}

In [None]:
gs_gb_smote = run_model(gb, params_gb_smote, 'smote')

In [None]:
#submit_kaggle('submit_gb_smote', gs_gb_smote)

### Gradient Boost w/smote Kaggle

<img src="../images/submit_gbsmote.png" alt="not great">

<a id = '2.9'><a/>
### 2.9 Gradient Boost w/smotetomek
* [Back To Top](#Content)

In [None]:
gb = GradientBoostingClassifier(random_state=42)

params_gb_smotetomek = {
    'clsf__learning_rate': [0.025, 0.05],
    'clsf__max_depth': [7, 10],
    'clsf__ccp_alpha':[0, 0.01],
    'clsf__n_estimators':[150, 250],
    'smpl__sampling_strategy':['all', 'auto']
}

In [None]:
gs_gb_smotetomek = run_model(gb, params_gb_smotetomek, 'smotetomek')

In [None]:
#submit_kaggle('submit_gbsmotetomek', gs_gb_smotetomek)

### Gradient boost w/smotetomek Kaggle

<img src="../images/submit_gbsmotetomek.png" alt="not great">

<a id = '2.10'><a/>
### 2.10 Random Forest w/smote
* [Back To Top](#Content)

In [None]:
rf = RandomForestClassifier(random_state=42, n_jobs=4)

params_rf_smote = {
    'clsf__n_estimators': [150, 250],
    'clsf__max_depth': [ 5, 7],
    'clsf__ccp_alpha':[0, 0.01],
    'smpl__k_neighbors': [3, 5, 7],
}

In [None]:
gs_rf_smote = run_model(rf, params_rf_smote, 'smote')

In [None]:
plot_confusion_matrix(gs_rf_smote, X_test, y_test)

In [None]:
#submit_kaggle('rf_smote', gs_rf_smote)

### Random Forest w/smote Kaggle

<img src="../images/submit_rfsmote2.png" alt="not great">

<a id = '2.11'><a/>
### 2.11 XGBoost w/smotetomek
* [Back To Top](#Content)

In [None]:
xg_smotetomek = XGBClassifier(
    use_label_encoder=False,
    eval_metric='auc',
    objective='binary:logistic',
    random_state=42,
)

xg_smotetomek_params = { 
    'clsf__max_depth': [3, 5],
    'clsf__gamma' : [0.15, 0.25],
    'clsf__learning_rate' : [0.125, 0.2],
    'clsf__n_estimators':[150, 200],
    'clsf__reg_alpha':[5,10],
    'smpl__sampling_strategy':['all', 'auto', 'not minority']
}

In [None]:
gs_xg_smotetomek = run_model(xg_smotetomek, xg_smotetomek_params, kind='smotetomek')

In [None]:
#submit_kaggle('submit_xgsmotetomek', gs_xg_smotetomek)

### XGBoost w/smote Kaggle

<img src="../images/submit_xgsmotetomek.png" alt="not great">

<a id = '3.0'><a/>
### 3.0 Make model table
* [Back To Top](#Content)

In [None]:
Table = pd.DataFrame(scoring_table)

#add Kaggle scores
Table['Kaggle_AUC'] = [0.712, 0.718, 0.648, 0.676, 0.698, 0.686, 0.697, 0.700, 0.716, 0.707]

In [None]:
# change column order for better viewing
Table.insert(6, 'Kaggle-AUC', Table['Kaggle_AUC'])
Table.drop('Kaggle_AUC', axis=1, inplace=True)

In [None]:
Table.index +=1
Table

Formating the final table output

In [None]:
Table2 = round(Table.loc[:,['Model','Type', 'ROC-AUC', 'Precision', 'Recall', 'F1', 'Kaggle-AUC']], 3)

In [None]:
Table2.loc[10, 'Model'] = 'XGBClassifier(random_state=42)'

In [None]:
def highlight_model(s):
    if s['Kaggle-AUC'] ==0.716:
        return ['background-color: yellow']*7
    else:
        return ['background-color: white']*7


In [None]:
Table2.style.apply(highlight_model, axis=1)

In [None]:
Table.to_csv('../datasets/modeltable2.csv', index=False)

<a id = '4.0'><a/>
### 4.0 Model Evaluations
* [Back To Top](#Content)

**Model conclusions:**

- the real difficulty our models encountered is identifying west-nile carrying mosquitoes correctly, since only about 5% of them carry the virus
- our goal is to avoid false negatives ( identifying west-nile mosquitoes as normal). we do not mind a high number of false positives (identifying normal mosquitoes as west-nile carrying ones)
- there is a clear tradeoff between recall and precision. models employing smote score high on recall metrics, as smote trains them to uncover more positives, but that comes at the expense of precision
- nevertheless, we do care about high recall score, and as such do not mind seeing our precision score dip
- in addition to recall we also care about the overall soundness of the model. out of all the scores, we trust the Kaggle competition score the most - ROC-AUC on completely unseen data with 116k rows
- **we thus choose Random Forest Classifier with SMOTE as our production model**
- it scores among top two in Kaggle ROC-AUC and recall, at 0.716 and 0.789 respectively.
- we are willing to tolerate low precision in our production model, as identifying as many Wnv mosquitoes as possible is the priority

Production model's confusion matrix: 90 Wnv mosquitoes predicted, only 24 missed

<img src="../images/rf_conf_matrix.png" alt="not great">

<a id = '5.0'><a/>
### 5.0 Features Importances
* [Back To Top](#Content)

In [None]:
feat_imp = gs_rf_smote.best_estimator_.named_steps['clsf'].feature_importances_

In [None]:
top_features = pd.DataFrame({'top_features': X.columns , 'coef': feat_imp}).sort_values(
    ['coef'], ascending=False)
top_features

In [None]:
fig, ax = plt.subplots(figsize=(8,8))
sns.barplot(x=top_features['coef'] , y = top_features['top_features'])
plt.title('Top Features', fontsize=14, 
            color='Darkorchid')
plt.xlabel(None)
plt.ylabel(None)
plt.show()

- length of the day and mosquito species turned out to have had the most influence in predicting the incidence of WnV. 
- the days are longest on June 21st, the Summer Solistice, and after that they get shorter, just as our West Nile season gets under way. this implies that any effort to kill mosquitos should be undertaken in July and early August, just before the peak west nile season in late summer
- as far species are concerned, this knowledge is useful in predicting and modeling the spread of west nile. Spraying however, works on all mosquitoes species indiscriminately.
- some dummy week variables have proven more useful than others. Namely, week28 - the first week in our dummy mosquito season, and weeks 33 and 34 (the peaks of the season) have most predictive power. this is consistent with our eda finding in notebook 1
- we could have added some polynomial features, but we chose not to in order to maintain interpretability of our model and all of its features
- the fact that the location of the trap (latitude and longitude) do not rank high, gives us a clue as to why overall model performance isn't very reliable. Being able to tell which trap would catch Wnv mosquitoes and which wouldn't is not something that this model was able to predict. The question we must ask as ourselves is: as mosquitoes fly around the city and are carried by wind, is there such a model which can reliably predict which traps in the city, some of them only a 1 or 2km away from each other, will record a Wnv mosquito, and which ones would not?

<a id = '6.0'><a/>
### 6.0 Cost-Benifit analysis
* [Back To Top](#Content)

**COSTS**

- city of Chicago uses Zenivex spray to control its mosquito population. The cost, under some assumptions, come out to around $200USD per km² per week. 



**BENEFITS**

- the benefits are somewhat harder to quantify, and are based on many assumptions (see image below), but we came up with the figure of $160,000USD in productivity and medical costs that would be saved if every single west-nile carrying mosquito were eliminated.

**RESULT**

- 160000/200 gives us acceptable range of spraying of 800km². If we further assume that the summer spray season is about 8 weeks long, that gives us the upper limit for cost-effective spraying of 100km² per week. Anything more than 100km² per week and the City would be spending more money trying to eredicate the disease, then the disease, even in its most severe 2002 form, brings about in lost productivity and health-care costs.
- 100km² is only 15% of the area of the city of Chicago proper.
- calculations below:

<img src="../images/cb-analysis.png" alt="not great">

<a id = '7.0'><a/>
### 7.0 Conclusions
* [Back To Top](#Content)

**There are many reasons to suggest not spending resources on spraying:**

- our charts made during EDA section show no significant decrease in either the mosquito number nor in the incidence of WnV after previous spraying
- under any reasonable assumptions, we can only afford to spray 15% of the city area for 8 weeks during the peak mosquito season, which wouldn't make much of a difference, as mosquitoes from other parts of the city could easily migrate to replace them

**Still, we believe, and this is our final recommendation, that limited spraying should take place.**

- even assuming that only 15% of the chicago area can be sprayed in any given week, we could use our model to maximize the efficiency of such a limited spray, by focusing on the areas in our model that give rise to highest 'WnV present' odds. (model.predic_proba) 
- scientifically speaking, the spray **does** kill off mosquitoes and their larvae, even if our charts/data do not show it. We choose to believe the science and use spraying as a mosquito-reduction technique. We will however use our findings of spray's lack of success, by deciding not spend too much money on it.
- it is important for a city government to show that it cares about its people, so standing idly by while the virus is affecting its people is, politically speaking, not an option
- lastly, even though spraying does not have substantial effect now, that does not mean that it will remain so in the future, especially if west nile virus were to start spreading at faster rates. that is why it is important to have a well-oiled  and functioning spraying program in place now, which can then easily be ramped up in the future if  need were suddendly to arise.

**Beyond the cost-benefit: What's next?**

- the only way to substantially improve our model and make use of City's limited spraying finds, is to extend the model in such a way so that it is able to pinpoint the micro-location of where the Wnv mosquitoes will be present. This would include more advanced use of GPS data, as well teaming up with weather experts and entomologists to model the ways in which mosquitoes move around the city based on atmospheric data and their biological needs.
