# Using Linear Regression to Predict Number of Mosquitos

### This notebook is to explore how to predict number of mosquitos so that we can use this as a feature for predicting the presence of West Nile Virus

First, let's import some dependencies for numerical calculations; create and manipulate data; train and test our machine learning algorithms; something to scale our data; and import our Linear Regression model that we will use to predict the number of mosquitos.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import ParameterGrid
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.dummy import DummyRegressor
from sklearn.svm import SVR

from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

  from numpy.core.umath_tests import inner1d


Next, we will import our training and test data

In [2]:
train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")

We are going to create a function for removing columns/creating new dummy columns

In [3]:
def create_month(x):
    return int(x.split('-')[1])

def create_day(x):
    return int(x.split('-')[2])

def transform_df(df):
    # Turn month, date into features
    df['month'] = df["Date"].apply(create_month)
    df['day'] = df["Date"].apply(create_day)
    df.drop(columns=["Street", "Address", "AddressAccuracy", "Date","Species","AddressNumberAndStreet"], inplace=True)
    return pd.get_dummies(df, columns=["Trap"])

train = transform_df(train)

I am not going to do too much feature selection for now; will do a basic train-test split and then perform the same on the test data

In [4]:
X = train.drop(columns=["NumMosquitos"])
y = train["NumMosquitos"]

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [5]:
ss = StandardScaler()
X_train_scaled = ss.fit_transform(X_train)
X_test_scaled = ss.fit_transform(X_test)

In [6]:
models_and_penalties = [
    LinearRegression(),
    Lasso(alpha=.05,normalize=True, max_iter=1e5),
    Ridge(alpha=1,normalize=True, max_iter=1e5),
    ElasticNet(alpha=.01,max_iter=1e5)
]

In [7]:
for model in models_and_penalties:
    model.fit(X_train_scaled, y_train)
    preds = model.predict(X_train_scaled)
    test_preds = model.predict(X_test_scaled)
    print(f"Predicted {sum(preds)} total number of mosquitos for the training data")
    print(f"Predicted {sum(test_preds)} total number of mosquitos for the testing data")
    print(f"Actual is {train.NumMosquitos.sum()}")
    print()
    print(f"Mean Squared Error for train data: {mean_squared_error(y_train, preds)}")
    print(f"Mean Squared Error for test data: {mean_squared_error(y_test, test_preds)}")
    print()
    print()

Predicted 100626.14221319814 total number of mosquitos for the training data
Predicted 33038.427642822266 total number of mosquitos for the testing data
Actual is 135039

Mean Squared Error for train data: 163.46031591017302
Mean Squared Error for test data: 2.084878551720714e+23


Predicted 100633.00000001147 total number of mosquitos for the training data
Predicted 33552.848204086215 total number of mosquitos for the testing data
Actual is 135039

Mean Squared Error for train data: 238.77900346445094
Mean Squared Error for test data: 244.10026097374276


Predicted 100633.00000000044 total number of mosquitos for the training data
Predicted 33552.84820408697 total number of mosquitos for the testing data
Actual is 135039

Mean Squared Error for train data: 183.75954231869144
Mean Squared Error for test data: 192.75389384463708


Predicted 100632.99999999994 total number of mosquitos for the training data
Predicted 33552.84820408725 total number of mosquitos for the testing data
Actual

In [8]:
# borrowed from https://www.programcreek.com/python/example/85938/sklearn.ensemble.BaggingRegressor
grid = ParameterGrid({"max_samples": [0.5, 1.0],
                          "max_features": [0.5, 1.0],
                          "bootstrap": [True, False],
                          "bootstrap_features": [True, False]})

for base_estimator in [None,
                           DummyRegressor(),
                           DecisionTreeRegressor(),
                           KNeighborsRegressor(),
                           SVR()]:
    for params in grid:
        br = BaggingRegressor(base_estimator=base_estimator,
                         **params).fit(X_train, y_train)
        pred = br.predict(X_train)
        test_preds = br.predict(X_test)
        print(f"Scores for {str(base_estimator)} and {params}")
        print(f"Mean Squared Error for train data: {mean_squared_error(y_train, preds)}")
        print(f"Mean Squared Error for test data: {mean_squared_error(y_test, test_preds)}")
        [print() for _ in range(5)]

Scores for None and {'bootstrap': True, 'bootstrap_features': True, 'max_features': 0.5, 'max_samples': 0.5}
Mean Squared Error for train data: 122.76840517378167
Mean Squared Error for test data: 160.11653495368537





Scores for None and {'bootstrap': True, 'bootstrap_features': True, 'max_features': 0.5, 'max_samples': 1.0}
Mean Squared Error for train data: 137.30020711496633
Mean Squared Error for test data: 156.82911500757916





Scores for None and {'bootstrap': True, 'bootstrap_features': True, 'max_features': 1.0, 'max_samples': 0.5}
Mean Squared Error for train data: 100.91531733338975
Mean Squared Error for test data: 140.7384786370963





Scores for None and {'bootstrap': True, 'bootstrap_features': True, 'max_features': 1.0, 'max_samples': 1.0}
Mean Squared Error for train data: 93.73972535766899
Mean Squared Error for test data: 138.8047491998493





Scores for None and {'bootstrap': True, 'bootstrap_features': False, 'max_features': 0.5, 'max_samples': 0.5}
Mean Squa

Scores for DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best') and {'bootstrap': True, 'bootstrap_features': True, 'max_features': 0.5, 'max_samples': 1.0}
Mean Squared Error for train data: 125.87951273360127
Mean Squared Error for test data: 149.99895219691373





Scores for DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best') and {'bootstrap': True, 'bootstrap_features': True, 'max_features': 1.0, 'max_samples': 0.5}
Mean Squared Error for train data: 101.79559741079426


Scores for KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=5, p=2,
          weights='uniform') and {'bootstrap': True, 'bootstrap_features': True, 'max_features': 0.5, 'max_samples': 1.0}
Mean Squared Error for train data: 125.02251336464019
Mean Squared Error for test data: 146.3109644461363





Scores for KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=5, p=2,
          weights='uniform') and {'bootstrap': True, 'bootstrap_features': True, 'max_features': 1.0, 'max_samples': 0.5}
Mean Squared Error for train data: 129.8508019799467
Mean Squared Error for test data: 144.31852470498666





Scores for KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=5, p=2,
          weights='uniform') and {'bootstrap': True, 'bootstrap_features': True, 'max_features': 1.0, 'ma

Scores for SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False) and {'bootstrap': False, 'bootstrap_features': True, 'max_features': 0.5, 'max_samples': 0.5}
Mean Squared Error for train data: 285.87155037438157
Mean Squared Error for test data: 314.9746960326796





Scores for SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False) and {'bootstrap': False, 'bootstrap_features': True, 'max_features': 0.5, 'max_samples': 1.0}
Mean Squared Error for train data: 286.02303508089824
Mean Squared Error for test data: 302.68822271629676





Scores for SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False) and {'bootstrap': False, 'bootstrap_features': True, 'max_features': 1.0, 'max_samples': 0.5}
Mean Squared Error for t