# West Nile Virus Analysis

## Introduction and Background

## Problem Statement 

Due to the epidemic of West Nile Virus (WNV) in the Chicago, the Department of Public Health set up a surveillance and control system to learn something about the mosquito population and WNV incidence from the data over time. 
As part of the hiring assessment for the new data science team, we have been tasked to analyse the historical data and developed a robust predicted model. It is hoped that the department could make use of insights from the analysis and model to derive more targeted plans to control the outbreak of WNV in Chicago, taking into account the cost and benefits of any future mitigaion measures. 

## Dataset

The datasets used consists of:

[`train.csv`](datasets/original_data/train.csv): The training set consists of data from the years 2007, 2009, 2011, and 2013

[`test.csv`](datasets/original_data/test.csv): The test set consists of data for 2008, 2010, 2012, and 2014. The test set is similar to the training set other than the lack of WnvPresent and NumMosquitos columns, which we are required to predict.

[`spray.csv`](datasets/original_data/spray.csv): Records of spraying efforts in Chicago in 2011 and 2013

[`weather.csv`](datasets/original_data/weather.csv): Weather data of Chicago from 2007 to 2014.

## Findings from Exploratory Data Analysis

Some interesting observations were gathered through the EDA process. 

**Imbalanced class**:The response feature `WnvPresent` for the training set is imbalanced. Oversampling of minority class has to be conducted for preprocessing. 

**Lack of representation for non-Pippens/Restuans species**: The `Species` Features are also highly imbalanced. Furthermore, these minority rows have all 0 values for WnvPresent, which may mean that our model may not be able to generalize well.

**Spatial-Temporal Relations**: A level of time and spatial correlation between observations were observed. 

**Lack of linear correlation**: We did not observa a strong linear correlation between predictors and the response variable.

## Feature Engineering

Some steps were performed for feature engineering:

1. Wind data were processed to account for the effect of Lake Michigan, an important water body of Chicago

2. One hot encoding for categorical variables such as `Species`, `Month`, etc

3. A feature was created to indicate if the row is of pippiens/restuans species or not

4. Lagged weather features by 5-14 days to account for time series relations

5. A weighted distance matrix created to account for spatial autocorrelation

## Modelling

A two step approach was developed for modelling. First, a regression model was fitted to the training data to obtain the `NumMosquitos` prediction for the test data. This is followed by a classification model to obtain predictions for the `WnvPresent`. 

During preprocessing, steps such as oversampling of minority class, scaling and Principle Component Analysis (PCA) were performed. 

For regression, the following models were put through a 5-fold cross-validation and tuned using Root Mean Squared Error(RMSE) as metric:
- Linear Regression 
- Ridge Regression
- Lasso Regression
- Elastic Net
- Poission Regression
- KNN Regression
- Random Forest Regression

The best model was found to be random forest with with an RMSE of 7.7 and a distribution that resembles the training dataset.

For Classification, 3 models were put through a 5-fold cross-validation, we assessed multiple metrics including ROC-AUC score, accuracy, recall and precision. 

- Logistics Regression
- Gradient Boosting Classifier
- Random Forest Classifier

The best model chosen was the random forest classifier, which achieved a ROC-AUC score of 0.91 and an accuracy of 0.93. The model also has 0.70 for recall and 0.14 for precision. Due to our focus on curbing future outbreaks, the low precision is not a focus for this project, although future improvements such as better features, more rigourous feature selection and hyperparameter tuning, and more advanced models such as XGboost and Recurrent Neural Networks would definitely serve to improve our results.

## Cost Benefit Analysis

## Conclusions and Recommendations