# Project 4 : Kaggle West Nile Virus

----------

## Executive Summary

The West Nile Virus is most commonly spread to humans through infected mosquitos. Around 20% of people who become infected with the virus develop symptoms ranging from a persistent fever, to serious neurological illnesses that can result in death.
West Nile Virus-related hospitalizations and follow-ups in the United States costed [$778 million](https://www.medicinenet.com/script/main/art.asp?articlekey=176668) in health care expenses and lost productivity from 1999 through 2012.

In 2002, the first human cases of the West Nile virus were reported in Chicago. By 2004 the City of Chicago and the Chicago Department of Public Health (CDPH) had established a comprehensive surveillance and control program, which is still in effect today. Since the implementation of comprehensive surveillances and control programmes, occurrences of the West Nile Virus has been depleting. 

As part of the last group project for General Assembly Immersive Data Science, our team aims to build a robust classifier model to predict the presence of the West Nile Virus in Chicago. The following models were tested and compared: Logistic Regression, Random Forest, Gradient Boosting and AdaBoost. Evaluation was performed primarily based on the AUC ROC, recall and precision scores. Finally, we looked into potential interventions derived from our model and performed a cost-benefit analysis for a proposal to The Chicago Department of Public Health (CDPH). 

## Problem Statement

Recognising that the West Nile Virus could develop into an endemic, we aim to improve the cost-effectiveness of existing strategies to control adult mosquito populations and mitigate the spread of the virus.

Capitalising on data on past weather conditions and locations where West Nile Virus were found, we look to develop a machine learning model to predict the presence of the West Nile Virus at a particular location facing specific weather conditions. This prediction tool will be useful as it allows for targeted spraying of specific neighbourhoods facing higher threats of the West Nile Virus. We hope to aid Chicago in achieving cost-savings through efficient resource-management towards preventing the transmission of the West Nile Virus.

## Contents:
### Part 1a Spray Data Cleaning & Exploratory Data Analysis(EDA)

1. [Importing Libraries](#1.-Importing-Libraries)
2. [Importing Data](#2.-Importing-Data)
3. [Data Cleaning](#3.-Data-Cleaning)
-------

## 1. Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Display full output 
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_colwidth', None)

## 2. Importing Data

In [5]:
spray = pd.read_csv('./assets/spray.csv')

In [6]:
spray.shape

(14835, 4)

## 3. Data Cleaning

<b>Spray Data</b>:
- Renamed the spray data columns into snake-case for naming convention
- Turn date column from string object to datetime64
- Dropped Time variables

In [7]:
spray.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14835 entries, 0 to 14834
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       14835 non-null  object 
 1   Time       14251 non-null  object 
 2   Latitude   14835 non-null  float64
 3   Longitude  14835 non-null  float64
dtypes: float64(2), object(2)
memory usage: 463.7+ KB


In [8]:
spray.head()

Unnamed: 0,Date,Time,Latitude,Longitude
0,2011-08-29,6:56:58 PM,42.391623,-88.089163
1,2011-08-29,6:57:08 PM,42.391348,-88.089163
2,2011-08-29,6:57:18 PM,42.391022,-88.089157
3,2011-08-29,6:57:28 PM,42.390637,-88.089158
4,2011-08-29,6:57:38 PM,42.39041,-88.088858


In [9]:
spray.isnull().sum()

Date           0
Time         584
Latitude       0
Longitude      0
dtype: int64

In [10]:
spray[spray.duplicated()]

Unnamed: 0,Date,Time,Latitude,Longitude
485,2011-09-07,7:43:40 PM,41.983917,-87.793088
490,2011-09-07,7:44:32 PM,41.986460,-87.794225
491,2011-09-07,7:44:32 PM,41.986460,-87.794225
492,2011-09-07,7:44:32 PM,41.986460,-87.794225
493,2011-09-07,7:44:32 PM,41.986460,-87.794225
...,...,...,...,...
1025,2011-09-07,7:44:32 PM,41.986460,-87.794225
1026,2011-09-07,7:44:32 PM,41.986460,-87.794225
1027,2011-09-07,7:44:32 PM,41.986460,-87.794225
1028,2011-09-07,7:44:32 PM,41.986460,-87.794225


<b>Observation<b/> : 

1. Spray data columns has to be renamed into snake-case for naming convention.
2. Date column need to chaange to datetime64 format.
3. <b>Time</b> variables has 584 missing value and is not crucial for our analysis, we will drop the column.
4. There are duplicates value. we decided not to drop, since the duplicates row might indicate high spray frequency of certain location due to high West Nile Virus cases.

#### a) Renamed the spray data columns into snake-case for naming convention

In [11]:
spray.columns = spray.columns.str.lower()

#### b) Turn date column from string object to datetime64

In [12]:
spray['date'] = pd.to_datetime(spray['date'])

#### c) Dropped missing Time variable

In [13]:
spray.drop(columns = 'time', inplace = True)

In [15]:
spray.to_pickle('./data/spray.pk1')