# Project 4: West Nile Virus Prediction

## Problem Statement

1. **To predict locations with high potential** of having mosquitoes carrying West Nile Virus.
2. Find out **factors that contribute to the growth and spread** of the virus in mosquitoes.

## Executive Summary

### Context

West-Nile-Virus(WNV) is a mosquito-borne disease that has plagued the continental United States since 1999. The vast majority of infected people will develop mild symptoms that will subside over a few days to several weeks ([source](https://medlineplus.gov/westnilevirus.html)). About 1 out of 150 infected people develop a serious illness. The sometimes neuroinvasive virus may cause encephalitis and meningitis, which can prove to be fatal. There is currently no vaccine to prevent or medication to treat WNV ([source](https://www.cdc.gov/westnile/index.html)).

Outbreaks typically intensify over as little as a couple of weeks; however, human case reports are lagging indicators of risk since case reports occur weeks after the time of infection. Thus, environmental surveillance – monitoring enzootic and epizootic WNV transmission in mosquitoes and birds – forms a timelier index of risk, and is an important cornerstone for implementing effective WNV risk reduction efforts.Research and operational experience shows that increases in WNV infection rates in mosquito populations can provide an indicator of developing outbreak conditions several weeks in advance of increases in human infections. Aggressive and timely efforts to reduce the number of infected adult mosquitoes will optimally impact human WNV case incidence ([source](https://www.cdc.gov/westnile/resources/pdfs/wnvGuidelines.pdf)).

### Scope

The goal of this project is to derive a plan to deploy pesticides throughout the city of Chicago. Hence, the scope of our plan for deployment and cost analysis will also be limited within the city of Chicago. 

However, the model that we have trained can certainly be used in other states/cities within the United States to predict the potential outbreak of West Nile Virus. 

The modelling techniques and strategies will be presented to biostaticians,  epidemiologists, American Public Health Service and decision makers from Centers for Disease Control and Prevention (CDC).

### Contents:
- Jupyter Notebook 1 - ***1_data_cleaning.ipynb***
    - [Data Importing and Cleaning](#Data-Importing-and-Cleaning)
        - [Cleaning train.csv](#Cleaning-train.csv)
        - [Cleaning test.csv](#Cleaning-test.csv)
        - [Cleaning spray.csv](#Cleaning-spray.csv)
        - [Cleaning weather.csv](#Cleaning-weather.csv)
        - [Exporting cleaned data](#Exporting-cleaned-data)
- Jupyter Notebook 2 - ***2_merging_data_feature_engineering.ipynb***
   - Feature Engineering & Data Merging
        - Feature engineering to merge spray data with train & test data
        - Feature engineering for spatial correlation
        - Merging weather data with train & test data
        - Exporting merged data
- Jupyter Notebook 3 - ***3_EDA_modelling_evaluation.ipynb.ipynb***
     - Exploratory Data Analysis
     - Model Preparation
        - Modelling Approach
        - Classification Metrics
        - Class Balancing Techniques
     - Classfication Modelling
        - *GridSearchCV* for *LogisticRegression* with *SMOTE* balancing technique
        - *RandomForestClassifier* with *SMOTE* balancing technique
        - *SVC* with *SMOTE* balancing technique
        - *GradientBoostingClassifier* with *SMOTE* balancing technique
        - *RandomForestClassifier* with *ADASYN* balancing technique
        - *RandomForestClassifier* with *ClusterCentroids* balancing technique
        - *RandomForestClassifier* with hyperparameter *class_weight='balanced_subsample'*
        - Feature Importance
        - ROC Curve
        - Visualizing the Predictions
        - Cost-Benefit Analysis
        - Deployment of Model
     - Conclusions and Recommendations

In [1]:
# Imports

import pandas as pd
import numpy as np
from datetime import datetime
import re
from sklearn.feature_extraction.text import CountVectorizer

pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

## Data Importing and Cleaning

In [2]:
# Importing all given CSV files.

train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")
spray = pd.read_csv("../input/spray.csv")
weather = pd.read_csv("../input/weather.csv")

In [3]:
# Defining a function to understand the data better.

def eda(df, df_name):
    print(df_name.capitalize())
    print()
    print(f"Rows: {df.shape[0]} \t Columns: {df.shape[1]}")
    print()
          
    print(f"Number of Missing rows: {df.isnull().sum().sum()}")
    print()
          
    print(f"Number of Duplicate rows: {df[df.duplicated(keep=False)].shape[0]}")
    print()
          
    print(df.dtypes)
    print("_________________________________________\n")

In [4]:
data = [(train, 'train'),
       (spray, 'spray'),
       (weather, 'weather'),
       (test,'test')]

In [5]:
# Printing a summary of all data files.

[eda(df, name) for df, name in data]

Train

Rows: 10506 	 Columns: 12

Number of Missing rows: 0

Number of Duplicate rows: 1062

Date                       object
Address                    object
Species                    object
Block                       int64
Street                     object
Trap                       object
AddressNumberAndStreet     object
Latitude                  float64
Longitude                 float64
AddressAccuracy             int64
NumMosquitos                int64
WnvPresent                  int64
dtype: object
_________________________________________

Spray

Rows: 14835 	 Columns: 4

Number of Missing rows: 584

Number of Duplicate rows: 543

Date          object
Time          object
Latitude     float64
Longitude    float64
dtype: object
_________________________________________

Weather

Rows: 2944 	 Columns: 22

Number of Missing rows: 0

Number of Duplicate rows: 0

Station          int64
Date            object
Tmax             int64
Tmin             int64
Tavg            object
De

[None, None, None, None]

### Cleaning *train.csv*

In [6]:
train.head()

Unnamed: 0,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy,NumMosquitos,WnvPresent
0,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634, USA",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0
1,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634, USA",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0
2,2007-05-29,"6200 North Mandell Avenue, Chicago, IL 60646, USA",CULEX RESTUANS,62,N MANDELL AVE,T007,"6200 N MANDELL AVE, Chicago, IL",41.994991,-87.769279,9,1,0
3,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX PIPIENS/RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,1,0
4,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,4,0


In [7]:
train.shape

(10506, 12)

In [8]:
train.columns

Index(['Date', 'Address', 'Species', 'Block', 'Street', 'Trap',
       'AddressNumberAndStreet', 'Latitude', 'Longitude', 'AddressAccuracy',
       'NumMosquitos', 'WnvPresent'],
      dtype='object')

In [9]:
train.dtypes

Date                       object
Address                    object
Species                    object
Block                       int64
Street                     object
Trap                       object
AddressNumberAndStreet     object
Latitude                  float64
Longitude                 float64
AddressAccuracy             int64
NumMosquitos                int64
WnvPresent                  int64
dtype: object

In [10]:
# Checking for null values.

train.isnull().sum()

Date                      0
Address                   0
Species                   0
Block                     0
Street                    0
Trap                      0
AddressNumberAndStreet    0
Latitude                  0
Longitude                 0
AddressAccuracy           0
NumMosquitos              0
WnvPresent                0
dtype: int64

In [11]:
# Dropping unnecessary columns.

train_cols_to_drop = ['Address', 'Block', 'Street', 'Trap', 'AddressNumberAndStreet', 'AddressAccuracy', 'NumMosquitos']
train.drop(columns=train_cols_to_drop, inplace=True)

In [12]:
train.shape

(10506, 5)

In [13]:
# Dropping duplicate rows
# Since NumMosquitos column is already dropped, there are a lot of rows which have duplicate values in all other columns. These rows need to be dropped.
# ignore_index=True to reset the index after dropping all duplicate rows.

train.drop_duplicates(ignore_index=True, inplace=True)

In [14]:
train.shape

(8610, 5)

In [15]:
# Converting Date column from object type to datetime type.

train['Date'] = train['Date'].map(lambda date: datetime.strptime(date, '%Y-%m-%d'))

In [16]:
train.dtypes

Date          datetime64[ns]
Species               object
Latitude             float64
Longitude            float64
WnvPresent             int64
dtype: object

In [17]:
# Grouping by individual species which have (or don't have) the virus.

species = train.groupby('Species')['WnvPresent'].sum()
species

Species
CULEX ERRATICUS             0
CULEX PIPIENS             184
CULEX PIPIENS/RESTUANS    225
CULEX RESTUANS             48
CULEX SALINARIUS            0
CULEX TARSALIS              0
CULEX TERRITANS             0
Name: WnvPresent, dtype: int64

**Only 3 of 7 species have the West Nile Virus.** Remaining 4 of 7 species do not have the virus, ie. they are virus free. Rows of these virus free species should be dropped.

In [18]:
# Filtering the species with virus.

species_with_virus = species[species!=0].index
species_with_virus

Index(['CULEX PIPIENS', 'CULEX PIPIENS/RESTUANS', 'CULEX RESTUANS'], dtype='object', name='Species')

In [19]:
# Using list comprehension to create a list of 1 and 0.
# 1 if the species is in the list of species_with_virus above. 0 otherwise.

train_species_with_virus = [1 if species in species_with_virus else 0 for species in train['Species']]

In [20]:
# Checking remaining number of rows of species WITHOUT virus (the ones with 4 out of 7 virus free species above).

train.shape[0] - sum(train_species_with_virus)

# Only 306 rows out of 8304 (3.68%). So these rows can definitely be dropped.

306

In [21]:
# Assigning above defined list to a new column VirusSpecies.

train['VirusSpecies'] = train_species_with_virus

In [22]:
#  Keeping only rows with VirusSpecies = 1. Dropping all rows which have VirusSpecies = 0.
# .copy() so that the train dataframe gets assigned as a new dataframe instead of just the reference link to it.

train = train[train['VirusSpecies']==1].copy()

In [23]:
train.shape

(8304, 6)

In [24]:
# Dropping the VirusSpecies column now because it is no longer needed.

train.drop(columns=['VirusSpecies'], inplace=True)

In [25]:
train.shape

(8304, 5)

In [26]:
train.dtypes

Date          datetime64[ns]
Species               object
Latitude             float64
Longitude            float64
WnvPresent             int64
dtype: object

In [27]:
# Checking for null values.

train.isnull().sum()

Date          0
Species       0
Latitude      0
Longitude     0
WnvPresent    0
dtype: int64

In [28]:
# Checking no. of unique species remaining.

len(train['Species'].unique())

3

In [29]:
train = pd.get_dummies(train, columns=['Species'])

In [30]:
train.shape

(8304, 7)

In [31]:
# Checking final number of columns to ensure they are correct.
# 5 = Initial number of columns before one-hot encoding.
# 1 = Only Species column is being one-hot encoded.
# 3 = No. of unique values in Species column.

5 - 1 + 3

7

In [32]:
train.columns

Index(['Date', 'Latitude', 'Longitude', 'WnvPresent', 'Species_CULEX PIPIENS',
       'Species_CULEX PIPIENS/RESTUANS', 'Species_CULEX RESTUANS'],
      dtype='object')

In [33]:
# Re-arranging the columns to put target variable WnvPresent as the last column.

train_cols = list(train.columns)
train_cols.remove('WnvPresent')
train_cols.append('WnvPresent')

train = train[train_cols]

### Cleaning *test.csv*

In [34]:
test.head()

Unnamed: 0,Id,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy
0,1,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634, USA",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
1,2,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634, USA",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
2,3,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634, USA",CULEX PIPIENS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
3,4,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634, USA",CULEX SALINARIUS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
4,5,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634, USA",CULEX TERRITANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9


In [35]:
test.shape

(116293, 11)

In [36]:
test.columns

Index(['Id', 'Date', 'Address', 'Species', 'Block', 'Street', 'Trap',
       'AddressNumberAndStreet', 'Latitude', 'Longitude', 'AddressAccuracy'],
      dtype='object')

In [37]:
test.dtypes

Id                          int64
Date                       object
Address                    object
Species                    object
Block                       int64
Street                     object
Trap                       object
AddressNumberAndStreet     object
Latitude                  float64
Longitude                 float64
AddressAccuracy             int64
dtype: object

In [38]:
# Checking for null values.

test.isnull().sum()

Id                        0
Date                      0
Address                   0
Species                   0
Block                     0
Street                    0
Trap                      0
AddressNumberAndStreet    0
Latitude                  0
Longitude                 0
AddressAccuracy           0
dtype: int64

In [39]:
# Dropping unnecessary columns.

cols_to_drop = ['Address', 'Block', 'Street', 'Trap', 'AddressNumberAndStreet', 'AddressAccuracy']
test.drop(columns=cols_to_drop, inplace=True)

In [40]:
test.shape

(116293, 5)

In [41]:
# Converting Date column from object type to datetime type.

test['Date'] = test['Date'].map(lambda date: datetime.strptime(date, '%Y-%m-%d'))

In [42]:
test.dtypes

Id                    int64
Date         datetime64[ns]
Species              object
Latitude            float64
Longitude           float64
dtype: object

In [43]:
test = pd.get_dummies(test, columns=['Species'])

In [44]:
test.shape

(116293, 12)

In [45]:
# Filtering extra columns in test df that do not exist in train df.

test_extra_cols = []

for col in list(test.columns):
    if col not in list(train.columns):
        test_extra_cols.append(col)

test_extra_cols

['Id',
 'Species_CULEX ERRATICUS',
 'Species_CULEX SALINARIUS',
 'Species_CULEX TARSALIS',
 'Species_CULEX TERRITANS',
 'Species_UNSPECIFIED CULEX']

In [46]:
# Removing Id column from list of extra columns in test df.
# We cannot drop Id column because this column is needed for final submission of results to kaggle.

test_extra_cols.remove('Id')
test_extra_cols

['Species_CULEX ERRATICUS',
 'Species_CULEX SALINARIUS',
 'Species_CULEX TARSALIS',
 'Species_CULEX TERRITANS',
 'Species_UNSPECIFIED CULEX']

In [47]:
# Dropping all other extra columns from test df.
# These columns can be dropped because they do not exist in train df. So they will never be used in our modelling process.

test.drop(columns=test_extra_cols, inplace=True)

In [48]:
# Filtering missing columns in test df that exist in train df.

test_missing_cols = []

for col in list(train.columns):
    if col not in list(test.columns):
        test_missing_cols.append(col)

test_missing_cols

# As expected, only 'WnvPresent' column is missing in test df, and it exists in train df.

['WnvPresent']

In [49]:
# Ensuring the order of columns in test df is same as train df (just for ease of viewing and comparison purposes).
# Retaining the Id column in test as well.

test_clean_cols = ['Id']
test_clean_cols.extend(list(train.columns))
test_clean_cols.remove('WnvPresent')
test_clean_cols

test = test[test_clean_cols]

### Cleaning *spray.csv*

In [50]:
spray.head()

Unnamed: 0,Date,Time,Latitude,Longitude
0,2011-08-29,6:56:58 PM,42.391623,-88.089163
1,2011-08-29,6:57:08 PM,42.391348,-88.089163
2,2011-08-29,6:57:18 PM,42.391022,-88.089157
3,2011-08-29,6:57:28 PM,42.390637,-88.089158
4,2011-08-29,6:57:38 PM,42.39041,-88.088858


In [51]:
spray.shape

(14835, 4)

In [52]:
spray.columns

Index(['Date', 'Time', 'Latitude', 'Longitude'], dtype='object')

In [53]:
spray.dtypes

Date          object
Time          object
Latitude     float64
Longitude    float64
dtype: object

In [54]:
spray.isnull().sum()

Date           0
Time         584
Latitude       0
Longitude      0
dtype: int64

In [55]:
# Dropping unnecessary columns.

spray.drop(columns='Time', inplace=True)

In [56]:
# Converting Date column from object type to datetime type.

spray['Date'] = spray['Date'].map(lambda date: datetime.strptime(date, '%Y-%m-%d'))

In [57]:
spray.dtypes

Date         datetime64[ns]
Latitude            float64
Longitude           float64
dtype: object

### Cleaning *weather.csv*

In [58]:
weather.head()

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,Sunrise,Sunset,CodeSum,Depth,Water1,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2007-05-01,83,50,67,14,51,56,0,2,0448,1849,,0,M,0.0,0.0,29.1,29.82,1.7,27,9.2
1,2,2007-05-01,84,52,68,M,51,57,0,3,-,-,,M,M,M,0.0,29.18,29.82,2.7,25,9.6
2,1,2007-05-02,59,42,51,-3,42,47,14,0,0447,1850,BR,0,M,0.0,0.0,29.38,30.09,13.0,4,13.4
3,2,2007-05-02,60,43,52,M,42,47,13,0,-,-,BR HZ,M,M,M,0.0,29.44,30.08,13.3,2,13.4
4,1,2007-05-03,66,46,56,2,40,48,9,0,0446,1851,,0,M,0.0,0.0,29.39,30.12,11.7,7,11.9


In [59]:
weather.shape

(2944, 22)

In [60]:
weather.columns

Index(['Station', 'Date', 'Tmax', 'Tmin', 'Tavg', 'Depart', 'DewPoint',
       'WetBulb', 'Heat', 'Cool', 'Sunrise', 'Sunset', 'CodeSum', 'Depth',
       'Water1', 'SnowFall', 'PrecipTotal', 'StnPressure', 'SeaLevel',
       'ResultSpeed', 'ResultDir', 'AvgSpeed'],
      dtype='object')

In [61]:
weather.dtypes

Station          int64
Date            object
Tmax             int64
Tmin             int64
Tavg            object
Depart          object
DewPoint         int64
WetBulb         object
Heat            object
Cool            object
Sunrise         object
Sunset          object
CodeSum         object
Depth           object
Water1          object
SnowFall        object
PrecipTotal     object
StnPressure     object
SeaLevel        object
ResultSpeed    float64
ResultDir        int64
AvgSpeed        object
dtype: object

In [62]:
weather.isnull().sum()

Station        0
Date           0
Tmax           0
Tmin           0
Tavg           0
Depart         0
DewPoint       0
WetBulb        0
Heat           0
Cool           0
Sunrise        0
Sunset         0
CodeSum        0
Depth          0
Water1         0
SnowFall       0
PrecipTotal    0
StnPressure    0
SeaLevel       0
ResultSpeed    0
ResultDir      0
AvgSpeed       0
dtype: int64

In [63]:
print(weather['Date'].max(), weather['Date'].min())

2014-10-31 2007-05-01


In [64]:
# Dropping unnecessary columns.

cols_to_drop = ['Water1','Depth','SnowFall']
weather.drop(columns=cols_to_drop, inplace=True)

In [65]:
# Replacing 'T's and 'M's with 0s in some columns.

weather.PrecipTotal.replace(['  T','M'], 0, inplace=True)
weather.WetBulb.replace(['M'], 0, inplace=True)
weather.Heat.replace(['M'], 0, inplace=True)
weather.Cool.replace(['M'], 0, inplace=True)
weather.StnPressure.replace(['M'], 0, inplace=True)
weather.SeaLevel.replace(['M'], 0, inplace=True)
weather.AvgSpeed.replace(['M'], 0, inplace=True)

In [66]:
# Changing column dtypes from type object to type float.

weather.PrecipTotal = weather.PrecipTotal.astype(float)
weather.WetBulb = weather.WetBulb.astype(float)
weather.Heat = weather.Heat.astype(float)
weather.Cool = weather.Cool.astype(float)
weather.StnPressure = weather.StnPressure.astype(float)
weather.SeaLevel = weather.SeaLevel.astype(float)
weather.AvgSpeed = weather.AvgSpeed.astype(float)

In [67]:
# Calculate Tavg using (Tmax + Tmin)/2 to deal with 'M's in Tavg.

weather.Tavg = (weather.Tmax + weather.Tmin)/2

In [68]:
weather.head()

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,Sunrise,Sunset,CodeSum,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2007-05-01,83,50,66.5,14,51,56.0,0.0,2.0,0448,1849,,0.0,29.1,29.82,1.7,27,9.2
1,2,2007-05-01,84,52,68.0,M,51,57.0,0.0,3.0,-,-,,0.0,29.18,29.82,2.7,25,9.6
2,1,2007-05-02,59,42,50.5,-3,42,47.0,14.0,0.0,0447,1850,BR,0.0,29.38,30.09,13.0,4,13.4
3,2,2007-05-02,60,43,51.5,M,42,47.0,13.0,0.0,-,-,BR HZ,0.0,29.44,30.08,13.3,2,13.4
4,1,2007-05-03,66,46,56.0,2,40,48.0,9.0,0.0,0446,1851,,0.0,29.39,30.12,11.7,7,11.9


In [69]:
# Grouping values from station 1 & 2 by dates.
# For all numerical columns, we take the average of the values from stations 1 & 2.

weather_stations_combined = weather.groupby('Date').mean()

In [70]:
# Dropping the Stations column from weather_stations_combined.

weather_stations_combined.drop(['Station'], axis=1, inplace=True)

In [71]:
weather_stations_combined.head()

Unnamed: 0_level_0,Tmax,Tmin,Tavg,DewPoint,WetBulb,Heat,Cool,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2007-05-01,83.5,51.0,67.25,51.0,56.5,0.0,2.5,0.0,29.14,29.82,2.2,26.0,9.4
2007-05-02,59.5,42.5,51.0,42.0,47.0,13.5,0.0,0.0,29.41,30.085,13.15,3.0,13.4
2007-05-03,66.5,47.0,56.75,40.0,49.0,8.0,0.0,0.0,29.425,30.12,12.3,6.5,12.55
2007-05-04,72.0,50.0,61.0,41.5,50.0,3.5,0.0,0.0,29.335,30.045,10.25,7.5,10.6
2007-05-05,66.0,53.5,59.75,38.5,49.5,5.0,0.0,0.0,29.43,30.095,11.45,7.0,11.75


In [72]:
# Defining a function to convert 24h time to float number (for example, 0445 will be converted to 4.75).

def sun_time_converter(sun_time):
  
  # Getting the hour part of the time to convert to the whole number part in the float number.
  hours = sun_time // 100

  # Getting the minute part of the time to convert to the decimal part in the float number.
  minutes = (sun_time % 100) / 60

  return hours + minutes

In [73]:
# Creating series of sunrise and sunset times and converting them to dtype float.
sunrise_times = weather[weather['Station']==1]['Sunrise'].astype(float)
sunset_times = weather[weather['Station']==1]['Sunset'].astype(float)

# Mapping the above created function to convert 24h time values to floating point numbers.
weather_stations_combined['Sunrise'] = list(sunrise_times.map(sun_time_converter))
weather_stations_combined['Sunset'] = list(sunset_times.map(sun_time_converter))

In [74]:
weather_stations_combined.head()

Unnamed: 0_level_0,Tmax,Tmin,Tavg,DewPoint,WetBulb,Heat,Cool,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,Sunrise,Sunset
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2007-05-01,83.5,51.0,67.25,51.0,56.5,0.0,2.5,0.0,29.14,29.82,2.2,26.0,9.4,4.8,18.816667
2007-05-02,59.5,42.5,51.0,42.0,47.0,13.5,0.0,0.0,29.41,30.085,13.15,3.0,13.4,4.783333,18.833333
2007-05-03,66.5,47.0,56.75,40.0,49.0,8.0,0.0,0.0,29.425,30.12,12.3,6.5,12.55,4.766667,18.85
2007-05-04,72.0,50.0,61.0,41.5,50.0,3.5,0.0,0.0,29.335,30.045,10.25,7.5,10.6,4.733333,18.866667
2007-05-05,66.0,53.5,59.75,38.5,49.5,5.0,0.0,0.0,29.43,30.095,11.45,7.0,11.75,4.716667,18.883333


In [75]:
weather.head()

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,Sunrise,Sunset,CodeSum,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2007-05-01,83,50,66.5,14,51,56.0,0.0,2.0,0448,1849,,0.0,29.1,29.82,1.7,27,9.2
1,2,2007-05-01,84,52,68.0,M,51,57.0,0.0,3.0,-,-,,0.0,29.18,29.82,2.7,25,9.6
2,1,2007-05-02,59,42,50.5,-3,42,47.0,14.0,0.0,0447,1850,BR,0.0,29.38,30.09,13.0,4,13.4
3,2,2007-05-02,60,43,51.5,M,42,47.0,13.0,0.0,-,-,BR HZ,0.0,29.44,30.08,13.3,2,13.4
4,1,2007-05-03,66,46,56.0,2,40,48.0,9.0,0.0,0446,1851,,0.0,29.39,30.12,11.7,7,11.9


In [76]:
# Replacing string of weather conditions given in the Codesum column.

weather['CodeSum'] = weather['CodeSum'].str.replace("BCFG","BC FG")
weather['CodeSum'] = weather['CodeSum'].str.replace("MIFG","MI FG")
weather['CodeSum'] = weather['CodeSum'].str.replace("TSRA","TS RA")
weather['CodeSum'] = weather['CodeSum'].str.replace("VCFG","VC FG")
weather['CodeSum'] = weather['CodeSum'].str.replace("VCTS","VC TS")

In [77]:
# Defining a function to clean the weather conditions given in Codesum column.

def clean_codesum(string):
    merge_stations = " ".join(string)
    split_strings = merge_stations.split()
    unique_strings = set(split_strings)
    return " ".join(unique_strings)

In [78]:
# Applying the above defined function on the Codesum column group by dates.

code_sum_combined = weather.groupby('Date')['CodeSum'].agg(clean_codesum)
code_sum_list = code_sum_combined.tolist()

In [79]:
# Applying CountVectorizer to create dummy columns of each individual weather condition.

cvec = CountVectorizer(analyzer='word', token_pattern=r'[\w\+]+')
code_sum_cvec = cvec.fit_transform(code_sum_list)
code_sum_df = pd.DataFrame(code_sum_cvec.toarray(), columns=cvec.get_feature_names())
code_sum_df.head()

Unnamed: 0,bc,br,dz,fg,fg+,fu,gr,hz,mi,ra,sn,sq,ts,vc
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,0,0,1,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [80]:
code_sum_df.shape

(1472, 14)

In [81]:
weather_stations_combined = weather_stations_combined.reset_index()
weather_stations_combined.shape

(1472, 16)

In [82]:
# Merging weather_stations_combined and the dummy columns of all weather conditions.

cleaned_weather = pd.concat([weather_stations_combined, code_sum_df.reset_index(drop=True)], axis=1)
cleaned_weather.head()

Unnamed: 0,Date,Tmax,Tmin,Tavg,DewPoint,WetBulb,Heat,Cool,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,Sunrise,Sunset,bc,br,dz,fg,fg+,fu,gr,hz,mi,ra,sn,sq,ts,vc
0,2007-05-01,83.5,51.0,67.25,51.0,56.5,0.0,2.5,0.0,29.14,29.82,2.2,26.0,9.4,4.8,18.816667,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2007-05-02,59.5,42.5,51.0,42.0,47.0,13.5,0.0,0.0,29.41,30.085,13.15,3.0,13.4,4.783333,18.833333,0,1,0,0,0,0,0,1,0,0,0,0,0,0
2,2007-05-03,66.5,47.0,56.75,40.0,49.0,8.0,0.0,0.0,29.425,30.12,12.3,6.5,12.55,4.766667,18.85,0,0,0,0,0,0,0,1,0,0,0,0,0,0
3,2007-05-04,72.0,50.0,61.0,41.5,50.0,3.5,0.0,0.0,29.335,30.045,10.25,7.5,10.6,4.733333,18.866667,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,2007-05-05,66.0,53.5,59.75,38.5,49.5,5.0,0.0,0.0,29.43,30.095,11.45,7.0,11.75,4.716667,18.883333,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [83]:
cleaned_weather.shape

(1472, 30)

### Exporting cleaned data

In [84]:
# Exporting cleaned dataframes as CSV files.

train.to_csv("../input/cleaned_train.csv", index=False)
test.to_csv("../input/cleaned_test.csv", index=False)
spray.to_csv("../input/cleaned_spray.csv", index=False)
cleaned_weather.to_csv("../input/cleaned_weather.csv", index=False)