# Cleaning the spray data#

In [42]:
import pandas as pd
import numpy as np

Load the weather data

In [43]:
spray = pd.read_csv('./spray.csv')

This is what the five first rows look like

In [44]:
spray.head()

Unnamed: 0,Date,Time,Latitude,Longitude
0,2011-08-29,6:56:58 PM,42.391623,-88.089163
1,2011-08-29,6:57:08 PM,42.391348,-88.089163
2,2011-08-29,6:57:18 PM,42.391022,-88.089157
3,2011-08-29,6:57:28 PM,42.390637,-88.089158
4,2011-08-29,6:57:38 PM,42.39041,-88.088858


**Get more info about the weather data frame**

In [45]:
spray.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14835 entries, 0 to 14834
Data columns (total 4 columns):
Date         14835 non-null object
Time         14251 non-null object
Latitude     14835 non-null float64
Longitude    14835 non-null float64
dtypes: float64(2), object(2)
memory usage: 463.7+ KB


**Date**

Turn the Date column into datetime-format

In [46]:
# Set a big Y so that pandas know that the the year is 4 and not 2 digits 
spray['Date'] = pd.to_datetime(spray['Date'], format='%Y-%m-%d')

Over what time period is the data collected?

In [47]:
# Earliest collection
print(spray['Date'].min())
# Latest collection
print(spray['Date'].max())

2011-08-29 00:00:00
2013-09-05 00:00:00


Make new columns in data frame for Year, Month, Day

In [48]:
spray['Year'] = spray['Date'].dt.year

In [49]:
spray['Month'] = spray['Date'].dt.month

In [50]:
spray['Day'] = spray['Date'].dt.day

In [51]:
spray.head()

Unnamed: 0,Date,Time,Latitude,Longitude,Year,Month,Day
0,2011-08-29,6:56:58 PM,42.391623,-88.089163,2011,8,29
1,2011-08-29,6:57:08 PM,42.391348,-88.089163,2011,8,29
2,2011-08-29,6:57:18 PM,42.391022,-88.089157,2011,8,29
3,2011-08-29,6:57:28 PM,42.390637,-88.089158,2011,8,29
4,2011-08-29,6:57:38 PM,42.39041,-88.088858,2011,8,29


**Time**

In [52]:
spray['Time'].isnull().sum()

584

In [53]:
spray['Time'].unique()

array(['6:56:58 PM', '6:57:08 PM', '6:57:18 PM', ..., '8:04:01 PM',
       '8:04:11 PM', '8:04:21 PM'], dtype=object)

We decided to drop the column Time from the spray dataframe

In [54]:
spray = spray.drop(['Time'], axis=1)
spray

Unnamed: 0,Date,Latitude,Longitude,Year,Month,Day
0,2011-08-29,42.391623,-88.089163,2011,8,29
1,2011-08-29,42.391348,-88.089163,2011,8,29
2,2011-08-29,42.391022,-88.089157,2011,8,29
3,2011-08-29,42.390637,-88.089158,2011,8,29
4,2011-08-29,42.390410,-88.088858,2011,8,29
5,2011-08-29,42.390395,-88.088315,2011,8,29
6,2011-08-29,42.390673,-88.088002,2011,8,29
7,2011-08-29,42.391027,-88.088002,2011,8,29
8,2011-08-29,42.391403,-88.088003,2011,8,29
9,2011-08-29,42.391718,-88.087995,2011,8,29


In [55]:
spray[['Latitude','Longitude']].describe()

Unnamed: 0,Latitude,Longitude
count,14835.0,14835.0
mean,41.904828,-87.73669
std,0.104381,0.067292
min,41.713925,-88.096468
25%,41.785001,-87.794225
50%,41.940075,-87.727853
75%,41.980978,-87.694108
max,42.395983,-87.586727


**Load the data to CSV**

In [56]:
spray.to_csv('spray_clean.csv')