# Clean Spray Data

In this notebook, we will be looking and cleaning the `spray.csv` file that is found in the Kaggle competition.

In [1]:
# For Calculation and Data Manipulation
import numpy as np
import pandas as pd
# import datetime

In [2]:
# both file will be located in the same folder
original_filename = '../data/spray.csv'
clean_filename = '../data/cleaned_spray.csv'

In [3]:
# read in data
spray = pd.read_csv(original_filename)

# see data shape and first 5 rows
print(spray.shape)
spray.head()

(14835, 4)


Unnamed: 0,Date,Time,Latitude,Longitude
0,2011-08-29,6:56:58 PM,42.391623,-88.089163
1,2011-08-29,6:57:08 PM,42.391348,-88.089163
2,2011-08-29,6:57:18 PM,42.391022,-88.089157
3,2011-08-29,6:57:28 PM,42.390637,-88.089158
4,2011-08-29,6:57:38 PM,42.39041,-88.088858


In [4]:
# see data info and if there is any null values
spray.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14835 entries, 0 to 14834
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       14835 non-null  object 
 1   Time       14251 non-null  object 
 2   Latitude   14835 non-null  float64
 3   Longitude  14835 non-null  float64
dtypes: float64(2), object(2)
memory usage: 463.7+ KB


Let's change the date DType for `Date` and create a new column to store the time in 24 hours format

In [None]:
# convert date Dtype
spray['Date'] = pd.to_datetime(spray['Date'])

# create new column to store time in 24 hour format
spray['Time_24h'] = pd.to_datetime(spray['Time']).dt.time

In [None]:
# see data shape and first 5 rows
print(spray.shape)
spray.head()

(14835, 5)


Unnamed: 0,Date,Time,Latitude,Longitude,Time_24h
0,2011-08-29,6:56:58 PM,42.391623,-88.089163,18:56:58
1,2011-08-29,6:57:08 PM,42.391348,-88.089163,18:57:08
2,2011-08-29,6:57:18 PM,42.391022,-88.089157,18:57:18
3,2011-08-29,6:57:28 PM,42.390637,-88.089158,18:57:28
4,2011-08-29,6:57:38 PM,42.39041,-88.088858,18:57:38


In [None]:
# see data info
spray.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14835 entries, 0 to 14834
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       14835 non-null  datetime64[ns]
 1   Time       14251 non-null  object        
 2   Latitude   14835 non-null  float64       
 3   Longitude  14835 non-null  float64       
 4   Time_24h   14251 non-null  object        
dtypes: datetime64[ns](1), float64(2), object(2)
memory usage: 579.6+ KB


For easier manipulation, we will create 3 columns to store the day, month and year of the date. 

In [None]:
# create 3 columns to store the day, month and year
spray['day'] = spray['Date'].dt.day
spray['month'] = spray['Date'].dt.month
spray['year'] = spray['Date'].dt.year

In [None]:
# see data shape and first 5 rows
print(spray.shape)
spray.head()

(14835, 8)


Unnamed: 0,Date,Time,Latitude,Longitude,Time_24h,day,month,year
0,2011-08-29,6:56:58 PM,42.391623,-88.089163,18:56:58,29,8,2011
1,2011-08-29,6:57:08 PM,42.391348,-88.089163,18:57:08,29,8,2011
2,2011-08-29,6:57:18 PM,42.391022,-88.089157,18:57:18,29,8,2011
3,2011-08-29,6:57:28 PM,42.390637,-88.089158,18:57:28,29,8,2011
4,2011-08-29,6:57:38 PM,42.39041,-88.088858,18:57:38,29,8,2011


In [None]:
# see data info
spray.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14835 entries, 0 to 14834
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       14835 non-null  datetime64[ns]
 1   Time       14251 non-null  object        
 2   Latitude   14835 non-null  float64       
 3   Longitude  14835 non-null  float64       
 4   Time_24h   14251 non-null  object        
 5   day        14835 non-null  int64         
 6   month      14835 non-null  int64         
 7   year       14835 non-null  int64         
dtypes: datetime64[ns](1), float64(2), int64(3), object(2)
memory usage: 927.3+ KB


In [None]:
# rename the columns to just lower case
spray.rename(columns={column: column.lower() for column in spray.columns}, inplace=True)

# see data shape and first 5 rows
print(spray.shape)
spray.head()

(14835, 8)


Unnamed: 0,date,time,latitude,longitude,time_24h,day,month,year
0,2011-08-29,6:56:58 PM,42.391623,-88.089163,18:56:58,29,8,2011
1,2011-08-29,6:57:08 PM,42.391348,-88.089163,18:57:08,29,8,2011
2,2011-08-29,6:57:18 PM,42.391022,-88.089157,18:57:18,29,8,2011
3,2011-08-29,6:57:28 PM,42.390637,-88.089158,18:57:28,29,8,2011
4,2011-08-29,6:57:38 PM,42.39041,-88.088858,18:57:38,29,8,2011


In [None]:
# see data info
spray.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14835 entries, 0 to 14834
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   date       14835 non-null  datetime64[ns]
 1   time       14251 non-null  object        
 2   latitude   14835 non-null  float64       
 3   longitude  14835 non-null  float64       
 4   time_24h   14251 non-null  object        
 5   day        14835 non-null  int64         
 6   month      14835 non-null  int64         
 7   year       14835 non-null  int64         
dtypes: datetime64[ns](1), float64(2), int64(3), object(2)
memory usage: 927.3+ KB


We will **not be removing the null data** from the data as we are not clear if time will be utilised in the analysis. As any point in time, we can remove these data by dropping the null data or dropping the column. 

```python
# to run the below code to drop null data
spray.dropna(inplace=True)

# to run below code to drop column
df.drop(['time', 'time_24h'], axis = 1, inplace = True)

```

In [None]:
# export cleaned data file
spray.to_csv(clean_filename, index=False)