# Data Engineering

The climate data for Hawaii is provided through two CSV files. Start by using Python and Pandas to inspect the content of these files and clean the data.

* Create a Jupyter Notebook file called data_engineering.ipynb and use this to complete all of your Data Engineering tasks.
* Use Pandas to read in the measurement and station CSV files as DataFrames.
* Inspect the data for NaNs and missing values. You must decide what to do with this data.
* Save your cleaned CSV files with the prefix clean_.

In [1]:
#dependencies
import pandas as pd
import numpy as np

In [2]:
#defining the paths
measurements_path = "hawaii_measurements.csv"
stations_path = "hawaii_stations.csv"

In [3]:
#reading data from csv to dataframes
measurements_df = pd.read_csv(measurements_path)
stations_df = pd.read_csv(stations_path)

## Inspection of Measurements Data

In [4]:
#check the datasource
measurements_df.head(10)

Unnamed: 0,station,date,prcp,tobs
0,USC00519397,2010-01-01,0.08,65
1,USC00519397,2010-01-02,0.0,63
2,USC00519397,2010-01-03,0.0,74
3,USC00519397,2010-01-04,0.0,76
4,USC00519397,2010-01-06,,73
5,USC00519397,2010-01-07,0.06,70
6,USC00519397,2010-01-08,0.0,64
7,USC00519397,2010-01-09,0.0,68
8,USC00519397,2010-01-10,0.0,73
9,USC00519397,2010-01-11,0.01,64


In [5]:
# describe the dataset
measurements_df.describe()

Unnamed: 0,prcp,tobs
count,18103.0,19550.0
mean,0.160644,73.097954
std,0.468746,4.523527
min,0.0,53.0
25%,0.0,70.0
50%,0.01,73.0
75%,0.11,76.0
max,11.53,87.0


In [6]:
#call the info function to get a sense of the datatypes and nulls
measurements_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19550 entries, 0 to 19549
Data columns (total 4 columns):
station    19550 non-null object
date       19550 non-null object
prcp       18103 non-null float64
tobs       19550 non-null int64
dtypes: float64(1), int64(1), object(2)
memory usage: 611.0+ KB


In [7]:
# of stations
num_stations = measurements_df['station'].nunique()
num_dates = measurements_df['date'].nunique()
min_date = measurements_df['date'].min()
max_date = measurements_df['date'].max()
nulls = measurements_df[measurements_df['prcp'].isnull() == True]
total_size = len(measurements_df)
print(f"There are {num_stations} unique stations, measuring data from {num_dates} days from {min_date} to {max_date}.")
print(f"There are {len(nulls)} records with null values, {len(nulls) / total_size} of the total data set.")

There are 9 unique stations, measuring data from 2792 days from 2010-01-01 to 2017-08-23.
There are 1447 records with null values, 0.0740153452685422 of the total data set.


In [8]:
#Investigating the null values
grouped_nulls = nulls[['station', 'date']].groupby('station').count()
grouped_nulls = grouped_nulls.rename(columns = {'date': 'Null_Values'})
grouped_full = measurements_df[['station', 'date']].groupby('station').count()
grouped_full = grouped_full.rename(columns = {'date': 'Full_Data_Set'})
station_comparison = grouped_full.merge(grouped_nulls, left_index = True, right_index = True, how = "outer")
station_comparison['%_null'] = station_comparison['Null_Values'] / station_comparison['Full_Data_Set']
station_comparison

Unnamed: 0_level_0,Full_Data_Set,Null_Values,%_null
station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
USC00511918,1979,47.0,0.023749
USC00513117,2709,13.0,0.004799
USC00514830,2202,265.0,0.120345
USC00516128,2612,128.0,0.049005
USC00517948,1372,689.0,0.502187
USC00518838,511,169.0,0.330724
USC00519281,2772,,
USC00519397,2724,39.0,0.014317
USC00519523,2669,97.0,0.036343


### Based on the analysis above, I will take the following actions:

* Drop all measurements from station - USC00517948 - it appears to be faulty
* Drop all other NaNs

In [9]:
#Drop measurements from station USC00517948
measurements_df.drop(measurements_df['station'] == 'USC00517948', inplace = True)

#Drop all other null values
measurements_df.dropna(inplace = True)

#reseting index on measurements_df
measurements_df = measurements_df.reset_index(drop = True)

In [10]:
measurements_df.to_csv('clean_hawaii_measurements.csv', index = False)

## Inspection of Stations Data

In [11]:
stations_df

Unnamed: 0,station,name,latitude,longitude,elevation
0,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0
1,USC00513117,"KANEOHE 838.1, HI US",21.4234,-157.8015,14.6
2,USC00514830,"KUALOA RANCH HEADQUARTERS 886.9, HI US",21.5213,-157.8374,7.0
3,USC00517948,"PEARL CITY, HI US",21.3934,-157.9751,11.9
4,USC00518838,"UPPER WAHIAWA 874.3, HI US",21.4992,-158.0111,306.6
5,USC00519523,"WAIMANALO EXPERIMENTAL FARM, HI US",21.33556,-157.71139,19.5
6,USC00519281,"WAIHEE 837.5, HI US",21.45167,-157.84889,32.9
7,USC00511918,"HONOLULU OBSERVATORY 702.2, HI US",21.3152,-157.9992,0.9
8,USC00516128,"MANOA LYON ARBO 785.2, HI US",21.3331,-157.8025,152.4


In [12]:
stations_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 5 columns):
station      9 non-null object
name         9 non-null object
latitude     9 non-null float64
longitude    9 non-null float64
elevation    9 non-null float64
dtypes: float64(3), object(2)
memory usage: 440.0+ bytes


In [13]:
stations_df.drop(stations_df['station'] == 'USC00517948', inplace = True)
stations_df = stations_df.reset_index(drop = True)

In [14]:
stations_df.to_csv('clean_hawaii_stations.csv', index = False)

In [15]:
stations_df.head()

Unnamed: 0,station,name,latitude,longitude,elevation
0,USC00514830,"KUALOA RANCH HEADQUARTERS 886.9, HI US",21.5213,-157.8374,7.0
1,USC00517948,"PEARL CITY, HI US",21.3934,-157.9751,11.9
2,USC00518838,"UPPER WAHIAWA 874.3, HI US",21.4992,-158.0111,306.6
3,USC00519523,"WAIMANALO EXPERIMENTAL FARM, HI US",21.33556,-157.71139,19.5
4,USC00519281,"WAIHEE 837.5, HI US",21.45167,-157.84889,32.9
