### Step 1 - Data Engineering

The climate data for Hawaii is provided through two CSV files. Start by using Python and Pandas to inspect the content of these files and clean the data.

• Create a Jupyter Notebook file called data_engineering.ipynb and use this to complete all of your Data Engineering tasks.

• Use Pandas to read in the measurement and station CSV files as DataFrames.

• Inspect the data for NaNs and missing values. You must decide what to do with this data.

• Save your cleaned CSV files with the prefix clean_.

In [1]:
# Import dependencies

import pandas as pd
import os
import csv
import numpy as np

In [2]:
# Import files

measurements_path = os.path.join("Resources", "hawaii_measurements.csv")
stations_path = os.path.join("Resources", "hawaii_stations.csv")

In [3]:
# Read the Hawaii Measuremnts file to a df

measurements_df = pd.read_csv(measurements_path)
measurements_df.head()

Unnamed: 0,station,date,prcp,tobs
0,USC00519397,2010-01-01,0.08,65
1,USC00519397,2010-01-02,0.0,63
2,USC00519397,2010-01-03,0.0,74
3,USC00519397,2010-01-04,0.0,76
4,USC00519397,2010-01-06,,73


In [4]:
# Read the Hawaii Stations file to a df

stations_df = pd.read_csv(stations_path)
stations_df.head()

Unnamed: 0,station,name,latitude,longitude,elevation
0,USC00519397,"WAIKIKI 717.2, HI US",21.2716,-157.8168,3.0
1,USC00513117,"KANEOHE 838.1, HI US",21.4234,-157.8015,14.6
2,USC00514830,"KUALOA RANCH HEADQUARTERS 886.9, HI US",21.5213,-157.8374,7.0
3,USC00517948,"PEARL CITY, HI US",21.3934,-157.9751,11.9
4,USC00518838,"UPPER WAHIAWA 874.3, HI US",21.4992,-158.0111,306.6


In [5]:
measurements_df.describe()

Unnamed: 0,prcp,tobs
count,18103.0,19550.0
mean,0.160644,73.097954
std,0.468746,4.523527
min,0.0,53.0
25%,0.0,70.0
50%,0.01,73.0
75%,0.11,76.0
max,11.53,87.0


In [6]:
measurements_df.count()

station    19550
date       19550
prcp       18103
tobs       19550
dtype: int64

In [7]:
measurements_df.isnull().sum()

station       0
date          0
prcp       1447
tobs          0
dtype: int64

In [8]:
stations_df.isnull().sum()

station      0
name         0
latitude     0
longitude    0
elevation    0
dtype: int64

In [9]:
clean_measurements = measurements_df.dropna()
clean_measurements.count()

station    18103
date       18103
prcp       18103
tobs       18103
dtype: int64

In [10]:
clean_measurements.isnull().sum()

station    0
date       0
prcp       0
tobs       0
dtype: int64

In [11]:
#Export Cleaned DF

clean_measurements.to_csv(os.path.join("Resources",'clean_hawaii_measurements.csv'))

In [12]:
clean_measurements.head()

Unnamed: 0,station,date,prcp,tobs
0,USC00519397,2010-01-01,0.08,65
1,USC00519397,2010-01-02,0.0,63
2,USC00519397,2010-01-03,0.0,74
3,USC00519397,2010-01-04,0.0,76
5,USC00519397,2010-01-07,0.06,70
