# Data engineering

In this section we'll explore the climate data for Hawaii by:

+ checking for missing data (null values)
+ checking for duplicates

## Results

No duplicate data was found in either the station data nor in the measurements data.
There was about 7.4 % of null values found in the measurements data (1447 rows on a total of 19550 rows).

The rows with null values were deleted from the original measurements data file.
A percentage of 7.4% seems on a total of 19550 rows seems acceptable in this case.
Be aware however that that might not always be the case.
And that statistical analysis might be needed to determine whether deleting data might have a negative impact on data reliability hence on data analysis reliability.

In [2]:
# Dependencies
import pandas as pd
import os

In [3]:
# Load files into dataframe
## Measurement file
input_file_m = input("Enter the name of the measurement file you want to analyze (without extension): ") + ".csv"
filepath_m = os.path.join('Resources', input_file_m)
measurement_df = pd.read_csv(filepath_m)
## Station file
input_file_s = input("Enter the name of the station file you want to analyze (without extension): ") + ".csv"
filepath_s = os.path.join('Resources', input_file_s)
station_df = pd.read_csv(filepath_s)

Enter the name of the measurement file you want to analyze (without extension): hawaii_measurements
Enter the name of the station file you want to analyze (without extension): hawaii_stations


In [4]:
# Check for NaN values in measurements data:
measurement_df.isnull().sum()

station       0
date          0
prcp       1447
tobs          0
dtype: int64

In [5]:
# Check for NaN values in stations data:
station_df.isnull().sum()

station      0
name         0
latitude     0
longitude    0
elevation    0
dtype: int64

In [6]:
# Check for duplicates in measurements data:
measurement_df[measurement_df.duplicated(keep=False)].sum()

station    0.0
date       0.0
prcp       0.0
tobs       0.0
dtype: float64

In [None]:
# Check for duplicates in stations data:
## Stations:
station_df[station_df.duplicated(keep=False)].sum()

In [9]:
# Precipitation is missing for 1447 rows in measurement_df
# There's no clues as for what the missing data might be so we'll delete all rows with null values:
measurement_df = measurement_df.dropna()

In [10]:
# Check again:
measurement_df.isnull().sum()

station    0
date       0
prcp       0
tobs       0
dtype: int64

In [11]:
# Save cleaned csv files with prefix clean_
prefix = "clean_"
measurement_df.to_csv(prefix+input_file_m, encoding = "utf-8-sig", index = False)
station_df.to_csv(prefix+input_file_s, encoding = "utf-8-sig", index = False)