# Capstone - Port Hardy Data Cleaning
Date Started: 2021.10.28<br>
Date Completed: 2021...<br>
William Matthews

### Data Set

### Data Dictionary

### Report Objectives and Flow

This report's primary objective is to outline the data cleaning process for the Port Hardy Weather Ballon Station Data.  The rough plan is as follows:
- Confirm we have the appropriate dates
- Identify any rows with missing data
    - Shift data to correct columns if possible (the scraping process was not perfect!)
    - If missing data is in columns of interest, explore dropping or imputing
        - Drop/impute as needed
- Drop columns that are not of interest
- Check for duplicates and drop as neccessary
- Compress multiple soundings for a specific datetime into some sort of aggregated measure
- Write to csv

### Libraries and Imports

In [7]:
# managing data
import pandas as pd

# managing time
from datetime import datetime

In [2]:
ph_df = pd.read_csv('./Data/BallonData/Port Hardy.csv')

### Data Exploration

Let's start out with a high level look at our data.

In [15]:
display(ph_df.info(),
        ph_df.head(),
        ph_df.shape,
        ph_df.isna().sum().sum() / ph_df.shape[0])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8838 entries, 0 to 8837
Data columns (total 13 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   DATE    8838 non-null   datetime64[ns]
 1   PRES    8838 non-null   float64       
 2   HGHT    8838 non-null   int64         
 3   TEMP    8838 non-null   float64       
 4   DWPT    8838 non-null   float64       
 5   RELH    8838 non-null   int64         
 6   MIXR    8838 non-null   float64       
 7   DRCT    8838 non-null   float64       
 8   SKNT    8827 non-null   float64       
 9   THTA    8827 non-null   float64       
 10  THTE    8813 non-null   float64       
 11  THTV    8813 non-null   float64       
 12  STNM    8838 non-null   object        
dtypes: datetime64[ns](1), float64(9), int64(2), object(1)
memory usage: 897.7+ KB


None

Unnamed: 0,DATE,PRES,HGHT,TEMP,DWPT,RELH,MIXR,DRCT,SKNT,THTA,THTE,THTV,STNM
0,2014-12-31 16:00:00,724.0,2899,-1.7,-5.3,76,3.58,351.0,25.0,297.7,308.6,298.3,Port Hardy
1,2014-12-31 16:00:00,714.0,3010,-1.9,-6.6,70,3.28,344.0,25.0,298.7,308.8,299.2,Port Hardy
2,2014-12-31 16:00:00,703.0,3134,-1.1,-10.1,50,2.54,337.0,26.0,300.9,308.9,301.3,Port Hardy
3,2014-12-31 16:00:00,700.0,3168,-1.3,-11.3,47,2.31,335.0,26.0,301.0,308.4,301.4,Port Hardy
4,2014-12-31 16:00:00,699.0,3179,-1.3,-11.3,47,2.32,333.0,26.0,301.1,308.5,301.6,Port Hardy


(8838, 13)

0.008146639511201629

From the above it look like we have 8,838 rows across 13 columns.  Several columns have some missing values, but account for less than 1% of our data.  We will address this shortly.  The data appears to be what is expected based on the first 5 rows.  Data types are all as expected.  Since we might do some manipulation on the `DATE` column, we will change it to a date time.  We will do a quick check to make sure the first and last dates match our known date range first (01 Jan 2015 through 17 Apr 2021).

In [6]:
print(f"First Date: {ph_df.iloc[0, :]['DATE']}")
print(f"Last Date: {ph_df.iloc[-1, :]['DATE']}")

First Date: 2014-12-31 16:00:00
Last Date: 2021-04-30 04:00:00


From the above it looks like the first date matches our expectations (the night before our first target date).  The ending date is a little past our ending date of 17 Apr 2021.  Let's go ahead and convert the column to datetime and then drop the records that are later than 17 Apr 2021.

In [8]:
# convert column to date time
ph_df['DATE'] = pd.to_datetime(ph_df['DATE'] )

# confirm it worked
ph_df['DATE'].dtypes

dtype('<M8[ns]')

Now that we have the `DATE` column as datetime objects, let's move ahead and drop the records outside our time range

In [13]:
# get the indicies of rows with dates past our last date
past_indicies = ph_df[ph_df['DATE'] > datetime(2021, 4, 17)].index

# confirm this drop is going to do what we want it to
(ph_df.drop(index = past_indicies)['DATE'].max() < datetime(2021, 4, 18))

True

In [14]:
# drop the additional rows
ph_df1 = ph_df.drop(index = past_indicies).copy()

# confirm
display(ph_df1['DATE'].max() < datetime(2021, 4, 18))

# double check
display(ph_df1.iloc[-1, :])

True

DATE    2021-04-16 16:00:00
PRES                  682.4
HGHT                   3353
TEMP                   -1.2
DWPT                  -18.2
RELH                     26
MIXR                   1.34
DRCT                   95.0
SKNT                   13.0
THTA                  303.4
THTE                  307.8
THTV                  303.6
STNM             Port Hardy
Name: 8763, dtype: object

With the data outside of our time range removed let's move onto the rows with missing values.  We know that when we scraped the data from the web that a few records were missing various columns.  The scraping algorithem dealt with this by simply shifting all values to the left most empty columns.  Let's take a look at all of the rows missing data (only 25) to make sure nothing strange is going on.

In [17]:
ph_df[ph_df.isna().any(axis = 1)]

Unnamed: 0,DATE,PRES,HGHT,TEMP,DWPT,RELH,MIXR,DRCT,SKNT,THTA,THTE,THTV,STNM
488,2015-02-25 04:00:00,704.2,3048,-5.0,265.0,15,296.4,296.4,,,,,Port Hardy
489,2015-02-25 04:00:00,700.0,3095,-5.3,275.0,14,296.6,296.6,,,,,Port Hardy
490,2015-02-25 04:00:00,698.0,3117,-5.5,276.0,15,296.6,296.6,,,,,Port Hardy
491,2015-02-25 04:00:00,677.2,3353,-6.3,285.0,23,298.3,298.3,,,,,Port Hardy
2451,2016-11-08 16:00:00,720.9,2743,-1.3,190.0,45,298.5,298.5,,,,,Port Hardy
2452,2016-11-08 16:00:00,700.0,2978,-2.3,185.0,50,299.9,299.9,,,,,Port Hardy
2453,2016-11-08 16:00:00,693.8,3048,-2.7,185.0,48,300.3,300.3,,,,,Port Hardy
2666,2016-12-01 16:00:00,700.0,3003,-9.3,-34.3,11,0.3,292.2,293.2,292.2,,,Port Hardy
2851,2016-12-26 16:00:00,715.0,2712,-9.3,-9.3,100,2.66,290.4,298.4,290.9,,,Port Hardy
2852,2016-12-26 16:00:00,703.0,2843,-11.3,-11.9,95,2.2,289.6,296.3,290.0,,,Port Hardy


Two of the key variables of interst are `DRCT` (wind direction) and `SKNT` (speed in knots) so it is unfortunate they have been effected.  `MIXR, THTA, THTE,` and `THTV` are not of interest to us, so we are not worried about them.  The trouble with simply deleting these records is that we only have two times of measurement per day, and relatively few soundings for each datetime, so simply deleting them will leave holes in our record.  Let's grab each unique datetime and then see if we can impute the values.  This may take some manual research to get the sounding data we omitted from our webscraping!

_Personal Note: Roughlyl 1 hr to this point_