## Capstone 1 
# San Francisco Bay Water Quality

ref. [Water quality of SF Bay home page](https://sfbay.wr.usgs.gov/access/wqdata/index.html)
     
     

## Unit 5 - Data Wrangling, part 1

### Tasks

The first step in completing your capstone project is to collect data. Depending on your dataset, you may apply some of the data wrangling techniques that you learned in this unit.

Include answers to these questions in your submission:
   * What kind of cleaning steps did you perform?

   * How did you deal with missing values, if any?

   * Were there outliers, and how did you handle them?


## Data Acquisition

### Water Quality Data

#### Access
   1. Water quality data, 1969 - 2019, requested via query form. No API is available.
   [Expert query](https://sfbay.wr.usgs.gov/access/wqdata/query/expert.html) in three chunks, saved as CSV files
      1. Julian Date < 1999001 <br/>
      2. 1999001 < Julian Date < 2009001 <br/>
      3. Julian Date > 2009001 <br/>

      
**Note**: Water quality data is also available for download from [ScienceBase](https://www.sciencebase.gov/catalog/item/5841f97ee4b04fc80e518d9f); however, that archive includes fewer parameters and is not as up to date as the database at sfbay.wr.usgs.gov.

#### Files
   1. `SFBayWaterQuality1969-1998.csv` 
   2. `SFBayWaterQuality1999-2008.csv`
   3. `SFBayWaterQuality2009-2019.csv`

   
#### Data Format

All files are formatted as CSV (comma-separated values) with 27 columns.

WaterQuality files have two header rows; the second row shows units of measure. 

<small>

```
Date, Time, Station Number, Distance from 36, Depth, Discrete Chlorophyll, Chlorophyll a/a+PHA, Fluorescence, Calculated Chlorophyll, Discrete Oxygen, Oxygen Electrode Output, Oxygen Saturation %, Calculated Oxygen, Discrete SPM, Optical Backscatter, Calculated SPM, Measured Extinction Coefficient, Calculated Extinction Coefficient, Salinity, Temperature, Sigma-t, Nitrite, Nitrate + Nitrite, Ammonium, Phosphate, Silicate
```
```
MM/DD/YYYY, 24 hr., , [km], [meters], [mg/m3], , [volts], [mg/m3], [mg/L], [volts], , [mg/L], [mg/L], [volts], [mg/L], [per meter], [per meter], [psu], [°C], [kg/m3], [µM], [µM], [µM], [µM], [µM]
```
</small>





## Setup

Import libraries

In [1]:
# Import useful libraries

import pandas as pd
import matplotlib.pyplot as plt
import datetime
import re
import json


## Read in the Water Quality data

In [2]:
# Read in the Water Quality data
wq_df1 = pd.read_csv('Data/orig/SFBayWaterQuality1969-1998.csv', header=[0,1])
wq_df2 = pd.read_csv('Data/orig/SFBayWaterQuality1999-2008.csv', header=[0,1])
wq_df3 = pd.read_csv('Data/orig/SFBayWaterQuality2009-2019.csv', header=[0,1])

## Combine datasets

The three water quallity DataFrames have identical columns and can easily be concatenated into one file.

In [3]:
# Concatenate water quality DataFrames
wq_df = pd.concat([wq_df1, wq_df2, wq_df3]).reset_index(drop=True)

In [4]:
# Examine the new DataFrame
wq_df.sample(20)

Unnamed: 0_level_0,Date,Time,Station Number,Distance from 36,Depth,Discrete Chlorophyll,Chlorophyll a/a+PHA,Fluorescence,Calculated Chlorophyll,Discrete Oxygen,...,Measured Extinction Coefficient,Calculated Extinction Coefficient,Salinity,Temperature,Sigma-t,Nitrite,Nitrate + Nitrite,Ammonium,Phosphate,Silicate
Unnamed: 0_level_1,MM/DD/YYYY,24 hr.,Unnamed: 2_level_1,[km],[meters],[mg/m3],Unnamed: 6_level_1,[volts],[mg/m3],[mg/L],...,[per meter],[per meter],[psu],[°C],[kg/m3],[µM],[µM],[µM],[µM],[µM]
67959,5/1/1996,1423,6.0,110.9,11.0,,,0.72,6.5,,...,,,5.53,17.84,2.85,,,,,
137488,1/9/2007,1507,6.0,110.9,4.0,,,0.22,2.1,,...,,,4.36,8.39,3.26,,,,,
57028,2/7/1995,816,27.0,26.19,13.0,1.2,0.57,0.66,1.9,,...,,,16.4,13.0,12.03,,,,,
68989,7/17/1996,1359,8.0,99.77,13.0,,,0.32,2.3,,...,,,12.34,19.39,7.7,,,,,
10086,5/16/1983,1045,25.0,32.84,1.0,,,,13.4,,...,,2.9,15.92,16.3,,,,,,
222470,9/19/2017,1217,9.0,96.79,34.0,2.4,0.32,0.28,3.7,,...,,,13.47,20.42,8.32,,,,,
218779,4/4/2017,915,30.0,14.75,4.0,,,0.54,11.2,,...,,,15.63,15.73,10.96,,,,,
117890,3/9/2004,747,29.5,17.56,7.0,,,2.06,6.9,,...,,,17.33,14.66,12.46,,,,,
71363,1/13/1997,1024,25.0,32.84,1.0,,,0.25,0.9,,...,2.1,,10.44,10.73,7.76,,,,,
53116,5/17/1994,1825,2.0,127.71,5.0,,,1.13,8.3,,...,,,0.47,18.2,,,,,,


We can now ignore the original Water Quality files / DFs and use the concatenated DF containing all data from 1969 to 2019.

### Handle multi-level index for water quality columns

The original Water Quality CSV files had two-row column headers. The second level is units.

```
wq_df.columns
```
<small>

```
MultiIndex([(                             'Date',          'MM/DD/YYYY'),
            (                             'Time',              '24 hr.'),
            (                   'Station Number',  'Unnamed: 2_level_1'),
            (                 'Distance from 36',                '[km]'),
            (                            'Depth',            '[meters]'),
            (             'Discrete Chlorophyll',             '[mg/m3]'),
            (              'Chlorophyll a/a+PHA',  'Unnamed: 6_level_1'),
            (                     'Fluorescence',             '[volts]'),
            (           'Calculated Chlorophyll',             '[mg/m3]'),
            (                  'Discrete Oxygen',              '[mg/L]'),
            (          'Oxygen Electrode Output',             '[volts]'),
            (              'Oxygen Saturation %', 'Unnamed: 11_level_1'),
            (                'Calculated Oxygen',              '[mg/L]'),
            (                     'Discrete SPM',              '[mg/L]'),
            (              'Optical Backscatter',             '[volts]'),
            (                   'Calculated SPM',              '[mg/L]'),
            (  'Measured Extinction Coefficient',         '[per meter]'),
            ('Calculated Extinction Coefficient',         '[per meter]'),
            (                         'Salinity',               '[psu]'),
            (                      'Temperature',                '[°C]'),
            (                          'Sigma-t',             '[kg/m3]'),
            (                          'Nitrite',                '[µM]'),
            (                'Nitrate + Nitrite',                '[µM]'),
            (                         'Ammonium',                '[µM]'),
            (                        'Phosphate',                '[µM]'),
            (                         'Silicate',                '[µM]')],
           )
```
</small>

It will be easier to work with the data if I save the units into a dictionary and change the DataFrame to only have one level of headers.

In [5]:
# create a dictionary of Water Quality parameters and units
wq_units = {}
for param, unit in wq_df.columns:
    if 'Unnamed:' in unit:
        # handle fields with no units
        unit = ''
    wq_units[param] = unit
    
wq_units

{'Date': 'MM/DD/YYYY',
 'Time': '24 hr.',
 'Station Number': '',
 'Distance from 36': '[km]',
 'Depth': '[meters]',
 'Discrete Chlorophyll': '[mg/m3]',
 'Chlorophyll a/a+PHA': '',
 'Fluorescence': '[volts]',
 'Calculated Chlorophyll': '[mg/m3]',
 'Discrete Oxygen': '[mg/L]',
 'Oxygen Electrode Output': '[volts]',
 'Oxygen Saturation %': '',
 'Calculated Oxygen': '[mg/L]',
 'Discrete SPM': '[mg/L]',
 'Optical Backscatter': '[volts]',
 'Calculated SPM': '[mg/L]',
 'Measured Extinction Coefficient': '[per meter]',
 'Calculated Extinction Coefficient': '[per meter]',
 'Salinity': '[psu]',
 'Temperature': '[°C]',
 'Sigma-t': '[kg/m3]',
 'Nitrite': '[µM]',
 'Nitrate + Nitrite': '[µM]',
 'Ammonium': '[µM]',
 'Phosphate': '[µM]',
 'Silicate': '[µM]'}

In [6]:
# Reset the Water Quality column headers
wq_df.columns = wq_units.keys()

wq_df.columns

Index(['Date', 'Time', 'Station Number', 'Distance from 36', 'Depth',
       'Discrete Chlorophyll', 'Chlorophyll a/a+PHA', 'Fluorescence',
       'Calculated Chlorophyll', 'Discrete Oxygen', 'Oxygen Electrode Output',
       'Oxygen Saturation %', 'Calculated Oxygen', 'Discrete SPM',
       'Optical Backscatter', 'Calculated SPM',
       'Measured Extinction Coefficient', 'Calculated Extinction Coefficient',
       'Salinity', 'Temperature', 'Sigma-t', 'Nitrite', 'Nitrate + Nitrite',
       'Ammonium', 'Phosphate', 'Silicate'],
      dtype='object')

### Convert Date/Time columns to DateTime

The initial dataset has a Date column and a Time column, both in non-standard format. It will be useful to have a single DateTime column.

Issues:
   * The initial Date column is type `string`, M/D/YYYY, with no leading zeroes on day or month, possibly with a leading space. Conveniently, `pd.to_datetime` is able to convert this to DateTime format without trouble.
   * The initial Time column is type `int`, with no leading zeroes on the hour. To concatenate this to the Date column, I need it to be type `string`, 0-padded.

When I have two strings, I can concatenate them into a new DateTime column and covert that to DateTime format.

In [7]:
# Convert the Date field to datetime format
# 6/4/2010 => 2019-06-04
wq_df['Date'] = pd.to_datetime(wq_df['Date'])

# Convert back to string
wq_df['Date'] = wq_df['Date'].astype('str')

# convert Time field from int to str
wq_df['Time'] = wq_df['Time'].astype('str')

# 0-pad Time values
wq_df['Time'] = wq_df['Time'].transform(lambda x: x.rjust(4,'0')) 

# create a new DateTime field by concatenating the strings
wq_df['DateTime'] = wq_df['Date'].str.cat(wq_df['Time'],sep=' ')

# convert the new field to DateTime format
wq_df['DateTime'] = pd.to_datetime(wq_df['DateTime'])



In [8]:
# Move the new DateTime column to the front of the DataFrame
cols = list(wq_df.columns)    # get the list of columns
cols = [cols[-1]] + cols[:-1] # rearrange the list

wq_df = wq_df[cols]   # rearrange the columns

wq_df.columns

Index(['DateTime', 'Date', 'Time', 'Station Number', 'Distance from 36',
       'Depth', 'Discrete Chlorophyll', 'Chlorophyll a/a+PHA', 'Fluorescence',
       'Calculated Chlorophyll', 'Discrete Oxygen', 'Oxygen Electrode Output',
       'Oxygen Saturation %', 'Calculated Oxygen', 'Discrete SPM',
       'Optical Backscatter', 'Calculated SPM',
       'Measured Extinction Coefficient', 'Calculated Extinction Coefficient',
       'Salinity', 'Temperature', 'Sigma-t', 'Nitrite', 'Nitrate + Nitrite',
       'Ammonium', 'Phosphate', 'Silicate'],
      dtype='object')

In [9]:
# Update the Water Quality units dictionary with the enhanced date data
wq_units['Date'] = 'YYYY-MM-DD'
wq_units['DateTime'] = 'YYYY-MM-DD HH:MM:SS'


### Remove Columns that are not useful

**Optical Backscatter** 

According to the data dictionary, due to sensor changes and gain differences, this value is only comparable within cruises and may not be comparable between cruises.

Thus, I will remove this column.

In [10]:
wq_df.drop(columns=['Optical Backscatter'], inplace=True)

In [11]:
del wq_units['Optical Backscatter']

Calculated clorophyll, SPM, and O2 values are determined using linear regression between the discrete values and othe measurements. 

My USGS contact has suggested that I ignore the "discrete" values going forward. I will remove these from the dataset.

In [12]:
wq_df.drop(columns=['Discrete Chlorophyll', 'Discrete Oxygen', 'Discrete SPM'
                   ], inplace=True)

In [13]:
del wq_units['Discrete Chlorophyll']
del wq_units['Discrete Oxygen']
del wq_units['Discrete SPM']

Convert Station numbers to strings and remove unnecessary trailing `.0`. 

Also, shorten the column name to one word for ease of use.

In [14]:
wq_df['Station Number'] = wq_df['Station Number'].astype(str)

In [15]:
wq_df['Station Number'] =  [x.replace('.0', '') for x in wq_df['Station Number']]

In [16]:
wq_df.rename(columns={"Station Number": "Station"}, inplace=True)

I no longer need the Time column however, I will keep the Date column for now.

In [17]:
wq_df.drop(columns=['Time'], inplace=True)
del wq_units['Time']

In [18]:
# Save the units dictionary
with open('Data/water_quality_units.json', 'w') as fp:
    json.dump(wq_units, fp)

In [19]:
# Save the DataFrame to CSV
wq_df.to_csv('Data/SFBayWaterQualityCleaned.csv', index=False)

<hr style="border: 5px solid green;">

Next time, we can read the data in with
```
wq_df = pd.read_csv('Data/SFBayWaterQuality.csv', 
                    header=0, 
                    parse_dates=['DateTime', 'Date', 'Time'],
                    dtype={'Station' : str}
                    )


with open('Data/water_quality_units.json', 'r') as f:
    wq_units = json.load(f)

```