## Capstone 1 
# San Francisco Bay Water Quality

ref. [Water quality of SF Bay home page](https://sfbay.wr.usgs.gov/access/wqdata/index.html)
     
     
## Unit 5 - Data Wrangling - Phytoplankton, part 2

Merge Phytoplankton data into Water Quality DataFrame

Recommendation from Tara Schraga - Sum up all the biovolume at depth and station. Add the result to the WQ DataFrame.

## Setup

Import libraries



In [1]:
# Import useful libraries

import csv
import json
import pandas as pd
import matplotlib.pyplot as plt
import datetime


## Read in the Phytoplankton data


In [2]:
ph_df = pd.read_csv('Data/PhytoplanktonCleaned.csv', 
                     header=0, 
                     parse_dates=['Date'],
                     dtype={'Station Number' : str}
                   )

In [3]:
ph_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33185 entries, 0 to 33184
Data columns (total 6 columns):
Date                        33185 non-null datetime64[ns]
Station Number              33185 non-null object
Depth                       33185 non-null float64
Biovolume                   33185 non-null float64
Phylum or Class             33185 non-null object
Taxonomic Identification    33185 non-null object
dtypes: datetime64[ns](1), float64(2), object(3)
memory usage: 1.5+ MB


In [4]:
ph_df.head(20)

Unnamed: 0,Date,Station Number,Depth,Biovolume,Phylum or Class,Taxonomic Identification
0,1992-04-01,30,1.0,47502.0,BACILLARIOPHYTA,Asterionellopsis glacialis
1,1992-04-01,30,1.0,950.6,BACILLARIOPHYTA,Ceratoneis closterium
2,1992-04-01,30,1.0,307827.0,BACILLARIOPHYTA,Chaetoceros debilis
3,1992-04-01,30,1.0,476403.6,BACILLARIOPHYTA,Coscinodiscus radiatus
4,1992-04-01,30,1.0,21122.2,BACILLARIOPHYTA,Cyclotella sp.
5,1992-04-01,30,1.0,6160.0,BACILLARIOPHYTA,Gyrosigma acuminatum
6,1992-04-01,30,1.0,36779.8,BACILLARIOPHYTA,Melosira nummuloides
7,1992-04-01,30,1.0,3625.9,BACILLARIOPHYTA,Nitzschia spp.
8,1992-04-01,30,1.0,37225.5,BACILLARIOPHYTA,Paralia sulcata
9,1992-04-01,30,1.0,31356.0,BACILLARIOPHYTA,Pleurosigma normanii


Build a new data structure, where each record references a unique combination of date, station, and depth, with the sum of all biovolum sampled at that combination.

```
Date	    Station Number	Depth	Biovolume
1992-04-01	            30	1.0     2932720
1992-04-01	            32	1.0     6483381
...
```


In [5]:
# Group by {date, station, depth}, summing biovolume for each set

ph2 = ph_df.groupby(['Date', 'Station Number', 'Depth'])['Biovolume'].sum()

In [6]:
ph2

Date        Station Number  Depth
1992-04-01  30              1.0      3.289001e+06
1992-04-07  13              1.0      6.516864e+06
            18              1.0      3.819823e+06
            657             1.0      9.504703e+05
1992-04-08  30              1.0      1.314765e+07
                                         ...     
2018-12-14  18              2.0      8.769752e+05
            22              2.0      1.280684e+06
            27              2.0      1.654259e+06
            32              2.0      6.115351e+05
            36              2.0      9.222331e+05
Name: Biovolume, Length: 1475, dtype: float64

In [7]:
ph2_df = ph2.reset_index()  
ph2_df.sort_values(by=['Date', 'Station Number', 'Depth'], inplace=True)


In [8]:
type(ph2_df)

pandas.core.frame.DataFrame

In [9]:
ph2_df.head()

Unnamed: 0,Date,Station Number,Depth,Biovolume
0,1992-04-01,30,1.0,3289000.9
1,1992-04-07,13,1.0,6516864.4
2,1992-04-07,18,1.0,3819822.7
3,1992-04-07,657,1.0,950470.3
4,1992-04-08,30,1.0,13147645.7


In [10]:
ph2_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1475 entries, 0 to 1474
Data columns (total 4 columns):
Date              1475 non-null datetime64[ns]
Station Number    1475 non-null object
Depth             1475 non-null float64
Biovolume         1475 non-null float64
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 57.6+ KB


In [11]:
# The decimal place is not significant at these volumes
ph2_df.Biovolume = ph2_df.Biovolume.astype(int)

ph2_df.head()

Unnamed: 0,Date,Station Number,Depth,Biovolume
0,1992-04-01,30,1.0,3289000
1,1992-04-07,13,1.0,6516864
2,1992-04-07,18,1.0,3819822
3,1992-04-07,657,1.0,950470
4,1992-04-08,30,1.0,13147645


### Save Phytoplankton DataFrame to disk


In [12]:
# Save our work
ph2_df.to_csv('Data/PhytoplanktonBiovolume.csv', index=False)

Review the first two lines of the file on disk

```
Date,Station Number,Depth,Biovolume
1992-04-01,30,1.0,3289000
```

In [13]:
wq_df = pd.read_csv('Data/SFBayWaterQuality.csv', 
                     header=0, 
                     parse_dates=['DateTime', 'Date'],
                     dtype={'Station Number' : str}
                   )

In [14]:
wq_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 237061 entries, 0 to 237060
Data columns (total 22 columns):
DateTime                             237061 non-null datetime64[ns]
Date                                 237061 non-null datetime64[ns]
Station Number                       237061 non-null object
Distance from 36                     236597 non-null float64
Depth                                237061 non-null float64
Chlorophyll a/a+PHA                  12895 non-null float64
Fluorescence                         220308 non-null float64
Calculated Chlorophyll               225275 non-null float64
Oxygen Electrode Output              189908 non-null float64
Oxygen Saturation %                  191606 non-null float64
Calculated Oxygen                    188107 non-null float64
Calculated SPM                       200937 non-null float64
Measured Extinction Coefficient      13987 non-null float64
Calculated Extinction Coefficient    4772 non-null float64
Salinity                  

In [15]:
cols_to_order = ['Date', 'Station Number', 'Depth']
new_columns = cols_to_order + (wq_df.columns.drop(cols_to_order).tolist())

new_columns


['Date',
 'Station Number',
 'Depth',
 'DateTime',
 'Distance from 36',
 'Chlorophyll a/a+PHA',
 'Fluorescence',
 'Calculated Chlorophyll',
 'Oxygen Electrode Output',
 'Oxygen Saturation %',
 'Calculated Oxygen',
 'Calculated SPM',
 'Measured Extinction Coefficient',
 'Calculated Extinction Coefficient',
 'Salinity',
 'Temperature',
 'Sigma-t',
 'Nitrite',
 'Nitrate + Nitrite',
 'Ammonium',
 'Phosphate',
 'Silicate']

In [16]:
wq_df = wq_df[new_columns]
wq_df.sort_values(by=['Date', 'Station Number', 'Depth'], inplace=True)

wq_df.head()

Unnamed: 0,Date,Station Number,Depth,DateTime,Distance from 36,Chlorophyll a/a+PHA,Fluorescence,Calculated Chlorophyll,Oxygen Electrode Output,Oxygen Saturation %,...,Measured Extinction Coefficient,Calculated Extinction Coefficient,Salinity,Temperature,Sigma-t,Nitrite,Nitrate + Nitrite,Ammonium,Phosphate,Silicate
0,1969-04-10,4,0.5,1969-04-10 16:15:00,119.9,,,,,,...,,,0.3,13.1,,,,,,
1,1969-04-10,4,2.0,1969-04-10 16:16:00,119.9,,,,,,...,,,0.3,13.1,,0.7,,,1.6,236.0
2,1969-04-10,4,4.0,1969-04-10 16:17:00,119.9,,,,,,...,,,0.3,13.0,,,,,,
3,1969-04-10,4,11.0,1969-04-10 16:18:00,119.9,,,,,,...,,,0.3,13.0,,,,,,
4,1969-04-10,5,0.5,1969-04-10 16:30:00,115.63,,,,,,...,,,0.3,14.1,,,,,,


There are 64 phytoplankton samples that were recorded when no other parameters were measured for that {date, station number, depth}. I could use an outer join to include these. However, I have decided not to exclude these samples from the combined Water Quality dataframe.

In [17]:
new_df = pd.merge(wq_df, ph2_df, on=['Date', 'Station Number', 'Depth'], how='left') 
                  
new_df.info()        

<class 'pandas.core.frame.DataFrame'>
Int64Index: 237061 entries, 0 to 237060
Data columns (total 23 columns):
Date                                 237061 non-null datetime64[ns]
Station Number                       237061 non-null object
Depth                                237061 non-null float64
DateTime                             237061 non-null datetime64[ns]
Distance from 36                     236597 non-null float64
Chlorophyll a/a+PHA                  12895 non-null float64
Fluorescence                         220308 non-null float64
Calculated Chlorophyll               225275 non-null float64
Oxygen Electrode Output              189908 non-null float64
Oxygen Saturation %                  191606 non-null float64
Calculated Oxygen                    188107 non-null float64
Calculated SPM                       200937 non-null float64
Measured Extinction Coefficient      13987 non-null float64
Calculated Extinction Coefficient    4772 non-null float64
Salinity                  

In [18]:
# update the dictionary of Water Quality parameters and units

with open('Data/water_quality_units.json', 'r') as f:
    wq_units = json.load(f)

wq_units['Biovolume'] = '(cubic micrometers/mL)'

# Save the units dictionary
with open('Data/water_quality_units.json', 'w') as fp:
    json.dump(wq_units, fp)

Write the updated Water Quality DataFrame to disk.

In [19]:
new_df.to_csv('Data/SFBayWaterQualityPlus.csv', index=False)

```
Date,Station Number,Depth,DateTime,Time,Distance from 36,Chlorophyll a/a+PHA,Fluorescence,Calculated Chlorophyll,Oxygen Electrode Output,Oxygen Saturation %,Calculated Oxygen,Calculated SPM,Measured Extinction Coefficient,Calculated Extinction Coefficient,Salinity,Temperature,Sigma-t,Nitrite,Nitrate + Nitrite,Ammonium,Phosphate,Silicate,Biovolume
1969-04-10 00:00:00,4,0.5,1969-04-10 16:15:00,1615,119.9,,,,,,,,,,0.3,13.1,,,,,,,
```