# Mini project 2: primary productivity in coastal waters

In this project you're again given a dataset and some questions. The data for this project come from the [EPA's National Aquatic Resource Surveys](https://www.epa.gov/national-aquatic-resource-surveys/data-national-aquatic-resource-surveys), and in particular the National Coastal Condition Assessment (NCCA); broadly, you'll do an exploratory analysis of primary productivity in coastal waters.

By way of background, chlorophyll A is often used as a proxy for [primary productivity in marine ecosystems](https://en.wikipedia.org/wiki/Marine_primary_production); primary producers are important because they are at the base of the food web. Nitrogen and phosphorus are key nutrients that stimulate primary production. 

In the data folder you'll find water chemistry data, site information, and metadata files. It might be helpful to keep the metadata files open when tidying up the data for analysis. It might also be helpful to keep in mind that these datasets contain a considerable amount of information, not all of which is relevant to answering the questions of interest. Notice that the questions pertain somewhat narrowly to just a few variables. It's recommended that you determine which variables might be useful and drop the rest.

As in the first mini project, there are accurate answers to each question that are mutually consistent with the data, but there aren't uniquely correct answers. You will likely notice that you have even more latitude in this project than in the first, as the questions are slightly broader. Since we've been emphasizing visual and exploratory techniques in class, you are encouraged (but not required) to support your answers with graphics.

The broader goal of these mini projects is to cultivate your problem-solving ability in an unstructured setting. Your work will be evaluated based on the following:
- choice of method(s) used to answer questions;
- clarity of presentation;
- code style and documentation.

Please write up your results separately from your codes; codes should be included at the end of the notebook.

---

## Part 1: dataset

Merge the site information with the chemistry data and tidy it up. Determine which columns to keep based on what you use in answering the questions in part 2; then, print the first few rows here (but *do not include your codes used in tidying the data*) and write a brief description (1-2 paragraphs) of the dataset conveying what you take to be the key attributes. Direct your description to a reader unfamiliar with the data; ensure that in your data preview the columns are named intelligibly.

*Suggestion*: export your cleaned data as a separate `.csv` file and read that directly in below, as in: `pd.read_csv('YOUR DATA FILE').head()`.

In [92]:
# show a few rows of clean data


*Write your description here.*

## Part 2: exploratory analysis

Answer each question below and provide a visualization supporting your answer. A description and interpretation of the visualization should be offered.

*Comment:* you can either designate your plots in the codes section with clear names and reference them in your answers; or you can export your plots as image files and display them in markdown cells.

### What is the apparent relationship between nutrient availability and productivity?

*Comment*: it's fine to examine each nutrient -- nitrogen and phosphorus -- separately, but do consider whether they might be related to each other.

*Write your answer here.*

### Are there any notable differences in available nutrients among U.S. coastal regions?

*Write your answer here.*

### Based on the 2010 data, does productivity seem to vary geographically in some way? 

If so, explain how; If not, explain what options you considered and ruled out.

*Write your answer here.*

### How does primary productivity in California coastal waters change seasonally in 2010, if at all?

Does your result make intuitive sense?

*Write your answer here.*

### Pose and answer one additional question.

*Write your answer here.*

---

# Codes

In [1]:
import pandas as pd
import numpy as np
import altair as alt

ncca_raw = pd.read_csv('assessed_ncca2010_waterchem.csv')
ncca_sites = pd.read_csv('assessed_ncca2010_siteinfo.csv')

In [4]:
ncca_sites.PROVINCE

0       Californian Province
1       Californian Province
2       Californian Province
3       Californian Province
4        Carolinian Province
                ...         
1099    Great Lakes Province
1100    Great Lakes Province
1101    Louisianian Province
1102    Louisianian Province
1103    Great Lakes Province
Name: PROVINCE, Length: 1104, dtype: object

In [31]:
# select site and date info along with chemistry data
ncca_subset = ncca_raw.iloc[:, [0, 1, 2, 3, 5, 7]].pivot(
    index = ['UID', 'SITE_ID', 'STATE', 'DATE_COL'],
    columns = 'PARAMETER',
    values = 'RESULT'
).reset_index()

# select waterbody, region, and lat/long from site data
site_subset = ncca_sites.loc[:, ['UID', 'WTBDY_NM', 'NCCR_REG', 'ALAT_DD', 'ALON_DD']]

# merge site info with chemistry data
ncca = pd.merge(ncca_subset, site_subset, how = 'left', on = 'UID')

# lowercase column names
ncca.columns = ncca.columns.str.lower()

# split dates into month, day, year
ncca_dates = ncca.date_col.str.split(
    pat = '/', 
    n = 3, 
    expand = True
).rename(
    columns = {0: 'month', 
               1: 'day', 
               2: 'year'}
)

# append
ncca = pd.concat([ncca, ncca_dates], axis = 1)

# filter to chemical parameters missing in fewer than 1% of instances
ncca = ncca.loc[:, ncca.isna().mean() < 0.01]

# rename some columns
ncca.rename(columns = {'alat_dd': 'lat', 'alon_dd': 'lon', 'wtbdy_nm': 'waterbody', 'nccr_reg': 'region'}, inplace = True)

# preview
ncca.head()

Unnamed: 0,uid,site_id,state,date_col,chla,din,nh3,no3no2,ntl,ptl,srp,waterbody,region,lat,lon,month,day,year
0,59,NCCA10-1111,CA,7/1/2010,3.34,0.014,0.0,0.014,0.4075,0.061254,0.028,Mission Bay,West,32.77361,-117.21471,7,1,2010
1,60,NCCA10-1119,CA,7/1/2010,2.45,0.02,0.01,0.01,0.23,0.037379,0.026,San Diego Bay,West,32.71424,-117.23527,7,1,2010
2,61,NCCA10-1123,CA,7/1/2010,3.82,0.009,0.0,0.009,0.33625,0.0481,0.03,Mission Bay,West,32.78372,-117.22132,7,1,2010
3,62,NCCA10-1127,CA,7/1/2010,6.13,0.01,0.0,0.01,0.23875,0.044251,0.028,San Diego Bay,West,32.72245,-117.20443,7,1,2010
4,63,NCCA10-1133,NC,6/9/2010,9.79,0.03,0.002,0.028,0.6325,0.090636,0.043,White Oak River,Southeast,34.75098,-77.12117,6,9,2010


In [66]:
# relationship between total phosphorus and chlorophyll
alt.Chart(ncca).transform_filter(
    alt.FieldGTPredicate(field = 'chla', gt = 0)
).transform_filter(
    alt.FieldGTPredicate(field = 'ptl', gt = 0)
).encode(
    x = alt.X('ptl', scale = alt.Scale(type = 'log')),
    y = alt.Y('chla', scale = alt.Scale(type = 'log'))
).mark_point()

In [65]:
# relationship between total nitrogen and chlorophyll
alt.Chart(ncca).transform_filter(
    alt.FieldGTPredicate(field = 'chla', gt = 0)
).transform_filter(
    alt.FieldGTPredicate(field = 'ntl', gt = 0)
).encode(
    x = alt.X('ntl', scale = alt.Scale(type = 'log')),
    y = alt.Y('chla', scale = alt.Scale(type = 'log'))
).mark_point()

In [62]:
# relationship between total nitrogen and total phosphorus
alt.Chart(ncca).transform_filter(
    alt.FieldGTPredicate(field = 'ptl', gt = 0)
).transform_filter(
    alt.FieldGTPredicate(field = 'ntl', gt = 0)
).encode(
    x = alt.X('ntl', scale = alt.Scale(type = 'log')),
    y = alt.Y('ptl', scale = alt.Scale(type = 'log'))
).mark_point()

In [59]:
# one way to look at geographic variation -- more skewness at lower latitudes suggests higher productivity there
alt.Chart(ncca).transform_bin(
    as_ = 'binned latitude',
    field = 'lat',
    bin = alt.Bin(maxbins = 3)
).transform_density(
    density = 'chla', 
    as_ = ['Chlorophyll', 'Estimated Density'], 
    groupby = ['binned latitude'],
    bandwidth = 2, 
    extent = [0, 25],
    steps = 1000 
).mark_line().encode(
    x = 'Chlorophyll:Q',
    y = 'Estimated Density:Q',
    color = 'binned latitude:Q'
)

In [64]:
# another way -- using regions
alt.Chart(ncca).transform_density(
    density = 'chla', 
    as_ = ['Chlorophyll', 'Estimated Density'], 
    groupby = ['region'],
    bandwidth = 2, 
    extent = [0, 25],
    steps = 1000 
).mark_line().encode(
    x = 'Chlorophyll:Q',
    y = 'Estimated Density:Q',
    color = 'region'
)

In [62]:
# interestingly, region seems to partly account for the clustering
alt.Chart(ncca).transform_filter(
    alt.FieldGTPredicate(field = 'ptl', gt = 0)
).transform_filter(
    alt.FieldGTPredicate(field = 'ntl', gt = 0)
).encode(
    x = alt.X('ntl', scale = alt.Scale(type = 'log')),
    y = alt.Y('ptl', scale = alt.Scale(type = 'log'))
).mark_point()

In [80]:
# most variable states in each region by month
ncca.groupby(
    ['state', 'region', 'month']
).var().sort_values(
    by = ['chla'], 
    ascending = False
).reset_index().groupby(
    ['region', 'month']
).head(1).sort_values(
    by = 'month'
).iloc[:, 0:3].pivot(
    index = 'region',
    values = 'state',
    columns = 'month'
).iloc[:, 2:6]

month,6,7,8,9
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Great Lakes,OH,OH,WI,IN
Gulf,LA,AL,FL,LA
Northeast,NH,NJ,VA,NJ
Southeast,FL,NC,GA,
West,OR,WA,OR,OR


In [82]:
# most variable states in each region by month
ncca.groupby(
    ['state', 'region', 'month']
).mean().sort_values(
    by = ['chla'], 
    ascending = False
).reset_index().groupby(
    ['region', 'month']
).head(1).sort_values(
    by = 'month'
).iloc[:, 0:3].pivot(
    index = 'region',
    values = 'state',
    columns = 'month'
)

month,10,5,6,7,8,9
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Great Lakes,,MI,OH,OH,WI,IN
Gulf,TX,,LA,MS,LA,LA
Northeast,MD,,VA,NJ,VA,NJ
Southeast,,,FL,NC,GA,
West,,,WA,OR,OR,OR


In [90]:
# in CA, higher productivity in later months
alt.Chart(ncca).transform_filter(
    alt.FieldEqualPredicate(field = 'state', equal = 'CA')
).transform_density(
    density = 'chla', 
    as_ = ['Chlorophyll', 'Estimated Density'], 
    groupby = ['month'],
    bandwidth = 3.5, 
    extent = [0, 20],
    steps = 1000 
).mark_line().encode(
    x = 'Chlorophyll:Q',
    y = 'Estimated Density:Q',
    color = 'month:Q'
)