## Part 0. Setup Steps

- Create a repo on GitHub named `eds217-trypy-02`
- Clone to create a version-controlled project
- Create some subfolder infrastructure (nbs, data, etc..)
- Create a new python notebook.

## Part 1. Real data

Explore this [data package](https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-arc.10341.5) from EDI, which contains a "Data file describing the biogeochemistry of samples collected at various sites near Toolik Lake, North Slope of Alaska". Familiarize yourself with the metadata (particularly, View full metadata > expand 'Data entities' to learn more about the variables in the dataset). 

**Citation:** Kling, G. 2016. Biogeochemistry data set for soil waters, streams, and lakes near Toolik on the North Slope of Alaska, 2011. ver 5. Environmental Data Initiative. https://doi.org/10.6073/pasta/362c8eeac5cad9a45288cf1b0d617ba7 

1. Download the CSV containing the Toolik biogeochemistry data
2. Take a look at it - how are missing values stored? Keep that in mind. 
3. Drop the CSV into your data folder of your project
4. Create a new notebook document in VSCode, save in your nbs folder as `toolik_chem.ipynb`
5. Import the `pandas` (as `pd`), `numpy` (as `np`), and `matplotlib.pyplot` (as `plt`) libraries into your first code cell.
6. Read in the data as `toolik_biochem`. Remember, you'll want to specify here how `NA` values are stored. Use the `na_values` argument in your `pd.read_csv()` call to do that.



Copy and paste this code into a cell to create a clean_names function:

```python

def clean_names(df):
        """Convert CamelCase dataframe column names
            to snake_case and lowercase the string
        
        df: Dataframe

        Returns a new dataframe with updated columns

        """
        def snakecase(s)
            s = re.sub(
            # Find a lower case letter or number (group 1)
            # followed by an upper case letter (group 2):
            '([a-z0-9])([A-Z])',
            # Replace with - 
            # \1, the lower case letter, 
            # _, an underscore, and
            # \2, the upper case letter:
            r'\1_\2',
            # Perform the search and replace in 
            # the string s:
            s 
            )
            s = re.sub(
                    ' ', # Find a space
                    '_', # Replace with an underscore
                    s    # In the string
                    ).lower()  # Convert to lower case
            return s
        df.columns = [snakecase(col) for col in df.columns]
        return df

```
Run the cell to create the function, which you can use 

In [1]:
#| echo: false
import pandas as pd
# import pyjanitor
toolik_biochem = pd.read_csv(
    '../data/2011_Kling_Akchem.csv',
    na_values=".").clean_names()

FileNotFoundError: [Errno 2] No such file or directory: '../data/2011_Kling_Akchem.csv'

7. Create a subset of the data that contains only observations from the "Toolik Inlet" site, and that only contains the variables (columns) for pH, dissolved organic carbon (DOC), and total dissolved nitrogen (TDN). Store this subset as `inlet_biochem`. Make sure to LOOK AT the subset you've created. 

In [None]:
#| echo: false
valid = toolik_biochem["site"] == "Toolik Inlet"
inlet_biochem = toolik_biochem[valid][[
    'ph','doc_um','tdn_um']]

8. Find the mean value of each column in `inlet_biochem` 2 different ways: 

a. Write a for loop from scratch to calculate the mean for each
b. Use *one other method* (e.g. `.mean()`, or `.apply()`) to find the mean for each column.

In [None]:
#| echo: false
import numpy as np

# Strategy a:
print("Using for loop:")
for col in inlet_biochem.columns:
    mean_val = np.nanmean(inlet_biochem[col])
    print(f"col {col}: {mean_val:.2f}")

# Strategy b: 
print("Using list comprehension:")
[print(
    f"col {col}:",
    f"{np.nanmean(inlet_biochem[col]):.2f}")
 for col in inlet_biochem.columns]

# Strategy c: 
print("Using df.mean()")
print(inlet_biochem.mean())

# Strategy d: 
print("Using .apply()")
print(inlet_biochem.apply(np.nanmean))

### Save, stage, commit, pull, push!

## END activities