---
title: Back Bay National Wildlife Refuge
jupyter: python3
---

> Back Bay National Wildlife Refuge is located in the southeastern corner of the City of Virginia Beach. The refuge was established in 1938 to protect and provide habitat for migrating and wintering waterfowl. Diverse habitats, including beachfront, freshwater marsh, dunes, shrub-scrub and upland forest are home to hundreds of species of birds, reptiles, amphibians, mammals and fish.

![BNWR](https://www.fws.gov/sites/default/files/styles/banner_image_xl/public/banner_images/2020-09/waterfowl%20%28tundras%29.jpg?h=0c8d0f81&itok=NcZlpD27)


To get introduced to the park and its history, please view the following interactive story map.

[BBNWR History and Introduction](https://storymaps.arcgis.com/stories/960d9db38cca4f3d8d38111119b9874f)

Additionally, here is some drone footage of the park for a better look at the geography and ecology of the area.

[BBNWR Drone Footage](https://www.youtube.com/watch?v=NlW330aBTCc)

In [None]:
import os
import pandas as pd
import numpy as np
import seaborn as sb

In [None]:
bbnwr = pd.read_csv("./BKB_WaterQualityData_2020084.csv")
bbnwr["Site_Id"] = bbnwr["Site_Id"].replace("d", "D")
bbnwr.columns

## Question 1


### Part (a)

Plot the distribution of "Water Depth (m)". Comment on this distribution. What do you notice? Calculate a numerical summary that expresses this main feature of the plot.

In [None]:
# answer 

### Part (b)

In this problem we will treat the values of "Water Depth (m)" as a population. Pull out this column by itself into a variable (`waterdepth`). Remove all missing values.

For this population, compute the mean and standard deviation.

In [None]:
# population mean and std dev

Using theoretical results from class, for a sample size of $n = 20$ units, calculate the *theoretical* mean and standard error of a sample mean of water depths using a simple random sample. Print out the values.

In [None]:
# theoretical mean and std error of sample mean

### Part (c)

Create a simulation of taking samples of 20 units and compute the sample mean. Repeat this simulation 1000 times. In order to ensure that our "population" is sufficiently large, use the `replace = True` option to the `sample` method.

Using the results, calculate the **empirical** mean and standard error of the mean. We would not expect it to be exactly the same as the values calculated in the previous part, but it should be close.

In [None]:
# simulation

### Part (d)

Repeat Part (c) for a sample size of 1000. Graph the two distributions (use two code cells so that they get separate plots). What do you notice? Look carefully at the scales and at the shape of each distribution.

In [None]:
# Simulation 

### Part (e)

According to the Gaussian empirical rule, 68% of observations should be within one standard error of the theoretical mean for $\bar X$.

For the two simulations, compute the empirical probability of an observation falling within one (theoretical) SEM of the (theoretical) mean of $\bar X$. (Note: you will need to calculate the SEM for the sample mean of samples of size

In [None]:
# empirical mean of sample means and standard error

Comment on what you see. In particular, do you think the Central Limit Theorem would apply to a sample of 20 for this population? What about a sample of size 1000?

Double click to edit for your answer.

## Question 2

### Part (a)
As you probably noticed, the water depth measure is rather skewed. Select a transformation that minimizes the coefficient of skewness. Implement this and save it as `water_xform`.

In [None]:
# transformation

### Part (b)

Repeat the two simulations ($n = 20$, $n = 1000$) using the un-skewed water depth and compute the proportion of observations that fall within with one SEM. Note that you will need to recalculate the theoretical mean and standard error based on the transformed values.

In [None]:
# simulations

### Part (c)

Comment the results of Question 1 and Question 2. What do we see about the quality of the Central Limit Theorem approximation in these two cases?

Double click to edit for your answer.

## Question 3

### Part (a)

Create a table that has only the log of "Water Depth (m)" and "Year" as values. Limit the table to years after 1980. Drop any rows with missing values. Call this table `water_year`.

Plot these values in a scatter plot.

In [None]:
# scatter plot

### Part (b)

Treating `water_year` as a population, take samples of size $n = 100$ (using `replace = True` again). For each, calculate the correlation of log water depth and year.

Compare average the correlation coefficient with the "population" correlation coefficient of the `water_year` table. Evaluate the standard deviation of the sampling distribution and compare it to the theoretical standard error ($1/\sqrt{n}$).

In [None]:
# simulation and comparison