# SnowGW Exploratory Data Analysis (EDA)

---

### Preliminary visualization

Time series plots of all variables

In [2]:
import numpy as np
import pandas as pd

# scipy statistics package
import scipy.stats as st

# for plotting
import matplotlib.pyplot as plt
# tell jupyter to show our plots in the notebook here
%matplotlib inline

Load in a csv

In [3]:
data = pd.read_csv('data/mashupdata.csv')

data.head(3)

Unnamed: 0,prcp,et,disch,swe,stn_swe,gw,month,year
0,334.104779,4.4299,153.008225,65.746676,11.374194,63.449355,1,2008
1,168.820151,6.240035,110.803206,56.775266,20.47931,63.017931,2,2008
2,246.50977,18.118176,139.330672,33.609042,21.335484,62.409355,3,2008


In [None]:
fig, ax = plt.subplots()

data.plot(x='years', y='SLI_max', c='b', ax=ax, label='Slide Canyon')
data.plot(x='years', y='BLC_max', c='r', ax=ax, label='Blue Canyon')

ax.set_title('Timeline of Peak Snow Water Equivalent (SWE)')
ax.set_xlabel('Water Year')
ax.set_ylabel('Peak SWE (mm)');
plt.legend(loc="best")

Hydrograph/Hyetograph

In [None]:
# From Nina's code

fig, ax1 = plt.subplots(figsize=(12,8))

color = 'teal'
ax1.set_xlabel('time (s)')
ax1.set_ylabel('swe', color=color)
ax1.plot(swe['datetime'], swe['687_OR_SNTL'], color=color)
ax1.tick_params(axis='y', labelcolor=color)

ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis

color = 'tab:blue'
ax2.set_ylabel('precip', color=color)  # we already handled the x-label with ax1
ax2.bar(swe['datetime'],-swe['687_OR_SNTL_daymetPR'], color=color,alpha = .6)
ax2.tick_params(axis='y', labelcolor=color)

fig.tight_layout()  # otherwise the right y-label is slightly clipped
plt.show()

Other time series comparisons?

---

### Questions

**How correlated are Daymet and SNOTEL SWE?**
* Scatterplot Daymet ~ SNOTEL
* Estimate correlation coefficient

**How do Daymet and SNOTEL distributions compare? Are there aggregation differences (e.g., Daymet more smooth than SNOTEL, SNOTEL greater magnitude)?**
* Histogram or density plots

**What predictors is discharge most correlated with?**
* Multiple linear regression of discharge on water storage, GW levels, SWE, and precipitation

**What are the lags between water storage, SWE, and discharge? How do we interpret them?**
* Visualizations; differences between seasonal peaks

----------------

<div class="alert alert-success" style="font-size:100%">
<b style="font-size:120%">Scatterplot</b></br>
</div>

In [None]:
fig, ax = plt.subplots(figsize=(4,4))

# Scatterplot
data.plot.scatter(x='SLI_max', y='BLC_max', c='k', ax=ax);

ax.set_xlabel('Slide Canyon max SWE (mm)')
ax.set_ylabel('Blue Canyon max SWE (mm)');

----------------

<div class="alert alert-success" style="font-size:100%">
<b style="font-size:120%">Distributions</b></br>
</div>

In [None]:
# From Nina's code
dif = swe['dif'][swe['dif']!=0]
dif.hist(bins=120, color = 'b')

In [None]:
# From Nina's code
import seaborn as sns
ax = sns.distplot(dif)

In [None]:
# From Nina's code

# Sample from a normal distribution using numpy's random number generator
samples = dif

# Compute a histogram of the sample
bins = np.linspace(-5, 5, 30)
histogram, bins = np.histogram(samples, bins=bins, density=True)

bin_centers = 0.5*(bins[1:] + bins[:-1])

# Compute the PDF on the bin centers from scipy distribution object
from scipy import stats
pdf = stats.norm.pdf(bin_centers)

from matplotlib import pyplot as plt
plt.figure(figsize=(6, 4))
plt.plot(bin_centers, histogram, label="Histogram of samples")
plt.plot(bin_centers, pdf, label="PDF")
plt.legend()
plt.show()

----------------

<div class="alert alert-success" style="font-size:100%">
<b style="font-size:120%">Regression</b></br>
</div>

**Linear regression**: regression of discharge on water storage, GW levels, SWE, and precipitation

In [None]:
st.linregress?

In [None]:
# use the linear regression function
slope, intercept, rvalue, pvalue, stderr = st.linregress(data.SLI_max, data.BLC_max)

Plot the result

In [None]:
fig, ax = plt.subplots(figsize=(4,4))

# Scatterplot
data.plot.scatter(x='SLI_max', y='BLC_max', c='k', ax=ax);

# Create points for the regression line
x = np.linspace(data.SLI_max.min(), data.SLI_max.max(), data.SLI_max.size) # x coordinates from min and max values of SLI_max
y = slope * x + intercept # y coordinates using the slope and intercept from our linear regression

# Plot the regression line
ax.plot(x, y, '-r')

ax.set_xlabel('Slide Canyon max SWE (mm)')
ax.set_ylabel('Blue Canyon max SWE (mm)');

We've used the slope and intercept from the linear regression, what were the other values the function returned to us?

This function gives us our R value, we can report how well our linear regression fits our data with this or R-squared (you can see in this case linear regression did a poor job)

In [None]:
print('r-value = {}'.format(rvalue))

print('r-squared = {}'.format(rvalue**2))

This function also performed a two-sided "Wald Test" (t-distribution) to test if the slope of the linear regression is different from zero (null hypothesis is that the slope is not different from a slope of zero). Be careful using this default statistical test though, is this the test that you really need to use on your data set?

In [None]:
print('p-value = {}'.format(pvalue))

And finally it gives us the standard error

In [None]:
print('standard error = {}'.format(stderr))

We can also make a plot of the residuals (actual - predicted values)

In [None]:
residuals = data.BLC_max - y

For a good linear fit, we hope that our residuals are small, don't have any trends or patterns themselves, want them to be normally distributed:

In [None]:
f, (ax1, ax2) = plt.subplots(2,1,figsize=(6,6))

ax1.hist(residuals)
ax1.set_xlabel('residuals (mm SWE)')
ax1.set_ylabel('count')


ax2.plot(data.years,residuals)
ax2.set_xlabel('years')
ax2.set_ylabel('residuals (mm SWE)')

f.tight_layout()

That distribution doesn't look quite normal, and there seems to be a negative bias (our predictions are higher then the observations).

There doesn't seem to be a trend in the residuals over time, but they're very noisy.

Let's plot what the predictions of Blue Canyon SWE would look like if we were to use this linear model:

In [None]:
# Use our linear model to make predictions:
BLC_pred = slope * data.SLI_max + intercept

In [None]:
fig, ax = plt.subplots()

data.plot(x='years', y='SLI_max', c='b', ax=ax, label='Slide Canyon Observed')
data.plot(x='years', y='BLC_max', c='r', ax=ax, label='Blue Canyon Observed')

# Plot the predicted SWE at Blue Canyon
ax.plot(data.years, BLC_pred, c='k', linestyle='--', label='Blue Canyon Predictions')

ax.set_title('Timeline of Peak Snow Water Equivalent (SWE)')
ax.set_xlabel('Water Year')
ax.set_ylabel('Peak SWE (mm)');
plt.legend(loc="best")