# Wyoming Weather Analysis

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the area of the state of Wyoming. The area under analysis is inside the blue box in the map displayed below.

![ipyleaflet](HW5_figures/ipyleaflet.png)

The data we will use here comes from [National Oceanic and Atmospheric Administration (NOAA)](https://www.ncdc.noaa.gov/). Specifically, Global Historical Climatology Network ‐ Daily (GHCN‐D) data was downloaded from this FTP site (ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We will focus on the following measurements of six meteorological elements:
* **TMIN, TMAX:** the daily minimum and maximum temperature (tenths of degrees C).
* **TOBS:** Temperature at the time of observation (tenths of degrees C).
* **PRCP:** Daily Percipitation (i.e.,rainfall and snow water equivalent) (tenths of mm)
* **SNOW:** Daily snowfall (mm)
* **SNWD:** The depth of accumulated snow (mm).

## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/jackson/wyoming/united-states/uswy0088" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>


<p><img alt="Wyoming-Jackson-Climate" src="HW5_figures/Wyoming-Jackson-Climate.png" /></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperature agree with the ones we got from our data, once we translate Fahrenheit to Centigrade.</p>

<p>&nbsp;</p>

<p><img alt="TMIN-TMAX" src="HW5_figures/TMIN-TMAX.png" style="height:400px; width:1060px" /></p>

<p>To compare the precipitation&nbsp;we need to translate millimeter/day to inches/month. According to our analysis the average rainfall is 1.00 mm/day which translates to about 1.18 Inches&nbsp;per month for most part of the year. Average rainfall rises to 2.00 mm/day which translates to about 2.36 Inches&nbsp;per month in May and June. According to US-Climate-Data the average rainfall is closer to 1 inch per month. However, there is clear agreement that average precipitation&nbsp;is close to a constant throughout the year except for May and June</p>

<p>&nbsp;<img alt="PRCP" src="HW5_figures/PRCP_expanded.png"  /></p>


## PCA analysis

For each of the six elements, we compute the percentage of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.

![VarExplained1](HW5_figures/VarExplained1.png)
We see that the top 5 eigen-vectors explain 33% of variance for TMIN, 55% for TOBS and 38% for TMAX.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 45% of the variance.

![VarExplained2](HW5_figures/VarExplained2.png)

The top 5 eigenvectors explain 10.5% of the variance for PRCP and 12% for SNOW. Both are low values. On the other hand the top 5 eigenvectors explain 99% of the variance for SNWD. This means that these top 5 eigenvectors capture most of the variation in the snow signals. Based on that we will dig deeper into the PCA analysis for snow-depth.

It makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on diffferent years.

## Analysis of snow depth

We choose to analyze the eigen-decomposition for snow-depth because the first eigen-vector explain 89% of the variance.

First, we graph the mean and the top 3 eigen-vectors.

We observe that the snow season is from early-november to the raly-june, where the middle of February marks the peak of the snow-depth.
![SNWD_mean_eigs](HW5_figures/SNWD_mean_eigs.png)

Next we interpret the eigen-functions. The first eigen-function (eig1) has a shape very similar to the mean function. 

**eig2 and eig3** are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total (much).

They can be interpreted as follows:
* **eig2:** less snow in nov - mid mar, more snow in mid may.
* **eig3:** more snow in mid Feb and mid-mar, less snow in mid may.



### Examples of reconstructions

#### Coeff1
Coeff1: most positive
![SNWD_grid_Pos_coeff1.png](HW5_figures/SNWD_grid_Pos_coeff1.png)
Coeff1: most negative
![SNWD_grid_neg_coeff1.png](HW5_figures/SNWD_grid_neg_coeff1.png)
Large positive values of coeff1 correspond to more than average snow. Low values correspond to less than average snow.

#### Coeff2
Coeff2: most positive
![SNWD_grid_Pos_coeff2.png](HW5_figures/SNWD_grid_Pos_coeff2.png)
Coeff2: most negative
![SNWD_grid_neg_coeff2.png](HW5_figures/SNWD_grid_neg_coeff2.png)

Large positive values of coeff2 correspond to a late snow season (most of the snowfall is after mid apr. Negative values for coeff2 correspond to an early snow season (most of the snow is before mid-apr.
#### Coeff3
Coeff3: most positive
![SNWD_grid_Pos_coeff3.png](HW5_figures/SNWD_grid_Pos_coeff3.png)
Coeff3: most negative
![SNWD_grid_neg_coeff3.png](HW5_figures/SNWD_grid_neg_coeff3.png)

Large positive values of coeff2 correspond to a snow season with two spikes: one in the end of feb, the other at the end of mar. Negative values of coeff3 correspond to a season with a single peak at the end of apr.



### Geographical distribution of first 3 coefficients for SNWD. 

![SNWD_pdf](HW5_figures/SNWD_pdf.png)

We see from above figure that most of the lowest coefficients are concentrated at elevation 1400-1600 and higher coefficients are at higher elevation. This indicates that snow depth varies more at higher elevations than at lower elevations.

## Analysis of correlation between percipitation across locations

To estimate the effect of time vs. location on the first eigenvector coefficient we compute:

    The average row: mean-by-station
    The average column: mean-by-year

We then compute the RMS before and after subtracting either the row or the column vector

    Total RMS                    =  36.5489939587
    RMS removing mean-by-station =  34.4126343557
    RMS removing mean-by-day     =  28.4299656446

<p><img alt="PRCP_CDF.png" src="HW5_figures/PRCP_CDF.png" style="height:350px; width:500px" /></p>

It is likely to be hard to find correlations between the **amount** of rain on the same day in different stations. Because amounts of rain vary a lot between even close locations. It is more reasonable to try to compare whether or not it rained on the same day in different stations. As we see from the graph above, in our region it rains in about one third of the days.


### Definition of statistical test

We want to find a statistical test for rejecting the null hypothesis that says that the rainfall in the two locations is independent.

Using the inner product is too noisy, because you multiply the rainfall on the same day in two locations and that product can be very large - leading to a large variance and poor ability to discriminate.

An alternative is to ignore the amount of rain, and just ask whether it rained in both locations. We can then compute the probability associated with the number of overlaps under the null hypothesis.

### Measuring statistical significance
Fix two stations. We restrict our attention to the days for which we have measurements for both stations, and define the following notation:
* $m$ : the total number of days (for which we have measurements for both stations).
* $n_1$ : the number of days that it rained on station 1
* $n_2$ : the number of days that it rained on station 2
* $l$ : the number of days that it rained on both stations.

We want to calculate the probability that the number of overlap days is $l$ given $m,n_1,n_2$.

The answer is:
$$
P = {m \choose l,n_1-l,n_2-l,m-n_1-n_2+l} /{m \choose n_1}{m \choose n_2}
$$

Where
$$
{m \choose l,n_1-l,n_2-l,m-n_1-n_2+l} = \frac{m!}{l! (n_1-l)! (n_2-l)! (m-n_1-n_2+l)!}
$$

We use the fact that $\Gamma(n+1) = n!$ and denote $G(n) \doteq \log \Gamma(n+1)$
$$
\log P = \left[G(m) - G(l) -G(n_1-l) -G(n_2-l) -G(m-n_1-n_2+l) \right] - 
\left[G(m)-G(n_1)-G(m-n_1)\right] - \left[G(m)-G(n_2)-G(m-n_2)\right]
$$
Which slightly simplifies to 
$$
\log P = -G(l) -G(n_1-l) -G(n_2-l) -G(m-n_1-n_2+l) - G(m)+G(n_1)+G(m-n_1) +G(n_2)+G(m-n_2)
$$

The log probability scales with $m$ the length of the overlap. So to get a per-day significance we consider $
\frac{1}{m} \log P $

After we calculate per-day significance for each pair of stations we get the following histogram:

<p><img alt="PRCP_Significance_Histogram" src="HW5_figures/PRCP_Significance_Histogram.png" style="height:350px; width:500px" /></p>

We now analyze using rependency matrix:

<p><img alt="PRCP_rependency matrix" src="HW5_figures/PRCP_rependency matrix.png" style="height:500px; width:500px" /></p>

The matrix above shows, for each pair of stations, the normalized log probability that the overlap in rain days is random.

We see immediately the first 10 stations are highly correlatedwith each other.

To find more correlations we use SVD (the term PCA is reserved for decomposition of the covariance matrix). As we shall see that the top 10 eigenvectors explain about 60% of the square magnitude of the matrix.

A group of very correlated stations are: USC00488209, USC00488830, USC00488889, USC00480228, USS0009F27S, USC00488810, USC00486200, USC00486680, US1WYCV0011, US1WYTT0024'

### PCA Analysis of percipitation 

Using PCA SVD, the top 10 eigenvectors explain about 60% of the variance as shown in figure below:

<p><img alt="PRCP_var_explained" src="HW5_figures/PRCP_var_explained.png" style="height:350px; width:500px" /></p>

We will look only at the top 4 eigenvectors and compare with mean.
![PRCP_top4_eigs](HW5_figures/PRCP_top4_eigs.png)
![PRCP_mean_eigs](HW5_figures/PRCP_mean_eigs.png)

#### Reorder Matrix
When we reorder the rows and columns of the matrix using one of the eigenvectors, the grouping of the stations becomes more evident. For example, consider the upper left corner of the second matrix (The upper left one). The stations at positions 0-20 are clearly strongly correlated with each other. Even though there are some stations, in positions 8-10 or so, which are more related to each other than to the rest of this block.

This type of organization is called Block Diagonal and it typically reveals important structure such as grouping or clustering.

![PRCP_top4_eigs_matrix](HW5_figures/PRCP_top4_eigs_matrix.png)

#### Residuals Analysis

In the figure below, x=0 in the graphs below correspond to the fraction of the variance explained by the mean alone x=1,2,3,... are the residuals for eig1, eig1+eig2, eig1+eig2+eig3 ...

<p><img alt="PRCP_residual_var_explained" src="HW5_figures/PRCP_residual_var_explained.png" style="height:350px; width:500px" /></p>

We can Visualize the distribution of the rediduals in the following map. We can see that residuals are spread all across the region under analysis.

![PRCP_residuals_map](HW5_figures/PRCP_residuals_map.png) 

### Conclusions

Summer (July and August) Warm days and cool nights prevail, with some rain.  

Fall (September - November) Sunny days and cold nights alternate with rain and snowstorms. Snow becomes persistent by late fall.

Spring (late April - June) Mild days and cold nights interspersed with rain and occasional snow. Rainfall is highest in May and June

Winter (late November - April) Snow blankets the mountains and the valley in Wyoming.  There is a hige variation in snow depth over the winter months and also between stations, but they don't change the total (much)

Average Rainfall is consistent all throught the year except for May and June wher eit is highest. Snow is accumulated  from early Nov to late June and starts melting in July and is almost melted by August