# Wyoming Weather Analysis

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the area of the state of Wyoming.

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/wyoming/united-states/3220#" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="Climate_Wyoming_-_ Cheyenne_and_Weather_averages_Wyoming.png" src="my_figures/Climate_Wyoming_-_ Cheyenne_and_Weather_averages_Wyoming.png" /></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperature agree with the ones we got from our data, once we translate the temperature to Centigrade (from one tenths).</p>

<p>&nbsp;</p>

<p><img alt="TMIN,TMAX.png" src="my_figures/TMIN,TMAX.png" style="height:300px; width:800px" /></p>

<p>Then we compare the precipitation which also agrees with our data. The precipitation pattern suggests that it rises to a maximum during mid of the year (May-June) which is what we observe from our data.</p>

<p>&nbsp;<img alt="PRCP.png" src="my_figures/PRCP.png" style="height:450px; width:600px" /></p>


## PCA analysis

For each of the six measurement, we compute the percentage of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
![VarExplained1.png](my_figures/VarExplained1.png)
We see that the top 5 eigen-vectors explain 29% of variance for TMIN, 51% for TOBS and 32% for TMAX.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 45% of the variance.

![VarExplained2.png](my_figures/VarExplained2.png)

The top 5 eigenvectors explain 11% of the variance for PRCP and 9% for SNOW. Both are low values. On the other hand the top 5 eigenvectors explain 85% of the variance for SNWD. This means that these top 5 eigenvectors capture most of the variation in the snow depth signals. Based on that we will dig deeper into the PCA analysis for snow-depth.

It makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on diffferent years.

## Analysis of snow depth

We choose to analyze the eigen-decomposition for snow-depth because the first 4 eigen-vectors explain 82% of the variance.

First, we graph the mean and the top 4 eigen-vectors.

We observe that the snow season is from end of October to the begginnig of May, where the end of February marks the peak of the snow-depth.
![SNWD_mean_eigs.png](my_figures/SNWD_mean_eigs.png)

Next we interpret the eigen-functions. The first eigen-function (eig1) has a shape very similar to the mean function. The main difference is that the eigen-function is very close to zero during october end-November beginning while the mean is not.  The interpretation of this shape is that eig1 represents the overall amount of snow above/below the mean, but without changing the distribution over time.

**eig2,eig3 and eig4** are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total (much).

They can be interpreted as follows:
* **eig2:** more snow in Dec-Jan, average snow in Jan-Feb,  less snow in mid Feb-May.
* **eig3:** less snow in Jan-Feb, more snow in April-May, slightly more snow in Dec
* **eig4:** more snow in start of Feb, less snow in March, more snow in April and beginning of May.


### Examples of reconstructions

#### Coeff1
Coeff1: most positive
![SNWD_grid_Pos_coeff1.png](my_figures/SNWD_grid_Pos_coeff1.png)
Coeff1: most negative
![SNWD_grid_neg_coeff1.png](my_figures/SNWD_grid_neg_coeff1.png)
Large positive values of coeff1 correspond to more than average snow. Low values correspond to less than average snow.

#### Coeff2
Coeff2: most positive
![SNWD_grid_Pos_coeff2.png](my_figures/SNWD_grid_Pos_coeff2.png)
Coeff2: most negative
![SNWD_grid_neg_coeff2.png](my_figures/SNWD_grid_neg_coeff2.png)

Large positive values of coeff2 correspond to a early snow season (most of the snowfall is in Dec-Jan). Negative values for coeff2 correspond to an late snow season (most of the snow is after mid-Feb). Also with large positive values of coeff2 we observe a second small spike in snow around April. 
#### Coeff3
Coeff3: most positive
![SNWD_grid_Pos_coeff3.png](my_figures/SNWD_grid_Pos_coeff3.png)
Coeff3: most negative
![SNWD_grid_neg_coeff3.png](my_figures/SNWD_grid_neg_coeff3.png)

Large positive values of coeff3 correspond to a late snow season (most of the snowfall is after mid Feb with some peaks in Dec). Negative values for coeff3 correspond to an early snow season (most of the snow is before mid-Feb).

### Cummulative Residuals
Below we plot cummulative distributions for top 3 residuals. For instance residual 3 is obtained after subtracting in sequence: the mean, the projection on the first eigen-vector the projection on the second eigen-vector.<br>
Res1: 
![SNWD_res_1_CDF.png](my_figures/SNWD_res_1_CDF.png)
Res2: 
![SNWD_res_2_CDF.png](my_figures/SNWD_res_2_CDF.png)
Res3: 
![SNWD_res_3_CDF.png](my_figures/SNWD_res_3_CDF.png)


### The spatio-temporal variation in the snow depth pattern
We now estimate the relative importance of location-to-location variation relative to year-by-year variation for snow depth.

These are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. 

To estimate the effect of time vs. location on the eigenvector coefficients we
compute:

* The average row: `mean-by-station`
* The average column: `mean-by-year`

We then compute the MS (mean-squared) error before and after subtracting either the row or the column vector. Here are the results:

#### Coeff_1
total MS                   =  3417312.82625 <br>
MS removing mean-by-station= 1103030.053516, fraction explained = 67.722298<br>
MS removing mean-by-year   = 2629209.314931, fraction explained = 23.062083

#### Coeff_2
total MS                   =  410441.46651<br>
MS removing mean-by-station= 358630.637717, fraction explained = 12.623196<br>
MS removing mean-by-year   = 244812.957351, fraction explained = 40.353747

#### Coeff_3
total MS                   =  384303.953987<br>
MS removing mean-by-station= 289658.003703, fraction explained = 24.627889<br>
MS removing mean-by-year   = 241249.504596, fraction explained = 37.224298

We see that the variation by station explains more than the variation by year for coeff_1. This effect has to do with the total snowfall (coeff_1), vs. coeff_2,3 which, as we saw above have to do with the timining of snowfall. We see that for coeff_2,3 the stations explain 12-24% of the variance while the year explaines ~40%. Hence location better explains the amount of snowfall and time (years) better explains the snow-season (timing) of snowfall.

### Geographical distribution of first coefficient

Following map shows the density of snowfall for different stations. The circle radius is directly proportional to count of data entries for station.

<table><tr><td><img src='my_figures/legend.png' width='320'></td><td><img src='my_figures/5__maps_using_iPyLeaflet.png'></td></tr></table>

## Analysis of correlation between percipitation across locations

### Definition of statistical test
It is likely to be hard to find correlations between the amount of rain on the same day in different stations. Because amounts of rain vary a lot between even close locations. It is more reasonable to try to compare whether or not it rained on the same day in different stations. As we see from the plot below, in our region it rains in about one fifth of the days.
![rainfall_cdf.png](my_figures/rainfall_cdf.png)


### Correlations Matrix
The matrix below shows, for each pair of stations, the normalized log probability. It is evident that the overlap in rain days is random. We see immediately that stations 14-19 are highly correlated with each other.
![correlation_mat.png](my_figures/correlation_mat.png)

### PCA analysis
To find more correlations we use SVD of the rependency matrix (the term PCA is reserved for decomposition of the covariance matrix). As we shall see that the top 10 eigenvectors explain about 64% of the square magnitude of the matrix.
![pca_analysis.png](my_figures/pca_analysis.png)
Below are the plots for top 4 pca components sorted in the respective order. For example- in plot 1 we sort according to pca_component 1, plot 2 according to pca_component 2 and so on.
![pca_comp_plot.png](my_figures/pca_comp_plot.png)

### Geographical distribution of top pca components.
When we reorder the rows and columns of the matrix using one of the eigenvectors, the grouping of the 
stations becomes more evident. For example, consider the upper left corner of the first matrix (the upper left one). The stations at positions 0-6 are clearly strongly correlated with each other.

![sorted_correlation.png](my_figures/sorted_correlation.png)

This type of organization is called **Block Diagonal** and it typically reveals important structure such as grouping or clustering.

I extracted the set of 6 stations that forms the top dense block for the region in upper left matrix, and then plotted them on the map to see their spatial relationship. I also added the 7th station from the matrix just to verify if it is far away from the other 6. 

![geographical_significance.png](my_figures/geographical_significance.png)

The plot clearly justifies the grouping idea as the top 6 stations are very close to each other geographically and the 7th stations lies a bit far from the cluster of 6. 
