# New York State Weather Analysis

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the central area of the state of New York.

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/boston/massachusetts/united-states/usma0046" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="watertown.png.jpg" src="hw5/watertown.png" style="height:300px; width:400px" /></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperature agree with the ones we got from our data.</p>

<p>&nbsp;</p>

<p><img alt="TMINTMAX.png" src="hw5/TMINTMAX.png" style="height:300px; width:800px" /></p>

<p>According to US-Climate-Data the average rainfall is closer to 100 millimeter per month. And this agree with our observation. </p>

<p>&nbsp;<img alt="PRCP.png" src="hw5/PRCP.png" style="height:300px; width:400px" /></p>


## PCA analysis

For each of the six measurement, we compute the percentate of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
![VarExplained1.png](hw5/PCA1.png)
We see that the top 5 eigen-vectors explain 19% of variance for TMIN, 32% for TOBS and 19% for TMAX.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 24% of the variance.

![VarExplained2.png](hw5/PCA2.png)

The top 5 eigenvectors explain 7.3% of the variance for PRCP and 12.5% for SNOW. Both are low values. On the other hand the top 5 eigenvectors explain 85% of the variance for SNWD. This means that these top 5 eigenvectors capture most of the variation in the snow signals. Based on that we will dig deeper into the PCA analysis for snow-depth.

It makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on diffferent years.

## Analysis of snow depth

We choose to analyze the eigen-decomposition for snow-depth because the first 4 eigen-vectors explain 80% of the variance.

First, we graph the mean and the top 4 eigen-vectors.

We observe that the snow season is from early-november to the beginning of May, where the middle of February marks the peak of the snow-depth.
![SNWD_mean_eigs.png](hw5/SNWD_mean_eigs.png)

Next we interpret the eigen-functions. The first eigen-function (eig1) has a shape similar to the mean function. The main difference is that the eigen-function is close to zero during october-december while the mean is not.  The interpretation of this shape is that eig1 represents the overall amount of snow above/below the mean, but without changing the distribution over time.

**eig2,eig3 and eig4** are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total (much).

They can be interpreted as follows:
* **eig2:** more snow in early jan - mid feb, less snow in mid feb - late mar.
* **eig3:** more snow in nov, dec and feb feb, slightly less snow in jan and march.
* **eig4:** more snow in jan, march and apr, slightly more snow in dec, less snow in feb.


### Examples of reconstructions

#### Coeff1
Coeff1: most positive
![SNWD_grid_Pos_coeff1.png](hw5/SNWD_grid_Pos_coeff1.png)
Coeff1: most negative
![SNWD_grid_neg_coeff1.png](hw5/SNWD_grid_neg_coeff1.png)
Large positive values of coeff1 correspond to more than average snow. Low values correspond to less than average snow.

#### Coeff2
Coeff2: most positive
![SNWD_grid_Pos_coeff2.png](hw5/SNWD_grid_Pos_coeff2.png)
Coeff2: most negative
![SNWD_grid_neg_coeff2.png](hw5/SNWD_grid_neg_coeff2.png)

Large positive values of coeff2 correspond to a snow season with peak in jan and feb. Negative values for coeff2 correspond to a snow season with peak in march.
#### Coeff3
Coeff3: most positive
![SNWD_grid_Pos_coeff3.png](hw5/SNWD_grid_Pos_coeff3.png)
Coeff3: most negative
![SNWD_grid_neg_coeff3.png](hw5/SNWD_grid_neg_coeff3.png)

Large positive values of coeff2 correspond to a snow season with two spikes: one in december, the other in february.



## Visualizing the distribution of the observations

Here we visualize the distribution of the observations. In the map, the circles represent the stations, the radius of the circle represents the number of records the specific station contributes (here the records are SNWD), and the color represent the value of the first coefficient. In the map, we can clearly see that the region locate in the central area of New York State.
![map.png](hw5/map.png)

## The variation in the timing of snow is mostly due to year-to-year variation
In this section, we estimate the relative importance of location-to-location variation relative to year-by-year variation.

These are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results:

** coeff_1 **  
total MS                   =  4987805.0512  
MS removing mean-by-station=  2765617.07227, fraction explained=44.5  
MS removing mean-by-year   =  3193238.40882, fraction explained=36.0

** coeff_2 **  
total MS                   =  1041796.39176  
MS removing mean-by-station=  926429.893424, fraction explained=11.1  
MS removing mean-by-year   =  358636.448135, fraction explained=65.6  

** coeff_3 **  
total MS                   =  614551.497599  
MS removing mean-by-station=  563510.858921, fraction explained=8.3  
MS removing mean-by-year   =  219946.138822, fraction explained=64.2  

** coeff_4 **  
total MS                   =  533023.911379  
MS removing mean-by-station=  467997.234032, fraction explained=12.2  
MS removing mean-by-year   =  239259.558137, fraction explained=55.1  

We see that the variation by station explains more than the variation by year for the first coefficient, while it is the opposite for the rest coefficient. 

This is because coeff_1 reflects the total snowfall for the station and serves as the base coefficient. Thus, although the snowfall may vary from year to year, it is not suprising that the snowfall vary from station to station, and the location factor plays a more important role. However, the rest of the coefficient reflect the timing of the snowfall. When a cold wave comes, it will influence a large number of stations, so all the stations we observe the same trend in snowfall changes. However, the time when there comes a cold wave vary from year to year. Thus, the variation by year explains more than the variation by station for coeff2, coeff3 and coeff4. 