# Idaho Weather Analysis

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the area of the state of Idaho.

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** The daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/boston/massachusetts/united-states/usid0119" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="Climate_Idaho_City_-_Idaho_and_Weather_averages_Idaho_City.png" style="width:608px;height:456px;" src="r_figures/Climate_Idaho_City_-_Idaho_and_Weather_averages_Idaho_City.png" /></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperature agree with the ones we got from our data, once we translate Fahrenheit to Centigrade. The average annual low temperature from US Climate Data is 27.8°F and the average annual high temperature 59.4°F. Also, from the Idaho Climate Chart above, we can see our data also match the annual trend of temperature well.
</p> 

<p>&nbsp;</p>

<p><img alt="TMIN,TMAX.svg" src="r_figures/TMIN,TMAX.svg" style="height:300px; width:1500px" /></p>

<p>To analyze the precipitation&nbsp;we need to translate the unit from millimeter/day to inches/month. Since the precipitation of 3.00 mm/day can be translated as about 3.55 Inches&nbsp;per month, we can plot our annual precipitation as follows. According to US Climate Data the average rainfall is close to 2 inch per month, which has a quite close agreement with our average precipitation analysis. Also, we can see that the month with low precipitation distributes at June to October which matches the data from US Climate Data well.

<p>&nbsp;<img alt="PRCP.svg" src="r_figures/PRCP.svg" style="height:450px; width:600px" /></p>


## PCA analysis

For each of the six measurements,namely TMIN, TOBS, TMAX, SNOW, SNWD, PRCP, we compute the percentage of the variance explained as a function of the number of eigenvectors being used.

### Percentage of variance explained.
![VarExplained.svg](r_figures/VarExplained.svg)
We can see that the top five eigenvectors explain approximately 30% of variance for TMIN, 56% for TOBS, 32% for TMAX,  13% for SNOW, 94% for SNWD and 10% for PRCP.

Among six measurements, SNOW and PRCP cannot be well explained only with the top five eigenvectors, in which their first eigenvectors only explain approximately 5% and 3% of variance respectively. 

Among six measurements, SNWD can be best explained by the top five eigenvectors, in which the first eigenvector has already explained almost 80% of variance. Since top five eigenvectors capture most of the variation in the depth of accumulated snow indicator, we can then dig deeper into the PCA analysis for it. 

It makes sense that SNWD is less noisy than SNOW and PRCP because SNOW and PRCP can vary from day to day while SNWD is more like a continuous process and varies less when compared with the other two indicators. 


## Analysis of snow depth

We choose to analyze the eigen-decomposition for snow-depth because the top 4 eigenvectors can explain 92% of the variance.

First, we graph the mean and the top 4 eigen-vectors.

We observe that the snow season is from mid-november to the end of march, where the middle of February marks the peak of the snow-depth.
![SNWD_mean_eigs.png](r_figures/SNWD_mean_eigs.svg)

Then we interpret the eigen-functions. 

The first eigen-function (eig1) has a very similar shape to the mean function but with a negative sign. They both have a peak from Jan. to Apr. and a trend stretching away from zero from Nov. to Jan. This shows that eig1 can best depict the shape of the mean function and the distribution of SNWD over time. Also, from the statistics above, we know that eig1 can explain approximately 80% of variance.

The second eigen-function has similar distribution as eig1 during Jan. to Apr. but from Nov. to Jan., eig2 has opposite sign of values from eig1 and larger values. It can be interpreted that eig2 is an adjustment to eig1 to make the combining result more close to target function.

**eig3 and eig4** are similar in the following way. They all oscilate between positive and negative values apparently. They change the distribution of the snow depth over the winter months with lower weights, but they don't change the total result much.

They may be interpreted as follows:
* **eig2:** less snow in Mar., more snow in Nov. to early Jan.
* **eig3:** less snow in Jan. and Feb., more snow in Mar. and early Apr.
* **eig4:** less snow from Jan. to mid Feb., more snow in end of Feb. and Mar., less snow in end of Mar., Apr. and early May.

### Examples of reconstructions

#### Coeff1
Coeff1: most positive
![SNWD_grid_Pos_coeff_1.svg](r_figures/SNWD_grid_Pos_coeff_1.svg)
Coeff1: most negative
![SNWD_grid_Neg_coeff_1.svg](r_figures/SNWD_grid_Neg_coeff_1.svg)
Large positive values of coeff1 correspond to less than average snow while small negative values correspond to more than average snow.

#### Coeff2
Coeff2: most positive
![SNWD_grid_Pos_coeff_2.svg](r_figures/SNWD_grid_Pos_coeff_2.svg)
Coeff2: most negative
![SNWD_grid_Neg_coeff_2.svg](r_figures/SNWD_grid_Neg_coeff_2.svg)

Large positive values of coeff2 correspond to an early snow season (most of the snowfall is before early Mar.) Negative values for coeff2 correspond to a late  snow season (most of the snow is after early Mar.)
#### Coeff3
Coeff3: most positive
![SNWD_grid_Pos_coeff_3.svg](r_figures/SNWD_grid_Pos_coeff_3.svg)
Coeff3: most negative
![SNWD_grid_Pos_coeff_3.svg](r_figures/SNWD_grid_Neg_coeff_3.svg)

Large positive values of coeff3 correspond to a season with a single peak at the end of Mar. Small negative values of coeff2 correspond to a snow season with two spikes: one in the end of Feb., the other at the beginning of Apr. 



##  The variation in the timing of snow is mostly due to year-to-year variation


In this section, we are going to estimate the relative importance of location-to-location variation relative to year-by-year variation.
These are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results:

### coeff_1 
total MS = 7645869.42 <br>
MS removing mean-by-station= 2337260.73, fraction explained=69.43% <br>
MS removing mean-by-year = 4600023.35, fraction explained=39.84%

### coeff_2 
total MS = 720290.29 <br>
MS removing mean-by-station= 686028.44, fraction explained= 4.76% <br>
MS removing mean-by-year = 292349.05, fraction explained=59.41% 

### coeff_3 
total MS = 614998.07 <br>
MS removing mean-by-station= 450135.43, fraction explained= 26.81% <br>
MS removing mean-by-year = 346540.70, fraction explained=43.65%

We see that the variation by year explains more than the variation by station for coeff2 and coeff3. However this situation does not fit when considering coeff_1, which has to do with negative value of the total snowfall, vs. coeff_2,3 which, as we saw above have to do with the timing of snowfall. We see that for coeff_2,3 the stations explain 3-27% of the variance while the year explaines 43-60%.

### Geographical distribution of first 3 coefficients. 

Following pictures show the distribution of first 3 coefficients in measurement **SNWD**. The color bar ranges from red to blue, representing largest positive value and smallest negative value respectively.

#### Coeff1

<p><img alt="SNWD_distribution_coeff1_1.png" src="r_figures/SNWD_distribution_coeff1_1.png" style="height:300px; width:750px" /></p>


#### Coeff2
<p><img alt="SNWD_distribution_coeff2_1.png" src="r_figures/SNWD_distribution_coeff2_1.png" style="height:300px; width:750px" /></p>

#### Coeff3
<p><img alt="SNWD_distribution_coeff3_1.png" src="r_figures/SNWD_distribution_coeff3_1.png" style="height:300px; width:750px" /></p>


We can see that coeff1 and coeff2 share similar distribution of snow-depth value while coeff3's values are obviously opposite from the other two. Also, from analysis above, we already known that **MS removing mean-by-station** for coeff1 is comparatively small and the corresponding fraction expained is quite large, which is 69.43%. It can be seen in the map figure that color is more diverse in both coeff1 and coeff2, rather than coeff2.

## Analysis of correlation between percipitation across locations
### Definition of statistical test

In this section, we will analyze data about precipitation across locations. 
The analysis of eigen-decomposition for precipation is different from the analysis for snow-depth because this data is much more noisy and vary from day to day. It can be seen in the statistics above that the top five eigenvectors of PRCP can only explain 10% of variance.

This pheonomenon can be shown in following figure in which we can see the data is quite noisy and the eig1,2,3 all oscilate between positive and negative values frequently. 

<p><img alt="PRCP_Mean_Eigs.svg]" src="r_figures/PRCP_Mean_Eigs.svg" style="height:600px; width:1500px" /></p>

Then we extract all of the values of `res_3` which is the residual variance after the Mean and the first two Eigen-vectors have been subtracted out. From the following figure, we can see that approximately 10% `res_3` is explained by 90% instances of stations. 

<p><img alt="res_3_distribution_instances.svg]" src="r_figures/res_3_distribution_instances.svg" style="height:400px; width:750px" /></p>

Then we combine the measurements for each station into a single long array with an entry for each day of each day and calculate the RMS of removing average-per-day and the average-per-station respectively to see the significance of location-to-location variation relative to day-by-day variation. The results are shown as follows.

* total RMS = 31.71 <br>
* RMS removing mean-by-station= 30.01, fraction explained=5.36% <br>
* RMS removing mean-by-day = 23.24, fraction explained=26.71%

This illustrates that daily data is more relavant with daily change rather than station variation. Since amounts of rain vary a lot between even close locations, it is hard to find correlations between the amount of rain on the same day in different stations. Therefore, it is more reasonable for us to compare whether or not it rained on the same day in different stations.


### Correlations Matrix

We aim to find a statistical evidence to reject the null hypothesis that the rainfalls in the two locations are independent. Using the inner product will be too noisy, because you multiply the rainfall on the same day in two locations and the resulting products can be very large, which will lead to a large variance and make it difficult to discriminate.
An alternative way we adopt is to ignore the amount of rain, and just consider whether it rained in both locations. We can then compute the probability associated with the number of overlaps under the null hypothesis.

We then calculate the normalized log probability for each pair of stations. From the following figure we can see that lots of pairs of stations are dependent and reject the null hypothesis at the significant level of 5%.

<p><img alt="significance_PRCP.svg" src="r_figures/significance_PRCP.svg" style="height:400px; width:400px" /></p>
The following matrix shows, for each pair of stations, the normalized log probability that the overlap in rain days is random.

<p><img alt="random_p_norm.svg" src="r_figures/random_p_norm.svg" style="height:400px; width:400px" /></p>

Then we use SVD to find more correlation information between station pairs. The following figure shows the correlation between stations. If two stations are highly correlated, their block will be green.

<p><img alt="correlation_PRCP.svg" src="r_figures/correlation_PRCP.svg" style="height:800px; width:800px" /></p>



From the figure above, we can see that when we reorder the rows and columns of the matrix using one of the eigenvectors, the grouping of the stations becomes more evident. For example, when we look at the upper left corner of the second matrix (the upper right one). The stations at positions 0-10 are clearly strongly correlated with each other. 


### PCA analysis

We then conduct the PCA analysis for matrix made up of normalized log probability for each pair of stations. As we shall see that the top 10 eigenvectors explain about 70% of the square magnitude of the matrix.

<p><img alt="pca_explained_var_ratio_PRCP.svg" src="r_figures/pca_explained_var_ratio_PRCP.svg" style="height:400px; width:400px" /></p>

We then pick up the top 4 eigenvectors and sort each of them respectively to see their relationship with other 3 eigenvectors. 

<p><img alt="svd_PRCP_egi.svg" src="r_figures/svd_PRCP_egi.svg" /></p>

The restruction of using first 3 coefficients can be seen as follows.
<p><img alt="reconstruction_PCRP.png" src="r_figures/reconstruction_PCRP.png" style="height:400px; width:600px" /></p>



### Geographical distribution of top 4 coefficients

<p><img alt="map_PRCP_4eig.png" src="r_figures/map_PRCP_4eig.png" /></p>


In the geograohical distribution map of top 4 coefficients of measurement PRCP, triangle colored with red, green, blue and black represents coeff1,2,3,4 respectively. We can see the correlation between stations visually since stations in an area seem to have more common raining days. For example, in area around Pocatello in the map, stations share the same pattern in which red triangle dominates the result, followed by blue and black.