# Idaho Weather Analysis


## 1. Introduction

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the area of the state of Idaho. The report illustrates the analysis and reconstruction of the parameters like depth of accumulated snow and precipitation patterns using statistical analysis and PCA.

For the analysis we focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.


### Data Source

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).


## 2. Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/boston/massachusetts/united-states/usma0046" target="_blank">US Climate Data</a>. The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="Climate_Rupert_Idaho.png" src="r_figures/Climate_Rupert_Idaho.png" style="height:400px; width:600px"/></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperature agree with the ones we got from our data, once we translate Fahrenheit to Centigrade.</p>

<p>&nbsp;</p>

<p><img alt="Rupert_TMIN,TMAX.png" src="r_figures/Rupert_TMIN,TMAX.png" style="height:300px; width:800px" /></p>

<p>To compare the precipitation&nbsp;we need to translate millimeter/day to inches/month. According to our analysis the average rainfall is 8.72 mm/day which translates to about 10.29 Inches&nbsp;per month. According to US-Climate-Data the average rainfall is closer to 9.78 inch per month. However, there is clear agreement that average precipitation&nbsp;is close to a constant throughout the year</p>

<p>&nbsp;<img alt="Rupert_PRCP.png" src="r_figures/Rupert_PRCP.png" style="height:400px; width:600px" /></p>

## 3. Data Analysis

We start the analysis of our data by identifying the different stations from where we would be collecting data from. From the below geo chart, we observe that there are close to 40 stations and they are distributed all over the state of Idaho.

<p>&nbsp;<img alt="Geo distribution of stations" src="r_figures/geo_distribution_analysis.png" style="height:400px; width:600px" /></p>

We also analyze how the precipitation and snow depth differ across the year for a particular station. For example, we consider the station USC00105980 and analyze how precipitation and snow depth differ across a particular year 1949. It is clear that snow and rain do not overlap with each other.

<p>&nbsp;<img alt="PRCP vs SNWD" src="r_figures/prcp_vs_snwd.png" style="height:400px; width:600px" /></p>

We check if there is any sparse (or missing) data and observe that the valid points count (measured as the data points without a NaN) is high during the months of April to October and have more missing/invalid points during January and December. We observe more or less similar behavior with all the parameters. 

<p>&nbsp;<img alt="Valid counts for TMAX and TMIN" src="r_figures/valid_TMAX_TMIN.png" style="height:300px; width:900px" /></p>

As a part of any data analysis task, it is essential to observe the distribution of the data at hand esp. the mean, standard deviation, etc. We plot the mean-SD, mean and mean+SD of the snow depth data and find that there is appreciable deviation from the mean during January to April and October to December.

<p>&nbsp;<img alt="Data distribution of SNWD" src="r_figures/snwd_mean.png" style="height:400px; width:500px" /></p>


### 3.A. Percentage of variance explained.

In this homework, we will be constructing the data pattern from its eigen vectors with minimal reconstruction error. So, we compute the percentage of the variance explained as a function of the number of eigen vectors used. The more the variance explained by eigen vectors, the lower the reconstruction error.

![VarExplained1.png](r_figures/Rupert_VarExplained1.png)
We see that the top 5 eigen-vectors explain 31% of variance for TMIN, 55% for TOBS and 32% for TMAX.

We conclude that of the three, both TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 50% of the variance in TOBS.

![VarExplained2.png](r_figures/Rupert_VarExplained2.png)

The top 5 eigenvectors explain 10% of the variance for PRCP and 13% for SNOW. Both are low values. On the other hand the top 5 eigenvectors explain 91% of the variance for SNWD. This means that these top 5 eigenvectors capture most of the variation in the snow signals. Based on that we will dig deeper into the PCA analysis for snow-depth.




## 4. Snow Depth analysis using PCA

### 4.A Eigen vectors

We plot the mean of SNWD and oberve that the snow season at Idaho starts late October and gradually increases until it reaches a peak during mid February. After February, there is steady decline of snow depth and the snow season ends somewhere around April.

<p><img alt="Rupert_SNWD_mean_eigs1.png" src="r_figures/Rupert_SNWD_mean_eigs1.png" style="height:600px; width:500px" /></p>

Lets now try to use our observation on the variance in the data to identify the principal eigen vectors using which we could reconstruct the data pattern with minimal error. The first eigen-function (eig1) has a shape very similar to the mean function but in the opposite direction. The interpretation of this shape is that eig1 represents the overall amount of snow above/below the mean, but without changing the distribution over time. Eig1 helps to reconstruct the overall distribution of the snow. As a sidenote, since eig1 is negative, we would expect its coefficient c1 also to be negative. This is proved as we observe the most of the weather stations's coefficient 1 (coeff_1) is negative.

<p>&nbsp;<img alt="Heatmap distribution" src="r_figures/coeff1_heatmap.png" style="height:300px; width:400px" />

**eig2 and eig3** are similar in the following way. They all oscilate between positive and negative values and remain zero consistently between May to October. They can be interpreted as follows:
* **eig2:** more snow during March-April and less snow in November - December than the rest of the year.
* **eig3:** more snow during mid feb - may and Nov-Dec than the rest of the year.



### 4.B Eigen values

We plot a graph of eigenvalues as a sanity check for negative eigen values to identify if the data set is corrupted. Based on the graph it is not and as one can expect, the eigenvalues decreases as we go farther away from the most prinicipal component.
<p><img alt="Rupert_EigenVal.png" src="r_figures/Rupert_EigenVal.png" style="height:400px; width:600px" /></p>

### 4.C Reconstruction using eigen vectors

We experiment by reconstructing the data pattern using the first eigen vector, first two eigen vectors and first three eigen vectors and record the coefficients and residual error for the same.

#### Coeff1

Coeff1: most positive
![SNWD_grid_Pos_coeff1.png](r_figures/Rupert_SNWD_grid_Pos_coeff1.png)
Coeff1: most negative
![SNWD_grid_neg_coeff1.png](r_figures/Rupert_SNWD_grid_neg_coeff1.png)

As expected, we observe that best reconstruction is possible with negative c1 coefficients. Moreover, large positive values of coeff1 correspond to less than average snow and low values correspond to more than average snow. 

#### Coeff2
Coeff2: most positive
![SNWD_grid_Pos_coeff2.png](r_figures/Rupert_SNWD_grid_Pos_coeff2.png)
Coeff2: most negative
![SNWD_grid_neg_coeff2.png](r_figures/Rupert_SNWD_grid_neg_coeff2.png)

Large positive values of coeff2 and large negative values of coeff2 both correspond to a early snow season (most of the snowfall is during Nov to April end). The large positive values of coeff2 indicate high snowfall during March and large negative values of coeff2 indicate high snowfall during January. 

#### Coeff3
Coeff3: most positive
![SNWD_grid_Pos_coeff3.png](r_figures/Rupert_SNWD_grid_Pos_coeff3.png)
Coeff3: most negative
![SNWD_grid_neg_coeff3.png](r_figures/Rupert_SNWD_grid_neg_coeff3.png)

Large positive values of coeff3 and large negative values of coeff3 both correspond to a early snow season (most of the snowfall is during Nov to April end). The large positive values of coeff3 indicate high snowfall during March and large negative values of coeff3 indicate high snowfall during February. 



### 4.D Cumulative distribution of residuals 

We then analyze the cumulative distribution of residuals.

<p>&nbsp;</p>
<p><img alt="Cumulative distribution" src="r_figures/cumul_residuals.png" style="height:300px; width:500px" />

As we observe from the above graph, the residual error is minimized as we include more and more eigen vectors for reconstruction. This explains why res3 has lower error compared to the other two residuals. Also, we observe significant difference between each residuals in the distribution and this is explained by the following graph, where we observe that including eigen_2 and eigen_3 improves the reconstruction accuracy by significant percentage.

<p><img alt="eigen vector" src="r_figures/Rupert_eigen_dist.png" style="height:300px; width:500px" />

Based on this graph, the cumulative distributions after res_4 will get saturated and we will not be observing any difference between the residual distributions after res_4

### 4.E Geographical distribution of first 3 coefficients. 

We analyze the distribution of coefficients across the different stations in Idaho and also try to understand if there is any correlation between the elevation of the station and the coefficients.

#### Coefficient 1

From the below two graphs, it is clear that most of the stations's average coefficient 1 is negative. This confirms our intuition that most coefficients of c1 would be negative owing to the fact that eigen function 1 is negative.

<p>&nbsp;</p>

<p><img alt="Geo distribution" src="r_figures/coeff1_geomap.png" style="height:400px; width:800px" /></p>
<p><br />
&nbsp;</p>

<p>&nbsp;<img alt="Heatmap distribution" src="r_figures/coeff1_heatmap.png" style="height:300px; width:400px" />

<p><br />
&nbsp;</p>

We plot a graph to identify the correlation between elevation and the coefficient 1 and we find that there is some level of correlation between them and as the elevation increases, c1 decreases. But since there is a negative eigen function 1, the snow is above than average at high elevations. Coefficient 1 represents the overall snowfall pattern with high snowfall at higher elevations and less at lower elevations.

<img alt="Scatterplot" src="r_figures/coeff1_scatter.png" style="height:300px; width:400px" /></p>


#### Coefficient 2

<p>&nbsp;</p>

<p><img alt="Geo distribution" src="r_figures/coeff2_geomap.png" style="height:400px; width:800px" /></p>
<p><br />
&nbsp;</p>

<p>&nbsp;<img alt="Heatmap distribution" src="r_figures/coeff2_heatmap.png" style="height:300px; width:400px" />

<p><br />
&nbsp;</p>

From the above two graphs, it is clear that there is a mix of negative coefficients and positive coefficients. Lets try to analyze our observation by diving deeper into the eigen function 2 and its coefficients.

Eigen function:
<img alt="EigenFunction" src="r_figures/Rupert_eigenFunc.png" style="height:300px; width:600px" /></p>


Coeff2: most positive
![SNWD_grid_Pos_coeff2.png](r_figures/Rupert_SNWD_grid_Pos_coeff2.png)
Coeff2: most negative
![SNWD_grid_neg_coeff2.png](r_figures/Rupert_SNWD_grid_neg_coeff2.png)

The eigen function 2 is positive during mid october-December. However, as per our data, we observe very heavy snow during October-December and heavy snow from Jan to April. This implies we should observe negative coefficients during October-December and positive coefficients from Jan to April. This explains why we see a mix of negative and positive coefficients from different stations.

We plot a graph to identify the correlation between elevation and the coefficient 2 and we find that there is not much correlation between them. The coefficient 2 is almost a constant irrespective of the elevation. So, coefficient 2 explains the marginal variation of snowfall from the mean at different elevations.

<img alt="Scatterplot" src="r_figures/coeff2_scatter.png" style="height:300px; width:400px" /></p>


#### Coefficient 3

From the below two graphs, it is clear that there is large number of high positive coefficients. This is again because the eigen function 3 is positive for most part of the year (from October- December and mid Feb to April). Hence, we observe more positive coefficients. Also, due to the fact that it is negative during Jan to mid Feb, we witness few negative coefficients.

<p>&nbsp;</p>

<p><img alt="Geo distribution" src="r_figures/coeff3_geomap.png" style="height:400px; width:800px" /></p>
<p><br />
&nbsp;</p>

<p>&nbsp;<img alt="Heatmap distribution" src="r_figures/coeff3_heatmap.png" style="height:300px; width:400px" />

<p><br />
&nbsp;</p>

We plot a graph to identify the correlation between elevation and the coefficient 2 and we find that there is some level of correlation between them and as the elevation increases c3 increases and thereby the snow is above than average at high elevations. Coeffecient 3 represents the deviation of the snowfall from the mean at different elevations.

<img alt="Scatterplot" src="r_figures/coeff3_scatter.png" style="height:300px; width:400px" /></p>

### 4.F The variation of the timing of snow 

In the previous section we discussed the variation of Coeff1, which corresponds to the total amount of snow, with respect to location(elevation). We now estimate the relative importance of location-to-location variation relative to year-by-year variation.

These are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. 

Here are the results:

#### Coeff_1
total RMS = 2765.11652981

RMS removing mean-by-station= 1528.810233, percentage explained=44.7%

RMS removing mean-by-year = 2144.76650138, percentage explained=22.43%

#### Coeff_2
total RMS = 848.69917793

RMS removing mean-by-station= 828.268337939, fraction explained= 2.4%

RMS removing mean-by-year = 540.693118362, fraction explained= 36.29%

#### Coeff_3
total RMS = 784.218123513

RMS removing mean-by-station= 670.921326458, fraction explained= 14.44%

RMS removing mean-by-year = 588.677075016, fraction explained= 24.93%

We see that the variation by station explains more than the variation by year. However this effect is weaker: consider coeff_1, which has to do with the total snowfall pattern, vs. coeff_2,3,4 which, as we saw above have to do with the timining of snowfall. With Coeff_1, the timing of snow is mostly varies from station-to-station and with coeff_2,3,4 the variance is with respect to year-to-year variation as most of the variance from station-to-station is explained by coeff_1.





## 5. Precipitation at different stations

We want to find a statistical test for rejecting the null hypothesis that says that the rainfall in the two locations is independent.

<p><img alt="Cumulative distribution of precipitation" src="r_figures/cumul_precip.png" style="height:300px; width:500px" />

It is likely to be hard to find correlations between the amount of rain on the same day in different stations. Because amounts of rain vary a lot between even close locations. It is more reasonable to try to compare whether or not it rained on the same day in different stations. As we see from the graph above, in our region it rains in about one eighth of the days.

<p><img alt="Rependency matrix" src="r_figures/rependency_matrix.png" style="height:500px; width:500px" />


The rependency matrix above shows, for each pair of stations, the normalized log probability that the overlap in rain days is random. On the other hand, we dont observe any correlation between the different stations. To find more correlations we use SVD and we see that the top 10 eigenvectors explain about 90% of the square magnitude of the matrix.

<p><img alt="Eigenvectors" src="r_figures/precip_variance.png" style="height:300px; width:500px" />

Next, we plot a heatmap based on the principal component and we analyze the correlation between the stations. Each of the four plots signify the correlation between the different stations based on the first principal component, second, third and fourth. We observe high level of correlation between the first 8-10 stations and 21-40 stations as evident from the block diagonal clustering from the heatmaps.

<p><img alt="Correlation" src="r_figures/precip_heatmap.png" style="height:500px; width:500px" />

## 6. Conclusion

In summary, we considered the weather analysis of Idaho state using six different measures and observed that most of the snow depth pattern could be reconstructed with 90% accuracy using the first 3 eigen vectors. Based on the analysis above, there is a station-to-station variation in the snow depth which is explained by the first eigen vector. Moreover, we observed strong correlation between the different stations in terms of precipitation. 