# Montana Weather Analysis

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the area of the state of Montana.

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/montana/united-states/919" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p>Firstly is the temperature data comparision:</p>

<p>&nbsp;</p>

<p><img alt="monota.png" src="r_figures/monota.png" style="height:400px; width:600px"/></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperature agree with the ones we got from our data, once we translate Fahrenheit to Centigrade.</p>
(The data given is scaled 10 times perhaps for the sake of accuracy.)

<p>&nbsp;</p>

<p><img alt="TMIN,TMAX.png" src="r_figures/TMIN,TMAX.png" style="height:300px; width:600px" /></p>

<p>&nbsp;</p>

<p>However,the distrubution of our data is a litter bit more negatively skewed than the US Climate Data.
To be more convincing, we use data from a third party called <a href="http://www.worldclimate.com/climate/us/montana/roy" target="_blank">World Climate</a> for further comparison:
</p>

<p><img alt="roy_t.png" src="r_figures/roy_t.png" style="height:400px; width:700px" /></p>

<p>&nbsp;</p>

<p>This one is closer to the US Climate Data. The possible reason behind the "negatively skewness" phenomenon is greenhous effect. Since the dataset we are analysizing contains lots of historical data, it should be more negatively skewed for the reason that green house effect is not that serious at that moment.</p>

<p>&nbsp;</p>

<p>Now let's compare precipitation data:</p>

<p>&nbsp;</p>

<p>&nbsp;<img alt="PRCP.png" src="r_figures/PRCP.png" style="height:450px; width:600px" /></p>

<p>
One thing that attracts our attention is the data close to June viberates tremendously. The data given by US Climate shows a similar pattern: Generally speaking, the preciptation is at a constant level for about 0.5-0.6 inches, when time approaches June the preciptation grows to 2 inches, and after July it falls back to the 0.5-0.6 inches level.
</p>





## PCA analysis

For each of the six measurement, we compute the percentate of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
![VarExplained1.png](r_figures/VarExplained1.png)
We see that the top 5 eigen-vectors explain 25% of variance for TMIN, 33% for TOBS and 27% for TMAX.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 25% of the variance.

![VarExplained2.png](r_figures/VarExplained2.png)

The top 5 eigenvectors explain 14% of the variance for PRCP and 15% for SNOW. Both are low values. On the other hand the top 5 eigenvectors explain %72 of the variance for SNWD. This means that these top 5 eigenvectors capture most of the variation in the snow signals. Based on that we will dig deeper into the PCA analysis for snow-depth.

It makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on diffferent years.
The site here shows a dynamic process of SNOW and SNWD, which illustrate the idea that SNWD data have less noise well.
<a href="https://www.nohrsc.noaa.gov/nsa/" target="_blank">National Snow Analysis</a>


## Analysis of snow depth

We choose to analyze the eigen-decomposition for snow-depth because the first 4 eigen-vectors explain 80% of the variance.

First, we graph the mean and the top 4 eigen-vectors.

We observe that the snow season is from mid-november to the end of march, where the middle of February marks the peak of the snow-depth.
![SNWD_mean_eigs.png](r_figures/SNWD_mean_eigs.png)

Next we interpret the eigen-functions. The first eigen-function (eig1) has a shape very similar to the mean function. The main difference is that the eigen-function is close to zero during october-december while the mean is not.  The interpretation of this shape is that eig1 represents the overall amount of snow above/below the mean, but without changing the distribution over time.

**eig2,eig3 and eig4** are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total (much).

They can be interpreted as follows:
* **eig2:** less snow in jan - mid feb, more snow in mid feb-march.
* **eig3:** more snow in jan, less snow in feb, slightly more snow in march.
* **eig4:** more snow in dec, more snow in start feb, less snow in end of feb, more snow in march.


### Examples of reconstructions

#### Coeff1
Coeff1: most positive
![SNWD_grid_Pos_coeff1.png](r_figures/1_p.png)
Coeff1: most negative
![SNWD_grid_neg_coeff1.png](r_figures/1_n.png)
Large positive values of coeff1 correspond to more than average snow. Low values correspond to less than average snow.

#### Coeff2
Coeff2: most positive
![SNWD_grid_Pos_coeff2.png](r_figures/2_p.png)
Coeff2: most negative
![SNWD_grid_neg_coeff2.png](r_figures/2_n.png)

Large positive values of coeff2 correspond to a late snow season (most of the snowfall is after mid feb. Negative values for coeff2 correspond to an early snow season (most of the snow is before mid-feb.
#### Coeff3
Coeff3: most positive
![SNWD_grid_Pos_coeff3.png](r_figures/3_p.png)
Coeff3: most negative
![SNWD_grid_neg_coeff3.png](r_figures/3_n.png)

Large positive values of coeff3 correspond to a snow season with two spikes: one in the start of january, the other at the end of february. Negative values of coeff3 correspond to a season with a single peak at the end of Jan.



## Analysis of correlation between percipitation across locations

### Definition of statistical test
<p>
Statistical test is a mechanism that enables us to make quantitative decisions about hypothesis that we are trying to figure out.
Genarally speaking, the basic steps of statictical test are:
<p></p>
1)State the Null hypothesis.   eg. The mean precipitation in 1908 of Monota is 2 inches.
<p></p>
2)State the Alternative hypothesis.   eg. The mean precipitation in 1908 of Monota is not 2 inches.
<p></p>
3)Set α.    eg. α=0.95, α is the significance level of the test.
<p></p>
4)Collect data.
<p></p>
5)Calculate a test statistic.
<p></p>
6)Construct an acceptance region.  eg. mean +/- 1.96*standard error
<p></p>
7)Reach a conclusion.
</p>

### correlations matrix
In this part, we first describe how to compute correlations, and then we perform a significance test to the correlations.
#### computing correlations 
To compute the correlations, we first compute the covariance matrix using the following formula:
$$
E(x x^T)-E(x)E(x)^T
$$
If the data that we have is $x_1,x_2...x_n$ then the estimates we use are:
$$
\hat{E}(x x^T) = \frac{1}{n} \sum_{i=1}^n x_i x_i^T,\;\;\;\;\;
\hat{E}(x) = \frac{1}{n} \sum_{i=1}^n x_i
$$
After we obtain the covariance matrix, to compute the correlations we just need to divide the covariance by the corresponding product of standards errors of data, which can all be done by the function computeVar in homework4.
#### significance test
After the calculation above, we want to perform a significance test to see whether those low correlation values are significantly different from zero or not. The test statictic calculation formula we are using is:
<p><img alt="test.png" src="r_figures/test.png" style="height:100px; width:200px"/></p>
In my case, n is 43800.
<p>
As an example, at a 95% significance, the correlation between USC00246001 and USW00024036 is 0.34, performing the test give us a t-value 122.0328, which means the correlations between these two stations are significantly different from 0. These test can help us to find out whether two stations are truly unrelevant from each other.
</p>

### PCA analysis

Similarly, we compute the percentage of the variance explained as a function of the number of eigen-vectors used.
<p><img alt="p_pca.png" src="r_figures/p_pca.png" style="height:300px; width:500px"/></p>

<p>The interesting thing here is it seems to be a linear relationship, which means we can't use the top 5 eigenvectors to explain most of the variances as we did for SNWD.
To be more convincing, we try to set the number of eigen-vectors to be 300 and compare PRCP with SNOW and SNWD, here's what we get:
</p>
<p><img alt="pss_pca300.png" src="r_figures/pss_pca300.png" style="height:300px; width:700px"/></p>
<p>
The result above makes sense. Just as we analysized above, SNWD is less noisy for the reason that it's a decaying integral process. On the other hand, SNOW and PRCP are influenced by many more climate factors, that's why their plots behave in such a similar way.
</p>
<p>&nbsp;</p>


<p>&nbsp;</p>


