# Eastern California Mountain Area Weather Analysis

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the eastern mountain area of California and western mountain area of Nevada.

The data we use here comes from NOAA.

We focused on six measurements:
- **TMIN, TMAX**: the daily minimum and maximum temperature.
- **TOBS**: The average temperature for each day.
- **PRCP**: Daily Percipitation (in mm)
- **SNOW**: Daily snowfall (in mm)
- **SNWD**: The depth of accumulated snow.

## Sanity-check: comparison with outside sources
We start by comparing some of the general statistics with graphs that we obtained from a site called [US Climate Data](http://www.usclimatedata.com/climate/yosemite-natl-park/california/united-states/usca1269)
The map below shows the specific area where our data comes from.
<img src="hw5_figures/map1.png" width="700" height="500" alt=""/>

The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month in Yosemite National Park area and Tonopah area.
<img src="hw5_figures/Yosemite.png" width="500" height="500" alt=""/>
<img src="hw5_figures/Tonopah.png" width="500" height="500" alt=""/>
We see that the min and max daily temperature agree with the ones we got from our data, once we translate Fahrenheit to Centigrade.
<p>&nbsp;</p>

<p><img alt="TMIN,TMAX.png" src="hw5_figures/TMIN,TMAX.png" style="height:300px; width:800px" /></p>
<p>To compare the precipitation&nbsp;we need to translate millimeter/day to inches/month. According to our analysis the average rainfall is 1.38 mm/day which translates to about 1.62 Inches&nbsp;per month. According to US-Climate-Data the rainfall varies in the eastern mountain area and wester mountain area, for example, the average rainfall in Yosemite National Park is closer to 2.9 inch per month, while the average rainfall in Tonopah in Nevada is only 0.43 inch per month. The reason is that the eastern mountain area belongs to windward slope and the western area is in leeward slop, the satellite shows there are more vegetation in eastern than than in western. The precipitation of most area is contributed by the precipitation from October to next March.</p>
<p>&nbsp;<img alt="PRCP.png" src="hw5_figures/PRCP.png" style="height:450px; width:600px" /></p>

## PCA analysis
For each of the six measurement, we compute the percentate of the variance explained as a function of the number of eigen-vectors used.
### Percentage of variance explained.
![VarExplained1.png](hw5_figures/VarExplained1.png)
We see that the top 5 eigen-vectors explain 62% of variance for TMIN, 70% for TOBS and 58% for TMAX.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 60% of the variance.

![VarExplained2.png](hw5_figures/VarExplained2.png)

The top 5 eigenvectors explain 15% of the variance for PRCP and 14% for SNOW. Both are low values. On the other hand the top 5 eigenvectors explain more than 80% of the variance for SNWD. This means that these top 5 eigenvectors capture most of the variation in the snow signals. Based on that we will dig deeper into the PCA analysis for snow-depth.

It makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on diffferent years.

## Analysis of snow depth
We choose to analyze the eigen-decomposition for snow-depth because the first 4 eigen-vectors explain 80% of the variance.

First, we graph the mean and the top 4 eigen-vectors.

We observe that the snow season is from November to the end of May, where the middle of February marks the peak of the snow-depth.
![SNWD_mean_eigs.png](hw5_figures/SNWD_mean_eigs.png)

Next we interpret the eigen-functions. The Absolute value of the first eigen-function (eig1) has a shape similar to the mean function. The difference is that the value of eigen-function is below zero while the mean is not.  The interpretation of this shape is that eig1 represents the overall amount of snow above/below the mean, but without changing the distribution over time.

**eig2,eig3 and eig4** are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total (much).

They can be interpreted as follows:
* **eig2:** fast descrease of snow depth in March, more snow in mid feb-march.
* **eig3:** more snow in jan, less snow in feb, slightly more snow in march.
* **eig4:** more snow in dec, more snow in start feb, less snow in end of feb, slightly more snow in march.

### Examples of reconstructions

#### Coeff1
Coeff1: most positive
![SNWD_grid_Pos_coeff1.png](hw5_figures/SNWD_grid_Pos_coeff1.png)
Coeff1: most negative
![SNWD_grid_neg_coeff1.png](hw5_figures/SNWD_grid_neg_coeff1.png)

Large positive values of coeff1 correspond to more than average snow. Low values correspond to less than average snow.

#### Coeff2
Coeff2: most positive
![SNWD_grid_Pos_coeff2.png](hw5_figures/SNWD_grid_Pos_coeff2.png)
Coeff2: most negative
![SNWD_grid_neg_coeff2.png](hw5_figures/SNWD_grid_neg_coeff2.png)

Large positive values of coeff2 correspond to an early snow season (most of the snowfall is after mid feb. Negative values for coeff2 correspond to a late snow season (most of the snow is after March).
#### Coeff3
Coeff3: most positive
![SNWD_grid_Pos_coeff3.png](hw5_figures/SNWD_grid_Pos_coeff3.png)
Coeff3: most negative
![SNWD_grid_neg_coeff3.png](hw5_figures/SNWD_grid_neg_coeff3.png)

Large positive values of coeff2 correspond to a snow season with two spikes: one in the start of january, the other at the end of february. Negative values of coeff3 correspond to a season with a single peak at the end of Jan.

### Analysis of the distribution of the residuals

**Distribution of Residual_Mean, Residual_1, Residual_2, Residual_3**
![Distribution_res.png](hw5_figures/Distribution_res.png)

**Distribution of Coeff_1, Coeff_2, Coeff_3**
![Distribution_coeff.png](hw5_figures/Distribution_coeff.png)

From the distribution of residuals, we can find the pattern that as we add more eigenvector to reconstruction, the residual of the same number of instance will become smaller.

The Distribution of coefficients shows that in most case the coeff_1 is negative, the first eigenvector has the negative value of the mean function.

## Analyze whether percipitation/snow depth varies more from year to year or from place to place.

The data from US Climate Data tells us that the rainfall in eastern mountain area of California is much more than the rainfall in western mountain area of Nevada because of some geographical factors. But we are not sure if the the percipitation and snow depth varies more from year to year or from place to place, so we estimate the effect of the year vs the effect of the station

To estimate the effect of time vs. location on the first eigenvector coefficient we
compute:

* The average row: `mean-by-station`
* The average column: `mean-by-year`

The result of estimating is as follows(consider coeff_2):

**For Percipitation:**

*    Total RMS                   =  210.717922248

*    RMS removing mean-by-station=  207.78719153, fraction explained = 1.4

*    RMS removing mean-by-year   =  89.8353828466, fraction explained = 57.4

**For Snow depth:**

*    Total RMS                   =  1229.5296533

*    RMS removing mean-by-station=  1090.23742062, fraction explained = 11.3

*    RMS removing mean-by-year   =  859.293116903, fraction explained = 30.1

From the result we can see that the variation by year explains more than the variation by station as the RMS removing stations is larger than the RMS removing years for both PRCP and SNWD, corresponding to the fact that PRCP and SNWD varies more from year to year.

### measuring statistical significance
We want to find a statistical test for rejecting the null hypothesis that says that the rainfall in the two locations is independent.

Using the inner product is too noisy, because you multiply the rainfall on the same day in two locations and that product can be very large - leading to a large variance and poor ability to discriminate.

An alternative is to ignore the amount of rain, and just ask whether it rained in both locations. We can then compute the probability associated with the number of overlaps under the null hypothesis.

Fix two stations. We restrict our attention to the days for which we have measurements for both stations, and define the following notation:
* $m$ : the total number of days (for which we have measurements for both stations).
* $n_1$ : the number of days that it rained on station 1
* $n_2$ : the number of days that it rained on station 2
* $l$ : the number of days that it rained on both stations.

We want to calculate the probability that the number of overlap days is $l$ given $m,n_1,n_2$.

The answer is:
$$
P = {m \choose l,n_1-l,n_2-l,m-n_1-n_2+l} /{m \choose n_1}{m \choose n_2}
$$

Where
$$
{m \choose l,n_1-l,n_2-l,m-n_1-n_2+l} = \frac{m!}{l! (n_1-l)! (n_2-l)! (m-n_1-n_2+l)!}
$$

We use the fact that $\Gamma(n+1) = n!$ and denote $G(n) \doteq \log \Gamma(n+1)$
$$
\log P = \left[G(m) - G(l) -G(n_1-l) -G(n_2-l) -G(m-n_1-n_2+l) \right] - 
\left[G(m)-G(n_1)-G(m-n_1)\right] - \left[G(m)-G(n_2)-G(m-n_2)\right]
$$
Which slightly simplifies to 
$$
\log P = -G(l) -G(n_1-l) -G(n_2-l) -G(m-n_1-n_2+l) - G(m)+G(n_1)+G(m-n_1) +G(n_2)+G(m-n_2)
$$

The log probability scales with $m$ the length of the overlap. So to get a per-day significance we consider $
\frac{1}{m} \log P $

We calculate the normalized log probability for each pair of stations and draw the dependency matrix.
![Dependency Matrix.png](hw5_figures/Dependency Matrix.png)
The matrix shows, for each pair of stations, the normalized log probability that the overlap in rain days is random.
We see immediately the 3rd to 10th stations are highly correlatedwith each other.
To find more correlations we use SVD (the term PCA is reserved for decomposition of the covariance matrix). As we shall see that the top 8 eigenvectors explain nearly 80% of the square magnitude of the matrix.
![Eig_8.png](hw5_figures/Eig_8.png)

The histogram of all P_norm values obtained between every pair of two stations:

![pValue.png](hw5_figures/pValue.png)

As we can see from the distribution of p-value, most of the pvalue is larger than 0.05, so the null hypothesis that the probability of it raining on any given day is independent for the two stations in question cannot be rejected. There are some weak evidence that the raining on given day is dependent for two stations.

The top 4 eigenvectors:
![EigVector_4.png](hw5_figures/EigVector_4.png)
The dependency matrix:
![P_norm_4.png](hw5_figures/P_norm_4.png)

When we reorder the rows and columns of the matrix using one of the eigenvectors, the grouping of the stations becomes more evident. We find that in most case the stations at positions 0-10 are clearly strongly correlated with each other. 

We plot 0-10 stations on the map to see their spatial relationship:
<img src="hw5_figures/Top_10.png" width="700" height="700" alt=""/>