# New Mexico Weather Analysis

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the area of the state of New Mexico.

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/santa-fe/new-mexico/united-states/usnm0292" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="Climate_SantaFe_NewMexico.jpeg" src="my_figures/Climate_SantaFe_NewMexico.jpeg" style="height:400px; width:500px"/></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperature agree with the ones we got from our data, once we translate Fahrenheit to Centigrade. Both the graphs agree that the min, max temperatures rose a lot in the months of July, August and it can also be seen that the curve is almost similar. </p>

<p>&nbsp;</p>

<p><img alt="TminTmax.jpeg" src="my_figures/TminTmax.jpeg" style="height:300px; width:800px" /></p>

<p>To compare the precipitation&nbsp;we need to translate millimeter/day to inches/month. According to our analysis the average rainfall is 1.00 mm/day which translates to about 1.18 Inches&nbsp;per month. According to US-Climate-Data the average rainfall is closer to 1.2 inch per month. There is also clear agreement that precipitation&nbsp; rises a lot in July, August and it can be seen that the trend matches throughout the year. </p>

<p>&nbsp;<img alt="PRCP.jpeg" src="my_figures/PRCP.jpeg" style="height:400px; width:500px" /></p>



## PCA analysis

For each of the six measurement, we compute the percentage of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
![VarExplained1.jpeg](my_figures/VarExplained1.jpeg)
We see that the top 5 eigen-vectors explain 40% of variance for TMIN, 60% for TOBS and 45% for TMAX.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 50% of the variance.

![VarExplained2.jpeg](my_figures/VarExplained2.jpeg)

The top 5 eigenvectors explain 8% of the variance for PRCP and 10% for SNOW. Both are low values. On the other hand the top 5 eigenvectors explain 80% of the variance for SNWD. This means that these top 5 eigenvectors capture most of the variation in the snow signals. Based on that we will dig deeper into the PCA analysis for snow-depth.

It makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on different years.

## Analysis of snow depth

We choose to analyze the eigen-decomposition for snow-depth because the first 4 eigen-vectors explain 80% of the variance.

First, we graph the mean and the top 4 eigen-vectors.

We observe that the snow season is from mid-October to mid-May, where the beginning of February marks the peak of the snow-depth.
<img alt="SNWD_mean_eigs.jpeg" src="my_figures/SNWD_mean_eigs.jpeg" style="height:400px; width:600px"/>

Next we interpret the eigen-functions.

They can be interpreted as follows:
* **Eig1:** less snow from mid October to mid May.
* **Eig2:** less snow from December to mid February, more snow from mid February to mid May and mid October to December.
* **Eig3:** more snow from mid October to mid January, slightly more snow from March to mid May, less snow from mid January to March
* **Eig4:** more snow from February to mid March and from December to January, less snow from January to February, mid March to mid May and mid October to December.

### Examples of reconstructions

#### Coeff1
Coeff1: High values
![c1pos.jpeg](my_figures/c1pos.jpeg)
Coeff1: Low values
![c1neg.jpeg](my_figures/c1neg.jpeg)
Low values of Coeff1 correspond to more than average snow. High values have uneven spikes that go both higher and lower than the average snow. The overall SNWD increases when going from high to low values of Coeff1.

#### Coeff2
Coeff2: Most positive
![c2pos.jpeg](my_figures/c2pos.jpeg)
Coeff2: Most negative
![c2neg.jpeg](my_figures/c2neg.jpeg)

Large positive values of Coeff2 correspond to a late snow season (most of the snowfall is after March). Negative values for Coeff2 correspond to an early snow season (most of the snowfall is before March).

#### Coeff3
Coeff3: Most positive
![c3pos.jpeg](my_figures/c3pos.jpeg)
Coeff3: Most negative
![c3neg.jpeg](my_figures/c3neg.jpeg)

Large positive values of Coeff3 correspond to a snow season with a spike in November and December. Negative values of Coeff3 correspond to a season with no spike during that period.

#### Res1
res_1: High values
![r1worst.jpeg](my_figures/r1worst.jpeg)
res_1: Low values
![r1best.jpeg](my_figures/r1best.jpeg)

#### Res2
res_2: High values
![r2worst.jpeg](my_figures/r2worst.jpeg)
res_2: Low values
![r2best.jpeg](my_figures/r2best.jpeg)

#### Res3
res_3: High values
![r3worst.jpeg](my_figures/r3worst.jpeg)
res_3: Low values
![r3best.jpeg](my_figures/r3best.jpeg)

#### ResMean
res_mean: High values
![rmworst.jpeg](my_figures/rmworst.jpeg)
res_mean: Low values
![rmbest.jpeg](my_figures/rmbest.jpeg)

High values of all residuals correspond to a snow season with uneven spikes that go both higher and lower than the average snow. This is expected because the residual is high so the target varies from the mean and eigen vectors, and is not represented well using them. Low values of res 1,2,3 except the mean represent more than average snow.


### The variation in the timing of snow is mostly due to year-to-year variation

In the previous section we see the variation of Coeff1, which corresponds to the total amount of snow, with respect to location. We now estimate the relative importance of location-to-location variation relative to year-by-year variation.

These are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results:

** coeff_1 **  
total MS                   = 2224620.206  
MS removing mean-by-station= 757518.981, fraction explained=65.94  
MS removing mean-by-year   = 1802372.496, fraction explained=18.98  

** coeff_2 **  
total MS                   = 318159.534  
MS removing mean-by-station= 264675.691, fraction explained= 16.81  
MS removing mean-by-year   = 211595.745, fraction explained= 33.49  

** coeff_3 **  
total MS                   = 171991.618  
MS removing mean-by-station= 160686.786, fraction explained= 6.57  
MS removing mean-by-year   = 88446.976, fraction explained=48.57  

** coeff_4 **  
total MS                   = 144836.922  
MS removing mean-by-station= 135086.686, fraction explained= 6.73  
MS removing mean-by-year   = 82249.185, fraction explained=43.21  


We see that the variation by year explains more than the variation by station. However this effect is opposite for coeff_1, which has to do with the total snowfall, vs. coeff_2,3,4 which, as we saw above have to do with the timing of snowfall. We see that for coeff_2,3,4 the stations explain 6-16% of the variance while the year explaines 33-48%.

### Cumulative distribution of the coefficients

#### Coeff1

<img alt="c1cdf.jpeg" src="my_figures/c1cdf.jpeg" style="height:300px; width:400px"/>

It can be seen that the probability of Coeff1 being less than 0 is almost 1 and hence Coeff1 is almost always less than 0.

#### Coeff2

<img alt="c2cdf.jpeg" src="my_figures/c2cdf.jpeg" style="height:300px; width:400px"/>

It can be seen that the probability of Coeff2 being less than 2000 is almost 1 and hence Coeff2 is almost always less than 2000.

#### Coeff3

<img alt="c3cdf.jpeg" src="my_figures/c3cdf.jpeg" style="height:300px; width:400px"/>

It can be seen that the probability of Coeff3 being less than 1500 is almost 1 and hence Coeff3 is almost always less than 1500.

#### Coeff4

<img alt="c4cdf.jpeg" src="my_figures/c4cdf.jpeg" style="height:300px; width:400px"/>

It can be seen that the probability of Coeff4 being less than 1500 is almost 1 and hence Coeff4 is almost always less than 1500.

### Visualizing the distribution of the observations

*** Legend: ***

Area of the circle is proportional to the count of stations.
Color changes from Blue -> Green -> Yellow -> Orange -> Red as the average Coeff value goes from low negative to high positive.

#### Coeff1:

![map1.jpeg](my_figures/map1.jpeg)

#### Coeff2:

![map2.jpeg](my_figures/map2.jpeg)

#### Coeff3:

![map3.jpeg](my_figures/map3.jpeg)

#### Coeff4:

![map4.jpeg](my_figures/map4.jpeg)

## Analysis of correlation between precipitation across locations

### Precipitation mean and eigen vectors

<img alt="PRCPmean.jpeg" src="my_figures/PRCPmean.jpeg" style="height:300px; width:600px"/>

We observe that precipitation tends to happen throughout the year but it is most significant during the months of July, August and September. The Eigen functions tend to toggle positive and negative while being close to 0 and show maximum deviations for the months of July, August, September and October.

### How well-explained are the vectors in this collection?

We extract all of the values of `res_3` which is the residual variance after the Mean and the 
first two Eigen-vectors have been subtracted out.

<img alt="res_3.jpeg" src="my_figures/res_3.jpeg" style="height:300px; width:400px"/>

### Definition of statistical test

Fix two stations. We restrict our attention to the days for which we have measurements for both stations, and define the following notation:
* $m$ : the total number of days (for which we have measurements for both stations).
* $n_1$ : the number of days that it rained on station 1
* $n_2$ : the number of days that it rained on station 2
* $l$ : the number of days that it rained on both stations.

We want to calculate the probability that the number of overlap days is $l$ given $m,n_1,n_2$.

The answer is:
$$
P = {m \choose l,n_1-l,n_2-l,m-n_1-n_2+l} /{m \choose n_1}{m \choose n_2}
$$

Where
$$
{m \choose l,n_1-l,n_2-l,m-n_1-n_2+l} = \frac{m!}{l! (n_1-l)! (n_2-l)! (m-n_1-n_2+l)!}
$$

We use the fact that $\Gamma(n+1) = n!$ and denote $G(n) \doteq \log \Gamma(n+1)$
$$
\log P = \left[G(m) - G(l) -G(n_1-l) -G(n_2-l) -G(m-n_1-n_2+l) \right] - 
\left[G(m)-G(n_1)-G(m-n_1)\right] - \left[G(m)-G(n_2)-G(m-n_2)\right]
$$
Which slightly simplifies to 
$$
\log P = -G(l) -G(n_1-l) -G(n_2-l) -G(m-n_1-n_2+l) - G(m)+G(n_1)+G(m-n_1) +G(n_2)+G(m-n_2)
$$

The log probability scales with $m$ the length of the overlap. So to get a per-day significance we consider $
\frac{1}{m} \log P $

The null hypothesis is, the probability of it raining on any given day is independent between two stations.
The significance of the p-value is the probability of seeing the result given by the data under the null hypothesis.

<img alt="sign.jpeg" src="my_figures/sign.jpeg" style="height:300px; width:400px"/>

We can see that majority of the number of instances have a low P-value. A low P-value means that it is unlikely to see such results under the null hypothesis. Therefore, the null hypothesis is wrong for the dataset.

### Correlations matrix

<img alt="corrmat.jpeg" src="my_figures/corrmat.jpeg" style="height:500px; width:500px"/>

The matrix above shows, for each pair of stations, the normalized log probability that the overlap in rain days is random.

We see immediately the first 22 stations are highly correlated with each other. 

### SVD analysis

<img alt="svd.jpeg" src="my_figures/svd.jpeg" style="height:300px; width:400px"/>

To find more correlations we use SVD (the term PCA is reserved for decomposition of the covariance matrix). We see that the top 10 eigenvectors explain about 70% of the square magnitude of the matrix.

### Geographical distribution of top 4 coefficients

![eig4.jpeg](my_figures/eig4.jpeg)
<img alt="mat1.jpeg" src="my_figures/mat1.jpeg" style="height:400px; width:900px"/>
<img alt="mat2.jpeg" src="my_figures/mat2.jpeg" style="height:400px; width:900px"/>

When we reorder the rows and columns of the matrix using one of the eigenvectors, the grouping of the 
stations becomes more evident. For example, consider the upper left corner of the last matrix (The lower right one). The stations at positions 0-22 are clearly strongly correlated with each other.

This type of organization is called **Block Diagonal** and it typically reveals important structure such as grouping or clustering.

### Visualizing the distribution of the observations

Each station’s location is represented using the values of it’s 4 Coeff.

***Legend:***

Coeff 1 = Blue,
Coeff 2 = Orange,
Coeff 3 = Green,
Coeff 4 = Red

Hollow triangle = +ve Coeff,
Filled triangle = -ve Coeff

Size of the triangle = Proportional to the absolute value of the Coeff

![prcpmap.jpeg](my_figures/prcpmap.jpeg)