# New Mexico Weather Analysis

This is a report on historical analysis of weather patterns in northern New Mexico area, covering the cities of Albuquerque, Rio Rancho, Santa Fe and Los Alamos. The following map gives a brief look of that area.

<img alt="NewMexico.png" src="HW5_figures/NewMexico.png" />

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focus on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Sanity-check: comparison with outside sources

We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/new-mexico/united-states/3201" target="_blank">US Climate Data</a>. The graph below shows the daily minimum and maximum temperatures for each month in the four major cities, as well as the total precipitation for each month.

<img alt="SanityCheck.png" src="HW5_figures/SanityCheck.png" />
<p>&nbsp;</p>

Once we translate Fahrenheit to Centigrade, we can see that the minimum and maximum daily temperature agree with the ones we got from our data.

<p>&nbsp;</p>
<img alt="TMINTMAX.png" src="HW5_figures/TMINTMAX.png" />
<p>&nbsp;</p>

To compare the precipitation, we need to translate millimeter/day to inches/month. According to our analysis the average rainfall is 0.997 mm/day which translates to about 1.177 Inches per month. According to US-Climate-Data the average rainfall for the four major cities in this area is around 1.12 inch per month, which agrees with our data.

<p>&nbsp;</p>
<img alt="PRCP.png" src="HW5_figures/PRCP.png" />

## PCA analysis

For each of the six measurement, we compute the percentate of the variance explained as a function of the number of eigen-vectors used.

<p>&nbsp;</p>
![VarExplained1.png](HW5_figures/VarExplained1.png)
<p>&nbsp;</p>

The top 5 eigen-vectors explain about 40% for TMIN, 60% for TOBS, 40% for TMAX. So for these three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which explains more than 50% of the variance.

<p>&nbsp;</p>
![VarExplained2.png](HW5_figures/VarExplained2.png)
<p>&nbsp;</p>

The top 5 eigen-vectors explain only 10% for SNOW, 8% for PRCP, but more than 80% for SNWD. This means the top 5 eigen-vectors capture the most of the variation in SNWD, for which we will conduct deeper analysis. Also, note that SNWD is less noisy than SNOW, since SNWD is a decaying integral of SNOW.

## Analysis of snow depth

We choose to analysis the eigen-decomposition for snow-depth since the first 4 eigen-vectors explain more than 80% of the variance. The following is a plot of the mean and the top 4 eigen-vectors of SNWD.

<p>&nbsp;</p>
![VarExplained2.png](HW5_figures/SNWDmeaneigs.png)
<p>&nbsp;</p>

We can see that the snow season starts in mid November and ends in early May. The peak of the snow-depth is in January and February.

Now we move on to the eigen-functions. They all oscilate between positive and negative values except for the first eigen-function (eig1). Eig1 has the value of zero when it's non-snow season and has a curve with negative values during the snow season. The curve has the tendency that looks a little like the mean function flipped vertically. It probably corresbonds to those stations in snowless areas with low snow fall level. Eig2, eig3, eig4 have different peaks between January and May, and oscilate differently during November and January.

They can be interpreted as follows:
* **eig1:** less snow during the whole snow season.
* **eig2:** less snow in December-March, more snow in March-May.
* **eig3:** more snow in Feburary, less snow in mid November-January, slightly less snow in April.
* **eig4:** more snow in mid Feburary-March, more snow in December, less snow in Jan, slightly less snow in late April-early May.

### Examples of reconstructions

#### Coeff1
Coeff1: most positive
![c1pos.png](HW5_figures/c1pos.png)
Coeff1: most negative
![c1neg.png](HW5_figures/c1neg.png)
<p>&nbsp;</p>
Large positive values of coeff1 correspond to less than average snow. Large negative values correspond to more than average snow.

#### Coeff2
Coeff2: most positive
![c2pos.png](HW5_figures/c2pos.png)
Coeff2: most negative
![c2neg.png](HW5_figures/c2neg.png)
<p>&nbsp;</p>
Large positive values of coeff2 correspond to a late snow season during Febuary to May. Large negative values for coeff2 correspond to an early snow season during November to April.

#### Coeff3
Coeff3: most positive
![c3pos.png](HW5_figures/c3pos.png)
Coeff3: most negative
![c3neg.png](HW5_figures/c3neg.png)
<p>&nbsp;</p>
Large positive values of coeff3 correspond to a snow season peaks in january to April. Negative values of coeff3 correspond to a season with two peaks at December and March to April.

### The variation in the timing of snow

Now we estimate the relative importance of location-to-location variation relative to year-by-year variation. These are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results:

** coeff_1 **  
total MS                   = 2224620.21  
MS removing mean-by-station= 757518.98, fraction explained= 65.9  
MS removing mean-by-year   = 1802372.5, fraction explained= 19.0  

** coeff_2 **  
total MS                   = 318159.53  
MS removing mean-by-station= 264675.69, fraction explained= 16.8  
MS removing mean-by-year   = 211595.75, fraction explained= 33.5 

** coeff_3 **  
total MS                   = 171991.62  
MS removing mean-by-station= 160686.79, fraction explained= 6.6  
MS removing mean-by-year   = 88446.98, fraction explained= 48.6 

** coeff_4 **  
total MS                   = 144836.92  
MS removing mean-by-station= 135086.69, fraction explained= 6.7  
MS removing mean-by-year   = 82249.19, fraction explained= 43.2  

We can see that for coeff1, the variation by station explains more and for the other three coeff's, the variation by year explains more. For coeff1, the variation by station explains more probably because eig1 corresponds to those snowless stations with low annual snow fall level, so it's more focused on spatial elements. Coeff_2,3,4, on the other hand, are more focused on the timing of the snowfall. So for them, the variation by year explains much more. But overall, we can say that the time variation in the timing of snow is more due to year-to-year than place-to-place variation.
<p>&nbsp;</p>

## Measuring statistical significance on PRCP

Now we try to measure the statistical significance on PRCP. Note that part of the reason why we use PRCP instead of SNWD is because since the snow season only last for about six month, there will be a lot of None data. So we decide to do this part of the analysis on PRCP.

We want to find a statistical test for rejecting the null hypothesis that says that the rainfall in the two locations is independent.
Using the inner product is too noisy, because you multiply the rainfall on the same day in two locations and that product can be very large - leading to a large variance and poor ability to discriminate.

An alternative is to ignore the amount of rain, and just ask whether it rained in both locations. We can then compute the probability associated with the number of overlaps under the null hypothesis.

Fix two stations. We restrict our attention to the days for which we have measurements for both stations, and define the following notation:
* $m$ : the total number of days (for which we have measurements for both stations).
* $n_1$ : the number of days that it rained on station 1
* $n_2$ : the number of days that it rained on station 2
* $l$ : the number of days that it rained on both stations.

We want to calculate the probability that the number of overlap days is $l$ given $m,n_1,n_2$.

The answer is:
$$
P = {m \choose l,n_1-l,n_2-l,m-n_1-n_2+l} /{m \choose n_1}{m \choose n_2}
$$

Where
$$
{m \choose l,n_1-l,n_2-l,m-n_1-n_2+l} = \frac{m!}{l! (n_1-l)! (n_2-l)! (m-n_1-n_2+l)!}
$$

We use the fact that $\Gamma(n+1) = n!$ and denote $G(n) \doteq \log \Gamma(n+1)$
$$
\log P = \left[G(m) - G(l) -G(n_1-l) -G(n_2-l) -G(m-n_1-n_2+l) \right] - 
\left[G(m)-G(n_1)-G(m-n_1)\right] - \left[G(m)-G(n_2)-G(m-n_2)\right]
$$
Which slightly simplifies to 
$$
\log P = -G(l) -G(n_1-l) -G(n_2-l) -G(m-n_1-n_2+l) - G(m)+G(n_1)+G(m-n_1) +G(n_2)+G(m-n_2)
$$

The log probability scales with $m$ the length of the overlap. So to get a per-day significance we consider $
\frac{1}{m} \log P $

So we compute the normalized log probability between every two stations. The graph below shows the distribution of normalized log probability. Note that the 0.4 on the right side is the default value set for the normalized log probability of the station itself.

<p>&nbsp;</p>
![sigdistrib.png](HW5_figures/sigdistrib.png)
<p>&nbsp;</p>

Then using the normalized log probability, we construct a dependency matrix to show the relation between stations. The x and y axis are both for the stations.

<p>&nbsp;</p>
![depmatrix.png](HW5_figures/depmatrix.png)
<p>&nbsp;</p>

### Finding structure in the dependency matrix.

In the matrix above, we see immediately that the first 25 stations are highly correlated with each other. 

To find more correlations we use SVD (the term PCA is reserved for decomposition of the covariance matrix). As we shall see that the top 10 eigenvectors explain about 75% of the square magnitude of the matrix.

<p>&nbsp;</p>
![pcaexplained.png](HW5_figures/pcaexplained.png)
<p>&nbsp;</p>

Now we use the top 4 eigenvectors to reorder the rows and columns of the matrix, and the grouping of the stations becomes more evident. The following is the reordered dependency matrices.

<p>&nbsp;</p>
![pcareorder.png](HW5_figures/pcareorder.png)
<p>&nbsp;</p>

We can clear observe stronger correlations than the original dependency matrix. For example, the stations at position 0-50 are more correlated with each other in the upper left matrix. And stations at position 160-180 in lower left matrix are strongly correlated. We can ovserve the strong correlation existing at different positions in the rest of the matrices, too.

### Visualizing the distribution of the observations

We now use the PCA components to plot those stations on the map. A station on the map is represented by a point with 4 triangles. Each triangle represents the coefficient of the perticular component. The size of the triangle corresponds to the magnitude of the coefficient.

We first take a look at the northern New Mexico our stations are in. The stations are distributed mostly around Albuquerque, and three other cities (Rio Rancho, Santa Fe, Los Alamos), and some are in the rural and mountainous areas.

<p>&nbsp;</p>
![DataonMap.png](HW5_figures/DataonMap.png)
<p>&nbsp;</p>

We now zoom in to the Albuquerque area, where most stations are located. We can see that the coefficients corresponding to the red and blue triangle are more dominant. And these stations around Albuquerque are more similar, which explains the clusters in the dependency matrices.

<p>&nbsp;</p>
![DataonAlbuquerque.png](HW5_figures/DataonAlbuquerque.png)
<p>&nbsp;</p>

Now we take a look at the north. We can see the stations around Santa Fe are dominated by the coefficients corresponding to the red triangle. The blue one has way less affect here. This shows the rain fall in Santa Fe is a little different than in Albuquerque. As the figures in the sanity check section show, Santa Fe has higher rain fall level than Albuquerque.

<p>&nbsp;</p>
![DataonMountain.png](HW5_figures/DataonMountain.png)
<p>&nbsp;</p>

We also can look at the rural area. The stations there are very different than each other. And there is litter pattern can be found and no cluster can grouped. This is mainly because they are in different environment and are relatively far away with each other, which explains the low dependency parts in the dependency matrices.

## Conclusion

In this work we analyzed the weather in north New Mexico, including the cities of Albuquerque, Rio Rancho, Santa Fe and Los Alamos. Specifically, we analyzed the snow depth level in this area and found out the main components that can explain most of the variances and reconstruct the fluctuation of the annual SNWD. We also found out that the variation in the timing of the snow is more due to the year change than the location change.

We also measured the statistical significance on PRCP, i.e., rain fall, between the stations. And by reordering the dependency matrices and ploting the data on the map, we can clearly find correlation between nearby stations, as well as the different rain fall level in different cities.

Over all, we did a comprehensive analysis on the snow depth level and rain fall level in northern New Mexico. For future work we can focus the analysis on the temperature variation of that area.