# New Mexico Weather Analysis

* **Author: Zhenxuan Hu**
* **PID: A53204236**
* **Dataset: SSSSBBBB**


This is a report on the historical analysis of weather patterns in an area that approximately overlaps the north of the state of New Mexico.

We can see the locations and stations from the following images. 

<p><img alt="MAP.png" src="r_figures/MAP_SSSSBBBB.png" /></p>
<p><img alt="MAP.png" src="r_figures/MAP2_SSSSBBBB.png" /></p>

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/boston/massachusetts/united-states/usma0046" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="Climate_Albuquerque.png" src="r_figures/Climate_Albuquerque_SSSSBBBB.png" /></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperature agree with the ones we got from our data, once we translate Fahrenheit to Centigrade.</p>

<p>&nbsp;</p>

<p><img alt="TMIN,TMAX.png" src="r_figures/TMIN,TMAX_SSSSBBBB.png" style="height:300px; width:800px" /></p>

<p>We can see that in New Mexico area, the percipitation is low. According to US Climate Data, the average rainfall is about 1 inch per month. The peak is at July and August. The rainfall reaches at 2 inch per month. </p>

<p>&nbsp;<img alt="PRCP.png" src="r_figures/PRCP_SSSSBBBB.png" style="height:450px; width:600px" /></p>


## PCA analysis

For each of the six measurement, we compute the percentate of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
![VarExplained1.png](r_figures/VarExplained1_SSSSBBBB.png)
We see that the top 5 eigen-vectors explain 40% of variance for TMIN, 58% for TOBS and 45% for TMAX.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 52% of the variance.

![VarExplained2.png](r_figures/VarExplained2_SSSSBBBB.png)

Similarly, we see that the top 5 eigen-vectors explain 11% of variance for SNOW, 86% for SNWD and 8% for PRCP.

We can easily find that SNWD captures most of the variation in the snow signals from the six measurements. Based on that we will dig deeper into the PCA analysis for snow-depth.

It makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on diffferent years.

## Analysis of snow depth

We choose to analyze the eigen-decomposition for snow-depth because the first 4 eigen-vectors explain more than 80% of the variance.

First, we graph the mean and the top 4 eigen-vectors.

We observe that the snow season is from mid-October to mid-May, where January and February mark the peak of the snow-depth.
![SNWD_mean_eigs.png](r_figures/SNWD_mean_eigs_SSSSBBBB.png)

Next we interpret the eigen-functions. The first eigen-function (eig1) has a shape very similar to the mean function. The main difference is that the eigen-function is close to zero during october-december while the mean is not.  The interpretation of this shape is that eig1 represents the overall amount of snow above/below the mean, but without changing the distribution over time.

**eig2,eig3 and eig4** are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total (much).

They can be interpreted as follows:
* **eig2:** less snow in jan, feb and dec, more snow in mar to apr
* **eig3:** less snow in feb, slightly more snow in apr, very more snow in dec.
* **eig4:** less snow in jan, mar and apr, more snow in feb and dec.


### Examples of reconstructions

#### Coeff1
Coeff1: most positive
![SNWD_grid_Pos_coeff1.png](r_figures/SNWD_grid_pos_coeff1_SSSSBBBB.png)
Coeff1: most negative
![SNWD_grid_neg_coeff1.png](r_figures/SNWD_grid_neg_coeff1_SSSSBBBB.png)
The most positive values are negative. So that we can say that all of the coefficient values are negative. This is because the eig1 is negative and the data is positive. As a result,the coefficient values are negative.

Negative values with large absolute value correspond to more than average snow. Negative values with small absolute value correspond to less than average snow. 

#### Coeff2
Coeff2: most positive
![SNWD_grid_Pos_coeff2.png](r_figures/SNWD_grid_pos_coeff2_SSSSBBBB.png)
Coeff2: most negative
![SNWD_grid_neg_coeff2.png](r_figures/SNWD_grid_neg_coeff2_SSSSBBBB.png)

Large positive values of coeff2 correspond to a late snow season (most of the snowfall happens at mar and apr).

Large absolute negative values for coeff2 correspond to an early snow season (most of the snow happens at jan and feb)
#### Coeff3
Coeff3: most positive
![SNWD_grid_Pos_coeff3.png](r_figures/SNWD_grid_pos_coeff3_SSSSBBBB.png)
Coeff3: most negative
![SNWD_grid_neg_coeff3.png](r_figures/SNWD_grid_neg_coeff3_SSSSBBBB.png)

Large positive values of coeff3 correspond to a snow season with two spikes: one in the middle of mar, the other at the end of dec. 

Large absolute negative values of coeff3 correspond to a season with a single peak at the feb and mar or no obvious peak.




### Geographical distribution of first 3 coefficients. 
![SNWD_coeff_1_CDF.png](r_figures/SNWD_coeff_1_CDF_SSSSBBBB.png)
Nearly all of the data has a negative coefficient to the first eigen vector. So that the value of first eigen vector must be negative or zero. And the larger absolute value of the coefficient means the larger of the snow depth. 
![SNWD_coeff_1_CDF.png](r_figures/SNWD_coeff_2_CDF_SSSSBBBB.png)
![SNWD_coeff_1_CDF.png](r_figures/SNWD_coeff_3_CDF_SSSSBBBB.png)
The second and thrid eigen vector is similar. They can represent the data very well. We can see that there is many data with negative coefficient and also many data with positive coefficient. And the snow depth is large between december to march. So that positive coefficient means late snow in feb to mar and negative coefficient means early snow in dec to jan.

## The variation in the timing of snow
In the previous section we see the variation of Coeff1, which corresponds to the total amount of snow, with respect to location. We now estimate the relative importance of location-to-location variation relative to year-by-year variation.

These are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results:

** coeff_1 **  
total RMS                   =  1491.51607638

RMS removing mean-by-station=  870.355664009, fraction explained: 41.6

RMS removing mean-by-year   =  1342.52467265, fraction explained: 10.0

** coeff_2 **  
total RMS                   =  564.056321801

RMS removing mean-by-station=  514.466414579, fraction explained: 8.7

RMS removing mean-by-year   =  459.995375319, fraction explained: 18.4

** coeff_3 **  
total RMS                   =  414.718721629

RMS removing mean-by-station=  400.857563491, fraction explained: 3.3

RMS removing mean-by-year   =  297.400363389, fraction explained: 28.2

** coeff_4 **  
total RMS                   =  380.574463507

RMS removing mean-by-station=  367.541407743, fraction explained: 3.4

RMS removing mean-by-year   =  286.791188233, fraction explained: 24.6

It is clear that in coeff_1, validation by station explains more than validation by year. However, in coeff_2, coeff_3 and coeff_4, validation by year explains more than validation by station. 
As we discussed before, coeff_1 explains the amount of snowfall. So that validation by station can explain more on the total snow. When we want to explain the snow peak, validation by year becomes a primary factor. However, the value is not very high in both year and validation. It means that both of the two features are not a very good explaination of the coefficient value. 


## Analysis of correlation between percipitation across locations

### Plot Mean and Eigen Vectors
![PRCP1.png](r_figures/PRCP1_SSSSBBBB.png)
The PRCP Mean plot shows that July and August are the peak of the rainfall in New Mexico. 

The PRCP Eigen Vector1 is correlated to the average rainfall. And the PRCP Eigen Vector2 is related to the time of the rainfall. We can see that there is a peak in October, so the Eigen Vector2 has a negative peak to represent the peak. 

### Definition of statistical test
![PRCP2.png](r_figures/PRCP2_SSSSBBBB.png)

This plot displays a histogram of the resulting p-values. The null hypothesis for this example is that the probability of it raining on any given day is independent for the two stations. 

We can see that the most of the significance in our data is quite low so that the null hypothesis is rejected. 

It shows that the data is related to each other on whether it will rain or not. 

### correlations matrix
![PRCP3.png](r_figures/PRCP3_SSSSBBBB.png)
The X & Y labels represent the index of a station. The color of the dot represents the significance of the p-value for those two stations. 

The first 25 stations are strongly related to each others. The first 75 stations are related to each others to some extent. 

The top 5 related stations are:

USC00299657 , USC00292810 , US1NMSM0005 , US1NMBR0053, US1NMSM0007 

### PCA analysis
![PRCP3.png](r_figures/PRCP4_SSSSBBBB.png)
The images show the eigen vectors that are reordered by rows and columns of the matrix. It can make the grouping of the stations more evident. The first 50 stations in the top 2 images (eigen vector 1 and 2) are correlated with each other. The bottom left image shows the last 25 stations are related to each other in eigen vector 3. The bottom right image shows that the first 25 stations are related in whether it will rain or not. 