# New Mexico Weather Analysis

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the area of New Mexico.

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature (in C degree).
* **TOBS:** The average temperature for each day (in C degree).
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow (in mm)

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/new-mexico/united-states/3201" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="NMclimate.jpg" src="report_figures/NMclimate.png" /></p>

<p>&nbsp;</p>



<p>We see that the min and max daily&nbsp;temperature agree with the ones we got from our data. The data we collated are from the whole NM state while the one from the US Climate Data are from the area of Santa Fe, NM so will be a little different but overall we find they agree with each other.</p>

<p>&nbsp;</p>

<p><img alt="TMIN,TMAX.png" src="report_figures/Mean+-std_TMINMean+-std_TMAX.png" style="height:450px; width:800px" /></p>

<p>To compare the precipitation&nbsp;we need to translate millimeter/day to inches/month. According to our analysis the average rainfall is 1.00 mm/day which translates to about 1.18 Inches&nbsp;per month. According to US-Climate-Data the average rainfall is 361 mm/year, which is 1.16 Inches per month. So there is clear agreement that average precipitation&nbsp;is close to a constant throughout the year</p>

<p>&nbsp;<img alt="PRCP.png" src="report_figures/Mean+-std_TOBSMean+-std_PRCP.png" style="height:450px; width:800px" /></p>



## PCA analysis

For each of the six measurement, we compute the percentate of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
![PV_Explained_TMIN_TOBS_TMAX.png](report_figures/PV_Explained_TMIN_TOBS_TMAX.png)
We see that the top 5 eigen-vectors explain 43% of variance for TMIN, 58% for TOBS and 48% for TMAX.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 50% of the variance.

![PV_Explained_SNOW_SNWD_PRCP.png](report_figures/PV_Explained_SNOW_SNWD_PRCP.png)

The top 5 eigenvectors explain 8% of the variance for PRCP and 14% for SNOW. Both are low values. On the other hand the top 5 eigenvectors explain %75 of the variance for SNWD. This means that these top 5 eigenvectors capture most of the variation in the snow signals. Based on that we will dig deeper into the PCA analysis for snow-depth.

It makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on diffferent years.

## Analysis of snow depth

We choose to analyze the eigen-decomposition for snow-depth because the first 4 eigen-vectors explain 70% of the variance.

First, we graph the mean and the top 4 eigen-vectors.

We observe that the snow season is from mid-october to the mid-april, where the middle of February marks the peak of the snow-depth.
![SNWD_Mean.png](report_figures/SNWD_Mean.png)
![Top Eigenvectors SNWD.png](report_figures/Top Eigenvectors SNWD.png)

Next we interpret the eigen-functions. The first eigen-function (eig1) has a shape very similar to the mean function. The main difference is that the eigen-function is close to zero during october-december while the mean is not.  The interpretation of this shape is that eig1 represents the overall amount of snow above/below the mean, but without changing the distribution over time.

**eig2,eig3 and eig4** are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total (much).

They can be interpreted as follows:
* **eig2:** less snow in mid-Feb - April, more snow in mid Jan-start Feb.
* **eig3:** more snow in mid-Feb, less snow in Jan, slightly more snow in march.
* **eig4:** more snow in mid-Feb, les snow in Mar, more snow in April, more snow in Nov and Dec.


### Examples of reconstructions

#### Coeff1
Coeff1: most positive
![most_coeff_1.png](report_figures/most_coeff_1.png)

Coeff1: most negative
![least_coeff_1.png](report_figures/least_coeff_1.png)
Large positive values of coeff1 correspond to more than average snow. Low values correspond to less than average snow.

#### Coeff2
Coeff2: most positive
![most_coeff_2.png](report_figures/most_coeff_2.png)
Coeff2: most negative
![least_coeff_2.png](report_figures/least_coeff_2.png)

Large positive values of coeff2 correspond to a late snow season (most of the snowfall is after mid feb. Negative values for coeff2 correspond to an early snow season (most of the snow is before mid-feb.
#### Coeff3
Coeff3: most positive
![most_coeff_3.png](report_figures/most_coeff_3.png)
Coeff3: most negative
![least_coeff_3.png](report_figures/least_coeff_3.png)

Large positive values of coeff3 correspond to a snow season with two spikes: one in the start of january, the other at the end of february. Negative values of coeff3 correspond to a season with a single peak at the end of Jan.

## The variation in the depth of snow is more relative to station-to-station variation
In the previous section we see the variation of Coeff1, which corresponds to the total amount of snow, with respect to year. We now estimate the relative importance of location-to-location variation relative to year-by-year variation.

These are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results:

** coeff_1 **  
total MS                   = 258.253132732  
MS removing mean-by-station= 184.158764595, fraction explained=28.6  
MS removing mean-by-year   = 249.142933595, fraction explained=3.5  

** coeff_2 **  
total MS                   = 102.342894653  
MS removing mean-by-station= 85.6457997716, fraction explained=16.7  
MS removing mean-by-year   = 98.0453618425, fraction explained=3.9  

** coeff_3 **  
total MS                   = 77.2650949197  
MS removing mean-by-station= 76.4109837537, fraction explained=1.1  
MS removing mean-by-year   = 73.5791691959, fraction explained=4.8  


We see that the variation by station explains more than the variation by year. However this effect is weaker consider coeff_3, which has to do with the total snowfall, vs. coeff_1,2 which, as we saw above have to do with the timining of snowfall. We can also see that even the station factor does not affect the amount of variance explaned in case of SNWD that much compared to PRCP.


## Residual Analysis of SNWD

We will do some analysis of the Residuals on the snow depth on the dataset.  

The following graph shows the variance explained by the top eigenvectors:

![SNWD_PV_explained.png](report_figures/SNWD_PV_explained.png)


We can see that the top 8 eigenvector explained more than 80% of the variance. 

Then we want to figure out how well-explained are the vectors in this collection and I extract all of the values of `res_3` which is the residual variance after the Mean and the first two Eigen-vectors have been subtracted out. Before this I sort the dataframe by 'res_3' ascendingly.

The following graph shows the distribution:

![SNWD_plot_CDF_res_3.png](report_figures/SNWD_plot_CDF_res_3.png)

As we can see that more than half of instances has res_3 less than 0.1 and about 60% of instance has res_3 less than 0.3. So it is distributed well in the first half but a bit higher in the second half. So I think the data is well explained. This also matches the previous analysis on SNWD variance explained. 

Then we try to figure out if the amount of snow depth on two stations are related to each. Let's have the null hypothesis that they are independent. 

I compute the p-value of each pair of stations and the following graph shows the result:

![SNWD_significance_hist.png](report_figures/SNWD_significance_hist.png)
![P_norm_SNWD.png](report_figures/P_norm_SNWD.png)

From graphs above, we know most of the instances has almost 0% possiblility under the null hypothesis, which means the null hypothesis is wrong and it means the snow depth between station pairs are highly dependent. 

To double check this conclusion, I run the same test on the SNOW(snowfall) measurement, the following grpah shows the p-value for station pairs:

![SNOW_significance_hist.png](report_figures/SNOW_significance_hist.png)

As we can see the result for SNOW gives a very similar distribution and we know the SNWD is the integral of SNOW and they should be positive related so I think we can come to the conclusion that the snow depth/fall amount between station pairs are highly dependent. I think this matches our common senses. Say stations near each other will probabily has a similar amount of snow fall. Since the data are collected from the New Mexico area, we can tell that the depth of snow are location dependent on each other.   