# New Mexico Weather Analysis

This is a report on analysis of historical weather patterns in an area centered at Truth or Consequences in the state of New Mexico.

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

I focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Sanity-check: comparison with outside sources

<p>we observe that the collections of stations in the data set lies around a place called Truth or Consequences in the state of New Mexico.  So, we compare some of the general statistics of Truth or Consequences with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/truth-or-consequences/new-mexico/united-states/usnm0332" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="NewMexico_WR.png" src="report_figures/NewMexico_WR.png" style="height:450px; width:600px" /></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperatures agree with the ones we got from the data, once we translate Fahrenheit to Centigrade.</p>

<p>&nbsp;</p>

<p><img alt="TMIN_TMAX1.png" src="report_figures/TMIN_TMAX1.png" style="height:300px; width:800px" /></p>

<p> We see that the trends of precipitation in the below graph agrees with that of US-Climate-Data. According to US-Climate date, We see maximum percipiations in the months of July and August which agrees with the analysed data. According to our analysis the average peak of rainfall for the months July and August is around 2 mm/day which translates to about 2.36 Inches&nbsp;per month.</p>

<p>&nbsp;<img alt="PRCP2.png" src="report_figures/PRCP2.png" style="height:450px; width:600px" /></p>

## PCA analysis

For each of the six measurements, We compute the percentage of the variance explained as a function of the number of eigen-vectors used to approximate them.

### Percentage of variance explained.
![VarExplained1.png](report_figures/Variances_T.png)
We see that the top 5 eigen-vectors explain 46% of variance for TMIN, 58% for TOBS and 48% for TMAX.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 50% of the variance. Based on that we will dig deeper into the PCA analysis for TOBS.

![VarExplained2.png](report_figures/Variances_Meas.png)

The top 5 eigenvectors explain 9% of the variance for PRCP and 14% for SNOW. Both are low values. On the other hand the top 5 eigenvectors explain 75% of the variance for SNWD. This means that these top 5 eigenvectors capture most of the variation in the snow signals. Based on that we will dig deeper into the PCA analysis for snow-depth.

It makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on diffferent years.

## Analysis of snow depth

We choose to analyze the eigen-decomposition for snow-depth because the first 4 eigen-vectors explain 72% of the variance.

First, we graph the mean and the top 4 eigen-vectors.

We observe that the snow season is from middle of November to the end of March, where the middle of February marks the peak of the snow-depth.
![SNWD_mean_eigs.png](report_figures/SNWD_eigen.png)

Next we interpret the eigen-functions. The first eigen-function (**eig1**) has a shape very similar to the mean function. So, adding/subtracting a scaled eigen vector to the mean vector increase/decrease the reconstructed vector by same amount. Hence, this eigen vector represents the overall amount of snow above/below the mean, but without changing the distribution over time.

**eig2,eig3 ** are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total (much). They both started at the middle of november, but have different frequencies.

They can be interpreted as follows:
* **eig2:** more snow in dec - mid feb, less snow in mid feb-march.
* **eig3:** more snow in dec, less snow in jan, more snow in feb and very less snow in march.

### Examples of reconstructions

#### Coeff1
Coeff1: most positive
![SNWD_grid_Pos_coeff1.png](report_figures/r1_posCoeff1.png)
Coeff1: most negative
![SNWD_grid_neg_coeff1.png](report_figures/r1_negCoeff1.png)
Large positive values of coeff1 correspond to more than average snow. Low values correspond to less than average snow.

#### Coeff2
Coeff2: most positive
![SNWD_grid_Pos_coeff2.png](report_figures/r2_posCoeff.png)
Coeff2: most negative
![SNWD_grid_neg_coeff2.png](report_figures/r2_negCoeff.png)

Large positive values of coeff2 correspond to a late snow season (most of the snowfall is after mid feb. Negative values for coeff2 correspond to an early snow season (most of the snow is before mid-feb.
#### Coeff3
Coeff3: most positive
![SNWD_grid_Pos_coeff3.png](report_figures/r3_posCoeff.png)
Coeff3: most negative
![SNWD_grid_neg_coeff3.png](report_figures/r3_posCoeff.png)

Large positive values of coeff2 correspond to a snow season with two spikes: one in the start of january, the other at the end of february. Negative values of coeff3 correspond to a season with a single peak at the end of Jan.



## Analysis of TOBS

We choose to analyze the eigen-decomposition for TOBS because the first 4 eigen-vectors explain 56% of the variance.

First, we graph the mean and the top 4 eigen-vectors.

We observe that the observed temperature is highest in the month of July. And it is less in the winter season.
![SNWD_mean_eigs.png](report_figures/TOBS_eigen.png)

Next we interpret the eigen-functions. The first eigen-function (eig1) has a shape very similar to the negative mean function. Hence, this eigen vector represents the overall amount of snow above/below the mean with the corresponding coefficient being negative for most cases.
eig2,eig3 are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total (much).

### Examples of reconstructions

#### Coeff1
Coeff1: highly negative
![SNWD_grid_Pos_coeff1.png](report_figures/Tobs_r1_posCoeff.png)
Coeff1: less negative
![SNWD_grid_neg_coeff1.png](report_figures/Tobs_r1_negCoeff.png)

Large negative values of coeff1 correspond to high temperature where as Low values correspond to lower temperatures.

#### Coeff2
Coeff2: most positive
![SNWD_grid_Pos_coeff2.png](report_figures/Tobs_r1_posCoeff2.png)
Coeff2: most negative
![SNWD_grid_neg_coeff2.png](report_figures/Tobs_r2_negCoeff.png)

Large positive values of coeff2 correspond to a early summer season (high temperatures in early April). Negative values for coeff2 correspond to an late summer season.
#### Coeff3
Coeff3: most positive
![SNWD_grid_Pos_coeff3.png](report_figures/Tobs_r3_posCoeff.png)
Coeff3: less positive
![SNWD_grid_neg_coeff3.png](report_figures/Tobs_r3_negCoeff.png)

Large positive values of coeff2 correspond to samples with zero temperatures in December and January. Lower values of coeff3 correspond to a samples where data in July is not avaible. So, there is one case where we get bad results due unavailabity of data on particular days.


### The variation in TOBS is mostly due to year-to-year variation

We now estimate the relative importance of location-to-location variation relative to year-by-year variation.
These are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. 

Here are the results:
**coeff_1** 
total RMS                   =  2815.95999742
RMS removing mean-by-station=  511.376509436,percentage explained=82
RMS removing mean-by-year   =  758.324774773,percentage explained=73

**coeff_2** 
total RMS                   =  238.483246372
RMS removing mean-by-station=  180.44085281,percentage explained=24.3
RMS removing mean-by-year   =  125.658578984,percentage explained=47.5

**coeff_3** 
total RMS                   =  1115.30894622
RMS removing mean-by-station=  164.002888019,percentage explained=85.3
RMS removing mean-by-year   =  160.49394686,percentage explained=85.6

We see that the variation by year explains more than the variation by station for coeff_2. However this effect is weaker for coeff_1 and coeff_3. This is because coeff_2 represents the arrival of summer and varies with different years.

## Analysis of residuals for Precepitation
Below figure shiws the mean and the top 4 eigen-vectors for precipitation.

We observe that the pericipiation period is from the month of July to October and highest at the start of August. 
![SNWD_mean_eigs.png](report_figures/PRCP_eigen1.png)

As we have seen before, top eigen vectors after the PCA analysis doesnot explain much of the variance for Precipitation. One of the alternative way of anlysing is to apply PCA based on stations. For that, We combine the measurements for each station into a single long array with an entry for each day of each year.

The figure below shows the cumilative distribution function for daily rain fall and it is evident that it only rains on about 15% of the total days.
![SNWD_mean_eigs.png](report_figures/rainfall_cdf.png)

It is likely to be hard to find correlations between the amount of rain on the same day in different stations. Because amounts of rain vary a lot between even close locations. It is more reasonable to try to compare whether or not it rained on the same day in different stations. As we see from the graph above, in our region it rains in about one third of the days.

### Measuring statistical significance
We want to find a statistical test for rejecting the null hypothesis that says that the rainfall in the two locations is independent. 
Using the inner product is too noisy, because you multiply the rainfall on the same day in two locations and that product can be very large - leading to a large variance and poor ability to discriminate.
An alternative is to ignore the amount of rain, and just ask whether it rained in both locations. We can then compute the probability associated with the number of overlaps under the null hypothesis. 

The figure below represnts the negative log probability between pair of stations.

![SNWD_mean_eigs.png](report_figures/significane_prcp.png)

We observe that the there are lot of pairs with probabilty greater than 0.05. So, we can reject null hypothesis for those stations. But we can also find fair amount of pair of stations for which we cannot reject null hypotheis.


### Calculate the normalized log probability for each pair of stations.

This the matrix that shows correlation between each pair of stations. We see 3 clusters of correlated stations in the plot. First 10 stations are highly correct. So, are the stations from 10-23. We also see that statiosns from 25-55 are correlated, but they are not as highly correlated as first two clusters.
![SNWD_mean_eigs.png](report_figures/showmat_prcp.png)


### Finding structure in the dependency matrix.
To find more correlations we use SVD (the term PCA is reserved for decomposition of the covariance matrix). As we shall see that the top 10 eigenvectors explain about 73% of the square magnitude of the matrix.
![SNWD_mean_eigs.png](report_figures/pca_Pnorm.png)
![SNWD_mean_eigs.png](report_figures/sorted_mats.png)

When we reorder the rows and columns of the matrix using one of the eigenvectors, the grouping of the stations becomes more evident. For example, consider the upper left corner of the scond matrix (The upper left one). The stations at positions 0-20 are clearly strongly correlated with each other. Even though there are some stations, in positions 15-18 or so, which are more related to each other than to the rest of this block.
This type of organization is called Block Diagonal and it typically reveals important structure such as grouping or clustering.
You might want to extract the sets of stations that form blocks for your region, and then plot them on the map to see their spatial relationship.

### Visualization of these correlated stations on the Map

The figure below shows the stations with very similar(both magnitude and sign) coefficients. This means they have similar weather patterns over different days. One possible inference we can make out this is that these stations are very close to forrests on the map. So, they have similar weather patterns even tough they are geographically far away. We can also observe swarm of 5 stations at right bottom surrounding a forrest with similar weather patterns.  
![SNWD_mean_eigs.png](report_figures/closeStations.png)