# Montana Weather Analysis

This is a report on the historical analysis of weather patterns in the area of state of Montana.

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/montana/united-states/919" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="Climate_Montana_and_Weather_averages_Montana.png" src="hw5_figures/outside_source_weather.png" style="width:800px" /></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperature agree with the ones we got from our data, once we translate Fahrenheit to Centigrade.</p>

<p>&nbsp;</p>

<p><img alt="TMIN,TMAX.png" src="hw5_figures/tmin_tmax.png" style="height:300px; width:800px" /></p>

<p>To compare the precipitation&nbsp;we need to translate millimeter/day to inches/month. According to our analysis the average rainfall is 3.00 mm/day which translates to about 3.55 Inches&nbsp;per month. According to US-Climate-Data the average rainfall is closer to 4 inch per month. However, there is clear agreement that average precipitation&nbsp;is close to a constant throughout the year</p>

<p>&nbsp;<img alt="PRCP.png" src="hw5_figures/prcp.png" style="height:450px; width:600px" /></p>


## PCA analysis

For each of the six measurement, we compute the percentate of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
![VarExplained1.png](hw5_figures/variance_1.png)
We see that the top 5 eigen-vectors explain 22% of variance for TMIN, 34% for TOBS and 26% for TMAX.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 24% of the variance.

![VarExplained2.png](hw5_figures/variance_2.png)

The top 5 eigenvectors explain 11% of the variance for PRCP and 11% for SNOW. Both are low values. On the other hand the top 5 eigenvectors explain 81% of the variance for SNWD. This means that these top 5 eigenvectors capture most of the variation in the snow signals. Based on that we will dig deeper into the PCA analysis for snow-depth.

It makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on diffferent years.

## Analysis of snow depth

We choose to analyze the eigen-decomposition for snow-depth because the first 4 eigen-vectors explain 80% of the variance.

First, we graph the mean and the top 4 eigen-vectors.

We observe that the snow season is from October to June, where the start of January marks the peak of the snow-depth.
![SNWD_mean_eigs.png](hw5_figures/snwd_mean_eigvecs.png)

Next we interpret the eigen-functions. The first eigen-function (eig1) has a shape very similar to the mean function. The main difference is that the eigen-function is close to zero during october-december while the mean is not.  The interpretation of this shape is that eig1 represents the overall amount of snow above/below the mean, but without changing the distribution over time.

**eig2,eig3 and eig4** are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total (much).

They can be interpreted as follows:
* **eig2:** more snow in jan - feb, less snow in march-june and oct-dec.
* **eig3:** more snow in mid oct-dec, less snow in mid jan-june, slightly more snow in start of jan.
* **eig4:** more snow in jan, slightly more snow in mid march-june, less snow in feb-mid march, less snow in oct-dec.


### Examples of reconstructions

#### Coeff1
Coeff1: most positive
![SNWD_grid_Pos_coeff1.png](hw5_figures/SNWD_positive_coeff_1.png)
Coeff1: most negative
![SNWD_grid_neg_coeff1.png](hw5_figures/SNWD_negative_coeff_1.png)
Large positive values of coeff1 correspond to more than average snow. Low values correspond to less than average snow.

#### Coeff2
Coeff2: most positive
![SNWD_grid_Pos_coeff2.png](hw5_figures/SNWD_positive_coeff_2.png)
Coeff2: most negative
![SNWD_grid_neg_coeff2.png](hw5_figures/SNWD_negative_coeff_2.png)

Large positive values of coeff2 correspond to a early snow season (most of the snowfall is before mid March). Negative values for coeff2 correspond to an late snow season (most of the snowfall is after mid March).
#### Coeff3
Coeff3: most positive
![SNWD_grid_Pos_coeff3.png](hw5_figures/SNWD_positive_coeff_3.png)
Coeff3: most negative
![SNWD_grid_neg_coeff3.png](hw5_figures/SNWD_negative_coeff_3.png)

Large positive values of coeff3 correspond to a low snowfall during Jan-April with mostly no snow from April to end of June. Negative values of coeff3 correspond to a season with a high snowfall from Jan to end of June.


## The variation in the timing of snow is mostly due to year-to-year variation
In the previous section we see the variation of Coeff1, which corresponds to the total amount of snow, with respect to location. We now estimate the relative importance of location-to-location variation relative to year-by-year variation.

These are measured using the percentage by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results:

** coeff_1 **  
total RMS                   = 1757.62414965 
RMS removing mean-by-station= 911.871235198, percentage explained=48.1  
RMS removing mean-by-year   = 1510.65793281, percentage explained=14.05  

** coeff_2 **  
total RMS                   = 756.872346905  
RMS removing mean-by-station= 578.445915616, percentage explained=23.57  
RMS removing mean-by-year   = 555.289152167, percentage explained=26.63 

** coeff_3 **  
total RMS                   = 530.7728529  
RMS removing mean-by-station= 499.701075446, percentage explained= 5.85  
RMS removing mean-by-year   = 345.465554066, percentage explained=34.91 


We see that the variation by year explains more than the variation by station. However this effect is not observed with coeff_1, which has to do with the total snowfall, vs. coeff_2,3 which, as we saw above have to do with the timining of snowfall. 

## Analysis of precipitation

To analyze precipitation, we analyze how various stations correlate with each other. 

We first combine the measurements for each station into a single long array with an entry for each day of each year. This is the graph we obtained for cumulative disrtibution of daily rainfall.
![CDF_rainfall.png](hw5_figures/CDF_rainfall.png)

We observe that approximately 80% of days there is no rainfall, 96% have rainfall of less than 100mm/day and only 4% of days over all years has more than 100mm/day of rainfall. 


### Conclusions
It is likely to be hard to find correlations between the **amount** of rain on the same day in different stations. Because amounts of rain vary a lot between even close locations. It is more reasonable to try to compare whether or not it rained on the same day in different stations. As we see from the graph above, in our region it rains in about one fifth of the days.

### measuring statistical significance
We want to find a statistical test for rejecting the null hypothesis that says that the rainfall in the two locations is independent.

Using the inner product is too noisy, because you multiply the rainfall on the same day in two locations and that product can be very large - leading to a large variance and poor ability to discriminate.

An alternative is to ignore the amount of rain, and just ask whether it rained in both locations. We can then compute the probability associated with the number of overlaps under the null hypothesis.



This is the curve represetning distribution of normalized log probability for each pair of stations. 
![PRCP_pvalue.png](hw5_figures/PRCP_pvalue.png)

We can observe that for very few station pairs, null hypothesis can be rejected. We further analyze the correlation between 99 stations with a dependency matrix/heatmap.
![PRCP_stations_heatmap.png](hw5_figures/PRCP_stations_heatmap.png)

### Finding structure in the dependency matrix.
The matrix 99x99 above shows, for each pair of stations, the normalized log probability that the overlap in rain days is random.

We see immediately the first 3rd to 8th stations are highly correlatedwith each other. A group of very correlated stations is: 'USC00244522', 'USC00242347', 'USC00247342', 'USC00247214', 'USC00249192', 'USC00243463'

To find more correlations we use SVD (the term PCA is reserved for decomposition of the covariance matrix). As we shall see that the top 10 eigenvectors explain about 65% of the square magnitude of the matrix.
![PRCP_stations_var_explained.png](hw5_figures/PRCP_stations_var_explained.png)

### Ordering dependency matrix by Eigen vectors
![PRCP_ordered_rependency.png](hw5_figures/PRCP_ordered_rependency.png)
When we reorder the rows and columns of the matrix using one of the eigenvectors (among top 4 eigenvectors), the grouping of the stations becomes more evident. For example, consider the upper left corner of the second matrix (The upper left one). The stations at positions 0-18 are clearly strongly correlated with each other. Stations in positions 0-5 are more related to each other than to the rest of this block. Similarly consider the lower left corner of the first matrix (The upper right one). The stations at positions 75-80 are strongly correlated with each other and first 10 stations are also correlated.

This type of organization is called **Block Diagonal** and it typically reveals important structure such as grouping or clustering.

### Analyzing the spatial location of correlated stations
![PRCP_coeff1_map10.png](hw5_figures/PRCP_coeff1_map10.png)
This map shows spatial arrangement of first 10 stations ordered by first eigenvector which we found to be correlated from previous rependency matrix analysis. As we can observe, the stations are spatially close to each other which might explain the higher correlation in rainfall trend.

## Analysis of observed temperature (TOBS)

To analyze observed temperature, we first analyze the eigen-decomposition because the first 4 eigen-vectors explain 34% of the variance.

First, we graph the mean and the top 3 eigen-vectors.

We observe that the mid Junly marks the peak temperature. Since Montana is near to north pole, it is mostly cold and highest obsevered temperature is approximately 25C.
![TOBS_mean_eig.png](hw5_figures/TOBS_mean_eig.png)

Next we interpret the eigen-functions. The first eigen-function (eig1) has a shape very similar to the negative mean function. The main difference is that the eigen-function is close to zero during July-October while the mean is not.  The interpretation of this shape is that eig1 represents the overall amount of temperature above/below the mean, but without changing the distribution over time.

**eig2 and eig3** are similar in the following way. They both oscilate between positive and negative values. In other words, they correspond to changing the distribution of the temperature, but they don't change the total (much).

They can be interpreted as follows:
* **eig2:** more temperature in jan - april and dec, less temperature in may-nov.
* **eig3:** more temperature in mid apr-nov less temperature in jan-march, slightly more temperature in dec.


### Examples of reconstructions

#### Coeff1
Coeff1: most positive
![TOBS_pos_coeff1.png](hw5_figures/TOBS_pos_coeff1.png)
Coeff1: most negative
![TOBS_neg_coeff1.png](hw5_figures/TOBS_neg_coeff1.png)
Large positive values of coeff1 correspond to less than average temperature. Low values correspond to more than average temperature. Also, the highest temperature with low values of coeff1 is higher than highest temperature with large values.

## The variation in TOBS is mostly due to year-to-year variation
In the previous section we see the variation of Coeff1, which corresponds to the total amount of temperature, with respect to location. We now estimate the relative importance of location-to-location variation relative to year-by-year variation.

These are measured using the percentage by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results:

** coeff_1 **  
total RMS                   = 516.704145862 
RMS removing mean-by-station= 338.097680685, percentage explained=34.56  
RMS removing mean-by-year   = 479.004787549, percentage explained=7.3  

** coeff_2 **  
total RMS                   = 219.808955442 
RMS removing mean-by-station= 206.68609511, percentage explained=5.97  
RMS removing mean-by-year   = 100.102580817, percentage explained=54.46 

** coeff_3 **  
total RMS                   = 197.843887824  
RMS removing mean-by-station= 190.730385402, percentage explained= 3.59  
RMS removing mean-by-year   = 72.0407486943, percentage explained= 63.58


We see that the variation by year explains more than the variation by station. However this effect is not observed with coeff_1, which has to do with the total temperature, vs. coeff_2,3 which, as we saw above relates to arrival of summer.

## Conclusion

We analysed three major measurements snow depth, TOBS and Precipitation in this report. We see that variation of snow depth and TOBS can be explained better by performing PCA on number of days. This is beacuse the variation of these measurements are temporal. However, the variation of precipitation can be explained better when we perform PCA on different stations.

We also did year wise analysis for snow depth where we took mean values of all days to represent a year and then made a big vector of each station consisting every year. However, this analysis didn't explain variations much. The  notebook for this new analysis is included in code.