# North Dakota Weather Analysis

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the area of the state of Massachusets.

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/boston/massachusetts/united-states/usma0046" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="Climate_Boston_-_Massachusetts_and_Weather_averages_Boston.png" src="r_figures/Climate_Boston_-_Massachusetts_and_Weather_averages_Boston.png" style="height:300px;" /></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperature agree with the ones we got from our data, once we translate Fahrenheit to Centigrade.</p>

<p>&nbsp;</p>

<p><img alt="TMIN,TMAX.png" src="r_figures/TMIN,TMAX.png" style="height:300px; width:800px" /></p>

<p>To compare the precipitation&nbsp;we need to translate millimeter/day to inches/month. According to our analysis the average rainfall is 1.22 mm/day which translates to about 1.44 Inches&nbsp;per month. According to US-Climate-Data the average rainfall is closer to 1.48 inch per month. One can see that the precipitation increase from April until August.</p>

<p>&nbsp;<img alt="PRCP.png" src="r_figures/PRCP.png" style="height:450px; width:600px" /></p>


## PCA analysis

For each of the six measurement, we compute the percentate of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
![VarExplained1.png](r_figures/VarExplained1.png)
We see that the top 5 eigen-vectors explain 11% of variance for TMIN, 34% for TOBS and 20% for TMAX.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 23% of the variance. Whereas in case of TMAX, we observe with increase in eigenvectors, the variance explained increases.

![VarExplained2.png](r_figures/VarExplained2.png)

The top 5 eigenvectors explain 8% of the variance for PRCP and 12% for SNOW. Both are low values. On the other hand the top 5 eigenvectors explain 88% of the variance for SNWD. This means that these top 5 eigenvectors capture most of the variation in the snow signals. Based on that we will dig deeper into the PCA analysis for snow-depth.

It makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on diffferent years.

## Analysis of snow depth

We choose to analyze the eigen-decomposition for snow-depth because the first 4 eigen-vectors explain 88% of the variance.

First, we graph the mean and the top 4 eigen-vectors.

We observe that the snow season starts from november to the end of may, where the middle of February marks the peak of the snow-depth.
![SNWD_mean_eigs.png](r_figures/SNWD_mean_eigs.png)

Next we interpret the eigen-functions. The first eigen-function (eig1) has a shape very similar to the mean function. The main difference is that the eigen-function is close to zero during october-december while the mean is not.  The interpretation of this shape is that eig1 represents the overall amount of snow above/below the mean, but without changing the distribution over time.

**eig2,eig3 and eig4** are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total (much).

They can be interpreted as follows:
* **eig2:** less snow in jan - mid feb, more snow in mid feb-apr.
* **eig3:** more snow in jan - mid feb, less snow in mid feb-mid mar, slightly more snow during mid mar.
* **eig4:** less snow in dec, more snow in mid jan to mid feb, less snow in after mid feb.


### Examples of reconstructions

#### Coeff1
Coeff1: most positive
![SNWD_grid_Pos_coeff1.png](r_figures/SNWD_grid_Pos_coeff1.png)
Coeff1: most negative
![SNWD_grid_neg_coeff1.png](r_figures/SNWD_grid_neg_coeff1.png)
Large positive values of coeff1 correspond to more than average snow. Low values correspond to less than average snow.

#### Coeff2
Coeff2: most positive
![SNWD_grid_Pos_coeff2.png](r_figures/SNWD_grid_Pos_coeff2.png)
Coeff2: most negative
![SNWD_grid_neg_coeff2.png](r_figures/SNWD_grid_neg_coeff2.png)

Large positive values of coeff2 correspond to a late snow season (most of the snowfall is after mid feb). Negative values for coeff2 correspond to an early snow season (most of the snow is before mid-feb).

#### Coeff3
Coeff3: most positive
![SNWD_grid_Pos_coeff3.png](r_figures/SNWD_grid_Pos_coeff3.png)
Coeff3: most negative
![SNWD_grid_neg_coeff3.png](r_figures/SNWD_grid_neg_coeff3.png)

Large positive values of coeff3 correspond to a early snow. Negative values for coeff3 corresponds to late snow season.

#### Coeff4
Coeff4: most positive
![SNWD_grid_Pos_coeff3.png](r_figures/SNWD_grid_Pos_coeff4.png)
Coeff4: most negative
![SNWD_grid_neg_coeff3.png](r_figures/SNWD_grid_neg_coeff4.png)

Large positive values of coeff4 corresponds snow season with two peaks. Negative values for coeff4 corresponds to snow season with single peak.


## The variation in the timing of snow is mostly due to year-to-year variation
In the previous section we see the variation of Coeff1, which corresponds to the total amount of snow, with respect to location. We now estimate the relative importance of location-to-location variation relative to year-by-year variation.

These are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results:

** coeff_1 **  
total RMS                   = 1404.59
RMS removing mean-by-station= 1227.21, fraction explained=12.62  
RMS removing mean-by-year   = 861.91, fraction explained=38.63  

** coeff_2 **  
total RMS                   = 646.118  
RMS removing mean-by-station= 617.69, fraction explained= 4.42  
RMS removing mean-by-year   = 389.18, fraction explained=39.7

** coeff_3 **  
total RMS                   = 593.15  
RMS removing mean-by-station= 571.38, fraction explained= 3.67  
RMS removing mean-by-year   = 304.36, fraction explained=48.68

We see that the variation by year explains more than the variation by station. However this effect is weaker consider coeff_1, which has to do with the total snowfall, vs. coeff_2,3,4 which, as we saw above have to do with the timining of snowfall. We see that for coeff_2,3 the stations explain 3.5-4.5% of the variance while the year explaines 40-50%.

## Analysis of Precipitation

Over here we plot the cummulative distribution of rainfall accumulated over the years across all the stations each day.
![PRCP_CDF_rainfall.png](r_figures/PRCP_CDF_rainfall.png)

We observe that four-fifth of the days in a year doesn't receive any rainfall whereas 3% time of the days receives more than 100mm/day of rainfall. The rest of 19% of the days receives rainfall less than 100mm.

### Conclusions
It is likely to be hard to find correlations between the **amount** of rain on the same day in different stations. Because amounts of rain vary a lot between even close locations. It is more reasonable to try to compare whether or not it rained on the same day in different stations. As we see from the graph above, in our region it rains in about one fifth of the days.

### measuring statistical significance
We want to find a statistical test for rejecting the null hypothesis that says that the rainfall in the two locations is independent.

An alternative is to ignore the amount of rain, and just ask whether it rained in both locations. We can then compute the probability associated with the number of overlaps under the null hypothesis.

Over here we see the plot of the disrtibution negative normalized log probability for pairs of stations.
![PRCP_p-value.png](r_figures/PRCP_p-value.png)
We can see that some of station pairs for which significance < 0.05 can be discarded as they are independent.
![PRCP_stations_correlation.png](r_figures/PRCP_stations_correlation.png)


### Finding structure in the dependency matrix.
The matrix above shows, for each pair of stations, the normalized log probability that the overlap in rain days is random. The size of the matrix is 122x122 denoting 122 stations. The darker the cell of the matrix, the more correlated the station pairs are. 

It is keen to observe that first 18 pairs of stations are more correlated than other pairs of stations.

To find more correlations we use SVD (the term PCA is reserved for decomposition of the covariance matrix). As we can see below that the top 10 eigenvectors explain about 80% of the square magnitude of the matrix.

![PRCP_explained_variance.png](r_figures/PRCP_explained_variance.png)
For further analysis we will look at only top 4 eigenvectors. We would now like to group/reorder all the stations so that highly correlated stations come close to each other. This we would do by ordering the stations based on their eigenvector coefficients.
### Ordering dependency matrix based on eigen vectors.
![PRCP_Block-Diagonal.png](r_figures/PRCP_Block-Diagonal.png)
When we reorder the rows and columns of the matrix using one of the eigenvectors, the grouping of the 
stations becomes more evident. For example, consider the first matrix. The stations at positions 0-30, 80 - 120 are clearly less correlated with each other. The stations between 60 - 80 region are strongly correlated.

#### Spatial visualization of correlated data
Over here we can see the plot of correlated data from the above dependency matrix when we take top left 20 stations of the first matrix shown above.
![PRCP_spatial_analysis.png](r_figures/PRCP_spatial_analysis.png)
We can see that major of the stations situated above are near similar roadways, and thus they have similar precipitation pattern present throughout the year.

## Analysis of TOBS
We now estimate the relative importance of location-to-location variation relative to year-by-year variation for TOBS.

These are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results:

** coeff_1 **  
total RMS                   = 552.29  
RMS removing mean-by-station= 435.51, fraction explained=21.14  
RMS removing mean-by-year   = 474.62, fraction explained=14.06  

** coeff_2 **  
total RMS                   = 200.52  
RMS removing mean-by-station= 198.68, fraction explained= 0.91   
RMS removing mean-by-year   = 59.09, fraction explained=70.52

** coeff_3 **  
total RMS                   = 187.47  
RMS removing mean-by-station= 185.23, fraction explained= 1.19    
RMS removing mean-by-year   = 61.93, fraction explained=66.93

We see that the variation by year explains more(65% - 70%) than the variation by station(0 - 21%) which is similar to analysis done for SNWD.

We choose to analyze the eigen-decomposition for snow-depth because the first 4 eigen-vectors explain 33% of the variance. Also, from above analysis it is evident that variation by year is a better fit for TOBS.

First, we graph the mean and the top 4 eigen-vectors.

We observe that the snow season starts from november to the end of may, where the middle of February marks the peak of the snow-depth.
![SNWD_mean_eigs.png](r_figures/TOBS_mean_eigs.png)

Next we interpret the eigen-functions. The first eigen-function (eig1) has a shape similar to the mean function. When mean rises, the eig1 also rises and same when the mean falls, eig1 also falls. The main difference is that rate of rise and fall is slow compared to the mean.  The interpretation of this shape is that eig1 represents the overall amount of temperature above/below the mean, but without changing the distribution over time.

**eig2 and eig3** are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total (much).

They can be interpreted as follows:
* **eig2:** temperature rises from apr to mid-may. Overall it oscillates about 0.
* **eig3:** temperature falls from mid-feb to mid-apr. Rest of the time it oscillates about 0. You can see a bit rise during the month of nov.

### Examples of reconstructions

#### Coeff2
Coeff2: most positive
![SNWD_grid_Pos_coeff2.png](r_figures/TOBS_grid_Pos_coeff2.png)
Coeff2: most negative
![SNWD_grid_neg_coeff2.png](r_figures/TOBS_grid_neg_coeff2.png)

Large positive values of coeff2 correspond to the places which are farther away from the mean temperature. Whereas large negative values of coeff2 correspond to the places which have temperature close to the mean temperature.
#### Coeff3
Coeff3: most positive
![SNWD_grid_Pos_coeff3.png](r_figures/TOBS_grid_Pos_coeff3.png)
Coeff3: most negative
![SNWD_grid_neg_coeff3.png](r_figures/TOBS_grid_neg_coeff3.png)

Large negative values of coeff2 correspond to the places which have minimum and maximum temperature from -100 to 150C. Whereas large positive values of coeff2 correspond to the places which have minimum and maximum temperature from -100 to 200C.