# Montana Lewistown Weather Analysis

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the area of Lewistown in the state of Montana.

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Precipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/boston/massachusetts/united-states/usma0046" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="sanity_check.png" src="hw5_figures/sanity_check.png" style="height:424px; width:649px"/></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperature agree with the ones we got from our data, once we translate Fahrenheit to Centigrade.</p>

<p>&nbsp;</p>

<p><img alt="TMIN,TMAX.png" src="hw5_figures/TMIN,TMAX.png" style="height:348px; width:945px" /></p>

<p>To compare the precipitation&nbsp;we need to translate millimeter/day to inches/month. We first compare the peak rainfall in June. According to our analysis the average rainfall in June is 2.50 mm/day which translates to about 2.95 Inches&nbsp;per month. According to US-Climate-Data the average rainfall in June is closer to 3 inch per month. In addition, there is clear agreement between </p>

<p>&nbsp;<img alt="PRCP.png" src="hw5_figures/PRCP.png" style="height:348px; width:618px" /></p>


## PCA analysis

For each of the six measurement, we compute the percentage of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
![VarExplained1.png](hw5_figures/VarExplained1.png)
We see that the top 5 eigen-vectors explain 23% of variance for TMIN, 34% for TOBS and 26% for TMAX.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 24% of the variance.

![VarExplained2.png](hw5_figures/VarExplained2.png)

The top 5 eigenvectors explain 11% of the variance for PRCP and 11% for SNOW. Both are low values. On the other hand the top 5 eigenvectors explain %82 of the variance for SNWD. This means that these top 5 eigenvectors capture most of the variation in the snow signals. Based on that we will dig deeper into the PCA analysis for snow-depth.

It makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on different years.


## Analysis of snow depth

We choose to analyze the eigen-decomposition for snow-depth because the first 4 eigen-vectors explain 80% of the variance.

First, we graph the mean and the top 4 eigen-vectors.

We observe that the snow season is from October to the end of June, where the start of February marks the peak of the snow-depth.
![SNWD_mean_eigs.png](hw5_figures/SNWD_mean_eigs.png)

Next we interpret the eigen-functions. The first eigen-function (eig1) has a shape very similar to the mean function. Basically, it can be seen as a scaled version of mean function. The main difference is that the eigen-function is smaller during October-January than mean function. The interpretation of this shape is that **eig1** represents the overall amount of snow above/below the mean, but without changing the distribution over time.

**eig2** and **eig4** are similar in the following way. They all oscillate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total (much).

**eig3** on the other hand, is above zero in October - December. It may represent the overall amount of snow above/below the mean especially in October - December.


They can be interpreted as follows:
* **eig2:** more snow in January - February, less snow in March - June.
* **eig3:** slightly more snow in November - December, less snow in March - April.
* **eig4:** more snow in January, April - May, less snow February and March.



### Examples of reconstructions

#### Coeff0
Coeff0: most positive
![SNWD_grid_Pos_coeff0.png](hw5_figures/SNWD_grid_Pos_coeff0.png)
Coeff0: most negative
![SNWD_grid_neg_coeff1.png](hw5_figures/SNWD_grid_neg_coeff0.png)
Large positive values of coeff0 correspond to a great amount of snow than average snow depth. Small values for coeff0 correspond to snow depth close to or slightly more average.


#### Coeff1
Coeff1: most positive
![SNWD_grid_Pos_coeff1.png](hw5_figures/SNWD_grid_Pos_coeff1.png)
Coeff1: most negative
![SNWD_grid_neg_coeff1.png](hw5_figures/SNWD_grid_neg_coeff1.png)
Large positive values of coeff1 correspond to a early snow season (most of the snowfall is January - February). Negative values for coeff1 correspond to an late snow season (most of the snow is in March and early April).

#### Coeff2
Coeff2: most positive
![SNWD_grid_Pos_coeff2.png](hw5_figures/SNWD_grid_Pos_coeff2.png)
Coeff2: most negative
![SNWD_grid_neg_coeff2.png](hw5_figures/SNWD_grid_neg_coeff2.png)

Large positive values of coeff2 correspond to much more snow in November - December. Small values for coeff2 indicates snow in November - December is a little more than average.
#### Coeff3
Coeff3: most positive
![SNWD_grid_Pos_coeff3.png](hw5_figures/SNWD_grid_Pos_coeff3.png)
Coeff3: most negative
![SNWD_grid_neg_coeff3.png](hw5_figures/SNWD_grid_neg_coeff3.png)

Large positive values of coeff3 correspond to a earlier snow season then coeff1 (most of the snowfall is December - January). Negative values for coeff3 correspond to a slightly late snow season.





### The variation in the timing of snow is both due to year-to-year variation and station-to-station variation
In the previous section we see the variation of Coeff1, which corresponds to the total amount of snow, with respect to location. We now estimate the relative importance of location-to-location variation relative to year-by-year variation. 
 
These are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results: 
 
** coeff_0 **   
total MS                   =  3089242.65143  
MS removing mean-by-station=  831509.149582, fraction explained =  0.730837216948  
MS removing mean-by-year   =  2282087.38995, fraction explained =  0.26127933366  
 
** coeff_1 **   
total MS                   =  572855.74951  
MS removing mean-by-station=  334599.677293, fraction explained =  0.415909367097  
MS removing mean-by-year   =  308346.042514, fraction explained =  0.461738766212  
 
** coeff_2 **   
total MS                   =  281719.821376  
MS removing mean-by-station=  249701.164801, fraction explained =  0.113654255558  
MS removing mean-by-year   =  119346.449046, fraction explained =  0.57636474259  
 
** coeff_3 **  
total MS                   =  233475.366452  
MS removing mean-by-station=  216367.293934, fraction explained =  0.073275706889  
MS removing mean-by-year   =  94764.0688572, fraction explained =  0.594115343742  
 
For ** coeff_2 ** and ** coeff_3 **, the variation by year explains more than the variation by station. The fraction explained by station is only 10% while year explains 60% of the variance. This indicates the phenomenon (a great amount more snow in November - January) represents by ** coeff_2 ** and ** coeff_3 ** appears in whole area of Lewistown. 
 
For ** coeff_0 **, the variation by station explains more than the variation by year. This indicates that snowfalls in Lewistwon distribute unevenly. 
 
The explained fractions of ** coeff_1 ** by year and station are approximately same. We cannot conclude much about this.

## Analysis of precipitation

We then analyze the precipitation in Lewistown area. There is a clearly uneven distribution of precipitation. 
![percipitation.png](hw5_figures/percipitation.png)
|
However, the 5 eigenvectors of precipitation only explained 11% of the variance.
![prcp_explained.png](hw5_figures/prcp_explained.png)


### The variation in the timing of precipitation  is not same across the area
We estimate the relative importance of location-to-location variation relative to year-by-year variation. 
 
These are measured using the fraction by which the RMS is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results: 

total RMS                   =  38.7679058385  
RMS removing mean-by-station=  37.0238864075  
RMS removing mean-by-day   =  26.9365773318  

It is clear that the station mean precipitation cannot explain the RMS which indicates that the precipitation distribution is not same across the area. On the other hand, time does explain some of the RMS which indicates that there may be a trend of precipitation shared by some of the stations year to year.


### Statistical relation of precipitation between stations

We try to find a statistical test for rejecting the null hypothesis that says that the rainfall in the two locations is independent. Further, to avoid noise brought by the actual value of rainfall, we use a boolean to represent whether it rained.

Fix two stations. We restrict our attention to the days for which we have measurements for both stations, and define the following notation:
* $m$ : the total number of days (for which we have measurements for both stations).
* $n_1$ : the number of days that it rained on station 1
* $n_2$ : the number of days that it rained on station 2
* $l$ : the number of days that it rained on both stations.

We want to calculate the probability that the number of overlap days is $l$ given $m,n_1,n_2$.

The answer is:
$$
P = {m \choose l,n_1-l,n_2-l,m-n_1-n_2+l} /{m \choose n_1}{m \choose n_2}
$$

Where
$$
{m \choose l,n_1-l,n_2-l,m-n_1-n_2+l} = \frac{m!}{l! (n_1-l)! (n_2-l)! (m-n_1-n_2+l)!}
$$

We use the fact that $\Gamma(n+1) = n!$ and denote $G(n) \doteq \log \Gamma(n+1)$
$$
\log P = \left[G(m) - G(l) -G(n_1-l) -G(n_2-l) -G(m-n_1-n_2+l) \right] - 
\left[G(m)-G(n_1)-G(m-n_1)\right] - \left[G(m)-G(n_2)-G(m-n_2)\right]
$$
Which slightly simplifies to 
$$
\log P = -G(l) -G(n_1-l) -G(n_2-l) -G(m-n_1-n_2+l) - G(m)+G(n_1)+G(m-n_1) +G(n_2)+G(m-n_2)
$$

The log probability scales with $m$ the length of the overlap. So to get a per-day significance we consider $
\frac{1}{m} \log P $


#### The distribution of significance

The distribution of $\frac{1}{m} \log P$ is shown below. The bigger $\frac{1}{m} \log P$ is, more two stations are. The peak around 0.4 represents the high correlation between stations and themselves. We can most of the pairs of stations have little correlation while some of them do have bigger $\frac{1}{m} \log P$ values which indicate some level of correlation.
![PRCP_significance_distribution.png](hw5_figures/PRCP_significance_distribution.png)

#### The significance correlation matrix

The matrix above shows, for each pair of stations, the normalized log probability that the overlap in rain days is random. We can identify a group of stations on the top-left corner of the matrix. The precipitation in these stations in this group do correlate with each other. 
![PRCP_correlation.png](hw5_figures/PRCP_correlation.png)

These five stations are shown on map in figure below.
![PRCP_correlation_map.png](hw5_figures/PRCP_correlation_map.png)


#### PCA analysis of significance correlation matrix

To find more patterns in the matrix, we use PCA to group the stations. We first calculate the top 4 eigenvectors of the significance correlation matrix. We can fit original data using top 4 eigenvectors. We then sort the data according to the coefficients of top 4 eigenvectors respectively. The results are shown below. It seems that coefficients do not have significant correlation with each other. 

![Pvalue_PCA_coeff_PRCP.png](hw5_figures/Pvalue_PCA_coeff_PRCP.png)

We also sort the original significance correlation matrix according to the coefficient of top 4 eigenvectors. The results are shown below.

![Pvalue_PCA_reorder_PRCP.png](hw5_figures/Pvalue_PCA_reorder_PRCP.png)

The top-left (eigenvector 1) subfigure shows another separation. These two groups are shown in figure below.
![PRCP_correlation_map_2.png](hw5_figures/PRCP_correlation_map_2.png)

We can easily find a group of stations in top-right (eigenvector 2) subfigure. This group of stations is exactly the group (US1MTMH0006, US1MTMH0012, US1MTMH0013, US1MTRB0008, USC00241225) we identify in ***The significance correlation matrix***.

We finally plot the coefficient as triangles of the map. Solid triangle represents positive value while hollow triangle represents negative value. The size of the triangles represents the magnitude of the coefficient. The geographical map is chosen to show the influence of landform. The distribution of the coefficients shows that most of the stations in the flag area have big coefficient in terms of absolute value while stations in mountain area have small absolute coefficients.
![PRCP_correlation_map_3.png](hw5_figures/PRCP_correlation_map_3.png)
