# Weather Analysis - New Hampshire and Maine

This is a report on the historical analysis of weather patterns in an area that overlaps the state of New Hampshire and parts of Maine.

I focused on the following six measurements for analysis:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/new-hampshire/united-states/3199" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="Climate New Hampshire.png" src="my_report_figs/Climate New Hampshire.png" /></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperature agree with the ones we got from our data.</p>

<p>&nbsp;</p>

<p><img alt="tmin tmax.png" src="my_report_figs/tmin tmax.png" style="height:300px; width:800px" /></p>

<!--<p>To compare the precipitation&nbsp;we need to translate millimeter/day to inches/month. According to our analysis the average rainfall is 3.00 mm/day which translates to about 3.55 Inches&nbsp;per month. According to US-Climate-Data the average rainfall is closer to 4 inch per month. However, there is clear agreement that average precipitation&nbsp;is close to a constant throughout the year</p>-->
<p>Similar to the data in US-Climate-Data for New Hampshire and Maine, the average precipitation is close to a constant throughout the year.</p>

<p>&nbsp;<img alt="precp.png" src="my_report_figs/precp.png" style="height:450px; width:600px" /></p>


## PCA analysis

For each of the six measurement, we compute the percentage of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
![VarExplained1.png](my_report_figs/VarExplained1.png)
We see that the top 5 eigen-vectors explain 22% of variance for TMIN, 37% for TOBS and 20% for TMAX.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 30% of the variance.

![VarExplained2.png](my_report_figs/VarExplained2.png)

The top 5 eigenvectors explain 11% of the variance for PRCP and 13% for SNOW. Both are low values. On the other hand the top 5 eigenvectors explain almost 86% of the variance for SNWD. This means that these top 5 eigenvectors capture most of the variation in the snow depth signals. Based on that we will dig deeper into the PCA analysis for snow-depth.

It makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on different years.

## Analysis of snow depth

We choose to analyze the eigen-decomposition for snow-depth because the first 4 eigen-vectors explain 82% of the variance.

First, we graph the mean and the top 4 eigen-vectors.

We observe that the snow season is from mid-November to the beginning of April, where the end of February marks the peak of the snow-depth. We can also observe that the mean of snow depth is almost constant in the month of February, at the peak.
![SNWD_mean_eigs.png](my_report_figs/snwd_mean_eigs.png)

Next we interpret the eigen-functions. The first eigen-function (eig1) has a shape similar to the mean function. The main difference is that the eigen-function is close to zero during october-december while the mean is not. We can also notice that the peak is wider in the mean function. The interpretation of this shape is that eig1 represents the overall amount of snow above/below the mean, but without changing the distribution over time.

**eig2,eig3 and eig4** are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total (much).

They can be interpreted as follows:
* **eig2:** less snow in January - February, more snow in March - April.
* **eig3:** more snow in January, less snow in February, slightly more snow in March.
* **eig4:** less snow in December, more snow in January, less snow in February, slightly more snow in March.

### Examples of reconstructions of SNWD using first 3 eigenvectors

#### Coeff1
Coeff1: most positive
![snwd_pos_coeff1.png](my_report_figs/snwd_pos_coeff1.png)

Coeff1: most negative
![snwd_neg_coeff2.png](my_report_figs/snwd_neg_coeff1.png)

Large positive values of coeff1 correspond to more than average snow. Low values correspond to less than average snow.

#### Coeff2
Coeff2: most positive
![snwd_pos_coeff2.png](my_report_figs/snwd_pos_coeff2.png)

Coeff2: most negative
![snwd_neg_coeff2.png](my_report_figs/snwd_neg_coeff2.png)

Large positive values of coeff2 correspond to a late snow season (most of the snowfall and the peak are after mid February. Negative values for coeff2 correspond to an earlier snow season (most of the snow and peak are before mid-February).

#### Coeff3
Coeff3: most positive
![snwd_pos_coeff3.png](my_report_figs/snwd_pos_coeff3.png)

Coeff3: most negative
![snwd_neg_coeff3.png](my_report_figs/snwd_neg_coeff3.png)

Large positive values of coeff2 correspond to a snow season with two spikes: one in January, the other in March, with a dip in February. Negative values of coeff3 correspond to a season with a single peak in February.



### Geographical distribution of first 4 coefficients. 

Now we analyze the geographical distribution of coeff1 for SNWD. In the following map, the radius of the circle corresponding to each station corresponds to the number of observations at each station and the color is mapped to the value of coeff1.
<p><img alt="snwd_geo_coeff1_zoomed_legend.jpg" src="my_report_figs/snwd_geo_coeff1_zoomed_legend.jpg" /></p>

On close observation, we can see that most of the stations along the coast have negative coeff1 and as we move away from the coast, the coeff1 is increasing. We have seen in the reconstruction plots (in the section before) that negative coeff1 corresponds to less than mean snow. This is because the places closer to the coast will have lower snow depth as compared to the inland. 

## The variation in the timing of snow is mostly due to year-to-year variation

In the previous section we see the variation of Coeff1, which corresponds to the total amount of snow, with respect to location. We now estimate the relative importance of location-to-location variation relative to year-by-year variation.
These are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results:

** coeff_1 **  
total MS                   = 1855.51190003  
MS removing mean-by-station= 1392.70899374, fraction explained=24.9  
MS removing mean-by-year   = 1339.59182266, fraction explained=27.8  

** coeff_2 **  
total MS                   = 977.398862638  
MS removing mean-by-station= 930.247992826, fraction explained=4.8  
MS removing mean-by-year   = 496.582543868, fraction explained=49.2  

** coeff_3 **  
total MS                   = 633.578643968  
MS removing mean-by-station= 593.927565955, fraction explained= 6.2  
MS removing mean-by-year   = 301.443961333, fraction explained=52.4  

We see that the variation by year explains more than the variation by station. However this effect is weaker for coeff1, where both are close, which has to do with the total snowfall, vs. coeff2, coeff3 which, as we saw above have to do with the timining of snowfall.

## Analysis of Snow

### Measure of statistical significance of Snow

The null hypothesis is that the snowfall in the two locations is independent.
Here, we only consider whether it snowed in both locations. We can then compute the probability associated with the number of overlaps under the null hypothesis.

Fix two stations. We restrict our attention to the days for which we have measurements for both stations, and define the following notation:
* $m$ : the total number of days (for which we have measurements for both stations).
* $n_1$ : the number of days that it snowed on station 1
* $n_2$ : the number of days that it snowed on station 2
* $l$ : the number of days that it snowed on both stations.

We want to calculate the probability that the number of overlap days is $l$ given $m,n_1,n_2$.

The answer is:
$$
P = {m \choose l,n_1-l,n_2-l,m-n_1-n_2+l} /{m \choose n_1}{m \choose n_2}
$$

Where
$$
{m \choose l,n_1-l,n_2-l,m-n_1-n_2+l} = \frac{m!}{l! (n_1-l)! (n_2-l)! (m-n_1-n_2+l)!}
$$

We use the fact that $\Gamma(n+1) = n!$ and denote $G(n) \doteq \log \Gamma(n+1)$
$$
\log P = \left[G(m) - G(l) -G(n_1-l) -G(n_2-l) -G(m-n_1-n_2+l) \right] - 
\left[G(m)-G(n_1)-G(m-n_1)\right] - \left[G(m)-G(n_2)-G(m-n_2)\right]
$$
Which slightly simplifies to 
$$
\log P = -G(l) -G(n_1-l) -G(n_2-l) -G(m-n_1-n_2+l) - G(m)+G(n_1)+G(m-n_1) +G(n_2)+G(m-n_2)
$$

The log probability scales with $m$, the length of the overlap. So, to get a per-day significance we consider $
\frac{1}{m} \log P $

The following graph shows the distribution of the probability values for different pairs of stations. Since we have low p-values, we can infer that it is unlikely to see such results under the null hypothesis. So we can conclude that the null hypothesis is wrong for this dataset.
<p><img alt="p_norm_snow.png" src="my_report_figs/p_norm_snow.png" /></p>

The following matrix captures the p-values for the pairs of stations.
<p><img alt="p_norm_snow_mat.png" src="my_report_figs/p_norm_snow_mat.png" /></p>

From the plot above, we can say that many stations are correlated with each other. To find more correlations between stations, we use SVD. The top 10 eigenvectors explain about 72% of the square magnitude of the matrix as we can see from the plot below:
<p><img alt="var_explained_pnorm.png" src="my_report_figs/var_explained_pnorm.png" /></p>

The plots below are the matrices we get when we reorder the rows and columns of the probability matrix using first four eigenvectors. 
<p><img alt="p_norm_snow_coeff_mat.png" src="my_report_figs/p_norm_snow_coeff_mat.png" /></p>

We can now see the grouping of the stations becomes more evident as can be seen in the plots above. For example, consider the upper left corner of the first matrix (The upper left one). The stations at positions 0-22 are clearly strongly correlated with each other. We can also see a group of stations in the bottom right of the second matrix (top right plot) that are correlated with each other. 

This type of organization is called Block Diagonal and it typically reveals important structure such as grouping or clustering.

## Analysis of Temperature

### Change in TMIN over the years

The plot below shows the maximum of TMIN in each year.
<p><img alt="tmin over year.png" src="my_report_figs/tmin over year.png" /></p>
We can see a trend of the temperatures increasing over the years which can be attributed to the climate changing over the years.
A similar trend can be observed for TMAX and TOBS as well.