# Analysis On Maine Climate 
> Author: Lifan Chen     
Date: 5/13/2017   
PID: A53223155

This is a report on the historical analysis of climate patterns in an area that approximately overlaps the area of the state of Maine.


### Data source
The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We partition the stations into 256 geographical rectangles, indexed from BBBBBBBB to SSSSSSSS. And each containing about 12,000 station,year pairs. To be specific, this report will only focus on the region whose code is __BBBBBSBS__.


### Measurements
Most measurements exists only for a tiny fraction of the stations and years. We therefor restrict our use to the following measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.


### Sanity-check: comparison with outside sources
We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/maine/united-states/3189" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.

<p><img alt="scheck_1.png" src="m_figures/scheck_1.png" width = "500" /></p>
<br><br>
We see that the min and max daily&nbsp;temperature agree with the ones we got from our data, once we translate Fahrenheit to Centigrade.
<br><br><br>

<img src="m_figures/Tmin_Tmax.png" width="800">

We can also plot the mean along with standard deviation of daily percipitation (PRCP) and the average temperature for each day (TOBS) as following. 
<br>
<img src="m_figures/tobs_prcp.png" width="800">

We can see the __TOBS__ plot is almost coincident with our observation on min and max daily temprature degree as well as the graph from _US Climate Data_. 

<br>
However, the plot of __PRCP__ differs a lot from what we get from _US Climate Data_. In our observation, the average percipitation of the area is around 30.30107 millimeter/day. And after translate it in to inch/month, we can get 36 inch/month as a result which is far larger than the _US Climate Data_. In the same time, the standard deviation of the data is extremely large which means the noice of the data is very high. While it seems not resonable to have such high level of percipitation, the possible reason to explain might be there is a regional high percipation rate or the data itself somehow is not accurate. 

## PCA Analysis

We perform principle component analysis on each measurements by computing the percentate of the variance explained as a function of the number of eigen-vectors used. The results are plotted as following. 

#### Temperature analysis
<img src="m_figures/eigen_explain.png" width="800">

We see that the top 5 eigen-vectors explain 11% of variance for TMIN, 37% for TOBS and 10% for TMAX.
We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 29% of the variance.

All of these values appears to be not too high. It makes sense since the variance of daily temperature varies a lot during the year and the noice occupies a large part in the data itself. Therefore, only top eigenvectors can not explain the data entirely. 

#### Snow and percipitation analysis
<img src="m_figures/eigen_explain_snow.png" width="800">

The top 5 eigenvectors explain 11% of the variance for PRCP and 13% for SNOW. Both are low values. On the other hand the top 5 eigenvectors explain 86% of the variance for SNWD. This means that these top 5 eigenvectors capture most of the variation in the snow signals. Based on that we will dig deeper into the PCA analysis for snow-depth.

It makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on diffferent years.

## Analysis on snow
We will perform an analysis on snow and snow depth because the first 5 eigen-vector of SNWD explain 86% of the variance which is a very large value while the first 5 eigen-vector of SNOW only explain 13% of the variance. Since SNOW and SNWD has strong relationship, so let's look further into the data to see how this happens. 

First, we graph the mean+std and the top 3 eigen-vectors of them.
<img src="m_figures/snow_snwd.png" width="800">
<br>
<img src="m_figures/snow_snwd_eigen.png" width="800">


We observe that the snow season is from mid-november to the beginning of may, where the end of February marks the peak of the snow-depth.

By the plot of mean, we could see the daily snow has much more noice and larger variace compared with the depth accumulated snow, while the SWND tends to be much more smooth. Similarly, by the plot of top 3 eigenvectors, similar observasions could be made on SNOW and SWND where the eigenvectors of SNOW zig-zag across the entire plot while the eigenvectors of SWND remain stable. 


<img src="m_figures/SWND_sd.png" width="600">
When looking further into the eigenvectors, we could observe that the __first eigen-function__ (eig0) has a shape very similar to the mean function. The main difference is that the eigen-function is close to zero during october-december while the mean is not. The interpretation of this shape is that eig1 represents the overall amount of snow above/below the mean, but without changing the distribution over time.

eig1,eig2 are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total (much).

They can be interpreted as follows:
* __eig1__: less snow in jan - mid feb, more snow in mid feb-march.
* __eig2__: more snow in jan, less snow in feb, slightly more snow in march.

### Reconstruction from the eigen-vectors
Based on the eigen decomposition of SNWD, we know that top 5 eigen-vectors explain 86% of variance of SNWD. Thus, we can try to reconstruct the data using only top eigen-vectors. So let's try to construct the original data with the most positive coefficients as well as most negative coefficients and see what is going to happen. 

#### Coefficient of 1st eigenvector
Top 4 positive 
<br>
<img src="m_figures/co1_pos.png" width="800">
Top 4 negative
<br>
<img src="m_figures/co1_neg.png" width="800">
Thus, large positive values of coefficient 1 correspond to more than average snow. Low values correspond to less than average snow.

#### Coefficient of 2nd eigenvector
Top 4 positive 
<br>
<img src="m_figures/co2_pos.png" width="800">
Top 4 negative
<br>
<img src="m_figures/co2_neg.png" width="800">
Large positive values of coeff2 correspond to a late snow season (most of the snowfall is after mid feb. Negative values for coeff2 correspond to an early snow season (most of the snow is before mid-feb.

#### Coefficient of 3rd eigenvector
Top 4 positive 
<br>
<img src="m_figures/co3_pos.png" width="800">
Top 4 negative
<br>
<img src="m_figures/co3_neg.png" width="800">
Large positive values of coeff3 correspond to a snow season with two spikes: one in the start of January, the other at the end of february. Negative values of coeff3 correspond to a season with a single peak at the end of Jan.


## Measuring statistical significance
We want to find a statistical test for rejecting the null hypothesis that says that the rainfall in the two locations is independent.

#### Definition of statistical significance
Fix two stations. We restrict our attention to the days for which we have measurements for both stations, and define the following notation:
* $m$ : the total number of days (for which we have measurements for both stations).
* $n_1$ : the number of days that it rained on station 1
* $n_2$ : the number of days that it rained on station 2
* $l$ : the number of days that it rained on both stations.

We want to calculate the probability that the number of overlap days is $l$ given $m,n_1,n_2$.

The answer is:
$$
P = {m \choose l,n_1-l,n_2-l,m-n_1-n_2+l} /{m \choose n_1}{m \choose n_2}
$$

Where
$$
{m \choose l,n_1-l,n_2-l,m-n_1-n_2+l} = \frac{m!}{l! (n_1-l)! (n_2-l)! (m-n_1-n_2+l)!}
$$

We use the fact that $\Gamma(n+1) = n!$ and denote $G(n) \doteq \log \Gamma(n+1)$
$$
\log P = \left[G(m) - G(l) -G(n_1-l) -G(n_2-l) -G(m-n_1-n_2+l) \right] - 
\left[G(m)-G(n_1)-G(m-n_1)\right] - \left[G(m)-G(n_2)-G(m-n_2)\right]
$$
Which slightly simplifies to 
$$
\log P = -G(l) -G(n_1-l) -G(n_2-l) -G(m-n_1-n_2+l) - G(m)+G(n_1)+G(m-n_1) +G(n_2)+G(m-n_2)
$$

The log probability scales with $m$ the length of the overlap. So to get a per-day significance we consider $
\frac{1}{m} \log P $

<br><br><br>

#### Significance histogram
Now, based this method, we can calculate the normalized log probability for each pair of stations, and plotting the __histogram__ of significance as following, where X-axis represents the bins of significance and Y-axis represents counts.
<br>
<img src="m_figures/sig_hist.png" width="600">
<br><br>
We can see from the histogram that there are two peaks, where the first one is around 0.08 meaning the significance is not strong and the other is around 0.3 meaning the significance is strong. Thus, it indicates that different stations may have totally different level of correlation, where some of them might be strongly dependent, while some of them may be totally irrelevant. 


#### Corelation matrix
Let's look further into this by plotting the dependent matrix as following. 
<br>
<img src="m_figures/dep_mat.png" width="600">
<br><br>
The matrix above shows, for each pair of stations, the normalized log probability that the overlap in rain days is generally random.

But in this plot, we can see there are two strongly related region, where one is on the top left and the other is on the bottom right. 


#### PCA analysis
To find more correlations we use SVD. As we shall see that the top 10 eigenvectors explain about 90% of the square magnitude of the matrix, which indicate the top eigenvectors are representative for the dataset. 
<br>
<img src="m_figures/expl.png" width="600">
<br><br>

Then we can try to reorder rows (and columns) of the data and group those with similar eigenvalues to find out the structure under the data. 

When we reorder the rows and columns of the matrix using one of the eigenvectors, the grouping of the stations becomes more evident as following. 

<br>
<img src="m_figures/reord.png" width="600">
<br><br>

For example, in the top left image above, the top left part of it is strongly grouped. Simlilarly, in each image above, they all contain several strongly grouped stations, which exactly indicate the cluster of stations. 

#### Map 
Now let's look further into the grouped station and plot them on the map to figure out whether there is a spatial relationship between them. 

<br>
<img src="m_figures/map.png" width="600">
<br><br>
The image above is the general map of the stations where the size of each triangle corresponds to the value of its coefficent. Then, we could easily find some regional pattern in the data set as following. 

<img src="m_figures/reg1.png" width="300"><img src="m_figures/reg2.png" width="300"><img src="m_figures/reg3.png" width="300">

Given these partterns and the analysis we made above, we could know that there is some spatial relationship between the SNWD of different stations. 