###### Massacusets Weather Analysis

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the area of the state of Florida.

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

<b><i><font color="red"> Notice: To make the analysis more precise, I narrow down the areas in the dataset, namely, only choose data from stations begin with "USC", which has the most frequency in the overall dataset. The original dataset has 12583 records, while 10284 records are retained after filtering, which is acceptable.</font></i></b>

## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/boston/massachusetts/united-states/usma0046" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img src="report_figures/florida_hollywood.png" /></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperature agree with the ones we got from our data, once we translate Fahrenheit to Centigrade.</p>

<p>&nbsp;</p>

<p><img src="report_figures/4.4/1.png" style="height:300px; width:800px" /></p>

<p>To compare the precipitation&nbsp;we need to translate millimeter/day to inches/month. According to our analysis the average rainfall is around 6 mm/day in summer and 2 mm/day in other seasons which translates to about 7.1 inches per month in summer and 2.36 inches per month in other seasons. According to US-Climate-Data the average rainfall is around to 8 inches per month in summer and 3.5 inches per month in other seasons. However, there is clear agreement that average precipitation is much higher in summer than in other seasons and also fairly even during each individual season.</p>

<p>&nbsp;<img src="report_figures/4.4/2.png" style="height:450px; width:600px" /></p>


## PCA analysis

For each of the six measurement, we compute the percentate of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
![VarExplained1.png](report_figures/4.4/3.png)
We see that the top 5 eigen-vectors explain 37% of variance for TMIN, 52% for TOBS and 28% for TMAX.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 41% of the variance.

![VarExplained2.png](report_figures/4.4/4.png)

The top 5 eigenvectors explain 7% of the variance for PRCP and 100% for SNOW and SNWD. The reason that top 5 eigenvectors can perfectly explain the variance for SNOW and SNOWD is that it rarely rains in Florida, as the following figure shows. So only a few eigenvectors can explain all the variances.

![SNOW/SNOWD.png](report_figures/4.4/5.png)


## Analysis of snow

We choose to analyze the eigen-decomposition for snow-depth because the first 5 eigen-vectors explain 100% of the variance, and has more non-zero data points than SNWD.

First, we graph the mean and the top 4 eigen-vectors.

We observe that the snow season is mainly from February to March, where the period between middle of February to middle has a more steady number of snow days. However there are several days in June and October that snows, but these snows did not heap up since the corresponding SNOWD is still zero, which we can observe from the previous figure.
![SNOW_mean_eigs.png](report_figures/4.5/2.png)

Next we interpret the eigen-functions. The first eigen-function (eig1) has a shape very similar to the mean function, but in an opposite direction. Another main difference is that the mean has some peaks in June and October, but the eigen-function is close to zero. The interpretation of this shape is that eig1 represents the overall amount of snow above/below the mean, but without changing the distribution over time.

**eig2 and eigen-5** are similar in the following way. They all oscilate between positive and negative values.

**eig-3 and eig-4** mainly represent for the occasional snow in June and October, whereas the others stands for the usual snows in February and March.


### Examples of reconstructions

![SNOW_grid.png](report_figures/4.5/3.png)
Large positive values of coeff1 correspond to more than average snow. Low values correspond to less than average snow.

Large positive values of coeff2 correspond to a late snow season (most of the snowfall is after mid feb. Negative values for coeff2 correspond to an early snow season (most of the snow is before mid-feb.

Large positive values of coeff2 correspond to a snow season with two spikes: one in the start of january, the other at the end of february. Negative values of coeff3 correspond to a season with a single peak at the end of Jan.



## Analysis of correlation between percipitation across locations

### Definition of statistical test

We want to find a statistical test for rejecting the null hypothesis that says that the rainfall in the two locations is independent.

Using the inner product is too noisy, because you multiply the rainfall on the same day in two locations and that product can be very large - leading to a large variance and poor ability to discriminate.
An alternative is to ignore the amount of rain, and just ask whether it rained in both locations. We can then compute the probability associated with the number of overlaps under the null hypothesis.

Fix two stations. We restrict our attention to the days for which we have measurements for both stations, and define the following notation:
* $m$ : the total number of days (for which we have measurements for both stations).
* $n_1$ : the number of days that it rained on station 1
* $n_2$ : the number of days that it rained on station 2
* $l$ : the number of days that it rained on both stations.

We want to calculate the probability that the number of overlap days is $l$ given $m,n_1,n_2$.

The answer is:
$$
P = {m \choose l,n_1-l,n_2-l,m-n_1-n_2+l} /{m \choose n_1}{m \choose n_2}
$$

Where
$$
{m \choose l,n_1-l,n_2-l,m-n_1-n_2+l} = \frac{m!}{l! (n_1-l)! (n_2-l)! (m-n_1-n_2+l)!}
$$

We use the fact that $\Gamma(n+1) = n!$ and denote $G(n) \doteq \log \Gamma(n+1)$
$$
\log P = \left[G(m) - G(l) -G(n_1-l) -G(n_2-l) -G(m-n_1-n_2+l) \right] - 
\left[G(m)-G(n_1)-G(m-n_1)\right] - \left[G(m)-G(n_2)-G(m-n_2)\right]
$$
Which slightly simplifies to 
$$
\log P = -G(l) -G(n_1-l) -G(n_2-l) -G(m-n_1-n_2+l) - G(m)+G(n_1)+G(m-n_1) +G(n_2)+G(m-n_2)
$$

The log probability scales with $m$ the length of the overlap. So to get a per-day significance we consider $
\frac{1}{m} \log P $

### Significance distribution

![significance.png](report_figures/7/2.png)


### correlations matrix

![nlp_matrix.png](report_figures/7/4.png)

We see that the first 35 stations are slightly more corelated than the others, but in grneral, the corelations are not very strong.

### Explanation and possibe extensions
![nlp_matrix.png](report_figures/7/5.png)

When we reorder the rows and columns of the matrix using one of the eigenvectors, the grouping of the stations becomes more evident. For example, consider the lower left and upper right corner of the first matrix. The stations at positions 0-25 are clearly strongly correlated with stations 45-83.
