# Middle-west Georgia Weather Analysis

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the area of the middle-west state of Georgia.

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/boston/massachusetts/united-states/usma0046" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="Climate Columbus - Georgia and Weather averages Columbus.jpeg" src="my_figures/Climate Columbus - Georgia and Weather averages Columbus.jpeg" style="height:300px" /></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperature agree with the ones we got from our data, once we translate Fahrenheit to Centigrade.</p>

<p>&nbsp;</p>

<p><img alt="TMIN,TMAX.png" src="my_figures/TMIN,TMAX.png" style="height:300px; width:800px" /></p>

<p>To compare the precipitation&nbsp;we need to translate millimeter/day to inches/month. According to our analysis the average rainfall is 3.50 mm/day which translates to about 4.14 Inches&nbsp;per month. According to US-Climate-Data the average rainfall is closer to 3.8 inch per month. However, there is clear agreement that the shape of fluctuation is almost the same with US-Climate-Data.</p>

<p>&nbsp;<img alt="PRCP.png" src="my_figures/PRCP.png" style="height:450px; width:600px" /></p>


## PCA analysis

For each of the six measurement, we compute the percentate of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
![VarExplained1.png](my_figures/VarExplained1.png)
We see that the top 5 eigen-vectors explain 22% of variance for TMIN, 48% for TOBS and 21% for TMAX.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 39% of the variance.

![VarExplained2.png](my_figures/VarExplained2.png)

The top 5 eigenvectors explain 7% of the variance for PRCP, which is a low value. On the other hand the top 5 eigenvectors explain 53% for SNOW(much higher than that of Boston) and %85 of the variance for SNWD. This means that these top 5 eigenvectors capture most of the variation in the snow signals. Based on that we will dig deeper into the PCA analysis for snow-depth.

It makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on diffferent years.

## Analysis of snow depth

We choose to analyze the eigen-decomposition for snow-depth because the first 4 eigen-vectors explain 83% of the variance.

First, we graph the mean and the top 4 eigen-vectors.

We observe that the snow season is from mid-december to the end of April, where the middle of February marks the peak of the snow-depth.
![SNWD_mean_eigs.png](my_figures/SNWD_mean_eigs.png)

Next we interpret the eigen-functions. The first eigen-function (eig1) has a shape very similar to the mean function during the whole Jan. And its much more smoother. The interpretation of this shape is that **eig1** represents that during Jan, the snow depth is almost constant distributed and has an constant value of 0.2 despite some other variations.

**eig2,eig3 and eig4** are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total (much).

They can be interpreted as follows:
* **eig2:** less snow in mid-Feb.
* **eig3:** less snow in the early of mid-Feb, more snow in later mid-Feb.
* **eig4:** more snow in the end of Jan, less snow in the start of Jan.

### Examples of reconstructions

#### Coeff1
Coeff1: most positive
![SNWD_grid_Pos_coeff1.png](my_figures/SNWD_grid_Pos_coeff1.png)
Coeff1: most negative
![SNWD_grid_neg_coeff1.png](my_figures/SNWD_grid_neg_coeff1.png)
Large positive values of coeff1 correspond to more than average snow. Low values correspond to less than average snow.

#### Coeff2
Coeff2: most positive
![SNWD_grid_Pos_coeff2.png](my_figures/SNWD_grid_Pos_coeff2.png)
Coeff2: most negative
![SNWD_grid_neg_coeff2.png](my_figures/SNWD_grid_neg_coeff2.png)

Large positive values of coeff2 correspond to less snow in mid-Feb. Negative values for coeff2 correspond to more snow in mid-Feb.
#### Coeff3
Coeff3: most positive
![SNWD_grid_Pos_coeff3.png](my_figures/SNWD_grid_Pos_coeff3.png)
Coeff3: most negative
![SNWD_grid_neg_coeff3.png](my_figures/SNWD_grid_neg_coeff3.png)

According to the shape of mean, there seems like always a snow spike at the middle of Feb. Large positive values of coeff2 correspond to a late snow spike. Negative values of coeff3 correspond to early snow spike.



### Best of reconstruction
c1 = -2.53, c2 = -429.80, c3 = 119.03  
residual normalized norm  after mean: 0.995878434734  
residual normalized norm  after mean + top eigs: [ 0.99996826  0.08271237  0.01236778]  
![SNWD_best_reconstruction.png](my_figures/SNWD_best_reconstruction.png)

As we can see in the figure above, only the snow spike at the middle of Feb are reconstructed. It doesn't snow much in the area(compare to Boston). We filtered out the datavectors for which the mean is a worse approximation than zero, and only left 103 out of 2015. And there are lots of noise of the SNWD mean curve, which means there is not a very obvious relationship between the seasons and the depth of the snow. That may be the reason why the snow depth curve cannot be reconstructed very accurately.

### Distribution of the residuals and the coefficient
![SNWD_coeff_1_CDF.png](my_figures/SNWD_coeff_1_CDF.png)
![SNWD_coeff_2_CDF.png](my_figures/SNWD_coeff_2_CDF.png)
![SNWD_coeff_3_CDF.png](my_figures/SNWD_coeff_3_CDF.png)
![SNWD_res_1_CDF.png](my_figures/SNWD_res_1_CDF.png)
![SNWD_res_2_CDF.png](my_figures/SNWD_res_2_CDF.png)
![SNWD_res_3_CDF.png](my_figures/SNWD_res_3_CDF.png)

### Geographical distribution of first 3 coefficients. 
The the area of the circle corresponds to the count of observation of certain station. Different colors in the circle represent the different value of coefficient. Warmer the color, the larger the coefficient and colder the color, the smaller the coefficient.
#### Coefficient 1
![7_SNWD_coeff1_map.png](my_figures/7_SNWD_coeff1_map.png)
In this figure, we can see the colors are warmer at the edges of the area, and there are some blue circles at the east of Columbus. Since coeff1 represent the whole magnitude of snow depths, there are more snow at the edges of the area and less snow at the east of Columbus.
#### Coefficient 2
![7_SNWD_coeff2_map.png](my_figures/7_SNWD_coeff2_map.png)

#### Coefficient 3
![7_SNWD_coeff3_map.png](my_figures/7_SNWD_coeff3_map.png)

In this figure，we can see that there are some blue circles at the northeast of this area and other circles are almost yellow. Since eig3 represent the relative time of the snow spike of mid-Feb, in northeast of this area, the snow spike happens a little earlier at the middle of Feb and the a little late in other stations.

## The variation in the timing of snow is mostly due to year-to-year variation
In the previous section we see the variation of Coeff1, which corresponds to the total amount of snow, with respect to location. We now estimate the relative importance of location-to-location variation relative to year-by-year variation.

These are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results:

** coeff_1 **  
total MS                   =  9734.99735929  
MS removing mean-by-station=  4443.87675531  fraction explained= 0.543515360991  
MS removing mean-by-year   =  80.3140511889  fraction explained= 0.991749966823  


** coeff_2 **  
total MS                   =  5689.913501  
MS removing mean-by-station=  4012.18198595  fraction explained= 0.29486063624  
MS removing mean-by-year   =  1611.59505356  fraction explained= 0.71676282016  

** coeff_3 **  
total MS                   =  731.724942806  
MS removing mean-by-station=  477.816011085 fraction explained= 0.34700051463  
MS removing mean-by-year   =  618.131053812 fraction explained= 0.155241242096  



Considering coeff_1 and coeff_2, we see that the variation by year explains more than the variation by station. However, the variation by year explains less than the variation by station consider coeff_3. 


Coeff_1，2 has to do with the timining of snowfall. Coeff_3 have to do with the total of snowfall. We see that for coeff_1,2 the stations explain 3-5% of the variance while the year explains 70-100%. For coeff_3, the station explain 34.7% of the variance while the year explains 15.5%.

## Analysis of Temperature at the Observation Time

We choose to analyze the eigen-decomposition for TOBS because the first 4 eigen-vectors explain 48% of the variance.

First, we graph the mean and the top 3 eigen-vectors.

We observe that the temperature at the observation time increases from January to July, keep at about 26.5 Celsius Degree to the start of September and then goes down to about 8 Celsius Degree.
![TOBS_mean_eigs.png](my_figures/TOBS_mean_eigs.png)

Next we interpret the eigen-functions. 
They can be interpreted as follows:
* **eig1:** like a base signal, represent the average temperature across the year.
* **eig2:** difference in temperature between summer and winter.
* **eig3:** the tendency of the average temperature (whether to increase or decrease).

## Analysis of correlation between percipitation across locations


We graph the mean and the top 3 eigen-vectors. We can observe that the precipitation is less in June and September, and fluctate to more in winter and early summer.
![PRCP_meanEigs.png](my_figures/PRCP_meanEigs.png)


### Definition of statistical test
We want to find a statistical test for rejecting the null hypothesis that says that the rainfall in the two locations is independent.

Using the inner product is too noisy, because you multiply the rainfall on the same day in two locations and that product can be very large - leading to a large variance and poor ability to discriminate.

An alternative is to ignore the amount of rain, and just ask whether it rained in both locations. We can then compute the probability associated with the number of overlaps under the null hypothesis.
### Percentage of variance explained.

![PRCP_explained.png](my_figures/PRCP_explained.png)

As we can see in the figure, the top 9 eigenvectors can only explain 12% of variance. Which means PCA analysis of precipitation is not very suitable. The precipitation varies a lot from days to days in a year and the data is quite noisy.

![PRCP_CDF.png](my_figures/PRCP_CDF.png)
As we see from the graph above, in our region it rains in about one third of the days.
###  Histogram of the resulting p-values
Choose 2 stations to do the hypothesis:
USC00092161, which has 24087 records.
USC00093271, which has 32598 records.

After doing hypothesis testing for all unique combinations of 2 stations, this plot displays a histogram of the resulting p-values.The significance of the p-value is the probability of seeing the result given by the data under the null hypothesis. So a low p-value means that it is unlikely to see such results under the null hypothesis.

Since we don't find too much low p-values, the null hypothesis is not wrong for the dataset. Then the two locations is not totally independent.


![PRCP_significance.png](my_figures/PRCP_significance.png)


### Finding structure in the rependency matrix.

![PRCP_correlation.png](my_figures/PRCP_correlation.png)

When we reorder the rows and columns of the matrix using one of the eigenvectors, the grouping of the stations becomes more evident. For example, consider the upper left corner and the lower right corner of the first matrix.The stations at positions 0-55 and 90-110 are clearly strongly correlated with each other respectively. Let's look at the second matrix.The upper left corner and a square at the lower right show that 0-60 and 65-100 are somehow correlated with each other respectively.