# Alabama, Georgia and Florida Region Weather Analysis

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the southern region of the United State that includes region from state Alabama, Georgia and Florida.

<p><img alt="region.png" style="height:400px; width:1000px" src="r_figures/region.png" /></p>

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/florida/united-states/3179" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="sanity.png" style="height:350px; width:450px" src="r_figures/sanity.png" /></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperature agree with the ones we got from our data, once we translate Fahrenheit to Centigrade.</p>

<p>&nbsp;</p>

<p><img alt="tmin_tmax_c.png" src="r_figures/tmin_tmax_c.png" style="height:300px; width:800px" /></p>

<p>To compare the precipitation&nbsp;we need to translate millimeter/day to inches/month. According to our analysis the average rainfall is 3.3377 mm/day which translates to about 3.997 Inches&nbsp;per month. According to US-Climate-Data the average rainfall is closer to 4.93 inch per month. </p>
<p>The differences are mostly due to the Climate weather data is mostly centered at Florida, whereas my data is centered at the intersection between three states. From the graph, there's a clear similar rainfall pattern in that region. Most of the rainfall season happens from June to August in the summer. There's a significant amount of decreas in rainfall for April and May, which is reflected from both the US Climate Data and my graph.</p>
<p>Overall, there are many similiar trends between US Climate Data and our data. Therefore, we can conclude that our data is valid and it is unlikely to be corrupted. And this report is a reliable analysis of the data of the region between Alabama, Georgia and Florida.</p>

<p>&nbsp;<img alt="prcp_new.png" src="r_figures/prcp_new.png" style="height:450px; width:600px" /></p>


## PCA analysis

For each of the six measurement, we compute the percentate of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
![VarExplained1.png](r_figures/pca_1.png)
We see that the top 5 eigen-vectors explain 23% of variance for TMIN, 50% for TOBS and 22% for TMAX.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 40% of the variance. we will dig deeper into the PCA analysis for TOBS.

![VarExplained2.png](r_figures/pca_2.png)

The top 5 eigenvectors explain 6.5% of the variance for PRCP. A very low value for precipitation. On the other hand the top 5 eigenvectors explain 90% of the variance for SNWD and SNOW. This means that these top 5 eigenvectors capture most of the variation in the snow signals. 

In this region, there are a huge amount of rainfall throughout the year which results in a lot of variance for precipitation. Therefore, using eigenvectos to capture the variance of rainfall is a very hard. Top 5 eigenvectos can only explain 6.5% of the variance due to the large amount of rainfall across the years.

In the other hand, it is expected that the amount of snow is very small in this region, which means the variance between snowfall is small and it only takes a few eigenvectors to explain most of the variance. In this case, the top 5 eigenvectors explain 90% of variance for both SNOW and SNWD. However, we normally expect that SNWD to be less noisy than SNOW since SNWD is a decaying integral of SNOW and varies less between days and between the same date on different years unless the amount of SNOW and SNWD is very small, which likely to result in only 1 or 2 snowfalls throughout the whole year and therefore results in this case.

## Analysis of Average Temperature (TOBS)

We choose to analyze the eigen-decomposition for average temperature because the first 5 eigen-vectors explain 50% of the variance. 

In addition, TMIN's and TMAX's are harder to capture due to its inconsistency throughout the years, whereas average temperature is much more consistent. I decide to not analyize precipitation or snowfall because precipitation in this region is very hard to capture by eigen-decomposition as we have seen in the earlier graph, top 5 eigenvector only explains 6.5% of the variance. It's also not interesting to analyize snowfall because there's rarely any snowfall in this region. 

First, we graph the mean and the top 5 eigen-vectors.

We observe that the average temperature has less variance in the middle of the year (from May to August) and much more variance in the other months. This is also reflected in the eigen-function graph, where there's less variance in the middle and much more variance in the ends for all 5 eigen-functions. The temperature peaks at the middle of May and stay there until middle of August. 
![SNWD_mean_eigs.png](r_figures/tobs_eigen.png)

Next we interpret the eigen-functions. The first eigen-function (eig1) has a shape very similar to the mean function, which is a straight line that indicates the overall average temperature throughout the year. Similarly, this function mostly capture the overall average temperature throughout the year and does not oscillate at the ends. The interpretation of this shape is that eig1 represents the overall temperature for this region and act as a baseline for the rest of the eigen-functions to capture the overall variance of TOBS.

**eig2,eig3,eig4 and eig5** are similar in the following way. They all oscilate between positive and negative values. The oscilation occurs in a larger range in the colder months and less in the warmer months. In other words, they correspond to changing the distribution of the average temperature, where there are more variances in the ends and less in the middle. Also, from the variance graph, we know that th first eigen-vectors captures 40% of the variances and the subsequent eigen-vectors capture roughly 2% each time. Therefore, eigen-vectors 2 to 5 actually does not contain a lot of meanful information.

They can be interpreted as follows:
* **eig2:** decresing average temperature from Jan to Feb.
* **eig3:** higher and increasing average temperature in Feb to April.
* **eig4:** consistent temperature increase from Feb to April. 
* **eig5:** consistent temperature increase from Feb to April.


### Best reconstruction
This report did not go into details for the coeffients since the top 1 eigenvector is the most important and the rest are equal weights as they explain about same amount of variance but a lot less than the first one. 

![best_reconstruction.png](r_figures/best_reconstruction.png)

The residual normalized norm after mean: 0.218213868841. 

And the residual normalized norm after mean + top eigs is 0.52020091, 0.34125552, 0.34085702. As we expected, after the first eigen-function, the residual does not change much since the rest of the eigen-function actually doesn't explain variance much. 

## The variation in average temperature is mostly due to year-to-year variation
We now estimate the relative importance of location-to-location variation relative to year-by-year variation. To estimate the effect of time vs. location on the first eigenvector coefficient we compute:

    The average row: mean-by-station
    The average column: mean-by-year

These are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results:

    total RMS                   =  198.26720752
    RMS removing mean-by-station=  190.278337595
    RMS removing mean-by-year   =  106.61472991

We see that the variation by year explains more than the variation by station. 

## Average Temperature trends in the 70 years of this region
The following graph shows the average temperature for year 1902 and 1972. This has not been any dramatic temperature changes in the 7 decades.

<p><img alt="trend.png" style="height:300px; width:500px" src="r_figures/trend.png" /></p>
