# Georgia-Alabama Region Weather Analysis

This is a report on the historical analysis of weather patterns in an area that roughly overlaps the region of the USA states of Georgia and Alabama.

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Checking Valid Counts

We first check the distribution of valid counts for various measurments over different months in the year. Following are the graphs showing Valid Counts for TMIN, TMAX, SNOW, SNWD, PRCP

![r_Valid_Counts_TMIN_TMAX.png](r_figures/r_Valid_Counts_TMIN_TMAX.png)
![r_Valid_Counts_TOBS_PRCP.png](r_figures/r_Valid_Counts_TOBS_PRCP.png)
![r_Valid_Counts_SNOW_SNWD.png](r_figures/r_Valid_Counts_SNOW_SNWD.png)

It is obvious that the distribution of valid counts is not uniform. In particular, we observe that valid counts for SNWD in the winters drops a lot - this is probably because of the difficulty of measuring snow depth in cold - which is probably a manual procedure, at least historically.

## Sanity-check: comparison with outside sources

<p>To begin with, we compare some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/columbus/georgia/united-states/usga0823" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month. This data is for Columbus, GA which is a prominent city almost in the middle of our region</p>

<p>&nbsp;</p>

<p><img alt="r_Columbus_Averages.png" src="r_figures/r_Columbus_Averages.png" /></p>

<p>&nbsp;</p>

<p>We observe that High closely follows the pattern of our TMAX Mean, being as low as 15 degrees celcius during winter months and rising to around 33 degrees celcius during peak summers. Also, the Low closely follows the pattern of our TMIN mean, being around 20 degrees celcius in the peak summer and dropping to around 3 degrees celcius during peak winters. .</p>

<p>&nbsp;</p>

<p><img alt="r_TMIN_TMAX_Mean_STD.png" src="r_figures/r_TMIN_TMAX_Mean_STD.png" style="height:450px; width:1200px" /></p>

<p>Now, lets compare the precipitation. This website has given the precipiation data in terms of milimeters per month. According to the website's graph, we see that the precipitation fluctuates between 60mm to 120 mm per month. This translates to around 2 mm to 4 mm of daily rainfall. This closely follows our pattern of PRCP Mean which is clearly fluctuating between 2.00 mm/day to 5.00 mm/day. In addition the pattern of fluctuation closely resembles that of the provided graph.</p>

<p>&nbsp;<img alt="r_PRCP_Mean_STD.png" src="r_figures/r_PRCP_Mean_STD.png" style="height:450px; width:600px" /></p>

<p>Overall, we can say that the our observed statistical patterns agree with the external sources. Hence, it passes the Sanity Check!</p>



## PCA analysis

For each of the six measurement, we compute the percentage of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
![VarExplained1.png](r_figures/r_VarExplained1.png)
We see that the top 5 eigen-vectors explain 22% of variance for TMIN, 48% for TOBS and 20% for TMAX.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 40% of the variance.

![VarExplained2.png](r_figures/r_VarExplained2.png)

The top 5 eigenvectors explain 7% of the variance for PRCP, 52% for SNOW, 82% for SNWD. The value for PRCP is very low, on the other hand, values for SNOW and SNWD are quite high, with the value for SNWD being the highest. This shows that top 5 eigenvectors are able to capture most of the variance in SNOW and SNWD signals but not PRCP signals. Before we try to analyze this, lets first look at the general Snowfall trends in this region.

The weather of Georgia-Alabama region being considered is such that it receives little to no snowfall. We can see the SNOW and SNWD MEAN+-STD plots below which show that it snows very very little in this region, compared to USA average.
![r_SNOW_SNWD_Mean_STD.png](r_figures/r_SNOW_SNWD_Mean_STD.png)

So, it means that the Snow is close to zero almost throughout the year with little snowfall in the winter months. So, given this relatively simple trend of SNOW and SNWD, it is being explained very nicely by the top 5 eigen vectors.This implies that SNOW and SNWD have a very low dimenional pattern. Also, since SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on different years, it is being captured the best - with the top 5 eigen vectors explaining as high as 82% of the total variance.

PRCP, on the other hand has a more complex pattern, with fluctuations throughout the year and less obvious high level patterns. In other words, it is much more noisy that SNOW or SNWD. Hence, we see that the top 5 eigen vectors are able to explain only 7% of the variance in the case of PRCP. We would need more eigen vectors in this case to explain considerable amount of variance.

## Analysis of snow depth

We choose to analyze the eigen-decomposition for snow-depth because the first 5 eigen-vectors explain around 82% of the variance.

First, we plot a graph of the mean and the top 4 eigen-vectors for SNWD.

As discussed before, it snows very little in this region. There is light snow from December to March, where end of February marks the peak of the snow-depth.
![r_SNWD_Mean_Eigs.png](r_figures/r_SNWD_Mean_Eigs.png)

Next we interpret the eigen-functions. The first eigen-function (eig1) has a shape very similar to the mean function. The main difference is that the eigen-function is close to zero during December while the mean is not.  The interpretation of this shape is that eig1 represents the overall amount of snow above/below the mean, but without changing the distribution over time.

**eig2,eig3 and eig4** are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total (much).

They can be interpreted as follows:
* **eig2:** neutral in Jan, less snow in Feb, neutral in March.
* **eig3:** neutral in Jan,  less snow in Feb end - immidiately followed by a spike in snowfall.
* **eig4:** less snow in Jan, more snow in feb, neutral in March.


### Examples of reconstructions

#### Coeff1
Coeff1: most positive
![r_coeff1_pos.png](r_figures/r_coeff1_pos.png)
Coeff1: most negative
![r_coeff1_neg.png](r_figures/r_coeff1_neg.png)

#### Coeff2
Coeff2: most positive
![r_coeff2_pos.png](r_figures/r_coeff2_pos.png)
Coeff2: most negative
![r_coeff2_neg.png](r_figures/r_coeff2_neg.png)

#### Coeff3
Coeff3: most positive
![r_coeff3_pos.png](r_figures/r_coeff3_pos.png)
Coeff3: most negative
![r_coeff3_neg.png](r_figures/r_coeff3_neg.png)

Here, we would like to point out a few things:

1. Our region receives very, very little snow throughout the year. The distribution of snow has little noise and there don't exist any rich or interesting patterns to infer.
2. The number of data points we have in 103, which we suspect is not big enough number for very useful analysis.

Given the above two points, its not really very clear that which coefficient has what particular effect on the Snow distribution.

### The variation in the timing of snow is mostly due to year-to-year variation
We now estimate the relative importance of location-to-location variation relative to year-by-year variation.

These are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results:

** coeff_1 **  

total MS                   = 9734.99735929 

MS removing mean-by-station= 4443.87675531, fraction explained= 54.3515360991 

MS removing mean-by-year   = 80.3140511889, fraction explained= 99.1749966823  

** coeff_2 **  

total MS                   = 5689.913501  

MS removing mean-by-station= 4012.18198595, fraction explained= 29.486063624 

MS removing mean-by-year   = 1611.59505356, fraction explained= 71.676282016  

** coeff_3 **  

total MS                   = 731.724942806  

MS removing mean-by-station= 477.816011085, fraction explained= 34.700051463 

MS removing mean-by-year   = 618.131053812, fraction explained= 15.5241242096

Overall, we observe that in most cases, variation by year explains more than variation by station. This means that the variation in the timing of snow is mostly due to year-to-year variation. For our region, we observe that this effect is very, very strong for Coeff1 - where year explains over 99% of the variation. This effect is much weaker for other coefficients. This leads us to conclude that for Coeff1, year predominantly explains the variation.


## Analysis of TOBS [Originality]

Next, We choose to analyze the eigen-decomposition for TOBS because the first 5 eigen-vectors explain around 48% of the variance in the case of TOBS.

First, we plot a graph of the mean and the top 4 eigen-vectors for SNWD.
![r_TOBS_mean_eigs.png](r_figures/r_TOBS_mean_eigs.png)

We observe that TOBS peaks in the summers and has low values in the winter months


### Examples of reconstructions

#### Coeff1
Coeff1: most positive
![r_TOBS_pos.png](r_figures/r_TOBS_pos.png)

Coeff1: most negative
![r_TOBS_neg.png](r_figures/r_TOBS_neg.png)


Clearly, the very positive values of Coeff1 imply lower than mean TOBS for a particular year-station pair. On the other hand, very negative values of Coeff1 imply higher than mean TOBS for a particular year-station pair.


### The variation in the TOBS is mostly due to year-to-year variation

These are again measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results:

** coeff_1 **  

total MS                   = 474281.847681

MS removing mean-by-station= 235389.207946, fraction explained= 50.3693406997 

MS removing mean-by-year   = 348154.794874, fraction explained= 26.5932701037

** coeff_2 **  

total MS                   = 38193.7391873

MS removing mean-by-station= 34362.8992774, fraction explained= 10.0300206039

MS removing mean-by-year   = 13070.610299, fraction explained= 65.7781338587 

** coeff_3 **  

total MS                   = 29287.9121458 

MS removing mean-by-station= 27747.2174274 fraction explained= 5.26051399888 

MS removing mean-by-year   = 4513.61237665, fraction explained= 84.588821647


We see that the variation by year explains more than the variation by station. However this effect is weaker when we consider coeff_1, which has to do with how much above or below the mean is the TOBS, vs. coeff_2,3 which, which might have to do with the timining of peak and low TOBS. We see that for coeff_2,3 the stations explain 5-10% of the variance while the year explaines 65-85%.

## Analyzing Residuals

Following are the distribution of residuals after 3 eigen vectors in case of PRCP, SNWD, TOBS respectively.
![r_residual_PRCP.png](r_figures/r_residual_PRCP.png)
![r_residual_SNWD.png](r_figures/r_residual_SNWD.png)
![r_residual_TOBS.png](r_figures/r_residual_TOBS.png)

In case of PRCP, we observe that the residuals mostly have quite high values, which basically mean that the top 3 vectors are not really able to provide good approximation.

The graph for SNWD is a bit rough, and given that we have only 103 data points, we feel its not a good idea to derive any inference from the plot.

For the case of TOBS, the distribution is much more even than it was in case of PRCP, which implies that, overall, the top 3 eigen vectors are able to provide better approximation in case of TOBS than in case of PRCP.



## Maps

Following are the maps of our region, plotted using ipyleaflet.
The circles in the following map show Precipitation. The color coressponds to the amount of precipitation and circle size coressponds to count of observation for that particular place on the map. The interactive maps can be found in the respective notebooks
![r_Map.png](r_figures/r_Map.png)

![r_Map_Polygons1.png](r_figures/r_Map_Polygons1.png)

Following is zoomed-in version of the previous map

![r_Map_Polygons2.png](r_figures/r_Map_Polygons2.png)

## Analyzing Station-to-Station Correlation

Lets analyze the correlation between stations. The following matrix shows, for each pair of stations, the normalized log probability that the overlap in rain days is random.

![r_72.png](r_figures/r_72.png)

Further, after we reorder the rows and columns, of the matrix using one of the eigenvectors, the grouping of the stations becomes more evident. In the figure below, we can observe that in the upper left graph, most stations in positions 0-50 are very strongly correlated. This technique surely helps us better analyze how strongly the stations are correlated.

![r_73.png](r_figures/r_73.png)
