# Georgia Weather Analysis

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the area of the state of Georgia.

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was got from this github repository(https://github.com/ucsd-edx/CSE255-DSE230/tree/master/Data/Weather) with file index 'BSBSSSSB'. There are 12127 samples in this dataset. 

In My analysis, I focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/columbus/georgia/united-states/usga0823" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="Climate_Columbia_-_Georgia_and_Weather_averages_Columbia.jpg" src="report_figures_ye/Climate_Columbia_-_Georgia_and_Weather_averages_Columbia.jpg" /></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperature agree with the ones we got from our data, once we translate Fahrenheit to Centigrade.</p>

<p>&nbsp;</p>

<p><img alt="TMIN,TMAX.png" src="report_figures_ye/TMIN,TMAX.png" style="height:300px; width:800px" /></p>

<p>To compare the precipitation&nbsp;we need to translate millimeter/day to inches/month. According to our analysis the average rainfall is 3.50 mm/day which translates to about 4.13 Inches&nbsp;per month. According to US-Climate-Data the average rainfall is about 3.9 inch per month. However, there is clear agreement that average precipitation&nbsp;is close to a constant throughout the year</p>

<p>&nbsp;<img alt="PRCP.png" src="report_figures_ye/PRCP.png" style="height:300px; width:800px" /></p>


## PCA analysis

For each of the six measurement, we compute the percentate of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
![VarExplained1.png](report_figures_ye/VarExplained1.png)
We see that the top 5 eigen-vectors explain 22% of variance for TMIN, 48% for TOBS and 20.5% for TMAX.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains up to 40% of the variance of TOBS.

![VarExplained2.png](report_figures_ye/VarExplained2.png)

The top 5 eigenvectors explain 7% of the variance for PRCP, 54% for SNOW and 85% for SNWD. The top 5 eigenvectors of PRCP only contain very little information of PRCP. On the other hand the top 5 eigenvectors explain %85 of the variance for SNWD. This means that these top 5 eigenvectors capture most of the variation in the snow signals. Based on that we will dig deeper into the PCA analysis for snow-depth.

It makes sense that snow signals, both SNOW and SNWD, are less noisy than rain fall signal PRCP. It is because the temperature in Georgia is high. It rains much easier than snow. In orther words, the snow signal varies less between he same date on different yesars. Comparing SNOW and SNWD, it makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on diffferent years.

## Basic statistics of snow-depth and showing stations on map
There are 103 samples with measurement 'SNWD'. They are collected from 38 different stations. In addition, the latitude range of these stations is 31.77 to 33.28. The range of logitude is -85.5 to -83.60.
![map.png](report_figures_ye/map.png)
This figure is the snapshoot of the map created by maps using iPyLeaflet notebook. From this figure, we can see that there are only some stations with 'SNWD'measurement. It is consisted with the mean temperature at this region. It is in the subtopical. So, it is hard to snow in this region.

## Analysis of snow depth

We choose to analyze the eigen-decomposition for snow-depth because the first 3 eigen-vectors explain 80% of the variance.

First, we graph the mean and the top 3 eigen-vectors.

We observe that the snow season is from begining of January to the end of February. It almost only snows on January. Snow stops at the beginning of Feburay, but the peak of snow appeared at mid-February. Although there are also several small spikes during mid-Feb to the end of March, they are even smaller than 1 mm/day. They can be ignored. We can also see that it snows little during the whole year. Although at mid Feburary, the peak of snow, the snow-depth is only 0.9cm/day. 
![SNWD_mean_eigs.png](report_figures_ye/SNWD_mean_eigs.png)

Next we interpret the eigen-functions. The first eigen-function (eig1) has a shape of rectangular function starting at beginning of January and ending at the end of January. Since the mean of snow-depth at this location are a liitle bit messy. It almost only snows on January, but the peak occrus at mid-Feb. The interpretation of this shape is that eig1 represents the overall amount of snow above/below the mean on January. It does not consider the spikes occuring during the other months.

**eig2 and eig3** are similar in the following way. They all centered at specific time either positive or negative. There is not too much information of snow-depth during the whole year, since it snows very little at these locations. Those two egienvectors interprete the specific features for different stations varying over the winter months.

They can be interpreted as follows:
* **eig2:** less snow in beginning of feb - mid feb.
* **eig3:** more snow in mid-feb, slightly less snow at the beginning of feb.


### Examples of reconstructions

#### Coeff1
Coeff1: most positive
![SNWD_grid_Pos_coeff1.png](report_figures_ye/SNWD_grid_Pos_coeff1.png)
Coeff1: most negative
![SNWD_grid_Neg_coeff1.png](report_figures_ye/SNWD_grid_Neg_coeff1.png)
Large positive values of coeff1 correspond to more than average snow on January. Low values correspond to less than average snow on January.

#### Coeff2
Coeff2: most positive
![SNWD_grid_Pos_coeff2.png](report_figures_ye/SNWD_grid_Pos_coeff2.png)
Coeff2: most negative
![SNWD_grid_neg_coeff2.png](report_figures_ye/SNWD_grid_neg_coeff2.png)

Large positive values of coeff2 correspond to a sown season with one peak at the beginning of March. Negative values of coeff2 correspond a snow season with a signle peak at mid-Feb.

#### Coeff3
Coeff3: most positive
![SNWD_grid_Pos_coeff3.png](report_figures_ye/SNWD_grid_Pos_coeff3.png)
Coeff3: most negative
![SNWD_grid_neg_coeff3.png](report_figures_ye/SNWD_grid_neg_coeff3.png)

Both large positive values of coeff3 and negative values of coeff3 correspond to one peak at about beginning of Feb to mid-Feb. So, I speculate that the third eigen vector is not as important as first two. It is also asserted by the plot "Percentage of Variance Explained by SNWD" at above. Number of eigenvectors increases from 2 to 3, the percentage only increased by 5%.



## Distribution of the residuals and the coeffcients
We extract all of the values of `res_3` which is the residual variance after the Mean and the 
first two Eigen-vectors have been subtracted out and plot the culmulative distributions of `res_3` and `coeff_3`.
![CDF_of_res_3.png](report_figures_ye/CDF_of_res_3.png)
![CDF_of_coeff_3.png](report_figures_ye/CDF_of_coeff_3.png)
From the two figures, we can see that the distributions are not uniform. Most of instances do not contribute to the residuals. It is domainated by several eigenvectors.

## The variation in the timing of snow is mostly due to year-to-year variation
We now estimate the relative importance of location-to-location variation relative to year-by-year variation.

These are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results:

** coeff_1 **  
total MS                   = 96.67  
MS removing mean-by-station= 66.67, fraction explained=31.0  
MS removing mean-by-year   = 8.96, fraction explained=90.7  

** coeff_2 **  
total MS                   = 75.43  
MS removing mean-by-station= 63.34, fraction explained= 16.0  
MS removing mean-by-year   = 40.14, fraction explained=46.8  

** coeff_3 **  
total MS                   = 27.05  
MS removing mean-by-station= 21.86, fraction explained= 19.2  
MS removing mean-by-year   = 24.86, fraction explained=8.1  
 
We see that for coeff_1, the variation by year explains more than the variation by station. The stations explain 31% of the variance while the year explains 90.7% of the variance. For coeff_2, the variation by year explains more than the variation by station as well. The stations explain 16% of the variance while the year explains 46.8% of the variance. For coeff_3, the variation by station explains more than the variation by year. The year explains 8.1% of the variance while the stations explain 19.2% of the variance. The difference of effects is small. It also is corresponding to the result we got before, the first two eigenvectors contain most information. So, I speculate that the variation in timing of snow is mostly due to year-to-year variation.

## Analysis of PRCP
As the figure of percentage of variance explained for PRCP, we see that feature of PRCP is not domainated by several eigenvectors. The first 5 egienvectors only explain 7% of the variance. Ia is also asserted by the graph of mean and first 3 eigen-vectors of PRCP.
![PRCP_pca.png](report_figures_ye/PRCP_pca.png)
From the figure above, it is hard to reconstruct the whole dataset by several egienvectors. Therefore, I do not do deeper analysis on this feature.