# Florida Weather Analysis

## DataSet
This is a report on the historical analysis of weather patterns in an area that approximately overlaps the area of the state of Florida.The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.


### Positions of Stations 
<p>As you can see in the map,&nbsp; All the stations lies in the south part of U.S., which makes sense that there is little snowfall. Note that the latitude is between [30.0667, 31.4714] North, and the Longitude is between [-88.4833, -85.5833]</p>

<p>&nbsp;<img alt="StationPosition_zoomOut.png" src="report_figures/StationPosition_zoomOut.png" style="height:300px; width:800px" /></p>

<p>&nbsp;<img alt="StationPosition.png" src="report_figures/StationPosition.png" style="height:300px; width:800px" /></p>

## Sanity-check: comparison with outside sources

###  Reference Weather Dataset
<p>We compare some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/pensacola/florida/united-states/usfl0715" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="Climate_of_Pensacola_Florida.png" src="report_figures/Climate_of_Pensacola_Florida.png" style="height:450px; width:600px" /></p>

<p>&nbsp;</p>


### Max and Min Temperature
<p>We can see that the min and max daily&nbsp;temperature agree with the once we translate Fahrenheit to Centigrade.
Note that, the vertical axes is tenths of degrees C. </p>

<p>&nbsp;</p>

<p><img alt="TMIN_TMAX.png" src="report_figures/TMIN_TMAX.png" style="height:300px; width:800px" /></p>

<p>&nbsp;</p>

### Average Temperature and Rainfall
<p>In the following graph,&nbsp; We plot the 'TOBS' which is the average temperature for each day, and is pretty similar to the reference data. Note that, the vertical axes is tenths of degrees C.</p>

<p>To compare the precipitation&nbsp; We need to translate millimeter/day to inches/month.
According to our analysis the average rainfall is 3.64 mm/day which translates to about 4.33 Inches&nbsp; per month.
According to US-Climate-Data the average rainfall is closer to 5.44 inch per month. However, there is clear agreement that average precipitation&nbsp;is close to a constant throughout the year. Note that, the vertical axes is tenths of mm.</p>

<p>&nbsp;<img alt="TOBS_PRCP.png" src="report_figures/TOBS_PRCP.png" style="height:300px; width:800px" /></p>


### Snowfall and Snow Depth
<p>In this graph,&nbsp; It is easy to see the this place has rarely snow in the entire year. Only little tiny part of them will have snow in whole the year. The mean of 'Snowfall' is approximitely equals to zero</p>

<p>&nbsp;<img alt="SNOW_SNWD.png" src="report_figures/SNOW_SNWD.png" style="height:300px; width:800px" /></p>


## PCA analysis

For each of the six measurement, we compute the percentate of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.

#### Temperature
We see that the top 5 eigen-vectors explain 27% of variance for TMIN, 43% for TOBS and 22% for TMAX. To conclude, of the three, TOBS is the best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 35% of the variance. Thus, we can infer that the PCA analysis for temperature may not be a good choice since the PCA cannot preserve the variance of the sampled data, not even for the distribution.

![VarExplained_TMIN_TOBS_TMAX.png](report_figures/VarExplained_TMIN_TOBS_TMAX.png)


#### Snow and Rainfall
The top 5 eigenvectors explain variance up to 75% for SNOW, 94% of SNWD, and 9% for PRCP. This result implies that the top 5 eigenvectors capture most of the variation in the snow depth signal. 
It makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on diffferent years.

![VarExplained_SNOW_SNWD_PRCP.png](report_figures/VarExplained_SNOW_SNWD_PRCP.png)


We will dig deeper into the PCA analysis for TOBS and SNWD.

## Analysis of The Average Temperature for Each Day (TOBS)

### Mean and Eigen-vectors
We analyze the eigen-decomposition for Average Temperature for Each Day, since the first 3 eigen-vectors explain 40% of the variance. First, we graph the mean and the top 3 eigen-vectors. We observe that the average temperature reaches the highest from July till late of September. In this case, it is bovious that the data vibrate over time, which means that it contains lots of random noise. We'll see it in more detail in the following example of reconstructions.

Also, please note that, the superposition of the three eigen-vectors will have a camel but smooth peak from Jul to the beginnning of Oct. The eigen-vector 2 and 3 will help to decrease the level of temperature during the period from Jan to Mar, and the period from Dec to Jan.

![TOBS_mean_eigs.png](report_figures/TOBS_mean_eigs.png)

Next we interpret the eigen-functions. The interpretation of this shape is that eig1 represents the overall among of temperature below the mean, but without changing the distribution or data properties over time.

**eig2** are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months. The eigen-vector 2 helps to maintain the average temperature curve in a camel but smooth peak way.

**eig3** are similar in the following way. It oscillate in an opposite way of eigen-vector 2 from Nov to Apr. In other words, they correspond to changing the distribution of the temperature over the winter months.

As we expect, though the data may looks to have random noise amount the entire year, PCA is useful to reveal the actual signal in this case. Please see the following examples.


### Examples of reconstructions

#### Coeff1

The average temperature varies among the year, but reconstruction stays within the center of the curve. Although, the variance is merely 40%, the reconstructions performs very well without overfitting problem. Noth that the target lies below the mean. The coeff1 tries to pull the mean a little bit down in order to fit the target.

Coeff1: most positive
![TOBS_grid_Pos_coeff1.png](report_figures/TOBS_grid_Pos_coeff1.png)
Coeff1: most negative
![TOBS_grid_neg_coeff1.png](report_figures/TOBS_grid_neg_coeff1.png)


#### Coeff2

Same thing happens in the coefficient 2, but note that the target stays in above the mean. The coeff2 try to maintain a camel but smooth peak.

Coeff2: most positive
![TOBS_grid_Pos_coeff2.png](report_figures/TOBS_grid_Pos_coeff2.png)
Coeff2: most negative
![TOBS_grid_neg_coeff2.png](report_figures/TOBS_grid_neg_coeff2.png)


#### Coeff3

Same thing happens in the coefficient 3, but note that the target ocsillates above and below the mean in different years.

Coeff3: most positive
![TOBS_grid_Pos_coeff2.png](report_figures/TOBS_grid_Pos_coeff3.png)
Coeff3: most negative
![TOBS_grid_neg_coeff2.png](report_figures/TOBS_grid_neg_coeff3.png)



### Distributions of residuals

In the following plots, we can say that the noise is closed to a normal distribution according to their CDF. Thus, we can confirm that the noise is the reason why the percentage of variance explained is not so high, even though the reconstrution performs well to fit the target curve in different years.

#### Original CDF of TOBS
![TOBS_CDF.png](report_figures/TOBS_CDF.png)


#### Res_1
![TOBS_grid_CDF_res1.png](report_figures/TOBS_grid_CDF_res1.png)


#### Res_2
![TOBS_grid_CDF_res2.png](report_figures/TOBS_grid_CDF_res2.png)


#### Res_3
![TOBS_grid_CDF_res3.png](report_figures/TOBS_grid_CDF_res3.png)



### Significance

As being illustrated, since the variance explained by top 3 eigen-vectors is not so high. Thus the siginicant level is not so high when fitting to the original data. The average temperature in each day oscillates over the years. It make sense that the climate is a constant in every year. However, if the use the average across the years as the the target, the PCA can perform well.

![TOBS_significance.png](report_figures/TOBS_significance.png)

## Analysis of Snow Depth (SNWD)

### Mean and Eigen-vectors
We analyze the eigen-decomposition for snow-depth, since the first 2 eigen-vectors explain 90% of the variance. First, we graph the mean and the top 2 eigen-vectors. We observe that the snow season is from november to the end of march. The middle of December marks the peak of the snow-depth.

![SNWD_mean_eigs.png](report_figures/SNWD_mean_eigs.png)

Next we interpret the eigen-functions. The first eigen-function (eig1) has a shape very similar to the mean function. The main difference is that the eigen-function is zero from Jan to Apr while the mean is not.  The interpretation of this shape is that eig1 represents the overall amount of snow above/below the mean, but without changing the distribution or data properties over time.

**eig2** are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they change the total (much).

As we expected, since the snow is not so common in Florida, a tiny level of snow can have very small signal/noise ratio. Thus, even though some people may think PCA is not useful, it is still sutible, an powerful analysis tool and its variance explained performs well in such a case.


### Examples of reconstructions

#### Coeff1

Snow depth vaires in different year, but most of them lie in March.

Coeff1: most positive
![SNWD_grid_Pos_coeff1.png](report_figures/SNWD_grid_Pos_coeff1.png)
Coeff1: most negative
![SNWD_grid_neg_coeff1.png](report_figures/SNWD_grid_neg_coeff1.png)


#### Coeff2

Snow depth vaires in different year, but most of them lie in March.

Coeff2: most positive
![SNWD_grid_Pos_coeff2.png](report_figures/SNWD_grid_Pos_coeff2.png)
Coeff2: most negative
![SNWD_grid_neg_coeff2.png](report_figures/SNWD_grid_neg_coeff2.png)

## Geography and Correlation of TOBS


### Geographical distribution of TOBS

The Geographical distribution of PCA

![TOBS_PCA_map.png](report_figures/TOBS_PCA_map.png)


### Correlations Matrix of TOBS

Note that the upper left corner has higher correlation while the res has not.
![TOBS_corr_matrix.png](report_figures/TOBS_corr_matrix.png)


### Is TOBS variance spatial or temporal?

We can expect that the variance is because of the year, since we have already see that the target curve oscillates from year to year. From the statistical result below, we can confirm our claim.

This result shows that after removing mean by station, the RMS drops in a only a small amount. However, after dropping the mean by year, the RMS decreases significantly. Thus, we can say that the temporal has more influence on the RMS of TOBS.

<p><img alt="TOBS_RMS_summary.png" src="report_figures/TOBS_RMS_summary.png" style="height:100px; width:500px" /></p>

<p><img alt="TOBS_RMS_defactored.png" src="report_figures/TOBS_RMS_defactored.png" style="height:400px; width:500px" /></p>
