# Florida Weather Analysis

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the area of the state of Florida.

The data used here comes from [NOAA](https://www.ncdc.noaa.gov/).

I focused on six measurements:
* **TMIN, TMAX**: the daily minimum and maximum temperature.
* **TOBS**: The average temperature for each day.
* **PRCP**: Daily Percipitation (in mm)
* **SNOW**: Daily snowfall (in mm)
* **SNWD**: The depth of accumulated snow.

The distribution of the stations is shown in the map below. With all the stations data in hand, I choose the one next to Miami International Airport to start my analysis.
<p><img alt="map.png" src="r_figures/my_figures/map1.png" style="height:380px; width:660px" /></p>

## Distribution of missing observations
For this analysis, I first inspect the data and see if there were some prominant missing observation.
The distribution of missing observations is not uniform throughout the year. I visualize it below.
<p><img alt="TMIN_TMAX_missing.png" src="r_figures/my_figures/TMIN_TMAX_missing.png" style="height:300px; width:800px" /></p>
**TMIN** has some missing values between July and October, while **TMAX** has some missing values in January.
<p><img alt="TOBS_PRCP_missing.png" src="r_figures/my_figures/TOBS_PRCP_missing.png" style="height:300px; width:800px" /></p>
**TOBS** has some missing values in January, while **PRCP** has some missing values in September.
<p><img alt="SNOW_SNWD_missing.png" src="r_figures/my_figures/SNOW_SNWD_missing.png" style="height:300px; width:800px" /></p>
**SNOW** has some missing values in January and Fabruary, while **SNWD** has some missing values through December to early March.

## Sanity-check: comparison with outside sources

<p>I start by comparing some of the general statistics with graphs that I obtained from a site called [US Climate Data](http://www.usclimatedata.com/climate/miami/florida/united-states/usfl0316). The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p><img alt="Climate_Miami.jpg" src="r_figures/my_figures/Climate_Miami.png"></p>

<p>From my data, the average low and high temperature is 18.8 and 28.6 Celsius respectively, while the average low and high temperature according to US-Climate-Data is 19 and 29 Celsius respectively. From the result, I see that the min and max daily temperature agree with the ones I got from my data. </p>

<p><img alt="TMIN,TMAX.png" src="r_figures/my_figures/TMIN,TMAX.png" style="height:300px; width:800px" /></p>

<p>According to my analysis the average rainfall is 3.36 mm per day. According to US-Climate-Data the average rainfall is closer to 4 mm per day. The average rainfall is quite close to each other. Furthermore, I calculate the average perception in each month and found out that it is clear agreement that average precipitation is slightly higher from May to October and is lower from November and April.</p>

<p><img alt="PRCP.png" src="r_figures/my_figures/PRCP.png" style="height:360px; width:480px" /></p>

## PCA analysis

For each of the six measurement, I compute the percentage of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
![VarExplained1.png](r_figures/my_figures/VarExplained1.png)
I see that the top 5 eigenvectors explain 36% of variance for **TMIN**, 51% for **TOBS** and 27% for **TMAX**.

I conclude that of the three, **TOBS** is best explained by the top 5 eigenvectors. This is especially true for the first eigenvector which, by itself, explains up to 40% of the variance.

![VarExplained2.png](r_figures/my_figures/VarExplained2.png)

Not suprisingly, the top 5 eigenvectors explain 100% of the variance for both **SNOW** and **SNWD**, which means that these top 5 eigenvectors capture totally 100% of the vaiation in the snow signals. This phenomenon is quite clear since Miami barely snow. On the other hand the top 5 eigenvectors explain 7% of the variance for **PRCP**. The value is low, which means the variance is spread around each dimentions. Based on these figures and results, I will dig deeper into the PCA analysis for the average temperature.

## Analysis of the average temparature

I choose to analyze the eigen-decomposition for the average temparature because the first 4 eigenvectors explain 50% of the variance.

First, we graph the mean and the top 4 eigenvectors.

I observe that the summer season is from early-June to the end of September, the average temperature keeps really high through out these months. Furthermore, the temperature starts rising at the end of February and reaches the highest value in early-June, and it starts droppng at the end of September and reaches the lowest value at January. From the figure, it seems that seasons other than summer is not prominant.  
![TOBS_mean_eigs.png](r_figures/my_figures/TOBS_mean_eigs.png)

Next we interpret the eigen-functions. The first eigen-function (**eig1**) has a relatively flat shape and has a value greater than zero. The interpretation of this shape is that **eig1** represents the offset of the mean, but without changing the distribution over time. The second eigen-function (**eig2**) sligthly explains the overall trend of temperature changing. In other words, it shows that summer season has the peak and high temperature lasts for about 4 months, while through other months the temperture is smoothly changing.

**eig3** and **eig4** are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the temperature over the months other than months in summer season , but they don't change the total (much).

They can be interpreted as follows:
* **eig3:** high temperature in late January and early February, low temperature in November and December.
* **eig4:** high temperature in late Fabruary and March, low temperature in January and start Fabruary.

## Reconstructions of Data using top 3 eigenvectors

### Coeff1
Coeff1: most positive
![SNWD_grid_Pos_coeff1.png](r_figures/my_figures/TOBS_grid_pos_coeff1.png)
Large positive values of coeff1 correspond to more than average temperature. 

Coeff1: most negative
![SNWD_grid_neg_coeff1.png](r_figures/my_figures/TOBS_grid_neg_coeff1.png)
Low values correspond to less than average temperature.

### Coeff2
Coeff2: most positive
![SNWD_grid_Pos_coeff2.png](r_figures/my_figures/TOBS_grid_pos_coeff2.png)
Large positive values of coeff2 correspond to temperature rising starts in Feb and they also correspond to more than average temperature.. 

Coeff2: most negative
![SNWD_grid_neg_coeff2.png](r_figures/my_figures/TOBS_grid_neg_coeff2.png)

Negative values for coeff2 correspond to temperature dropping in Feb and they also correspond to less than average temperature.

### Coeff3
Coeff3: most positive
![SNWD_grid_Pos_coeff3.png](r_figures/my_figures/TOBS_grid_pos_coeff3.png)
Large positive values of coeff3 correspond to low temperature with a single peak at the end of Jan. 

Coeff3: most negative
![SNWD_grid_neg_coeff3.png](r_figures/my_figures/TOBS_grid_neg_coeff3.png)
Negative values of coeff3 correspond to the abscence of obvious peak through out whole year. 

## How well-explianed is our reconstruction?

To answer this question, recall that we use the top 3 eigenvectors to reconstruct the data. We then subtract the projection on the third eigenvector to ge the residual variance. The result is shown below:
![TOBS_res_3.png](r_figures/my_figures/TOBS_res_3.png)
This plot explains how well the top 3 eigenvectors explain the data. We see that the residual variance keep decreasing. This is because lesser the residual variance after subtracting the projections of top 3 eigenvectors, more is the variance explained by the top 3 eigenvectors, which also means that most variance is in the direction of these 3 eigenvectors.

## The variation in timing of temperature is due to year-by-year or station-by-station? 

<p>In the previous section we see the variation of Coeff1, which corresponds to the mean offset of the temperature, with respect to location. We now estimate the relative importance of location-to-location variation relative to year-by-year variation.</p>

<p>There are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results:

**coeff_1**<br>
total MS = 234951.90<br>
RMS removing mean-by-station = 92102.83, fraction explained = 60.80%<br>
RMS removing mean-by-year    = 194259.87, fraction explained = 17.32%<br>

**coeff_2**<br>
total MS = 34321.99<br>
RMS removing mean-by-station = 29474.18, fraction explained = 14.12%<br>
RMS removing mean-by-year    = 20641.98, fraction explained = 39.86%<br>

**coeff_3**<br>
total MS =  27414.57<br>
RMS removing mean-by-station = 25584.63, fraction explained = 6.68%<br>
RMS removing mean-by-year    = 11382.86, fraction explained = 58.48%<br>

we see that the variation by station explains more than the variation by year considering **coeff_1**, which has to do with the offset of mean of temperature. However if we consider **coeff_2** and **coeff_3**, the variation by yearexplains more than the variation by station. We see that for **coeff_2,3** the year explan 6-15% of the variance while the station explains 40-60%.