# Florida Weather Analysis

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the area of the state of Massachusets.

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/boston/massachusetts/united-states/usma0046" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="Climate_Boston_-_Massachusetts_and_Weather_averages_Boston.jpg" src="r_figures/Miami.jpg" style="height:300px; width:500px"/></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperatures are given in the US climate data as 78 Fahrenheit (25.55 Celsius) and 90 Fahrenheit (32.22 Celsius) agree with the ones we got from our data, 25 Celsius and 32.5 Celsius respectively.</p>

<p>&nbsp;</p>

<p><img alt="TMIN,TMAX.png" src="r_figures/Tmin,Tmax.png" style="height:300px; width:800px" /></p>

<p>To compare the precipitation&nbsp;we need to translate millimeter/day to inches/month. According to our analysis the average rainfall is 7 mm/day which translates to about 8.26 Inches&nbsp;per month. According to US-Climate-Data the average rainfall is closer to 9.5 inch per month. However, there is clear agreement that average precipitation&nbsp;is close to a constant throughout the year.</p>

<p>&nbsp;<img alt="PRCP.png" src="r_figures/Prcp.png" style="height:450px; width:600px" /></p>


## PCA analysis

For each of the six measurement, we compute the percentate of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
![VarExplained1.png](r_figures/1.png)
We see that the top 5 eigen-vectors explain about 37% of variance for TMIN, 54% for TOBS and 28% for TMAX."

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 42% of the variance.

![VarExplained2.png](r_figures/2.png)

The top 5 eigenvectors explain 6.7% of the variance for PRCP. This is an extremely low value. On the other hand the top 5 eigenvectors explain 100% of the variance for SNOW and the top 2 eigenvectors explain 100% of the variance for SNWD. Since it very very rarely snows in Florida, there is not much samples (only 2 stations in my case) to analyse the data regarding SNOW and SNWD. On the other hand, for PRCP, we cannot use PCA for further analysis as the top 5 vectors can only explain about 6.7% of the total variance. One of the alternatives is to apply PCA based on stations rather than days. As a result, we'll further explore by applying PCA based on stations for analysis of PRCP.


## Analysis of SNWD and SNOW

We choose to analyze the eigen-decomposition for SNOW and SNWD. Because in case of SNWD the first 2 eigen-vectors explain 100% of the variance, and in case of SNOW, the first 4 eigen-vectors explain 100% of the variance.

First, we graph the mean and the top 3 eigen-vectors for SNWD.

We observe that there are only 2 stations and that we cannot infer much from this meagre amount of data. 
<!--![SNWD_mean_eigs.png](r_figures/4.png)-->
<p><img alt="Climate_Boston_-_Massachusetts_and_Weather_averages_Boston.jpg" src="r_figures/mean4.png" style="height:300px; width:500px"/></p>
<p><img alt="Climate_Boston_-_Massachusetts_and_Weather_averages_Boston.jpg" src="r_figures/4.png" style="height:300px; width:500px"/></p>

Secondly, we graph the mean and the top 3 eigen-vectors for SNOW.

We observe that, in the mean graph, the mean is zero almost throughout the year and traces of snow was observed during the month of February and a very very little amount (a very small orange line) is observed in the starting of the month of October.
<p><img alt="Climate_Boston_-_Massachusetts_and_Weather_averages_Boston.jpg" src="r_figures/mean5.png" style="height:300px; width:500px"/></p>
<p><img alt="Climate_Boston_-_Massachusetts_and_Weather_averages_Boston.jpg" src="r_figures/5.png" style="height:300px; width:500px"/></p>

From the above plots of SNWD and SNOW, it is pretty evident that we cannot perform PCA on SNWD and SNOW, and I will restrict my analysis to PRCP. 

## Residual analysis for PRCP

Since, the PCA of SNWD and SNOW is found to be futile and doesn't give much information, in this section, I will perform the residual analysis for PRCP.

### Plot of Mean and Eigen-vectors 
Below figure shows the mean and the top 3 eigen-vectors for precipitation.

We observe that the pericipiation period is from the month of June to October, highest at the start of October and a slightly reduced amount of precipitation in the month of August.

<p><img alt="Climate_Boston_-_Massachusetts_and_Weather_averages_Boston.jpg" src="r_figures/6.png" style="height:300px; width:500px"/></p>

As we have seen before, top eigen vectors after the PCA analysis doesn't explain much of the variance for precipitation. One of the alternate ways of anlysing, is to apply PCA based on stations. For that, we combine the measurements for each station into a single long array with an entry for each day of each year.


### Plot of the Cumulative residual of res_3
In the below graph, we can see that only 10% of the data has residual variance less than 90% and as a result, this cannot give much information.
<p><img alt="Climate_Boston_-_Massachusetts_and_Weather_averages_Boston.jpg" src="r_figures/8.png" style="height:300px; width:500px"/></p>

### Plot of the CDF of daily rainfall
The figure below shows the cumilative distribution function for daily rain fall and it is evident that it rains during about one third (~30%) of the total days, which confirms to the fact that Florida is the wettest state in the US. 
<p><img alt="Climate_Boston_-_Massachusetts_and_Weather_averages_Boston.jpg" src="r_figures/9.png" style="height:300px; width:500px"/></p>

It is likely to be hard to find correlations between the amount of rain on the same day in different stations. Because amounts of rain vary a lot between even close locations. It is more reasonable to try to compare whether or not it rained on the same day in different stations. As we see from the graph above, in our region it rains in about one third of the days.

### Measuring statistical siginificance

We want to find a statistical test for rejecting the null hypothesis that says that the rainfall in the two locations is independent. Using the inner product is too noisy, because you multiply the rainfall on the same day in two locations and that product can be very large - leading to a large variance and poor ability to discriminate. An alternative is to ignore the amount of rain, and just ask whether it rained in both locations. We can then compute the probability associated with the number of overlaps under the null hypothesis.

The figure below represnts the negative log probability between pair of stations.
<p><img alt="Climate_Boston_-_Massachusetts_and_Weather_averages_Boston.jpg" src="r_figures/10.png" style="height:300px; width:500px"/></p>

We observe that the there are lot of pairs with probabilty greater than 0.05. So, we can reject null hypothesis for those stations. But we can also find fair amount of pair of stations for which we cannot reject null hypotheis.

### Calculating the normalized log probability for each pair of stations
The matrix below shows correlation between each pair of stations. We see one big cluster of correlated stations in the plot, implying that the first 50 stations are highly correlated. In this big cluster, there are 4 darker clusters (stations between 0-10, 10-25, 25-35, 35-55), showing a greater correlation.  
<p><img alt="Climate_Boston_-_Massachusetts_and_Weather_averages_Boston.jpg" src="r_figures/11.png" style="height:800px; width:800px"/></p>

### Finding structure in the dependency matrix
To find more correlations we use SVD (the term PCA is reserved for decomposition of the covariance matrix). As we shall see that the top 10 eigenvectors explain about 77% of the square magnitude of the matrix.
<p><img alt="Climate_Boston_-_Massachusetts_and_Weather_averages_Boston.jpg" src="r_figures/12.png" style="height:300px; width:500px"/></p>

<p><img alt="Climate_Boston_-_Massachusetts_and_Weather_averages_Boston.jpg" src="r_figures/14.png" style="height:800px; width:800px"/></p>

When we reorder the rows and columns of the matrix using one of the eigenvectors, the grouping of the stations becomes more evident. For example, consider the upper left corner of the scond matrix (The upper left one). The stations at positions 0-22 are clearly strongly correlated with each other. Even though there are some stations, in positions 0-5 or so, which are more related to each other than to the rest of this block. This type of organization is called Block Diagonal and it typically reveals important structure such as grouping or clustering. You might want to extract the sets of stations that form blocks for your region, and then plot them on the map to see their spatial relationship.


### Data on Maps
The figure below shows the stations with very similar (both magnitude and sign) coefficients. This means they have similiar weather patterns over different days. One possible inference we can make out this is that, since these stations are very close to the Atlantic Ocean on the map. So, they have similar weather patterns even tough they are geographically separated.
<p><img alt="Climate_Boston_-_Massachusetts_and_Weather_averages_Boston.jpg" src="r_figures/similar_map.png" style="height:600px; width:600px"/></p>

## Analysis for TOBS
We choose to analyze the eigen-decomposition for TOBS because the first 4 eigen-vectors explain 50% of the variance.

First, we graph the mean and the top 4 eigen-vectors.

We observe that the observed temperature is highest in the months of July and August. And it gradually slopes down as the Winter season approaches and the least during the Winter season.
![SNWD_mean_eigs.png](r_figures/15.png)

### Examples of reconstructions

#### Coeff1
Coeff1: most positive
![SNWD_grid_Pos_coeff1.png](r_figures/c1_max.png)
Coeff1: most negative
![SNWD_grid_neg_coeff1.png](r_figures/c1_min.png)
Large positive values of coeff1 correspond to more than average TOBS. Low values correspond to less than average TOBS.

#### Coeff2
Coeff2: most positive
![SNWD_grid_Pos_coeff2.png](r_figures/c2_max.png)
Coeff2: most negative
![SNWD_grid_neg_coeff2.png](r_figures/c2_min.png)

Positive values of coeff2 vary more, about 270 (0-270), while, the negative values vary lesser, about 150 (150-300). This could signify that, the temperature of the regions near the coast vary lesser compared to the regions present in the interiors of Florida.
#### Coeff3
Coeff3: most positive
![SNWD_grid_Pos_coeff3.png](r_figures/c3_max.png)
Coeff3: most negative
![SNWD_grid_neg_coeff3.png](r_figures/c3_min.png)


The positive values of coeff3 is closer to their mean, compared to the negative values of coeff3.

### The variation in TOBS is mostly due to year-to-year variation

We now estimate the relative importance of location-to-location variation relative to year-by-year variation. These are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively.

Here are the results: 
coeff_1 total RMS = 207.284761951 RMS removing mean-by-station= 196.32720046,percentage explained=5.3 RMS removing mean-by-year = 120.853088688,percentage explained=42

coeff_2 total RMS = 200.76099453 RMS removing mean-by-station= 192.223606713,percentage explained=4 RMS removing mean-by-year = 108.74828864,percentage explained=46

coeff_3 total RMS = 190.175305112 RMS removing mean-by-station= 183.527239833,percentage explained=3.6 RMS removing mean-by-year = 106.115119692,percentage explained=44.2

We see that the variation by year explains more than the variation by station for coeff_2. However this effect is weaker for coeff_1 and coeff_3. This is because coeff_2 represents the arrival of summer and varies with different years.

## Conclusion

I analysed three major measurements snow depth, TOBS and Precipitation in this report. We see that variation of snow depth and TOBS can be explained better by performing PCA on number of days. But, since, there was not much data for SNWD and SNOW in the region of Florida, PCA could not reveal much information in this case, but a thorough analysis has been done for TOBS. This is beacuse the variation of these measurements are temporal. However, the variation of precipitation can be explained better when we perform PCA on different stations.