# Florida Weather Analysis

This report aims to present the potential patterns of weather in Florida. 

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

The dataset index is BSBSSSBS, which includes six measurements of different sites in Florida on all the days of different years. Here is a figure of observation sites on the map (each red dot is a specific observation site):


<p><img alt="ObservationSite.jpg" src="MyFigure/observationSite.jpg" /></p>

<p>&nbsp;</p>



The six measurements are as following:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

Note that not all the data is valid. We use NaN to represent the missing data. For example, the following two figures show the amount of valid data points of all the stations under measurement TMIN and TMAX in each day.

<p><img alt="Climate_Gainesville_visualization_5.png" src="MyFigure/visualization_5.png" height="800" width="1000"/></p>

We can see that we miss most of our data during summer.

## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/boston/massachusetts/united-states/usma0046" target="_blank">US Climate Data</a>. We can see from the previous graph that the location of our observation sites spans through multiple cities. We choose two typical cities, Gainesville and Tampa, to show the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month in these two cities.</p>

<p>&nbsp;</p>

<p><img alt="Climate_Gainesville_visualization_1.png" src="MyFigure/visualization_1_update.png" height="500" width="500"/></p>

<p>&nbsp;</p>

<p>&nbsp;</p>

<p><img alt="Climate_Tampa_visualization_4.png" src="MyFigure/visualization_4.png" height="500" width="500"/></p>

<p>&nbsp;</p>

<p>We can see that two cities have the similar trends towards TMIN, TMAX and precipitaion. The temperature and precipitaion in Tampa is a little bit higher than that of Gainesville, because Tampa locates southern than Gainesville and nearer towards the sea.</p>

<p>&nbsp;</p>

<p>Now let's consider the dataset we get. We see that the min and max daily&nbsp;temperature agrees with the one we got from our data, once we translate Fahrenheit to Centigrade. Note that we use ten times temperature or precipitation when calculating for better precision, so we need to divide the value by ten before doing the translation.</p>

<p>&nbsp;</p>

<p><img alt="Climate_Gainesville_visualization_5.png" src="MyFigure/visualization_7.png" height="800" width="1000"/></p>

<p>&nbsp;</p>

<p><img alt="Climate_Gainesville_visualization_5.png" src="MyFigure/visualization_8.png" height="800" width="1000"/></p>


<p>We can also see from the figures that the average temperature meets well with the data provided by US Climate. To compare the precipitation&nbsp;we need to translate millimeter/day to inches/month. According to our analysis the average rainfall is 2.5 mm/day which translates to about 3.0 Inches&nbsp;per month. According to US-Climate-Data the average rainfall is closer to 3.94 inch per month. However, there is clear agreement that the trend of the precipitaion: there is more rainfall during summer and less during spring and autumn. </p>

<p>Things become interesting towards snow. </p>

<p>&nbsp;</p>

<p><img alt="Climate_Gainesville_visualization_5.png" src="MyFigure/visualization_6.png" height="800" width="1000"/></p>

<p>It should not snow in Florida. The mean value we get indicates that, but we get some strange points in a day that have very huge variance towards snow. I print all the datapoints that have non-zero snow fall and find that some of the reason that the value is not zero may due to the elevation of that station. For example, there is a station with elevation -999.9 and total snow fall 3 in a year. However, most of the data makes no sense at all, like elevation 6.1 with snow fall 185. They should be seen as outliers from my point of view.  </p>

## PCA analysis

For each of the six measurement, we compute the percentage of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
![VarExplained1.png](MyFigure/pca_2.png)
We see that the top 5 eigen-vectors explain 35% of variance for TMIN, 50% for TOBS and 25% for TMAX.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 42% of the variance. But we will not consider TOBS as our major analyze target since the TOBS is the average temperature for each day, which does not take temperature difference between day and night into account. We can certainly find some other patterns through average, but I analyze TMIN instead.

![VarExplained2.png](MyFigure/pca_1.png)

The top 5 eigenvectors explain 8% of the variance for PRCP and 100% for SNOW and SNWD. I will not consider PRCP since it is relative low. On the other hand, although top 5 eigenvectors capture most of the variation in the snow signals, most of them are zero and don't need to analyze. Thus, we will dig deeper into the PCA analysis for minimum temperature in a day.


## Analysis of TMIN

We choose to analyze the eigen-decomposition for minimum temperature because the first 5 eigen-vectors explain 30% of the variance.

First, we graph the mean and the top 3 eigen-vectors.

We observe that the minimum temperature remains high starting from June to the end of October, where August marks the peak of the minimum temperature.
![SNWD_mean_eigs.png](MyFigure/pca_3.png)

We then plot the mean function and top 3 eigen function and compare them with the original target. We can see that the first eigen-function (eig1) has a shape very similar to the mean function. The main difference is that the fluctuation range of eig1 is smaller than the mean.  Note that all the value of eig1 is negative. The interpretation of this shape is that eig1 represents the overall temperature below/above the mean, but without changing the distribution over time.

**eig2,eig3** are similar. They all oscilate between positive and negative values. They correspond to changing the distribution of the temperature shape, but they don't change the total (much).

They can be interpreted as follows:
* **eig1:** overall temperature below/above the mean, but without changing the distribution over time
* **eig2:** adjust TMIN a little bit higher in Jan to Feb, a little bit lower in the late Feb and early Nov, as well as little higher from December to January
* **eig3:** adjust TMIN a little bit higher from Nov to March, and some tiny adjustion from march to June fluctuating around 0

### Examples of reconstructions

<p>Our reconstruction graph looks like this:</p>

<p>&nbsp;</p>

<p><img alt="Climate_Gainesville_visualization_1.png" src="MyFigure/pca_5.png" height="500" width="500"/></p>

<p>&nbsp;</p>

<p>the target is the actual value that we need to reconstruct. mean is the value reconstructed using mean, c1 is the value reconstructed using mean and eigen1, etc. We can see from the graph that as we increase the number of involved eigen vectors, the reconstructed value "approaches" the target. </p>

<p>&nbsp;</p>

<p>Our reconstruction divides into two steps. We first select all the data points that have res_2 less than 0.4, which may be better eplained by our eigenvectors; We then choose the sites with the top 4 largest and smallest coefficients to see whether the interpretation of coeffiecients meets our expectation. </p>

<p>&nbsp;</p>

#### Coeff1
Coeff1: most positive
![SNWD_grid_Pos_coeff1.png](MyFigure/pca_4_c1_pos.png)
Coeff1: most negative
![SNWD_grid_neg_coeff1.png](MyFigure/pca_4_c1_neg.png)
We can see a trend that as we decrease the value of eig1, the reconstructed values move from below the mean to the above, which indecates the smaller positive values of coeff1 the larger average TMIN. Vice versa.

#### Coeff2
Coeff2: most positive
![SNWD_grid_Pos_coeff2.png](MyFigure/pca_4_c2_pos.png)
Coeff2: most negative
![SNWD_grid_neg_coeff2.png](MyFigure/pca_4_c2_neg.png)

We can see that large coefficient 2  has a relatively low value from Jan to Feb and December to January, which indicates a little bit higher adjustion towards mean. However, it seems we can not conclude some useful information from the most negative c2 data points.

#### Coeff3
Coeff3: most positive
![SNWD_grid_Pos_coeff3.png](MyFigure/pca_4_c3_pos.png)
Coeff3: most negative
![SNWD_grid_neg_coeff3.png](MyFigure/pca_4_c3_neg.png)

Large positive values of coeff3 does "drag" the reconstructed values apart from the mean, and the most obvious months starts from November to March, which meets our expectation.

This is part of the reconstruction of the original TMIN values in different locations on the map using top 4 eigen vectors. The four different triangles represents different result of reconstruction using corresponding eigen vectors.
Each color represents a eigen vector. The size of the triangle stands for the absolute value of coeffiecient, and whether it is filled indicates whether it is positive. Part of the map is like this:

![SNWD_grid_neg_coeff3.png](MyFigure/pca_6.png)

## The variation in the timing of minimum temperature is mostly due to year-to-year variation
 We are now trying to estimate the relative importance of location-to-location variation relative to year-by-year variation w.r.t minimum temperature in a day. Here are the results:

** coeff_1 **  
total RMS                   =  380.868316327
RMS removing mean-by-station=  166.317862207
RMS removing mean-by-year   =  347.463829132 

** coeff_2 **  
total RMS                   =  166.557468658
RMS removing mean-by-station=  163.677287709
RMS removing mean-by-year   =  58.1089227476 

** coeff_3 **  
total RMS                   =  148.819727105
RMS removing mean-by-station=  144.367740524
RMS removing mean-by-year   =  62.9208360015

We can see from the results that the variation by year explains more than by station towards coeff 2 and 3. As for coeff 1, things are different. This meets our expecations, because we interpret the coeff 1 as the overall temperature above/below the average, which may vary among different locations but not so much within different years; However, as for coeff 2 and 3, they indicates some particular adjustion for some months,which may vary year to year more than location to location.

## The cumulative distribution of coefficient and residual

We then compute the cumulative distribution of coeffiecient and residual of the whole dataset to make a evaluation of our data. we extract all of the values of res_3 which is the residual variance after the Mean and the first two Eigen-vectors have been subtracted out. The distribution is like this:

<p>&nbsp;</p>

<p><img alt="Climate_Gainesville_visualization_1.png" src="MyFigure/cumulative_1.png" height="500" width="500"/></p>

We can see that there is only a small fraction of data under the residual of 0.5. Our data seems not well explained. 

<p>&nbsp;</p>

<p><img alt="Climate_Gainesville_visualization_1.png" src="MyFigure/cumulative_2.png" height="500" width="500"/></p>

From this, we can see that nearly half of the datapoints have negative coefficient 1, which meets our assumption that coefficient 1 represents the average TMIN above/below the mean.




## Analyze the residuals

We now want to know whether "the TMIN of one location is above the average" has nothing to do with "the TMIN of another location is above the average". We want to find a statistical test for rejecting the null hypothesis that says whether the TMIN will be above the average in the two locations is independent.
For simplicity, we use the average temerature in Florida, 23.2 celcius, as our average TMIN.
We have 61 stations in total, indcating a 61*61 matrix of p values. Firstly, we have a histogram of all the p values:

<p>&nbsp;</p>

<p><img alt="Climate_Gainesville_visualization_1.png" src="MyFigure/residual_1.png" height="500" width="500"/></p>


<p>We can see that most of the data have a relative low p value, indicating most of the relationships are independent(we can not ignore the null hypothesis). And then we calculated the normalized probability for each pair of stations to see the correlation between stations using the P_norm dependency matrix. </p>

<p><img alt="Climate_Gainesville_visualization_1.png" src="MyFigure/residual_2.png" height="300" width="300"/></p>

<p>Obviously, station indexed 5-10 and 55-60 have a strong correlation than the other stations. We then used PCA to find top Eigen vectors and reordered the P_norm matrix based on the indices of top Eigen vectors. We can see more corelations after reordering:</p>

<p><img alt="Climate_Gainesville_visualization_1.png" src="MyFigure/residual_3.png" height="300" width="300"/></p>

<p><img alt="Climate_Gainesville_visualization_1.png" src="MyFigure/residual_4.png" height="300" width="300"/></p>
 
<p>The reason for this is called correlation ordering, first introduced by Michael Friendly and Ernest Kwan.</p>

<p>I plot the stations that has the strong correlation in the original matrix. The green dots indicate the stations indexed from 5-10 and 55-61. We can see that most of them are along the coast. My assumption is that the TMIN among these cities may be affected by the sea most, and have a relatively stronger relation about whether they will be above the average.</p>

<p><img alt="Climate_Gainesville_visualization_1.png" src="MyFigure/residual_5.png" height="300" width="300"/></p>

Thanks for reading. Have a great day!