# Weather Analysis of Southeastern Part of the US
This is a report on the historical analysis of weather patterns in an area which is near the the boundary of Florida, Alabama and Georgia.

The file index if this report is 'BSBSSSSS'. Data comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/) site.

In this report, the main data analytic method we use is based on principal component analysis (PCA).

The original dataset contians different kinds of measurements. In this report the data that we will focus on is the following six measurements:

* **TMIN, TMAX:** the daily minimum and maximum temperature (in Centigrade).
* **TOBS:** The average temperature for each day (in Centigrade).
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow (in mm).

## Check and evaluation of original data

Before we go into detailed analysis of the data, the fisrt step should be comparing the data with other source to make sure the reliability and accuracy of the dataset. Here we choose the data of the city Tallahassee in the area of our dataset covered, from [US climate data website](http://www.usclimatedata.com/) as a reference:

![US climate data](./US_Climate_Data.png 'US_Climate_data')

For our dataset, we can plot the mean temperature and precipitation with +- standard deviation:

The mininum temperature (TMIN) and maxinum tempature (TMAX)
![TMIN_TMAX](./TMIN,TMAX.png "mean+-std TMIN and TMAX")

and the percipitation (PRCP):
![PRCP](./PRCP.png "mean+-std PRCP")

Note the relationship between the units:

$$ 1 mm/d = 1.183 inch/m, 1 F = 1.8 * 1 C + 32$$

Then we can find out that: 
1. The mean PRCP of our dataset is in good agreement with that of the US-Climate-Data, both **the value and the shape through the year**;
2. The TMIN and TMAX of our data are both approximately **4 C lower than the US-Climate Data**. However, **this is reasonable** bacause: (i) we only choose one place from US-Climate_Data as a reference; and (ii) our dataset is averaged over the history while the US-climate-data is just recent years, and there exists the well-known the global warming over the past few decades

Generally, the two datasets are consistent with each other so that we can proceed the further analysis.

## PCA analysis

### Percentage of variance of PCA
Before implementing PCA analysis, the first step should be checking whether the PCA method is valid. This can be presented by **ploting the percentage of variance as a function the number of first several enigen-vectors**. Considering the data that we are interested in, here we plot the variance distribution of the six mesearement mentioned before respectively:

![TEMP_Var](./Var_Temperature.png "Percentage_Variance_Temperature")

For the temperature data, we can see from these figures that the first five eigen-vectors explain 23%, 50% and 22% of the total variance for TMIN, TBOS and TMAX, respectively. This means the TBOS is best represented by the first 5 eigen-vectors, also we can see that the first eigenvector already explains 41% of the variance.  

![PRCP_SNOW_Var](./Var_PRCP_SNOW.png "Percentage_Variance_PRCP_SNOW")

For the percipitation and snow data, we can see the first five eigen-vectors explains 90% and 91% of total variance of SNOW and SNWD, which means the PCA method can be well implented to those data. However, only 6% variance is explained by 5 eigen-vectors of PRCP. Thus, we will further focus on the snow data via PCA method. Also, it is intuitive that SNOW will be nosier than SNWD due to the definition, thus we can verify this by the following analysis.

##  Analysis of daily snow (SNOW)

From 1.2 we have seen that the fisrt **3** eigen-vectors have expained about **70%** of the tatol variance, thus we are going to use these 3 eigen-vectors for PCA analysis. Fisrt, we just plot the the mean and the top 3 eigen-vectors of SNOW:

![SNOW_mean_and_Eigen](./SNOW_Mean_and_Eigen.png 'SNOW_mean_and_Eigen')

From the first figure, we can see that the there are **only several peaks** in the mean SNOW curve, which means that the stations in our data set has only recorded several snow days over the year. These days are almost in February and December, with small values of depth ($\sim10^{-1} mm$). The snow time is consistent with intuition (rarely snow in area like Florida) and the small depth is self-consistent with the small number of snowing days. One counter-intuitive thing is the data which records **the snow day in August**, this is porbobly a corrupted data or perhaps some abnormal climate.

The second figure shows the first 3 eigen-vectors of the SNOW. We can see that these eigen-vectors basically represent 100% of the 3 peaks in the mean SNOW data respectively, which explains the reason that the first 3 eigen-vectors explain $\sim70 \%$ of total variance. To be sepecific:

**eig1**: represent the 1st lagrest snow day which is in Feb. (i.e the later one in Feb).

**eig2**: represent the 2nd lagrest snow day which is in Feb. (i.e the eariler one in Feb).

**eig3**: represent the 3rd lagrest snow day which is in late December.

Anoter good thing of this PCA analysis is that the first 3 eigen-vectors are corresponding only to those reseasonable snow data. The abnormal one in August which we are not confident in its reliability will be naturally dropped in the following PVA analysis.

### Examples of reconstructions
To further analyzing the data via PCA method, we define the $coeff[i]$ as the value of coefficent of *i*th eigen-vector in PCA. We can plot of these coff variables and manually change these values to see the influence. In the following part we analyze the influence of the three coefficents respectively.

Three things to note before we go into the details of the reconstruction analysis:

1. Since the absolute value of mean SNOW is relatively small compared to the value of those coefficients and targets, we amplify the mean SNOW value by a factor to make them in the similar scale. This is reseanable because the mean just serves as a constant reference in the following analysis;

2. To make the analysis focusing and clear, we pre-filter the data to make the target below to correspond to the 1st largest the peak in the original SNOW data. This is reasonable becuse previous PCA eigen-vector analysis shows that the top 3 eigen-vectors are almost "independently" represent the diffetent peaks (snow days) of the original SNOW data;

3. Due to the feature of the dataset we are working on, in the following figures we plot *coeff1*, *coeff2* and *coeff3* **dependently**, **NOT the summation like professor did in the sample report**.

#### Original coefficent values and the correspondent results
![Origin_SNOW_coeff](./Origin_SNOW_Coeff.png 'Origin_SNOW_coeff')

From the figure and the data above we can observe that the target peak is perfectly reprensted by the 1st eigen-vector, with a negilible residual value. This is just as we expected before.
#### Coeff1

Coeff1: most positive
![Most_Positive_Coeff1](./Most_Positive_C1.png 'Most_Positive_Coeff1')

Coeff1: most negative
![Most_Negative_Coeff1](./Most_Negative_C1.png 'Most_Positive_Coeff1')

Clearly, *Coeff1* corresponds to the largest peak in the original data (and it is the one left after manully filtered), which corresponds to the largest sonw record in the dataset in late February. Additionally, large value of *Coeff1* corresponds to large absolute value of that snow in late February and *vise versa*.
#### Coeff2
Based on our results and analysis before, for the simplicity reason, we will show the results of **only one** specific most positive and most negative value for *Coeff2*, respectively.

Coeff2: most positive
![Most_Positive_Coeff2](./Most_Positive_2.png 'Most_Positive_Coeff2')

Coeff2: most negative
![Most_Positive_Coeff2](./Most_Negative_2.png 'Most_Positive_Coeff2')

From the figure above, we can see that the *Coeff2* influence the peak which corresponds to the snow eariler in February, i.e. the snow reprenseted by the 2nd top eigen-vector of PCA analysis. This is self-consistent with the assumption and results we gain before.
#### Coeff3
Due to the independence of the top eigen-vetcors, which has been verified in the *Coeff2* part above, we can ecpect a very similar behevior and influence of *Coeff3* to *Coeff2*, given we have filtered the SNOW dataset with only the largest peak (corresponds to 1st top eigen-vector) remaining. For simplicity, we don't put the figures related to *Coeff3* below.

#### Summary
The analysis of coeffients, especially the artificially tuned most positive and negative results, further confirm our analysis through this whole report based on PCA method. The top 3 eigen-vectors of the SNOW dataset correspond to the 3 largest snow in out original dataset, and results in this section further show that these 3 eigen-vectors show "independent" representation of those three snow records in SNOW data. **We also rerun the notebook several times with different filtering**, the results are also consistent with the analysis above so that we don't put them in this report for simplicity.

### Spatial variation v.s. temporal variation

In the last section we have seen how the the variation of coefficients will influence the PCA results. For example, the change of value of *Coeff1* will change our expectation of the specific snow that corresponds to the 1st eigen-vector. Next we will estimate the relative importance of location-to-location variation relative to year-by-year variation, i.e. the spatial variation v.s. the temporal variation.

To analyze this, we will calculate the fraction by which the variance is diminished when we subtract from each station/year entry the average-per-year or the average-per-station, respectively. The fraction can be calculated by:

$$frac = 1 - RMS_{remove} / RMS $$

where $frac$ is the fraction explained after removing the root mean square ERROR (RMS), $RMS_{remove}$ is the RMS removing the mean by station or by year, and $RMS$ is the total RMS correspond to each coefficient. The calculation results are as following:

***Coeff1***   
total RMS = 30.425  
RMS removing mean-by-station= 16.697, fraction explained=45.12%  
RMS removing mean-by-year = 16.413, fraction explained=46.05%  

***Coeff2***   
total RMS = 27.774  
RMS removing mean-by-station= 22.508, fraction explained=18.96%  
RMS removing mean-by-year = 13.501, fraction explained=51.39%  

***Coeff3***   
total RMS = 23.486  
RMS removing mean-by-station= 17.504, fraction explained=25.47%  
RMS removing mean-by-year = 14.227, fraction explained=39.42%  

Main findings from the results above can be concluded:

1. The variation by year explains more than the variation by station, and the trends increases with the order of coefficient;
2. For the first coefficient (*Coeff1*), which corresponds to the largest peak in the origianal dataset, the difference between the variation explained by station and year is small (45.12% vs 46.05%);
3. For the top three coefficients, all the results of variances that can be explained is not a large fraction (at most 51.39%). However, we should also notice that the absolute values of total RMS are relatively small (at most 30.425), which is consistent to the intuitive understand of our original SNOW dataset.

## Analysis of accumalated snow depth (SNWD)

From 1.2 we also know that the accumulated snow depth (SNWD) data also has good representation of the variance by the fisrt several eigen-vectors. Here we choose the fisrt **3** eigen-vectors (explain 85% total variance) to implement the similar analysis as we did in 1.3. Mean and eigen-vectors are ploted as following:

![Mean_SNWD](./SNWD_Mean_and_Eigen.png "Mean_SNWD_and_Eigen")

The mean accumulated depth curve shows very similar results to the SNOW data, which means the two data are self-consistent to each other. Meanwhile the top 3 eigen-vectors also corresond to the three largest peaks in SNWD data, just like that in the SNOW data discussed above. Thus, **we can expect similar behaviors and results of reconstructions and annual variation for the SNWD data to the SNOW data we discussed above**. For simplicity, we are not going to the similar details of SNWD data as we did for SNOW. Also in the following parts, we will use **SNOW** data for analysis (such as the residuals anaulysis and representing in map).

One more need to note is that the abnormal peak in August in SNOW data **doesn't** exist in SNWD data. This possibly because that the SNWD is not corrupted. If the peak in SNOW is real, this can also be reasonable since it is impossible to accumulate snow in August, even if there is an abnormal snow. This is also can be regarded as the evidence which proves the intuition that **SNWD** data will be less noisy than **SNOW** data.

## Correlation between different stations for PRCP

To study the correlation between different stations, for example, study whether one event (SNOW, PRCP etc.) is recorded in different stations, as we discussed in the lecture, one problem is that we **cannot** use the SNOW or SNWD measurements since those measurements are rare in our dataset, which means it is not very possible to be observed in different stations, making the study of correlation between different stations meaningless. Thus, in this section, we will use the PRCP measurement to study the correlation.

To study the probility o a particular overlap under the null hypthesis, The null hyp. here is that the probability of it raining on any given day is independent for the two stations in question we need to define the varibles as following: (**This part is same to the lecture and the notebook**)

Fix two stations. We restrict our attention to the days for which we have measurements for both stations, and define the following notation:
* $m$ : the total number of days (for which we have measurements for both stations).
* $n_1$ : the number of days that it rained on station 1
* $n_2$ : the number of days that it rained on station 2
* $l$ : the number of days that it rained on both stations.

We want to calculate the probability that the number of overlap days is $l$ given $m,n_1,n_2$.

The answer is:
$$
P = {m \choose l,n_1-l,n_2-l,m-n_1-n_2+l} /{m \choose n_1}{m \choose n_2}
$$

Where
$$
{m \choose l,n_1-l,n_2-l,m-n_1-n_2+l} = \frac{m!}{l! (n_1-l)! (n_2-l)! (m-n_1-n_2+l)!}
$$

We use the fact that $\Gamma(n+1) = n!$ and denote $G(n) \doteq \log \Gamma(n+1)$
$$
\log P = \left[G(m) - G(l) -G(n_1-l) -G(n_2-l) -G(m-n_1-n_2+l) \right] - 
\left[G(m)-G(n_1)-G(m-n_1)\right] - \left[G(m)-G(n_2)-G(m-n_2)\right]
$$
Which slightly simplifies to 
$$
\log P = -G(l) -G(n_1-l) -G(n_2-l) -G(m-n_1-n_2+l) - G(m)+G(n_1)+G(m-n_1) +G(n_2)+G(m-n_2)
$$

The log probability scales with $m$ the length of the overlap. So to get a per-day significance we consider $
\frac{1}{m} \log P $

\* ***The part above is from the ipython notebook given by professor***

We randomly choose two stations among all the data. To make sure our analysis meaningful, we choose two stations which have comparable and midsize count of measruements:

Station|Count|Common Count|Location
-|-|-|-
USC00093516|28338|19264|(31.18N, 83.75W)
USC00096087|29309|19264|(31.63N, 85,05W)

After calculating the hypothesis testing for the 2 stations, we plot a histogram of the resulting p-values:

![Hist_p_value](./Hist_p_value.png "Hist_p_value")

From the histogram we can see that most of the p-value are in the insignificance bins, which means that it is unlikely to see such results under the null hypothesis. In our situation, this means it is **extremely unlikely that the two station have indepedent PRCP mesearement.**

Also we can go into the details to the specific stations:
![Station_p_value](./Station_p_Value.png "Station_p_value")

The X and Y labels the index of a station. The grayness of the dot represents the significance of the p-value for those two stations. From this figure we can see that there are only a few significant p-values, the black dot in the middle area in the figure may correspond to the spike in the right area of the histogram. These two figures of p-value are in good agreement.

Modified station figures can also be gained, with **sorted by significance of p-value and colored expressions**. The figures are shown above with similar conclusions that we can draw as before:
![Station_p_value](./Station_p_value_2.png "Station_p_value")

To double check the hypothesis testing results, we find out the specific location of two stations studied above in the original data file. It turns out that **the two stations are actually close to each other ($\sim 100 miles$) and in similar enviroment.** This proved that our hypothesis testing results and deny of null hypthesis are reasonable. We use a screenshot from *Google Map* to express the vicinity.

![Station_distance](./Station_distance.png "Station_distance")

## SNOW and PRCP visualized in map using iPyLeaflet

With *iPyLeaflet* package, we can further visualize our data on real map. This visualization will mainly focus on the **count** of measurement. The counts will be represented by Circles: we took square root of count so that the area of the circles corresponds to the count. Based on the rules above, we can visualize the SNOW and PRCP data in an interactive map widget. Here we just give three snapshots of the results, one show the relative area of our dataset in America, and the other two show the details of the SNOW and PRCP measurement count respectively.

![Map_area](./Map_1.png "Map_area")
The figure above shows the whole area of the dataset we studied through the report. The boundaries are ($29.73N, 31.70N, 85.52W, 83.48W$).

![Map_SNOW](./Map_SNOW.png "Map_SNOW")
The figure above shows the count of SNOW measurement of our dataset. The small and few circles are consistent with the fact that snowing is rare in this area and so that in the dataset.

![Map_PRCP](./Map_PRCP.png	"Map_PRCP")
The figure above shows the count of PRCP measurement of our dataset. This result is also in good agreement with the analysis of dataset studied before and the intuition to the climate in southeastern America.

## Conclusion

In this report, we analyzed the climate data of an area in southeastern America. We first cross-check our data with other source to make sure the reliability. The main statistical method we used is PCA. And in order to implement the PCA tool, we choose SNOW and SNWD for detailed analysis, since the top several eigen-vectors of these two measurements explain most of the total variance. Due to the similarity between SNOW and SNWD of our data, the report mainly focused on the SNOW measurement for detailed analysis. Using the similar methods we learned and implemented in the lecture, we studied the meaning and influence of top eigen-vectors, the coefficients of these eigen-vectors and related RMS error. Another important thing we studied is the indepedence of two stations via hypothesis testing method, here we swiched to PRCP measurement since it has plenty of valid record in the dataset (while neighter SNOW or SNWD has). With the help of other packages, such as the *iPyleaflet*, we also managed to visualize the data in figure and map. These results of the report show the great power of the mechine learning methods (such as PCA we used here), together with proper visualization tricks, in analyzing and representing the massive data.

# Appendix
## Index of figures

File name|Description
-|-
US_Climate_Data.png|     The refence data from [US climate data website](http://www.usclimatedata.com/)
TMIN,TMAX.png| The mean+-std data of TMIN and TMAX
PRCP.png| The mean +-std data of PRCP
Var_Temperature.png| Variance explained by eigen_vevtors of temperature data
Var_PRCP_SNOW.png|  Variance explained by eigen_vevtors of percipitation and snow data
SNOW_Mean_and_Eigen.png| Mean and eigen_vectors of SNOW measurement
SNWD_Mean_and_Eigen.png| Mean and eigen_vectors of SNWD measurement
Origin_SNOW_Coeff.png| Untuned coeffiencts of SNOW measurement
Most_Positive_C1.png| Tuned most positive *Coeff1* of SNOW measurement
Most_Negative_C1.png| Tuned most negative *Coeff1* of SNOW measurement                              
Most_Positive_2.png| Tuned most positive *Coeff2* of SNOW measurement
Most_Negative_2.png| Tuned most negative *Coeff2* of SNOW measurement
Hist_p_value.png| Histogram of p-value of PRCP measurement
Station_p_Value.png| Detailed station p-value of PRCP measurement
Station_p_value_2.png| Sorted detailed station p-value of PRCP measurement
Station_distance.png| Distance of two stations of PRCP measurement from [Google Map](http://maps.google.com/)
Map_1.png| Area of stations of whole dataset visualized in map
Map_SNOW.png| Counts of SNOW measurement visualized in map
Map_PRCP.png| Counts of PRCP measurement visualized in map