# South Wyoming Whether Analysis

This is a report on the whether analysis of South Wyoming in recent decades.

The dataset comes from [National Oceanic and Atmospheric Administration (NOAA)](https://www.ncdc.noaa.gov/) and can be downloaded from this [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/). More specifically, the South Wyoming dataset we used here is a subset of the whole dataset and it can be downloaded directly from [here](https://mas-dse-open.s3.amazonaws.com/Weather/small/US_Weather_SBBSSBSS.csv.gz).

The measurements in the dataset include:
* **TMIN**: the daily minimum temperature (in *°C*)
* **TMAX**: the daily maximum temperature (in *°C*)
* **TOBS**: the daily average temperature (in *°C*)
* **PRCP**: the daily precipitation (in *mm*)
* **SNOW**: the daily snowfall (in *mm*)
* **SNWD**: the depth of accumulated snowfall (in *mm*)

## Sanity-check

Before we start the analysis, it is necessary to check the reliability of our dataset. Here we compare our data with the one recorded in Lander, south Wyoming. And the data comes from [US Climate Data](http://www.usclimatedata.com/climate/lander/wyoming/united-states/uswy0101/2017/1). The two graphs below show the average low/high temperature for each month, as well as the total precipitation for each month.

<table>
    <tr>
        <td> <img src="graphs/Lander.png" style="width: 400px;"/> </td>
        <td> <img src="graphs/sanity.png" style="width: 400px;"/> </td>
    </tr>
    <tr>
        <td style="text-align:center"> Temperature and precipitation in Lander </td>
        <td style="text-align:center"> Temperature and precipitation in our dataset </td>
    </tr>
</table>

It can be seen from the two graphs that the patterns of temperature and precipitation are well-matched. Besides, according to the US Climate Data, the annual low/high/average temperatures are 0/14.4/7.2*°C* and the annual precipitation is 321*mm* in Lander. In our dataset, those measurements are -1.9/13.1/6.3*°C* and 343*mm* respectively, which are close enough and the minutia can be considered caused by differences of location and year.

## Dataset overview
To see how the whether changes with time, we should go through the patterns of all six measurements mentioned above. Here we plot the mean values as well as plus/minus the standard deviations of these measurements.

<img src="graphs/mean_std.png" style="width: 900px;"/>

From the figures above we can see that the three temperature measurements (**TMIN**, **TMAX**, **TOBS**) share the same patterns, which is very plausible. However, the precipitation measurement (**PRCP**) fluctuates a lot due to noise. And for the two snowfall measurements (**SNOW**, **SNWD**), the second one shows a smoother curve, which indicates its stability over the other. So later we will use **SNWD** rather than **SNOW** for snowfall analysis.

## PCA analysis
As we know, eigenvalues in PCA represent how much variance of the data there is on the corresponding eigenvector direction. Thus, by accumulating eigenvalues, we can have the knowledge of how much variance has be explained. And we plot the precentage of variance explained as a function of the number of eigenvectors for all six measurements.

<img src="graphs/eigenvalues.png" style="width: 900px;"/>

From the figures above we can see that 57.5% of the variance can be explained by the first ten eigenvalues of **TOBS**, while only 37.4% and 42.4% are explained of **TMIN** and **TMAX**. On the other hand, **PRCP** cannot be well explained, just as we mentioned before in the overview, since only 17.7% of the variance are covered. And 89.4% of the variance of **SNWD** are explained, which is far more than 15.5% variance of **SNOW**. This also confirms our choice of **SNWD** for the later analysis.

## Analysis of snowfall
As we discussed above, we choose **SNWD** alone to represent the snowfall. Since the top three eigenvalues can sum up to 80% of the total variance, we only pick the three corresponding eigenvectors here. We first plot the mean **SNWD** as well as the three eigenvectors.

<img src="graphs/mean_eig_SNWD.png" style="width: 600px;"/>

To understand what representation is behind each eigenvector, we have to look into them one by one.
* **eig1** is the first eigenvector. It shares a similar pattern with mean **SNWD**. So it represents the most common case (the overall snowfall) and can be consider as a base upon which extra features are built.
* **eig2** is the second eigenvector. It has a slight rise in February, a large drop in April, and a large rise in November and Descember.
* **eig3** is the third eigenvector. It has a large drop in February, and a larger rise in April.

### Examples of SNWD reconstruction
To verify the interpretations above, we reconstruct all **SNWD** with the mean and the three eigenvectors. And to show how the reconstruction changes, we pick four examples with the greatest coefficients and four examples with the least coefficients for each eigenvector.

* Four greatest **coeff1**:
<img src="graphs/recon1_great_SNWD.png" style="width: 900px;"/>
* Four smallest **coeff1**:
<img src="graphs/recon1_small_SNWD.png" style="width: 900px;"/>

It can be seen that the first four figures have more snowfall and appear to be more regular, and the last four figures only have snowfall irregularly on a few days. Thus, we can infer that **eig1** represents the overall snowfall, which fits our iterpretation.

* Four greatest **coeff2**:
<img src="graphs/recon2_great_SNWD.png" style="width: 900px;"/>
* Four smallest **coeff2**:
<img src="graphs/recon2_small_SNWD.png" style="width: 900px;"/>

It can be seen that the first four figures have huge rises of snowfall in Descember and three of them have drops of snowfall in April, and the last four figures have far less (almost no) snowfall in November and Descenber. Thus, it fits our iterpretation of **eig2**.

* Four greatest **coeff3**:
<img src="graphs/recon3_great_SNWD.png" style="width: 900px;"/>
* Four smallest **coeff3**:
<img src="graphs/recon3_small_SNWD.png" style="width: 900px;"/>

It can be seen that the first four figures have huge rises of snowfall from January to April, and the last four figures have huge drops of snowfall from February to April. Thus, it fits our iterpretation of **eig3**.

### Statistical analysis of SNWD
After we interpret the three eigenvectors and verify the interpretations by reconstructing several examples, we can focus on how our dataset distribute over the characters of **SNWD**. Here we plot the CDF of the three coefficients and analyze them one by one.

<img src="graphs/CDF_coeff_SNWD.png" style="width: 900px;"/>

From the first figure we can see that about 40% of **coeff1** are less than 0 while about 60% are greater. This means that only a small portion of data has little and irregular snowfall, and a large portion of data complies with **eig1**. In addition, by comparing **eig1** with the mean **SNWD**, we find that the ratio is around 1:100, which approximately means that data with **coeff1**>100 has snowfall greater than the average. And only less than 30% of data have **coeff1** greater than 100.

From the second figure we can see that the distribution of **coeff2** is almost rotational symmetric around (0,0.5). This means that there are same portion of data which comply (having more snowfall in Descember) or disapply (having less snowfall in Descember) with **eig2**.

From the third figure we can see that about 60% of **coeff3** are less than 0. This means that there are slightly more data with drop of snowfall in April than those with rise of snowfall in April.

Besides the CDF of coefficients, we can also plot CDF of residual variances after the mean and the eigenvectors are subtracted out from the data.

<img src="graphs/CDF_res_SNWD.png" style="width: 900px;"/>

From the three figures above we can see that around 20% of data have less than 0.4 of variance un-explained after subtracting **eig1**. Then that ratio grows to 30% and 40% after subtracting **eig2** and **eig3** respectively.

### Analysis of temperal and spatial effects on SNWD
To see the snowfall depends more on spatial factor or temperal factor, we form a table for each coefficient, in which the $ij^{th}$ element is the coefficient value of $j^{th}$ station in $i^{th}$ year. Then we subtract the row means and column means from the origin table, and see how the variance changes.

* **coeff1**: <br/>
total MS                    =  36209.6696032 <br/>
MS removing mean-by-station =  11437.3605478 variance explained =  0.684135186178 <br/>
MS removing mean-by-year    =  27130.3063879 variance explained =  0.250744160738 <br/>
* **coeff2**: <br/>
total MS                    =  4233.00979448 <br/>
MS removing mean-by-station =  3692.89079556 variance explained =  0.127596916885 <br/>
MS removing mean-by-year    =  2507.52353190 variance explained =  0.407626333592 <br/>
* **coeff3**: <br/>
total MS                    =  3975.66221196 <br/>
MS removing mean-by-station =  2981.48998858 variance explained =  0.250064560410 <br/>
MS removing mean-by-year    =  2497.12499315 variance explained =  0.371897092857 <br/>
* **coeff4**: <br/>
total MS                    =  1981.54414450 <br/>
MS removing mean-by-station =  1703.22910161 variance explained =  0.140453617274 <br/>
MS removing mean-by-year    =  1326.31835691 variance explained =  0.330664239505 <br/>

From the above numbers we can see that spacial factor has more influence on **coeff1**, which decides the overall snowfall amount. However, temperal factor has more influence on **coeff2-4**, which decide rises and drops of snowfall in particular months.

## Analysis of temperature
Since variance is better explained in **TOBS** than in other two temperature measurements, we choose **TOBS** for analysis now. Same as before, we first pick three eigenvectors with the top eigenvalues, and plot them as well as the mean.

<img src="graphs/mean_eig_TOBS.png" style="width: 600px;"/>

* **eig1** is the first eigenvector. It represents the most common case (the overall average temperature) and can be consider as a base upon which extra features are built. One thing should be paid attention here is that the first eigenvector is totally below 0, which means the larger the coefficient is, the lower the temperature is.
* **eig2** is the second eigenvector. It has slight rises in Spring and Winter, and slight drops in Summer and Autumn. Since the pattern of this eigenvector is the opposite of the normal case, the larger the coefficient is, the milder (smaller difference) the annual temperature is.
* **eig3** is the third eigenvector. It has a large drop in early November.

### Examples of TOBS reconstruction
Then we reconstruct all **TOBS** with the mean and the three eigenvectors, and show four examples with the greatest coefficients and four examples with the least coefficients for each eigenvector.

* Four greatest **coeff1**:
<img src="graphs/recon1_great_TOBS.png" style="width: 900px;"/>
* Four smallest **coeff1**:
<img src="graphs/recon1_small_TOBS.png" style="width: 900px;"/>

It can be seen that the first four figures have lower temperature, and the last four figures have higher temperature.

* Four greatest **coeff2**:
<img src="graphs/recon2_great_TOBS.png" style="width: 900px;"/>
* Four smallest **coeff2**:
<img src="graphs/recon2_small_TOBS.png" style="width: 900px;"/>

It can be seen that the first four figures have milder temperature, and the last four figures have larger difference in annual temperature.

* Four greatest **coeff3**:
<img src="graphs/recon3_great_TOBS.png" style="width: 900px;"/>
* Four smallest **coeff3**:
<img src="graphs/recon3_small_TOBS.png" style="width: 900px;"/>

It can be seen that the first four figures have drops of temperature in early November, and the last four figures have rises of temperature in early November.

### Statistical analysis of TOBS
Here we plot the CDF of the three coefficients and analyze them one by one.

<img src="graphs/CDF_coeff_TOBS.png" style="width: 900px;"/>

From the first figure we can see that almost 70% of **coeff1** are less than 0. That is because the first eigenvector as plotted before is somehow opposite to the real-world pattern - it is colder in summer. So it is natural for most of data having negative **coeff1**. Besides, for data with positive **coeff1**, there will be other eigenvectors to make the summer temperature higher.

From the second figure we can see that around 55% of **coeff2** are greater than 0. This means that slightly more data has milder temperature.

From the third figure we can see that the distribution of **coeff3** is almost rotational symmetric around (0,0.5). This means that there are same portion of data which has drops or rises of temperature in early November.

### Coefficient visualization on the map
To see how the coefficients distribute over different stations, we use several marks on the following map. In the map, the top-left, top-right, bottom-left, bottom-right triangles represent **coeff1**, **coeff2**, **coeff3**, **coeff4** respectively. The sizes of each triangle are decided by the coefficients. Assume $c_{s,y,k}$ represents the $k^{th}$ coefficient of station $s$ in year $y$, we calculate the size $l_{s,k}$ of the $k^{th}$ triangle at station $s$ as:
$$
c_{s,k} = \frac{1}{|year|}\sum_{y \in year}c_{s,y,k} \\
max\_c_k = \max_{s \in station}|c_{s,k}| \\
l_{s,k} = \frac{c_{s,k}}{5max\_c_k}
$$
If $l_{s,k}$ is positive, we mark it as a solid triangle; if not, we mark it as a hollow triangle.

<img src="graphs/map_coeff_TOBS.png" style="width: 900px;"/>

It can be seen that the green areas always have large solid top-left triangles and small hollow top-right triangles. This means that places with plants can make the temperature much cooler and slightly less mild (larger annual temperature difference).

## Analysis of precipitation
Same as before, we first pick three eigenvectors with the top eigenvalues, and plot them as well as the mean.

<img src="graphs/mean_eig_PRCP.png" style="width: 600px;"/>

But different as before, we find the eigenvectors of **PRCP** hard to interpret because the differences are small everywhere except one time in June.

Then we plot CDF of residual variances of **PRCP** after the mean and the eigenvectors are subtracted out from the data.

<img src="graphs/CDF_res_PRCP.png" style="width: 900px;"/>

Now from the third figure we can see that around 50% of data have less than 0.95 of variance un-explained after subtracting **eig3**. In another words, there are still half of the data whose first three eigenvectors cannot cover 0.05 of the total variance. This, again, shows that PCA has very poor performence over **PRCP**.

### Correlation analysis of PRCP
Since the amount of raining is too hard to analyze according to the previous discussion, we use **whether there is raining** instead of the amount as our measurement for the correlation analysis. In addition, for every station, we concatenate its per-year measurement into a single long vector to represent the station. After we get such vectors, we use per-day p_value as significance to compute the correlation between every two stations. One thing we should notice here is that this method only works for **PRCP**, since there are too many zeros (and all appear in summer) in snowfall measurements and there is no statement as "happen-or-not" in temperature measurements.

We first plot all minus p_value in a histogram.

<img src="graphs/P_norm_flatten.png" style="width: 400px;"/>

Here we use p_value=-0.4 as a mark of significance between same station. As we can see from the above figure, a very large portion of significance are between [0.0,0.1], which means the correlations between them are actually very weak.

Then we plot the correlation matrix as a grey-scale heatmap. In addition, we also run PCA on the correlation matrix, and sort the matrix by the order within each of the first three eigenvectors.

<img src="graphs/P_norm.png" style="width: 900px;"/>

From the first graph we can see that the highly correlated stations are concentrated within 27~28 stations. However, if we reorder them by the order of each eigenvector, we will find the correlations clearer. For example, there are 58~59 stations showing high correlation at the top-left corner in the second graph, and there are same situations at the top-left corner and bottom-right corner in the forth graph. This effectiveness is behind the idea called "correlation ordering", in which it states that samples with similar pattern towards some direction can have their similarity better revealed if they are arranged contiguously according to that certain direction.

Just for inference, we also sort these three eigenvectors together according to each one of them and plot them as follows.

<img src="graphs/P_norm_eig.png" style="width: 900px;"/>

It can be seen from the above figures that sorting one eigenvector has no effect on the trends of other two, which means eigenvectors are quite independent. In another words, eigenvectors represent orthogonal directions from its own coordinate.

### Month-based analysis of PRCP
Here we try to analyze precipitation from a new perspective. Since the daily **PRCP** has too much noise, we now discard the old measurement and focus on the monthly precipitation, denoted as **mPRCP**. The first thing to do is to plot the mean **mPRCP** plus/minus its standard deviation.

<img src="graphs/mean_std_PRCP_month.png" style="width: 400px;"/>

Then we plot its accumulated eigenvalues as well as the first eight eigenvectors.

<img src="graphs/eigen_PRCP_month.png" style="width: 700px;"/>

It can be seen from the first figure that the new measurement **mPRCP** allows us to easily cover 50% of the total variance with just two eigenvectors, which is a huge improvement compared to the old measurement **PRCP**. Besides, the eigenvectors in the second figure are also easier to interpret.
* **eig1** is the first eigenvector. It represents the most common case (the overall monthly precipitation) and can be consider as a base upon which extra features are built.
* **eig2-8** are the second-to-eighth eigenvectors. Together they compose eight rises of precipitation, each rise corresponds to a month. The four months they do not include are: January, March, August, Descember, and three of them (Jan, Aug, Des) have almost no rain according to the mean of **mPRCP**.

### Statistical analysis of mPRCP
Next we plot the CDF of the eight coefficients.

<img src="graphs/CDF_coeff_PRCP_month.png" style="width: 900px;"/>

From the figures above we can see that all curves except the first one share a same pattern. That is because **eig2-8** convey similar information, with slight difference in time. So if we want to represent the **mPRCP** of certain station, we just have to use **eig1** as well as some other eigenvectors with the desired rises, and combine them with proper coefficients.

We also plot the CDF of the residual variances.

<img src="graphs/CDF_res_PRCP_month.png" style="width: 400px;"/>

From the above figure we can see that adding each eigenvector can make significant improvement on the residual variance. Among them, **eig2-4** contribute the most. That is because **eig2-4** correspond to rises of precipitation in April, May, and June, which are exactly the three months with the most precipitation.