# Wyoming Weather Analysis

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the area of the state of Wyoming.

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Sanity-check: comparison with outside sources

We start by comparing some of the general statistics with graphs that we obtained from a site called [US Climate Data](http://www.usclimatedata.com/climate/wyoming/united-states/3220). Specifically, we choose Rawlins to compare the weather data from these two different sources. For US Climate Data, the graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.

<img src="//cseweb.ucsd.edu/~d5liu/r_figures/Rawlins_USClimateData.png" width=500>

For NOAA, we choose station USW00024057, which is located at the northeast of Rawlins. Below is the average min / max temperature of station USW00024057.

![TMIN,TMAX.svg](//cseweb.ucsd.edu/~d5liu/r_figures/TMIN,TMAX_USW00024057.svg)

To compare the precipitation, we need to translate millimeter/day to inches/month. According to our analysis the average rainfall is 0.64 mm/day which translates to about 0.76 Inches per month. According to US-Climate-Data the average rainfall is closer to 0.7 inch per month, which we think could be considered consistent with our result.

![PRCP.svg](//cseweb.ucsd.edu/~d5liu/r_figures/PRCP_USW00024057.svg)

## PCA analysis

### Percentage of variance explained.

For each of the six measurement, we compute the percentage of the variance explained as a function of the number of eigen-vectors used. We draw the result in the same graph as below.

![VarExplained.svg](//cseweb.ucsd.edu/~d5liu/r_figures/VarExplained.svg)

We see that the top 5 eigen-vectors explain 29% of variance for TMIN, 51% for TOBS, 32% for TMAX, 9% of SNOW, 84% of SNWD and 11% of PRCP.

We could draw the following conclusions

1. Compared with TMAX and TMIN, TOBS has is better explained by the top 5 eigenvectors. An explaination for this is that TOBS is a daily average temperature, which is more stable than TMIN and TMAX.
2. SNOW and PRCP is extremely bad fitted. This is because raining and snowing are affected by a lot of factors, and this two measurements don't have accumulation over time, thus are more hard to fit merely based on historical data. The result would be much better if we do a convolution over time.
3. It makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on diffferent years.

## Analysis of snow depth

First of all, we choose to analyze the eigen-decomposition for snow-depth since the first 4 eigen-vectors explain over 80% of this variance.

We graph the mean and the top 4 eigen-vectors first.

![SNWD_mean_eigs.svg](//cseweb.ucsd.edu/~d5liu/r_figures/SNWD_mean_eigs.svg)

The graph shows that in Wyoming, the snow season begins in October and ends in next May.

Next we interpret the eigen-functions. The first eigen-function (eig1) has a shape very similar to the mean function. The interpretation of this shape is that eig1 represents the overall amount of snow above/below the mean, but without changing the distribution over time.

**eig2, eig3 and eig4** are similar in the following way. They all oscillate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total (much).

They can be interpreted as follows:
* **eig2:** more snow from November to February, less snow from March to May.
* **eig3:** less snow in January and February, more snow from March to May, slightly more snow in November and December.
* **eig4:** more snow in January, February and April, less snow in March.

### Examples of reconstructions

#### Coeff1

Coeff1: most positive

![SNWD_grid_pos_coeff1.svg](//cseweb.ucsd.edu/~d5liu/r_figures/SNWD_grid_pos_coeff1.svg)

Coeff1: most negative

![SNWD_grid_neg_coeff1.svg](//cseweb.ucsd.edu/~d5liu/r_figures/SNWD_grid_neg_coeff1.svg)

We could see that large positive values of coeff1 correspond to more than average snow and low values correspond to less than average snow, which is expected since **eig1** has the same figure as the mean function.

#### Coeff2

Coeff2: most positive

![SNWD_grid_pos_coeff2.svg](//cseweb.ucsd.edu/~d5liu/r_figures/SNWD_grid_pos_coeff2.svg)

Coeff2: most negative

![SNWD_grid_neg_coeff2.svg](//cseweb.ucsd.edu/~d5liu/r_figures/SNWD_grid_neg_coeff2.svg)

Large positive values of coeff2 correspond to a snowless April -- the snowfall is less than usual in April. On the other side, negative values for coeff2 correspond to a snowy April.

#### Coeff3

Coeff3: most positive

![SNWD_grid_pos_coeff3.svg](//cseweb.ucsd.edu/~d5liu/r_figures/SNWD_grid_pos_coeff3.svg)

Coeff3: most negative

![SNWD_grid_neg_coeff3.svg](//cseweb.ucsd.edu/~d5liu/r_figures/SNWD_grid_neg_coeff3.svg)

Coeff3 is just the opposite as coeff2: positive values correspond to a snowy April and negtive values correspond to a snowless April.

## Which factor determine the coefficients? Year or station?

In the previous section we see the variation of Coeff1, which corresponds to the total amount of snow, with respect to location. We now estimate the relative importance of location-to-location variation relative to year-by-year variation.

These are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results:

** coeff_1 **
- total RMS                    =  1848.59752955
- RMS removing mean-by-station =  1050.25237611, fraction explained = 43.1865314478
- RMS removing mean-by-year    =  1621.48367705, fraction explained = 12.2857381809

** coeff_2 **
- total RMS                    =  640.657058425
- RMS removing mean-by-station =  598.857777538, fraction explained = 6.52443929831
- RMS removing mean-by-year    =  494.785769147, fraction explained = 22.7690130561

** coeff_3 **
- total RMS                    =  619.922538699
- RMS removing mean-by-station =  538.19885145, fraction explained = 13.1828869168
- RMS removing mean-by-year    =  491.171563302, fraction explained = 20.7688811681


We see that different coefficients behavior differently. For coeff_1, the variation by station explains more than the variation by year. But for coeff_2 and coeff_3, the variance by year explains slightly more than that by station. However, both of them are not very notable, which means it's hard to tell any coefficient is merely decided by year or station.

## Analysis of precipitation

Precipitation, as mentioned above, does not have a accumulation over time and thus is very noisy. One way to solve this is to do a convolution over time, we plot the original and processed vector below, with convolution from 0 to 50.

![PRCP_Conv.svg](//cseweb.ucsd.edu/~d5liu/r_figures/PRCP_Conv.svg)

The percentage of explained variance after each convolution is showed below.

![PRCP_Conv_VarExplained.svg](//cseweb.ucsd.edu/~d5liu/r_figures/PRCP_Conv_VarExplained.svg)

We could see that taking a conv = 30 or 40 would be a good choice. We take conv = 40.

Taking the top 4 eigenvectors, we graph the mean and the eigenvectors first.

![PRCP_mean_eigs.svg](//cseweb.ucsd.edu/~d5liu/r_figures/PRCP_mean_eigs.svg)


### Examples of reconstructions

By doing the above analysis again, we could get the following graphs.

#### Coeff1

Coeff1: most positive

![PRCP_grid_pos_coeff1.svg](//cseweb.ucsd.edu/~d5liu/r_figures/PRCP_grid_pos_coeff1.svg)

Coeff1: most negative

![PRCP_grid_neg_coeff1.svg](//cseweb.ucsd.edu/~d5liu/r_figures/PRCP_grid_neg_coeff1.svg)

#### Coeff2

Coeff2: most positive

![PRCP_grid_pos_coeff2.svg](//cseweb.ucsd.edu/~d5liu/r_figures/PRCP_grid_pos_coeff2.svg)

Coeff2: most negative

![PRCP_grid_neg_coeff2.svg](//cseweb.ucsd.edu/~d5liu/r_figures/PRCP_grid_neg_coeff2.svg)

#### Coeff3

Coeff3: most positive

![PRCP_grid_pos_coeff3.svg](//cseweb.ucsd.edu/~d5liu/r_figures/PRCP_grid_pos_coeff3.svg)

Coeff3: most negative

![PRCP_grid_neg_coeff3.svg](//cseweb.ucsd.edu/~d5liu/r_figures/PRCP_grid_neg_coeff3.svg)

### Spatial or temporal?

** coeff_1 **
- total RMS                    =  105.01855224
- RMS removing mean-by-station =  57.303438757, fraction explained = 45.4349374138
- RMS removing mean-by-year    =  90.5520063323, fraction explained = 13.7752288519

** coeff_2 **
- total RMS                    =  60.6492657721
- RMS removing mean-by-station =  52.2347873505, fraction explained = 13.8739988266
- RMS removing mean-by-year    =  52.2600973837, fraction explained = 13.8322670219

** coeff_3 **
- total RMS                    =  45.8879940114
- RMS removing mean-by-station =  43.6733582346, fraction explained = 4.8261769216
- RMS removing mean-by-year    =  38.4352589105, fraction explained = 16.2411438142

The conclusion is that, for coeff_1, it's more spatial than temporal; for coeff_2 and coeff_3, basically they are not very related to station or year.

We could plot the coeff_1 on a map. The size represents the number of records; the color represents the coeff_1 with blue coresponding to the minimum coeff_1 and red to the maximum.

![PRCP_Map.png](//cseweb.ucsd.edu/~d5liu/r_figures/PRCP_Map.png)

We can see that coeff_1 is mainly controlled by the vegeration of the area, in forest the coeff_1 is mostly positive and in desert it's mostly negative. This conforms to our experience since forest will often have more rain than desert.