# Weather Analysis - Southern part of Wyoming

### Written by Hao-en Sung (A53204772)

## Summary
This work mainly focus on the analysis of historical weather patterns on a specific zone in Untied States. In this report, I will 
1. Introduce the source of the dataset
2. Run a sanity check to verify it
3. Perform a PCA analysis
4. Dive into two most interesting measurements: **SNWD** and **TOBS**

For each of the measurement analysis, I am going to: 
1. Explain the phyical meaning of most important eigenvectors
2. find out the best coefficients to reconstruct origin data from them
3. Analysize the distribution of coefficient and residuals for eigenvectors
4. Explore the relationship between those eigenvectors and other features
5. Explore the correlation between different stations

There are also some parts that I decide to omit, which are listed as follows with a brief explanation.
* Identify the cause of variation, either spatial or temporal
    * Intuitively, these two factors are all very crucial factors. In my opinion, it is not so meaningful to find out which is *more important*, instead, one should find out a model that takes both factors into consideration. Besides that, I cannot fully understand how Professor's codes are related to this.
* Run P-value test to examine the independence of two stations
    * The purpose of running P-value test is to identify whether two stations are independent of each other. However, this can also be done through calculating the correlation between the coefficients of eigenvectors.

## Introduction to Dataset
In this analysis, I use the data provided by [NOAA](https://www.ncdc.noaa.gov/). The dataset can be directly download from [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

In my report, I focus on a sub-region, whose latitude ranges from 41.4 to 42.85 and longitude ranges from -111.0 to -104.9128. There are overall 12473 observations from different weather stations across years.

There are six measurements that interests me: TMIN, TOBS, TMAX, PRCP, SNOW, and SNWD.
* **TMIN**: daily minimum temperature (*2563 entries*)
* **TOBS**: daily average temperature (*1714 entries*)
* **TMAX**: daily maximum temperature (*2585 entries*)
* **PRCP**: daily percipitation (*1803 entries*)
* **SNOW**: daily snowfall (*1245 entries*)
* **SNWD**: accumulated snowfall depth (*2563 entries*)

## Sanity Check
Before I run into detailed analysis, it is important for me to do a sanity check of the downloaded dataset; otherwise, I might end with a weird conclusion from a incorrect dataset. In view of this, I decide to refer to another climate website [US Climate Data](http://www.usclimatedata.com/climate/).

### Identify the location in real world
From the given dataset, I have the information of latitudes and longitudes of each obervations. Thus, I can use a visualization tool, named *iPyLeaflet*, to interactively show the stations on a real world map. To clearly identify the state boundary, I read the source files and slightly edit professor's codes as follows.

``` python
m = Map(default_tiles=TileLayer(url = unicode('https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png'), \
                                opacity=1.0), center=center, zoom=zoom)
```

From the screenshot shown as below, I acknowledge that all given stations are located at the southern part of **Wyoming**.

![](img/station_location.png)

Thus, I can screenshot the figure from [Wyoming Climate (UC Climate Data)](http://www.usclimatedata.com/climate/wyoming/united-states/3220) as follows.

![](img/us_climate_wyoming.png)

On the other hand, I also plot the minimum temperature, maximum temperature, and percipitation from the given dataset directly for comparison.

![](img/TMIN_mv.png)
![](img/TMAX_mv.png)
![](img/PRCP_mv.png)

From the comparisons between those three measurements, one can tell that this given dataset is trustworthy for the following analysis.

## PCA Analysis
After the sanity check, now I need to find out which measurement provides more information. To do so, I need to use a well-known ancient technique, named *Principal component analysis (PCA)* to decompose my given dataset.

In order to objectively calculate the importance of each measurement, I am interested in finding out the accumulated explained variance with increasing number of eigen-vectors. The corresponding images can be shown as follows.

![](img/sum_explained_var_0.png)
![](img/sum_explained_var_1.png)

It is clear that **SNWD** and **TOBS** are two most important measurements; while the other measurements can barely explain the dataset. In view of this, we decide to have deeper analysis of both **SNWD** and **TOBS** measurements, which is shwon in the later sections.

## Analysis of Snow Depth (SNWD)
In this section, I focus only on **SNWD** measurement and do a series of anaylsis on its eigenvectors.

From the previous PCA analysis figure, it is known that the top three eigenvectors can explain around 80 percent of the data. Thus, in the following analysis, I will focus on only the top three eigenvectors.

I am first interested in the average *snow depth* across different years. From the figure below, one can tell that there is no snow left from mid May to early October. On top of that, the deepest snow accumulation happens at approximately mid Feburary.

![](img/SNWD_m.png)

Later on, I plot the top three eigenvectors in three different colors as follows.

![](img/SNWD_e.png)

From this plot, one can find out **eig1** mostly explains the shape of average snow depth; while **eig2** and **eig3** account for the advance and delay of the snow falling. It is also interesting to notice that the value for **eig2** and **eig3** range slightly above and below 0. It means that they are the indicators of slightly increase and decrease snow falling on top of **eig1** curve.


### Reconstruction of Elgenvectors
After looking into the physical meaning of top three eigenvectors, I am now interested in reconstructing the target function curve across different station. After that, I can find out those (station, year) combinations which are affected by different eigenvectors the most (or the least). It is noticeable that I do the quantitative measurement here in terms of the coefficient value for certain eigenvectors in reconstruction process.

To avoid outliers, I first filter out those (station, year) pairs whose average residual is larger or equal to 1. It is noticeable that this filter condition reduces the valid number of pairs from 1245 to 584.

After that, I will sort all remaining pairs by their coefficient values and print out the top four positive and negative ones.

#### 1st EigenVector
The top four positive and negative coefficients for first eigenvector are shown as follows.

##### Top four positive pairs of (station, year)
![](img/SNWD_eig1_coef_p.png)

##### Top four negative pairs of (station, year)
![](img/SNWD_eig1_coef_n.png)

According to our definition of **eig1**, the more positive the coefficient is, the more average snow accumulates.

#### 2nd EigenVector

##### Top four positive pairs of (station, year)
![](img/SNWD_eig2_coef_p.png)

##### Top four negative pairs of (station, year)
![](img/SNWD_eig2_coef_n.png)

According to our definition of **eig2**, the more positive the coefficient is, the more average snow accumulates in earlier months.

#### 3rd EigenVector

##### Top four positive pairs of (station, year)
![](img/SNWD_eig3_coef_p.png)

##### Top four negative pairs of (station, year)
![](img/SNWD_eig3_coef_n.png)

According to our definition of **eig3**, the more positive the coefficient is, the more average snow accumulates in later months.


### Distribution of Coefficients and Residuals for Eigenvectors 
In this part, I am curious about the distribution of coefficients and residuals in terms of different eigenvectors.

#### 1st EigenVector

##### Cumulative Distribution of Coefficient
![](img/SNWD_coef1.png)

This curve is very interesting. It indicates that most of the coefficient for the first eigenvector is not very large.

##### Cumulative Distribution of Residual
![](img/SNWD_res1.png)

Number of instances and accumulated probability are almost linear to each other. It means that the residual is quite uniform.

#### 2nd EigenVector

##### Cumulative Distribution of Coefficient
![](img/SNWD_coef2.png)

This distribution is similar to *Sigmoid* or *Tanh* function. It also means that the outliers are quite significant; while most of the (station, year) pairs have coefficients closed to zero.

##### Cumulative Distribution of Residual
![](img/SNWD_res2.png)

Similar to residuals for other eigenvectors, it is a uniform distribution.

#### 3rd EigenVector

##### Cumulative Distribution of Coefficient
![](img/SNWD_coef3.png)

Similar performance to coefficient distribution for second eigenvector.

##### Cumulative Distribution of Residual
![](img/SNWD_res3.png)

Similar to residuals for other eigenvectors, it is a uniform distribution.


### Relationship between Elevation and Coefficient of Eigenvectors

#### 1st EigenVector

![](img/SNWD_elev_coef1.png)

#### 2nd EigenVector

![](img/SNWD_elev_coef2.png)

#### 3rd EigenVector

![](img/SNWD_elev_coef3.png)

From the above three figures, one can easily tell that the elevation is a key factor in terms of the first eigenvector coefficient; however, this observation does not apply to second and third eigenvectors.

On the other hand, I also try to plot the relationship between elevation and residuals of eigenvectors. However, there is no clear trend.


### Correlation between Stations
In this section, I am curious about the relationship between different stations across years. Thus, I use two different ways to demonstrate the information: map utility in 2D-grid and correlation matrix.

#### Map Utility

![](img/SNWD_station_map.png)

I use the iPyleaflet utility to show the information for each station. If two stations share a similar weather pattern, they should have similar triangle symbol in the map. From the image, one can easily tell that this utility greatly capture the geography information.

#### Correlation Matrix

![](img/SNWD_station_corr.png)

To gernate this figure, I first compute the coefficient of top three eigenvectors for each station across years. Later, I use these information as the features for each station and then plot the correaltion figure. One can easily tell that some stations indeed share a very high correlation score with each other. For more detailed analysis, one can do a simple station clustering based on this correlation matrix.

## Analysis of Snow Depth (TOBS)
In this section, I focus only on **TOBS** measurement and do a series of anaylsis on its eigenvectors.

From the previous PCA analysis figure, it is known that the top three eigenvectors can explain around 50 percent of the data. Similar to what I just did for **SNWD**, in the following analysis, I will focus on only the top three eigenvectors.

I am first interested in the average *average temperature* across different years. From the figure below, one can tell that the hottest days occur in mid July and early August, which coordinates to what I observed in last section: there is no snow during the summer.

![](img/TOBS_m.png)

Later on, I plot the top three eigenvectors in three different colors as follows.

![](img/TOBS_e.png)

This plot is much more interesting than the one for **SNWD**. Intuitively, one might assume that **eig1** approximates the *most common* situation; however, its curve is totally different from the mean of *average temperature*.

In my understanding, I believe the abnormal shape of **eig1** is caused by the opposite sign of that eigenvector. It is quite reasonable, since the eigen decomposition procedure will not specify the sign. Curve **eig2** seems to have the same issue of **eig1**, which helps delivery more variance to the reconstructed curve. Curve **eig3** is also very interesting to me. From the figure, it seems that it helps delay the hottest days from June to mid July, though I do not have a physical explanation for it.


### Reconstruction Examples
Similar to analysis of **SNWD**, after looking into the physical meaning of top three eigenvectors, I am now interested in reconstructing the target function curve across different station. After that, I can find out those (station, year) combinations which is affected by different eigenvectors the most (or the least). It is noticeable that I do the quantitative measurement here in terms of the coefficient value for certain eigenvectors in reconstruction process.

To avoid outliers, I first filter out those (station, year) pairs whose average residual is larger or equal to 1. It is noticeable that this filter condition reduces the valid number of pairs from 1714 to 1493.

After that, I will sort all remaining pairs by their coefficient values and print out the top four positive and negative ones.

#### 1st EigenVector
The top four positive and negative coefficients for first eigenvector are shown as follows.

##### Top four positive pairs of (station, year)
![](img/TOBS_eig1_coef_p.png)

##### Top four negative pairs of (station, year)
![](img/TOBS_eig1_coef_n.png)

According to our definition of **eig1**, the more positive the coefficient is, the more average snow accumulates.

#### 2nd EigenVector

##### Top four positive pairs of (station, year)
![](img/TOBS_eig2_coef_p.png)

##### Top four negative pairs of (station, year)
![](img/TOBS_eig2_coef_n.png)

According to our definition of **eig2**, the more positive the coefficient is, the more average snow accumulates in earlier months.

#### 3rd EigenVector

##### Top four positive pairs of (station, year)
![](img/TOBS_eig3_coef_p.png)

##### Top four negative pairs of (station, year)
![](img/TOBS_eig3_coef_n.png)

According to our definition of **eig3**, the more positive the coefficient is, the more average snow accumulates in later months.


### Distribution of Coefficients and Residuals
In this part, I am curious about the distribution of coefficients and residuals in terms of different eigenvectors.

#### 1st EigenVector

##### Cumulative Distribution of Coefficient
![](img/TOBS_coef1.png)

This probabilistic distribution is very closed to a uniform distribution.

##### Cumulative Distribution of Residual
![](img/TOBS_res1.png)

Similar to residuals for **SNWD**, it is a uniform distribution.

#### 2nd EigenVector

##### Cumulative Distribution of Coefficient
![](img/TOBS_coef2.png)

Similar to what I observed in **SNWD** measurement, the outliers are relatively significant in terms of the coefficient of first eigenvector.

##### Cumulative Distribution of Residual
![](img/TOBS_res2.png)

Similar to residuals for other eigenvectors, it is a uniform distribution.

#### 3rd EigenVector

##### Cumulative Distribution of Coefficient
![](img/TOBS_coef3.png)

Similar performance to the coefficient of second eigenvector.

##### Cumulative Distribution of Residual
![](img/TOBS_res3.png)

Similar to residuals for other eigenvectors, it is a uniform distribution.


### Relationship between Elevation and Coefficient of Eigenvectors

#### 1st EigenVector

![](img/TOBS_elev_coef1.png)

#### 2nd EigenVector

![](img/TOBS_elev_coef2.png)

#### 3rd EigenVector

![](img/TOBS_elev_coef3.png)

Unfortunately, I cannot get any useful conclusion from the above three images, except there seems some outliers that I might remove it from the dataset.


### Correlation between Stations
Same as what I did for **SNWD**, I try to find out the relationship between different stations across years here. Thus, I use two different ways to demonstrate the information: map utility in 2D-grid and correlation matrix.

#### Map Utility

![](img/TOBS_station_map.png)

From the image, one can easily tell that this utility greatly capture the geography information.

#### Correlation Matrix

![](img/TOBS_station_corr.png)

The correlation pattern for **TOBS** is even stronger than **SNWD**. My personal explanation for this phenomena is that the ranges for temperature is more polarized, so that the correlations between stations are either very positive or negative.

## Conclusion
In this report, I not only analysize the performance of eigenvectors and corresponding residuals, but also compute the correlation between stations by my own written codes. For each figure, I provide a simple and straightforward explanation to it. All figures are generated by myself, which are also listed as follows.

![](img/fig_list.png)