# Washington Weather Pattern Analysis Report

## Introduction

In this report, we analyse the historical data of weather patterns of an areas that lies in Washington state. The area includes Moses Lake in the northwest and Municipai Airport in the southeast, which is shown in the figure below. As a remark, we point out that in this map, the area of a circle corresponds to the count of stations in that location.

![area.png](area.png)

The weather pattern data used for our analysis can be downloaded from [Data Link](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

There are six core measurements in our data set: 

* **PRCP** = Precipitation (tenths of mm)
* **SNOW** = Snowfall (mm)
* **SNWD** = Snow depth (mm)
* **TMAX** = Maximum temperature (tenths of degrees C)
* **TMIN** = Minimum temperature (tenths of degrees C)
* **TOBS** = Temperature at the time of observation (tenths of degrees C)

## Comparison with Other Sources

** Let us perform some sanity check on our data. The figure below is extracted from the 'US Climate Data' [Website](http://www.usclimatedata.com/climate/moses-lake/washington/united-states/uswa0762).**

![source.png](source.png)

From the above figure, we can see the maximum/minimum temprature and the precipitation for each month in the Morse Lake area. In the graph below, we plot out the maximum temperature curve and minimum temprature curve based on our data.

![Tmm.png](Tmm.png)

Clearly, the above curves have the same shape as the curves in the 'US Climate Data' plot. 

If we only look at the maximum temperature curve, the peak is reached around July for both plots. Besides, the peak value is 90F in the 'US Climate Data' plot, which is roughly 33C. This is roughly equal to the peak value in our data plot! Thus, our data indeed reflects the true value and can be trusted.

## Principal Component Analysis

In order to discover the weather patterns behind our data, we performed principal component analysis for all the six features (measurements). To identify which feature is the most suitable to analyse, **we plot the percentage of variance explained by the top-k eigen-vectors as a fucntion of k as follows**.

![PCA1.png](PCA1.png)

![PCA2.png](PCA2.png)


#### Feature Suitable for PCA Analysis

The percentages of variance explained by the top **10** eigen-vector for our six core measurements are: 

* PRCP = 19.5%
* SNOW = 28%
* SNWD = **96%**
* TMAX = 45%
* TMIN = 45%
* TOBS = 59%

It's easy to see that among these six features, SNWD is best explained by the top **10** eigen-vectors, i.e., **96% of the varince is exaplaned by the top 10 eigen-vectors**. However, to select SNWD as the feature to be analysed, we need to make sure that the signal has some meaningful structure. We present the mean+-sd plot of SNWD as a fuction of time. 

![snwd.png](snwd.png)

In the figure above, the orange curve is the mean SNWD signal. Clearly, it is a bowl-shape curve and has value zero in the middle. This makes sense because the temprature goes up to 32C during the middle of the year, and thus there should be no snow. To conclude, we use SNWD as the feature for further analysis.

## PCA Analysis for SNWD Signal

### How Many Eigen-Vectors Should We Select?
The length of our data vector is 365, as a result, there are 365 eigen-vectors. It is not practical to select all of them. Since 96% of the varince is explained by the top-10 eigen-vectors, it suffices to choose our eigen-vectors from the top-10 eigen-vectors. 

**There are in total 10 vecors, let us first plot the last five of them**.

![last5.png](last5.png)

If we look at the left end and right end of these curves, they have high frequencies and look like random noises. In other words, the underlining structure can not be simply described. 

**Now let us look at the top five eigen-vectors**.

![first5.png](first5.png)

In the above plot, the curves have very clear structure. Besides, it turns out that the top five eigen-vectors actually explain 92% of the variance. 

**Thus, it is reasonable to choose the top five eigen-vectors for our following analysis.**

### How Should We Interpret These Eigen-Vectors?

In order to interpret the eigen-vectors, we plot the mean vector and eigen-vectors seperately as follows.

![snwd1.png](snwd1.png)


Based on the curve shapes and the effects they have on the SNWD signal, we intepret these eigen-vectors as follows:
* **eig1:** less snow in Jan-May and Nov-Dec.
* **eig2:** little change in Jan-May, significant effect in adding more snow in Nov-Dec.
* **eig3:** more snow in Jan-Feb, less snow in Mar-May.
* **eig4:** more snow in Jan & Dec, less snow in Mar.
* **eig5:** more snow in Jan-Feb & Apr-May, less snow in Mar & Dec.

### Distributions of the Residuals and the Coefficients

#### Distributions of Res1

In the analysis above, we work on the average data vectors instead of individual data vectors in the whole data set. Now, let us plot the distribution of **Res1** (i.e., normalised residual variance after subtracting the mean vector and first eigen-vector) for all the SNWD data vectors. Note that the vertical axis is the fraction of samples.

![snwd2.png](snwd2.png)

From the figure above, we have following observations:

* For 30% of the samples, more than 50% variance is explained by the first eigen-vector. 
* For 70% of the samples, more than 30% variance is explained by the first eigen-vector.

#### Distributions of Coeff1
Below, we plot the distribution of **Coeff1** (i.e., the first eigen-vector coefficient for PCA vector reconstruction) for all the SNWD data vectors. Note that the vertical axis is the fraction of samples.

![snwd3.png](snwd3.png)

From the figure above, we have following observations:

* Around 40% of the samples have positive coefficients coeff1, whose values are less than 1000. 
* Around 60% of the samples have negative coeeficients coeff1 and 85% of these coefficients are less than -6000.
* The minimum coefficient coeff1 is -12400.

### Which Factor is More Important, Year or Location?

In the previous section, we have seen the effect of location on the distribution of eigen-vector coefficients and residual variance. In this section, we analyse the effect of 'locations' on the mean vectors and compare it with 'years'.  

To do this, we compute the RMS (root mean square) before and after subtracting either the year or the station mean vector.

We have following results **for coeff1:**  
            
Total RMS                   =  3056.92085277,                  
RMS removing station mean =  1664.6460512,                    
RMS removing year mean =  2294.48138564,                   

Clearly, the RMS after station mean removal is 30% smaller than the RMS after year mean removal. In other words, the station feature is **more important** than the year feature in the sense that it has more effect on the eigen-vector coefficients and thus on the SNWD signal. 

This makes sense because the weather pattern in a specific location does not change too much each year but could be quite different in different locations.

## Cumulative Density Function of Daily SNWD

In this section, we extract one of the most important internal features: the CDF (cumulative desity function) of the daily SNWD signal. This feature reflects the distribution of different SNWD values over the whole domain. We plot the CDF of SWND as follows.

![cdf.png](cdf.png)

From the CDF plot, we have following observations:

* In around 80% time of a year, the snow depth is zero. 
* The median **positive** snow depth value in a year is around 250 mm. 

## Measuring Depedence of SNWD in Different Locations

### Testing Statistical Significance

In this section, we use statistical methods to test if the SWND signal in two locations is independent (null hypothesis). We can compute the probability associated with the number of overlaps (SWND>0 in both locations) under the null hypothesis.

As an example, we can choose the following two locations (stations): **USC00107046** and **USC00104150**. The number of days that have SNWD measurements are 13188 and 13133 respectively for these two stations. We compute following parameters:

* **m** : the total number of days which have measurements for both stations. 
* **n1** : the number of days with SWND>0 for station 1
* **n2** : the number of days with SWND>0 for station 2
* **l** : the number of days  with SWND>0 for both stations.

The results are: m=8742,l=2908,n1=3344,n2=2952.
Under the null hypothesis, the negative log probability that we have such parameter values is LogPval=0.462413.

Then we compute all such negative log probability between possible station pairs and plot them in the following graph. (The vertical axis is the number of pairs)

![significance.png](significance.png)

The majority of the pairs have negative log probability values less than 0.3, which correpsonds to a probability value of 0.7408. This means that for most of the station pairs, the null hypothesis is likely to hold. 

### Finding Correlated Stations

If we plot the negative log probability pairs as a matrix, we can find strongly correlated stations (upper left corner):

![station.png](station.png)

These strongly correlated stations are : 'USC00457956', 'USC00451586', 'USC00454835', 'USC00100528', 'USC00102845', 'USC00452706', 'USC00102513', 'USC00457180'.

To find out deeper correlation structure, we can re-order the matrix according to the SVD coefficients (top four in our case). We have the following plots.

![svd.png](svd.png)

From the plots above, we can see:

* In the first plot, there is a correlated station group at the lower right corner.
* In the second plot, there are two correlated station groups: one at the upper left corner, the other at the lower right corner.

## Conclusion

In this report, we performed data analysis on Washington state historical weather data. With the help of PCA, SVD and statistical testing methods, we have identified patterns that are statistically significant, and presented useful observations in various forms.