# Weather Analysis

## 1 Introduction

In this report, I will analysis the weather patterns in an area that approximately overlaps the area of the state of Utah.

The data is from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

There are many kinds of measurements of weather, but we mainly focused on the following six measurements:
- **TMIN**: The daily minimum temperature.
- **TMAX**: The daily maximum temperature.
- **TOBS**: The average temperature for each day.
- **PRCP**: The daily percipitation (in mm).
- **SNOW**: The daily snowfail (in mm).
- **SNWD**: The depth of accumulated snow.

## 2 Data description

The part of data I have got contains 12691 rows, which have a common label "SBSSBSBB" (refers to an specific area). Each row has several attributes, such as latitude, longitude, measurement, station, year, label, and vector. Each vector is 365-dimensional, which represents the data for a year (365 days).

### 2.1 Data overview

In this section, I will display the mean and standard deviation for each meansurement. The following two pictures show the mean and standard deviation of TMIN and TMAX. From the figure, we can see that the mean temperature is higher in the summer and lower in the winter, and the maximum temperature is much higher than the minimum temperature, which make sense.
![mean-std-TminTmax.png](fig/mean-std-TminTmax.png)

The following two pictures show the mean and standard deviation of TOBS and PRCP, from which we can see that the mean temperature is in the same distribution with minimum and maximum temperature, which means our data is very resonable. Besides, the average daily percipitation is almost similar across the year.
![mean-std-TobsPrcp.png](fig/mean-std-TobsPrcp.png)

The following two pictures show the mean and standard deviation of SNOW and SNWD. From the figure, we can see that there are much snow during the winter, while there is no snow in summer, which make sense.
![mean-std-SnowSnwd.png](fig/mean-std-SnowSnwd.png)

### 2.2 Sanity-check: comparison with outside sources

I have compared some of the general statistics with graphs that obtained from a website called <a href="http://www.usclimatedata.com/climate/logan/utah/united-states/usut0147" target="_blank">US Climate Data</a>. The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.
![utah.png](fig/utah.png)

From above figure and figures in section 2.1, we can see that the min and max daily temperature agree with the ones we got from our dataset, after we translate the unit from Fahrenheit to Centigrade. 

Besides, the precipitation also agrees with our dataset, after translate the unit from millimeter/day to inches/day. Especially, the precipitation in July and Agust is less than other months during a year, and this phenomenon can be found both in our dataset and the figure above, which represents a great agreement between these two data sources.

### 2.3 Distribution of missing observations

The distribution of missing observations is not uniform throughout the year. The following pictures visualize the distribution of missing observations. Each figure is for one measurement method.
![valid-counts-TminTmax.png](fig/valid-counts-TminTmax.png)
![valid-counts-TobsPrcp.png](fig/valid-counts-TobsPrcp.png)
![valid-counts-SnowSnwd.png](fig/valid-counts-SnowSnwd.png)

## 3. PCA analysis

For each of the six measurement, I have computed the percentate of the variance explained as a function of the number of eigen-vectors used.

### 3.1 Top eigenvectors

First, I have displayed top 3 eigenvectors for each of the six measurement, which are shown in the following pictures.
![top-eigen-TminTmax.png](fig/top-eigen-TminTmax.png)
![top-eigen-TobsPrcp.png](fig/top-eigen-TobsPrcp.png)
![top-eigen-SnowSnwd.png](fig/top-eigen-SnowSnwd.png)

### 3.2 Percentage of variance explained

The following two pictures show the percentage of variance explained for TMIN, TOBS, and TMAX. We can see that the top 5 eigenvectors can explain more than 35% of variance for TMIN and TMAX, and more than 50% for TOBS.

Therefore, we can conclude that of the three, TOBS is best explained by the top 5 eigenvectors. I think this is mainly because the first eigenvector of TOBS, which, by itself, explains more than 45% of the variance. Besides, the figures are similar for TMIN and TMAX, which makes sense because these two measurements are almost the same.

It makes sense that TOBS would be less noisy than TMIN and TMAX. This is because TOBS is an average temperature, which vaires less between days and between the date on different years.
![percent-var-TminTobsTmax.png](fig/percent-var-TminTobsTmax.png)

The following two pictures show the percentage of variance explained for SNOW, SNWD, and PRCP. We can see that the top 5 eigenvectors can only explain about 10% and 12% of variance for SNOW and PRCP respectively, which are both very low. But it can explain more than 80% of variance for SNWD, which means that the top 5 eigenvectors capture most of the variance in SNWD.

Besides, It makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on different years.
![percent-var-SnowSnwdPrcp.png](fig/percent-var-SnowSnwdPrcp.png)

## 4 Snow depth analysis

In this section, I will analyze the eigen-decomposition for snow-depth because the top 4 eigenvectors have explained more than 80% of the variance.

### 4.1 Eigenvectors

First, I have displayed the mean and top 4 eigenvectors in the following pictures. From the figure, we can see that the snow season is from Novemeber to the end of April, where the February marks the peak of the snow-depth. Besides, the mean and eigenvectors show that there are no snow during the summer, which is reasonable according to the common knowledge.
![snwd-mean-eigs.png](fig/snwd-mean-eigs.png)

Next, I have interpreted the eigen-functions. From above figures, we can see that the shape of the first eigenvector is very similar with the shape of mean. (The only difference is the first eigenvector is below 0, while the mean is above 0. This makes no difference because the shapes are similar, and we can use a negative coefficient for the first eigenvector during reconstruction.) The interpretation of this shape is that eig1 represents the overall amount of snow above/below the mean, but without changing the distribution over time.

eig2, eig3, and eig4 are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total snow depth too much. And they can be interpreted as follows:
- **eig1**: overall amount of snow above/below the mean.
- **eig2**: less snow in the end of March and beginning of April, more snow in Janurary.
- **eig3**: less snow in Februray, more snow in April and Janurary.
- **eig4**: less snow in Februray, more snow in March.

### 4.2 Examples of reconstructions

#### 4.2.1 Coeff1
Coeff1: most positive
![recon-snwd-c1-pos.png](fig/recon-snwd-c1-pos.png)
Coeff1: most negative
![recon-snwd-c1-neg.png](fig/recon-snwd-c1-neg.png)
From the above figures, we can see that large positive values of coeff1 correspond to less than average snow. Low values of coeff1 correspond to more than average snow.

#### 4.2.2 Coeff2
Coeff2: most positive
![recon-snwd-c2-pos.png](fig/recon-snwd-c2-pos.png)
Coeff2: most negative
![recon-snwd-c2-neg.png](fig/recon-snwd-c2-neg.png)
From the above figures, we can see that large positive values of coeff2 correspond to a early snow season. (most of the snow fall is in Janurary.) While low values of coeff2 correspond to a late snow season. (most of the snow fall is in March.)

#### 4.2.3 Coeff3
Coeff3: most positive
![recon-snwd-c2-pos.png](fig/recon-snwd-c3-pos.png)
Coeff3: most negative
![recon-snwd-c2-neg.png](fig/recon-snwd-c3-neg.png)
From the above figures, we can see that the large positive values of coeff3 correspond to a late appearance of snow peak. (The snow peak is in Aril.) While the low values of coeff3 correspond to a early snow peak. (The peak appears in Februray.)

### 4.3 Distritution of residuals and coefficients

Next, I am going to display and analyze the distribution of residuals and coefficients.

#### Coeff1
![snwd-cum-dist-1.png](fig/snwd-cum-dist-1.png)

#### Coeff2
![snwd-cum-dist-2.png](fig/snwd-cum-dist-2.png)

#### Coeff3
![snwd-cum-dist-3.png](fig/snwd-cum-dist-3.png)

From above figures, I have found several conclusions as follows:
- First, the distrubition of coeff1 is different from that of coeff2 and coeff3, while the distributions of coeff2 and coeff3 are similar. This is because coeff1 represents the total amount of snowfall, while coeff2 and coeff3 represent the timing of the snowfall. So we can consider coeff1 is a different kind of coefficients from coeff2 and coeff3.
- Second, the three distributions of residuals are very similar, which can be considered as uniform distribution. And this is reasonble intuitively.

### 4.4 Coefficients distribution on map

In this section, I have displayed the first coefficient (coeff1) for each station on the map. The result is shown in the following picture. 
![snwd-map.png](fig/snwd-map.png)

In above figure, each circle represents a station, the bigger the circle is, the more observation data a station has. Besides, the warmer the color of station is, the bigger the coeff1 is.

## 5 The variantion in the timing of SNWD is mostly due to year-to-year variation

In the previous section, we have found that the top 4 eigenvectors of SNWD can explain more than 80% variance. And the the variation of Coeff1, which corresponds to the total amount of snow, with respect to location. In this section, I am going to further analyze the relative importance of location-to-location variation relative to year-by-year variation.

### 5.1 Different coefficients

I am going to measure the variation using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. The results are shown as follows:

#### Coeff1

total RMS                    =  1542.4295994  
RMS removing mean-by-station =  1167.01854244 ,  fraction explained =  0.243389427372  
RMS removing mean-by-year    =  1181.70634773 ,  fraction explained =  0.233866914775  

#### Coeff2
total RMS                    =  601.90862572  
RMS removing mean-by-station =  575.299716846 ,  fraction explained =  0.0442075553272  
RMS removing mean-by-year    =  346.048423438 ,  fraction explained =  0.425081468099  

#### Coeff3
total RMS                    =  598.320831818  
RMS removing mean-by-station =  545.039058008 ,  fraction explained =  0.0890521789937  
RMS removing mean-by-year    =  393.790733512 ,  fraction explained =  0.341840175755  

From the above results, we can conclude that generally the variation by year explains much more than the variation by station. However, for coeff1, variation by year and variation by station can explain almost the same percentate. I think this is because the coeff1 represents the total snowfall, while coeff2 and coeff3 represent the timing of the snowfall. So the above conclusion is weaker when just considering coeff1. Besides, from coeff2 and coeff3, we can see that the stations explain only 4-8% of the variance, while the year can explain 30-40%.

## 6 Snow depth analysis

In this section, I will analyze the eigen-decomposition for average temperature because the top 4 eigenvectors have explained more than 50% of the variance.

### 6.1 Eigenvectors

First, I have displayed the mean and top 4 eigenvectors in the following pictures. From the figures, we can see that the temperature is much higher during the summer, while it is much lower in winter, which corresponds to the real situation.
![tobs-mean-eigs.png](fig/tobs-mean-eigs.png)

Next, I have interpreted the eigen-functions. From above figures, we can see that it is a little harder to get similar conclusions as we got for SNWD in section 4. This is although the top 4 eigenvectors can explain more than 50% of the variance, it is much lower than 80% of SNWD. (Recall that the top 4 eigenvectors can explain more than 80% of the variance for SNWD.) 

However, by deeply looking at the figures, we can see that the shape of eig1 is similar with the mean shape by scaling. (Here by scaling means that we can multiple the eig1 by a fixed coefficient so that the shape would look the same as the eman shape.) So we can consider eig1 as a part of the explanation of the distribution of the average temperature.

### 6.2 Examples of reconstructions

The following pictures are three examples of reconstructions, using coeff1, 2 and 3. Although there is no such obvious regulation as SNOW, we can still find a similar regulation of coeff1. For coeff1, we can see that the large positive coefficients correspond to low temperature, while the lower coefficients corresponds to high temperature. This conclusion also corresponds to the conclusion in section 6.1, which makes sense. 

#### Coeff1
Coeff1: most positive
![recon-tobs-c1-pos.png](fig/recon-tobs-c1-pos.png)
Coeff1: most negative
![recon-tobs-c1-neg.png](fig/recon-tobs-c1-neg.png)

#### Coeff2
Coeff2: most positive
![recon-tobs-c2-pos.png](fig/recon-tobs-c2-pos.png)
Coeff2: most negative
![recon-tobs-c2-neg.png](fig/recon-tobs-c2-neg.png)

#### Coeff3
Coeff3: most positive
![recon-tobs-c2-pos.png](fig/recon-tobs-c3-pos.png)
Coeff3: most negative
![recon-tobs-c2-neg.png](fig/recon-tobs-c3-neg.png)


### 6.3 Distritution of residuals and coefficients

Next, I am going to display and analyze the distribution of residuals and coefficients.

#### Coeff1
![tobs-cum-dist-1.png](fig/tobs-cum-dist-1.png)

#### Coeff2
![tobs-cum-dist-2.png](fig/tobs-cum-dist-2.png)

#### Coeff3
![tobs-cum-dist-3.png](fig/tobs-cum-dist-3.png)

From above figures, I have found several conclusions as follows: (which are similar with the conclusions in section 4.3)
- First, the distrubition of coeff1 is different from that of coeff2 and coeff3, while the distributions of coeff2 and coeff3 are similar. This is because coeff1 represents the integral distribution of average temperature across the year, while coeff2 and coeff3 represents other aspects. So we can consider coeff1 is a different kind of coefficients from coeff2 and coeff3.
- Second, the three distributions of residuals are very similar, which can be considered as uniform distribution. And this is reasonble intuitively.

Besides that, comparing the distributions of both SNWD and TOBS, I have also found another important conclusion:
- Based on the fact that the interpretation of coeff1 is similar between SNWD and TOBS, and the distribution of coeff1 is different from that of coeff2 and coeff3, both for SNWD and TOBS, we can use the conclusion for SNWD to conduct that, coeff2 and coeff3 for TOBS are similar with that for SNWD, and can be considered that they both represent the timing information of the average temperature.

### 6.4 Coefficients distribution on map

In this section, I have displayed the first coefficient (coeff1) for each station on the map. The result is shown in the following picture. 
![tobs-map.png](fig/tobs-map.png)

In above figure, each circle represents a station, the bigger the circle is, the more observation data a station has. Besides, the warmer the color of station is, the bigger the coeff1 is.

## 7 The variation in the timing of TOBS may be due to location-to-location variation

In the previous section, we have found that the top 4 eigenvectors of TOBS can explain more than 50% variance. And the the variation of Coeff1, which corresponds to the distribution of the average temperature across the year, with respect to location. In this section, I am going to further analyze the relative importance of location-to-location variation relative to year-by-year variation.

### 7.1 Different coefficients

I am going to measure the variation using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. The results are shown as follows:

#### Coeff1
total RMS                    =  756.144877178  
RMS removing mean-by-station =  465.498142415 ,  fraction explained =  0.38437969169  
RMS removing mean-by-year    =  692.268018838 ,  fraction explained =  0.0844770099854  

#### Coeff2
total RMS                    =  215.448222468  
RMS removing mean-by-station =  182.265214649 ,  fraction explained =  0.15401848035  
RMS removing mean-by-year    =  151.823957338 ,  fraction explained =  0.295311162938  

#### Coeff3
total RMS                    =  173.416889085  
RMS removing mean-by-station =  165.933941754 ,  fraction explained =  0.0431500494024  
RMS removing mean-by-year    =  105.261889507 ,  fraction explained =  0.393012468034   

From the above results, especially from the coeff1 and coeff3, we can see that generally the variation by station explains much more than the variation by year. However, for coeff2, variation by year can explain a little more than the variation by station. But generally, considering all these three coefficients, we may get a conclusion that the variation in the timing of TOBS is mostly due to location-to-location variation.

## 8 Percipitation analysis

In this section, I will analyze the percipitation. Generally, it is unlikely to find correlations between the amount of rain on the same day in different stations, because it varies a lot between stations. It is more reasonable to try to compare whether or not it rained on the same day in different stations.

Therefore, I will use significance of P-value between each pair of stations, to find a statistical test for rejecting the null hypothesis, which is that the rainfall in the two locations is independent.

### 8.1 P-value distribution for each pair of stations

Here, I have displayed the significance of P-value for each pair of stations, the result can be found in the following picture.
![prcp-sign-station.png](fig/prcp-sign-station.png)

In above figure, the largest significance is 0.4, which represents that the two stations in a pair are the same. From the figure, we can see that most pairs have a low significance of P-values, but there still many pairs whose significance of P-values is very high, which means the rainfall between different stations may not independent.

Further, I have display the normalized P-norm matrix as follows, in which the x and y labels represent the index of each station. And the darker the color is, the more correlated the two stations are.
![prcp-pnorm-0.png](fig/prcp-pnorm-0.png)

From above figure, we can see that the normalized log probabililty that overlap in rain days is mostly random, but it is obvious that the first 20 stations are highly correlated with each other.

### 8.2 PCA analysis

In order to find more correlations, I will use SVD in this section. In the following picture, we can see that the top 8 eigenvectors can explain more than 80% of the square magnitude of the matrix, which means the stations are more correlated than we think, and the rainfall between different stations are not independent.
![prcp-pca-var.png](fig/prcp-pca-var.png)

Further, the following pictures are the top 4 eigenvectors.
![prcp-pca-eig.png](fig/prcp-pca-eig.png)

### 8.3 Re-order matrix

Using the results of section 8.2, I will re-order the rows and columns of significance matrix according to each of the eigenvectors. The results are shown as follows:
![prcp-pnorm-1.png](fig/prcp-pnorm-1.png)

In above figure, we can see that the grouping of the stations becomes more evident. So there is more likely that the rainfall between different stations are not independent.