# Weather Analysis (border region between WA and ID)

This is a report on the <b>historical analysis of weather patterns</b> in an area that approximately overlaps the border region between Washington State and Idaho State, covering Moses Lake and Clearwater National Forest, whose center is located around Pullman, WA.

Our data comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).
(file_index: SBSBBSSB)

In this weather analysis homework, we make use of six major measurements:
* **TMIN:** Daily minimum and maximum temperature
* **TMAX:** Daily minimum and maximum temperature
* **TOBS:** Average temperature for each day
* **PRCP:** Daily Precipitation (mm)
* **SNOW:** Daily snowfall (mm)
* **SNWD:** Accumulated snow depth

## I. Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/pullman/washington/united-states/uswa0356" target="_blank">US Climate Data</a>. The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month of the center of our region--Pullman, WA.</p>

<p><img alt="Pullman.png" src="rfigure/Pullman.png" style="height:300px"/></p>

<p>Daily min and max temperature in the above figure almost agree with those we got from our data. (Since the researched region is quite large, the slight differences are reasonable.)</p>

<p><img alt="TMIN&TMAX_mean+-std.png" src="rfigure/TMIN&TMAX_mean+-std.png" style="height:250px" /></p>

<p>The region we study is quite large, lying across two states and covering cities, suburbs and forests. According to <a href="http://www.vidiani.com/average-precipitation-map-of-the-usa/">USA Average Precipitation Map</a>, the amounts of rain varies a lot among stations in our region, even between two close stations, which will be discussed in details later in the PRCP analysis part. 
<img alt="PRCPmap.png" src="rfigure/PRCPmap.png" style="height:200px" />
<br/>Therefore, the PRCP mean is not very close to the center of our region--Pullman, WA.
<img alt="PRCP_mean+-std.png" src="rfigure/PRCP_mean+-std.png" style="height:250px" /></p>
<br/>
<b><u>Conclusion</u>: <i>Overall, our data can be used for further analysis.</i></b>

## II. PCA Analysis

For each of the six measurement, we compute the percentate of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
<p><img alt="VarExplained1.png" src="rfigure/VarExplained1.png" style="height:250px; width:900px" /></p>
We see that the top 5 eigen-vectors explain 26% of variance for TMIN, 49% for TOBS and 28% for TMAX.
<br/>
Among these three measurements, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 43% of the variance.
<br/>

<p><img alt="VarExplained2.png" src="rfigure/VarExplained2.png" style="height:250px; width:900px" /></p>

The top 5 eigenvectors explain 10% of the variance for PRCP and 14% for SNOW. Both are low values. However, the top 5 eigenvectors explain as much as 93% of the variance for SNWD. These top 5 eigenvectors capture almost all of the variation in the snow signals.
<br/>It is reasonable that SNWD would be less noisy than SNOW, because snow-depth is the accumulated result of snowfall and varies less between days and between the same date on different years.
<br/><br/><b><u>Conclusion</u>: <i>We will dig deeper into the PCA analysis for snow-depth.</i></b>

## III. Analysis of Snow Depth

We choose to analyze the eigen-decomposition for snow-depth because the first 3 eigen-vectors explain over 90% of the variance, which indicates the top 3 eigen-vectors are super influential.

First, we graph the mean and the top 3 eigen-vectors.

We observe that <b>the snow season is from November to the end of April, where February marks the peak of the snow-depth</b>.
<p><img alt="SNWD_mean_eigs.png" src="rfigure/SNWD_mean_eigs.png" style="height:400px; width:750px" /></p>

Next, from the shapes shown above, we attempt to interpret eig1 and eig2.
* **eig1:**
<br/>The absolute value of first eigen-function (eig1) has a shape similar to the mean function. The interpretation of this shape is that eig1 (multiplied by a negative coefficient) can represent the total amount of snow in the whole year.

* **eig2:** 
<br/>eig2 does not vary a lot until November. Its shape is similar to the mean function from November to the end of December. The interpretation of this shape is that eig2 can represent the amount of snow during this specific period, <i>as a supplement to eig1</i>.

Further interpretion will be discussed in the next part.

### Samples of reconstructions

#### coeff_1
<img alt="SNWD_recon_c0.png" src="rfigure/SNWD_recon_c0.png" style="width:max-width"/>
* **Coeff_1 represents the total amount of snow in the whole year.**
<br/>Larger absolute value of negative coeff_1 corresponds to more snow-fall during the whole year (snow season: from January to the end of April, as well as from November to the end of Decemeber). 
<br/>The above figure shows that the amount of snow is close to mean when c0=0. 

#### coeff_2
<img alt="SNWD_recon_c1.png" src="rfigure/SNWD_recon_c1.png" style="height:220px"/>
* **Coeff_2 focuses on the amount of snow during a specific period (from November to the end of Decemeber).**
<br/>Larger absolute value of positive coeff_2 corresponds to more snow during the period from November to the end of Decemeber.
<br/>The red line representing c1 sometimes overlaps with the cyan line of c2.

#### coeff_3
<img alt="SNWD_recon_c2.png" src="rfigure/SNWD_recon_c2.png" style="height:220px"/>
* **Coeff_3 is related to the timing of the end of snow season.**
<br/>Larger absolute value of negative coeff_3 corresponds to delay of the end of the snow season.

#### A Sample of Best Reconstruction
coeff_1=-6431 coeff_2=3606 coeff_3=-101
<img alt="bestRecon.png" src="rfigure/bestRecon.png" style="height:220px"/>
<br/>
With appropriate coeff_1,2,3, we can get a shap which agree with the target, which verifies that 
<br/>
<b><u>Conclusion</u>: <i>the top 3 eigenvectors capture almost all of the variation in the snow signals<i/>.</b>

### Average Snow Depth Respect to Location

#### Visualiztion on Map
<p>
As is discussed above, coeff_1 represent the snow depth in the whole year, and in addition, coeff_2 further deals with the snow depth from November to the end of December. Now we want to show the influence of coeff_1 and coeff_2 staightforwardly on the map.
</p>

<p>
<img alt="map_avg_SNWD.png" src="rfigure/map_avg_SNWD.png" style="width:max-width*97%"/>
The color of each circle represents snow depth during a period at the location. 
<br/>Warmer colors(close to red) indicate more snow, while cooler colors(close to blue) indicate less snow.
<br/>The size of circles corresponds the amonut of stations at the location. Data shown with larger circle is more reliable.
<br/>The right figure(coeff_2) is similar to the left one(coeff_1), so the average SNWD respect to location analysis is mainly depent on coeff_1. 
<br/><br/>
<b><u>Conclusion</u>: <i>Areas in warmer colors are supposed to have heavier snow during the whole snow season</i>.</b> 
<br/>We can see that these areas are in the suburb or forest regions, which seems to make sense.
</p>

#### Relationship between SNWD and elevation

It is a common sense that snow depth is usually deeper where elevation is higher. Now let us look at the relationship between coeff_1, coeff_2 and elevation.
<p>
<img alt="elevation.png" src="rfigure/elevation.png" style="height:250px"/>
As elevation increases, coeff_1 tends to decrease, while coeff_2 tends to increase.
<br/>Smaller coeff_1 and larger coeff_2 represent more snow. 
<br/><br/><b><u>Conclusion</u>: <i>As elevation increases, snow-depth is supposed to increase.</i></b>
<br/>The conclusion is the same as our assumption above.
</p>

### Year-to-year VS station-to-station variation
In the previous section we see the variation of coeff_1 and coeff_2, which corresponds to the total amount of snow, with respect to location. We now estimate the relative importance of location-to-location variation relative to year-by-year variation.

These are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results:

** coeff_1 **  
total RMS                    = 2548.79836597  
RMS removing mean-by-station = 1411.54603975, fraction explained = 44.6%  
RMS removing mean-by-year    = 2172.58783671, fraction explained = 14.8%  

** coeff_2 **  
total RMS                    = 891.646509155  
RMS removing mean-by-station = 860.293896216, fraction explained = 3.5%  
RMS removing mean-by-year    = 497.017351854, fraction explained = 44.2%  

** coeff_3 **  
total RMS                    = 659.849973909  
RMS removing mean-by-station = 610.112428488, fraction explained = 7.5%  
RMS removing mean-by-year    = 529.514148647, fraction explained = 19.8%  

<b><u>Conclusion</u></b>:
<br/>The fraction explained by <b><i>station-to-station variation</i></b> is more than that by year-to-year variation when considering <b><i>coeff_1</i></b>. 
<br/>The fraction explained by <b><i>year-to-year variation</i></b> is much more than that by station-to-station variation when considering <b><i>coeff_2</i></b> and <b><i>coeff_3</i></b>. 

Coeff_1 is related to the amount of snowfall in <i>the whole year</i>, so the variance between years is not that obvious.
<br/>Coeff_2 only focuses on the snowfall during a specific period, instead of the snowfall of the whole year, which may vary between different years.
<br/>Coeff_3 does with the timining of snowfall, which must varies among different years.

## Analysis of Precipitation
<p>
From PCA analysis, we know that the top 5 eigenvectors only explain 10% of the variance for PRCP. Actually, the top 10 eigenvectors only explain less than 20% of the variance. Therefore, PCA analysis for PRCP may not be as meaningful as that for SNWD.
<img alt="PRCP_PCA.png" src="rfigure/PRCP_PCA.png" style="height:200px" />
<b><u>Conclusion</u>: <i>We will dig into other analysis instead of PCA analysis for PRCP, for instance, the correlations between pairs of different stations.</i><b>
</p>

### Overview of Precipitation Analysis 
<p>
<img alt="PRCP_CDF.png" src="rfigure/PRCP_CDF.png" style="height:200px" />
From the CDF of daily rainfall, we can find that <b>it rains in about 30% of the days in our region</b>.
<img alt="map_avg_PRCP.png" src="rfigure/map_avg_PRCP.png" style="height:300px" />
Similar to the visualization map for SNWD; however, now warmer colors(close to red) correspond to less rainfall, while cooler colors(close to blue) represent more rainfall.
<br/><b><u>Conclusion</u>: <i>The amounts of rain at different stations in our region vary a lot, even between close locations</i>.</b> 
<br/>Therefore, finding correlations between the amount of rain on the same day in different stations is difficult. It is more reasonable to try to compare whether or not it rained on the same day in different stations, but even in this case, the correlations may not be very large.
</p>

### PRCP correlations among stations in our region
<p>
We can calculate <b>normalized log probability</b> for each pair of stations to study their correlations.
<img alt="PRCP_significance.png" src="rfigure/PRCP_significance.png" style="height:200px" />
(We set the log probability of two same stations to be -0.4)
<br/>We can see the distribution of log probilities of each pair of stations (most significane values are around or even less than 0.10), so we can find that the correlations between pairs of stations are not large in our region.
<img alt="PRCP_showmat.png" src="rfigure/PRCP_showmat.png" style="height:400px" />
X-axis and y-axis represent the indices of stations. A point in darker color indicates larger correlations between the corresponding two stations.
<br/>It is easy to figure out that the first 16 stations, especially the fisrt 8, are highly correlatedwith each other. However, other stations are not.
<br/><br/>When we reorder the rows and columns of the matrix using one of the eigenvectors, the <b>grouping and clustering</b> of the stations becomes more evident. (Cool colors(close to blue) indicates highly correlated)
<img alt="PRCP_matshow.png" src="rfigure/PRCP_matshow.png" style="height:800px" />
We have got nan data from too many stations, so there are lots of red points (log P=nan) in the matshow above.
<br/><br/>
<b><u>Conclusion</u>: <i>The precipitation of serveral (around 20) stations in our region are correlated with each other. However, other stations are not</i>.</b>
</p>