# Tucson Weather Analysis

## Contents
* Area Introduction
* Data Decription
 * Sanity Check on the data
* PCA Analysis
* Analysis of Snow Depth
* Analyze whether SNWD varies more from year to year or from place to place
* Geographical PRCP distribution of first 4 coefficients
* Caculate normalized log probability of PRCP for each pair of stations
* Conclusion

# Area Introduction - Tucson, Arizona
This is a report on the historical analysis of weather patterns in an area that approximately overlaps the area of Tucson, Arizona.

Based on the plot of stations on Google Map, we can see the dataset is collected from the stations in Arizona. If taking a closer look, most of the stations are in the city of Tuscon, and thus the analysis is focused on the weather in Tucson, Arizona.
<img alt="zoom out map" src="images/zoomOutMap.png"/>
<img alt="zoom in map" src="images/zoomInMap.png"/>

According to [Wikipedia page](https://en.wikipedia.org/wiki/Tucson,_Arizona#Climate) of Tucson, Arizona, Tucson has a desert climate, where hot summers and temperate winters are the norm. However, Tucson is almost always cooler and wetter than its nearby area, e.g. Phoenix, because of its higher elevation (2,643 ft or 806 m above sea level).
<img alt="Tucson city view" src="https://cdn.thecadencetucson.com/wp-content/themes/peak-template-properties-cadence/images/the-cadence-tucson-az-chevron.jpg"/ style="height:350px;">
<p style="text-align:center;"> Skyline from the cadence apartment in Tucson </p>


# Data Decription

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/tucson/arizona/united-states/usaz0247" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<img alt="tucson_climate_graph.jpg" src="images/tucson_climate_graph.png" style="height:500px; width:500px;"/> 
<img alt="TMIN,TMAX.jpg" src="images/tucson_temp_min_max.png"/>

<p>We see that the min and max daily&nbsp;temperature mostly agree with the ones we got from our data. However, we can notice that the mean of TMIN in our dataset is slightly lower than what we find in US Climate Data. This is reasonable, considering there are a few stations located on the mountains on the north and the east of Tuscon. The higer elevation, the lower the temperature would be.</p>

<p>To compare the precipitation, we can see that </p>

<p>&nbsp;<img alt="PRCP.jpg" src="images/tucson_prcp.png" /></p>


## PCA analysis

For each of the six measurement, we compute the percentate of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
![VarExplained1.png](images/VarExplained1.png)
We see that the top 5 eigen-vectors explain 47% of variance for TMIN, 60% for TOBS and 46% for TMAX.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 53% of the variance.

![VarExplained2.png](images/VarExplained2.png)

The top 5 eigenvectors explain 8.5% of the variance for PRCP and 15.6% for SNOW. Both are low values. On the other hand the top 5 eigenvectors explain about 88% of the variance for SNWD. This means that these top 5 eigenvectors capture most of the variation in the snow signals. Based on that we will dig deeper into the PCA analysis for snow-depth.

It makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on diffferent years.

## Analysis of Snow Depth (SNWD)

We choose to analyze the eigen-decomposition for snow-depth because the first 4 eigen-vectors explain 84% of the variance.

First, we graph the mean and the top 4 eigen-vectors.

We observe that the snow season is from mid-November to the beginning of April, where the start of January marks the peak of the snow-depth.
![SNWD_mean_eigs.png](images/SNWD_mean_eigs.png)

Next we interpret the eigen-functions. The first eigen-function (eig1) has a shape similar to the mean function. The main difference is that the eigen-function is close to zero during November and December while the mean is not.  The interpretation of this shape is that eig1 represents the overall amount of snow above/below the mean, but without changing the distribution over time.

**eig2 and eig3** are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total (much).

They can be interpreted as follows:
* **eig2:** less snow in late December, more snow in January.
* **eig3:** more snow in March, less snow in February, slightly more snow in December.


### Examples of reconstructions

**Error rate threshold = 40%.**

#### Coeff1
![SNWD_grid_coeff1.png](images/SNWD_c1.png)

We find out that if the data is recontructed using only eig1, there are only three stations that can be recontructed with fewer error. The figures are reasonable because large positive coefficients correspond to higher than average snow. However, note that there are no negative and correctly recontructed coeffcients.

#### Coeff2
![SNWD_grid_coeff2.png](images/SNWD_c2.png)

Large positive values of coeff2 correspond to a late snow season around January. Negative values for coeff2 correspond to an early snow season around December.
#### Coeff3
![SNWD_grid_coeff3.png](images/SNWD_c3.png)

Large positive values of coeff3 correspond to more snow in March. Negative values of coeff3 correspond to more snow in February.

### Discussion

Even with over 80% of variance explained, the three top eigenvectors can still not well represent the snow depth pattern in Tucson area, and that might be because the snow can only happen in the mountains area, but most of the stations are in the city. Therefore, there are very few stations with the snow pattern.




## Analyze whether SNWD varies more from year to year or from place to place.
Here is the result from calculating the root mean square of the coefficients of first eigenvector between different years and different stations:

<img alt="SWND_RMS.png" src="images/SNWD_RMS.png"/>

We can infer that the difference is larger when removing the mean by the year, which means for each year, weather shows less variations than for each station. It can be explained by the scattered stations for this dataset, where weather changes dramatically between the city and the nearby mountains.

## Geographical PRCP distribution of first 4 coefficients. 

We can see from the map below that there are certain areas having similar coefficients, when the coefficients are visualized with four triangles. Size of the triangles means how high the value is, and the triangles are coloer-filled when the values are positive, and vice versa. 

<img alt="PRCP_map.png" src="images/tucson_triangles.png"/>


## Calculate normalized log probability of PRCP for each pair of stations.

We ask whether it rains in each pair of stations, and then compute the probability associated.

Here is the result of the visualization matrix. The darker the element is, the higher the probability the pair has.

<img alt="PRCP_mat.png" src="images/PRCP_mat.png"/ style="height:500px;">

It is very obvious that the first 80 stations have some degree of correlation. Here we can infer that these stations are possibly those stations closeby to each other in Tucson, so they have relatively simlilar rainy days.

We can further use Single Value Decomposition (SVD) to look for more features.

<img alt="PRCP_SVD.png" src="images/PRCP_SVD.png"/ style="height:500px;">

Again, there are strong clustering among these stations. Although we can not be sure what makes these clustering, e.g. spatial relations, rainy amounts, or rainy seasons etc., the first 80 stations and the last 40 stations have something in common for sure.

## Conclusion

In this report, we take a deeper look at the weather data for Tuscon and its nearby area. We examine the dataset using exiting source. Based on PCA analysis, we choose to look closer at Snow Depth (SNWD) patterns at the area, only to find out that the pattern is not shared commonly in this region, but only some mountains where they snow. And then we find out that the weather varies more between stations than differet years. At last, we look at the precipitaion (PRCP), precisely the spatial relationships of stations on the map, and also find clusters using rependency matrix.
