# Weather Analysis Between Palm Springs, CA to Phoenix, AZ

## Introduction

This purpose of this report is to provide some historical analysis of the weather patterns gather from climate stations between Palm Springs, California and Pheonix, Arizona. The picture below shows the locations of all the weather stations in the assigned geographic area. These weather patterns range from temperature, rain, and snow. We will do some exploratory anaysis to determine which feature(s) show promise in producing statistically significant inferences and spend more time on those.


<p><img alt="leaflet_station.png" src="r_figures/leaflet_station.png" width="100%" height="100%"/></p>


## Data Information

The data source was gathered from [NOAA](https://www.ncdc.noaa.gov/) using the [FTP](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/) site. The file_index used for this report is 'SSSSSSBB'.

A description pulled from the NOAA site of the data:

<i>"The Global Historical Climatology Network (GHCN) is an integrated database of climate summaries from land surface stations across the globe that have been subjected to a common suite of quality assurance reviews. The data are obtained from more than 20 sources. Some data are more than 175 years old while others are less than an hour old. GHCN is the official archived dataset, and it serves as a replacement product for older NCEI-maintained datasets that are designated for daily temporal resolution (i.e., DSI 3200, DSI 3201, DSI 3202, DSI 3205, DSI 3206, DSI 3208, DSI 3210, etc.)."</i>

There are 6 different features that the data focuses on:
* **TMIN:** the daily minimum temperature 
* **TMAX:** the daily maximum temperature
* **TOBS:** The average temperature for each day
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow

The data has some inconsistencies. Not all years / days are available for each station, which leaves some NaN values scatter throughout the data. This was accounted for when running the algorithms. For some of the graphs, a scaling had to be done because the units (mm) were off by a magnitude of 10.

## Exploratory Analysis
### Validating Data

Due to there being reports of corrupted data from the source, it is important that we start with some confidence that the data we pulled matches the region it is identified as. Since the region of the data was identified to be between Palm Springs and Phoenix, we can pull some data from an external source to see if it is similar.

The graph below was pulled from [U.S. climate data](http://www.usclimatedata.com/climate/phoenix/arizona/united-states/usaz0166/2017/1). It gives a continuous line of the minimum and maximum temperatures over the course of a year, along with the total precipitation for each month.


<img alt="arizona_climograph.png" src="r_figures/arizona_climograph.png" width="50%" height="50%"/>

Three graphs below were generated from the weather data across all stations: the mean minimum temperature (+std), the mean maximum temperature (+std), and the mean precipitation (+std). 

<img alt="mean_tmin_tmax.png" src="r_figures/mean_tmin_tmax.png" width="65%" height="65%"/><img alt="mean_prcp.png" src="r_figures/mean_prcp.png" width="35%" height="35%" />

The external data graph matches up with the ones generated with the weather data when it comes to mean min/max temperature. The means are a few degrees different, but retains a similar shape. Basically a increases until July/August and starts decreasing, almost symmetric. However, the precipitation is harder to compare since in our generated graph it barely rains (~ 1mm/day). However, if you imagine that in January (1 mm/day) translates to (30 mm/month), then the weather data preprecitation graph matches up fairly well with the US climiate data precipitation graph. Thus we are confident that our data has not suffered any corruption.


### PCA: Percentage of Variance Explained

We look at the percentage of variance explained as a function of # of eigenvectors. The idea is that if a small number of eigenvectors can reconstruct the original signal, it might be worthwhile to do addition analysis since it probable that we can model the data using a simpler model (less dimensions) and be able to explain it easier.

![varcap_temp.png](r_figures/varcap_temp.png)
For the temperature features, the top 5 eigenvectors explain 25% variance for TMIN, 48% for TOBS, and 7.5% for TMAX. Of the 3 temperature features, we can conclude TOBS is explained the best by the top 5 eigenvectors.

![varcap_snowprcp.png](r_figures/varcap_snowprcp.png)

For the precipitation and snow features, the top 5 eigenvectors explain 50% of SNOW, 87% for SNWD, and 13% for PRCP. The variance explained by the PRCP eigenvectors are very low compared to the snow related (SNOW, SNWD) eigenvectors. It's likely that SNWD has better performance than SNOW because it has overall less variance, due to the snow-depth taking some time to decay (melt). When taking this into account, it's more likely to be stable on similar days over many different years.

## Analysis of SNWD (Snow-Depth)

From the previous section, we noted that SNWD has a very high variance explained of 87% when using 5 eigenvectors. Due to all the other features having a variance explained of less than or equal to 50%, we choose the SNWD feature to do our in-depth analysis.

### PCA: Interpreting the Eigenvectors

To start the analysis, we first do an eigen decomposition for snow-depth and attempt to interpret the eigenvectors. The mean and the first 3 eigenvectors are plotted below. 

![snwd_mean.png](r_figures/snwd_eigen.png)

From the graphs, we make the following observations based on each components contribution to the distribution:

- Mean: indicates that the positive snow-depth season is limited to mid November to beginning of February
- Eig1: more snow-depth in January
- Eig2: less snow-depth in November
- Eig3: less snow-depth in December

### PCA: Interpreting the Coefficients

Following up the eigenvector interpretation, we can look at the most positive/negative coefficients of each eigenvector to get an idea of what kinds of distributions they correspond to.

#### Coeff1 - c0 (orange line in pictures) 
Coeff1: most positive
![coeff1_max.png](r_figures/coeff1_max.png)
Coeff1: most negative
![coeff1_min.png](r_figures/coeff1_min.png)
Postive values of coeff1 (c0) correspond to more snow-depth in mid November to beginning of February. Negative values correspond to more than average snow-depth in December.

#### Coeff2 - c1 (green line in pictures) 
Coeff2: most positive
![coeff2_max.png](r_figures/coeff2_max.png)
Coeff2: most negative
![coeff2_min.png](r_figures/coeff2_min.png)
Interestingly enough, this seem to be the complete opposite of Coeff1 (c0). Positive values correspond to more than average snow-depth in December. Negative values of coeff2 (c1) correspond to more snow-depth in mid November to beginning of February. 

#### Coeff3 - c2 (purple line in pictures) 
Coeff3: most positive
![coeff3_max.png](r_figures/coeff3_max.png)
Coeff3: most negative
![coeff3_min.png](r_figures/coeff3_min.png)

This graph is a bit misleading because of the scale and due to the fact there is actually only one negative value (so most of the plots are positive values actually). Positive values of coeff3 (c2) correspond to little to no snow-depth all year (the first positive plot actually has a small magnitude). The only negative value corresponds to a large amount of snow-depth in december.

### Calculating Residual Variance

Earlier we determined that the explained variance of the snow-depth was very high, we can now test if that property results in a small residual variance over the SNWD data. The residual variance we measure in the following plot is of `res_3` which is the residual variance after the Mean, eig1, and eig2 are subtracted out. We plot a culmative distribution function and find that the residual variance observed is around 20% for a majority of the sammples, this corresponds to the explained variance which was around 80% for the first 3 eigenvectors.

![culmdist_res3.png](r_figures/culmdist_res3.png)

### Determining Station or Day Significance

A good question to is if the snow-depth significantly varies depending on the station or on the day. For a simple statistical analysis, we can compare the RMS values of the snow-depth when removing the influence of the station (remove the mean by station) and removing the influence of the day (remove the mean by day). 

- total RMS = 1.18224004315
- RMS removing mean-by-station = 1.17863438012
- RMS removing mean-by-day = 1.16098597698

Notice that the RMS (with removed means) are close in value, which may signify that neither of those two conditions will significantly change snow-depth over the other. In fact, the total RMS (without the mean removed) is about the same as the RMS (with removed means), which indicate that the station and day are not major contributing factors to snow-depth.


### Determining Daily Snow-Depth in Region

We can use a CDF to give us a rough idea of how often there is positive snow-depth in the region.

![cdf_dailysd.png](r_figures/cdf_dailysd.png)

This gives us an idea for a possible reason that all the RMS values are so low and have similar values. From the CDF graph, it seems to indicate that the region has no snow-depth (at 0) about 100% of the days.

### Correlate Stations for Snow At Any Depth

Although the CDF indicated there might not be very many snow-depth days, it might be beneficial if we focused on trying to compare if there was snow at any depth on the same day at any stations across our dataset. We can formulate a null hypothesis that says that positive snow-depth in the two locations is independent. To help us accomplish this, we used a probability function that calculated that he number of overlap days is l given m, n1, and n2 where:

* m : total # days (overlapping days at both stations)
* n1 : # days that it has positive snow-depth on station 1
* n2 : # days that it has positive snow-depth rained on station 2
* l : # days that it has positive snow-depth rained on both stations

The probability function we use is P = {m choose l,n_1-l,n_2-l,m-n1-n2+l} /{m choose n1}{m choose n2}. To simplify the equation, we will use log(P) and scale that to 1/m log(P) since the log probability scales with the length of the overlap.

The probabilities are for every pair of stations in the dataset, so we can plot a histogram that gives us an idea of how many pairings of significance there are. The probabilies histogram indicates that there are no significant values, the 0.4 values correspond to the comparing the same station. So we fail to reject the null hypothesis, we cannot dispprove that positive snow-depth in the two (different) locations is independent. 

![hist_significance.png](r_figures/hist_significance.png)

If we plot a dependency matrix using the probabilities we can see there's no significant values (all white except diagonal), the 0.4 values correspond to the diagonal of the matrix which is comparing the same station. 

<img alt="dependency_matrix.png" src="r_figures/dependency_matrix.png" width="35%" height="35%"/>


### Determine Geographic Significance

The last thing we can analyze that is part of the snow-depth data is geographic data, namely the latitude, longitude, and elevation. The picture below shows every station where there is positive snow-depth, the opacity has been tuned so darker dots refer to a station having multiple positive snow-depth days over the entire dataset.

<p><img alt="leaflet_elevation.png" src="r_figures/leaflet_elevation.png" width="100%" height="100%"/></p>

From the picture, we notice that the stations are spread out across the longitude and latitude so there doesn't seem to be anything we can infer from the those directions. However, after looking closely we can see that many of the stations are in brown mountainous areas of the map, which usually refer to higher elevation. 

If we plot the elevations of the stations that had positive snow-depth on a boxplot of all the SNWD elevation data, we notice that most of the positive snow-depth stations are above the upper quartile (marks the highest 25% of data) and even more are above the maximum line - resulting on them being outliers. This indicates that stations with higher elevations than the norm are more likely to have a positive snow-depth.

<p><img alt="boxplot_elevation.png" src="r_figures/boxplot_elevation.png" width="50%" height="50%"/></p>


## Conclusions

After performing analysis on the SNWA snow-depth data for the Palm Springs and Phoenix region, it was determined that:
- When PCA was performed it had the highest explained variance at 80% for eig3 and it had very low culmative residual error for eig3
- The eigenvectors were characterized as sharp spikes and seem to directly correspond to more/less snow-depth in a paticular month
- The highly positive/negative coeffients would correspond to very similar distributions across eigenvectors
- The station or day did not seem to be an important indicator of snow-depth
- Positive snow-depth is very rare in this region
- We could not reject the null hypothesis that two (different) stations are independent
- Most of stations with positive snow-depth have an elevation that in or above the upper quartile

After completing this report, we elected to do a sanity check by parsing out every single positive snow-depth event from the SNWD data. The table in the appendix is group by station and year, the vector_list are the days that a positive snow-depth event occurred. Notice that only about 4 times did a station have a positive snow-depth event the same time as another station, 3 of those times happened during a record snowstorm in 1967. Also, the number of positive snow-depth events is rare, about 50 days in a dataset of 500k+ days (2000 records * 365 days).

Likely due to the sparse dataset with little variance to begin with, finding eigenvectors that had a high explained variance would be trival task. For future work, it would be better to examine another feature of the weather dataset or switch to an entirely different region. An effort should be made to find datasets with high number of events and a high explained variance.



## Appendix
<img alt="df_snowdep.png" src="r_figures/df_snowdep.png" width="50%" height="50%"/>
<center>Table of Postive Snow Depth Days</center>

vvvvssss