# California/Nevada Weather Analysis

This is a report on the historical analysis of weather patterns in an area that includes Northeastern California and Nevada

The data we're analyzing comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from here: [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.


## My region of interest

<p>My dataset includes regions of Northeastern California as well as Nevada. Below is map of my region of interest, with a box showing the latitudinal and longitudinal boundaries, as well as the locations of the weather stations. The radii of the circles (which denote the weather stations) are proportional to the amount of data collected from them. </p>

<p>&nbsp;<img alt="PRCP.png" src="CA_figs/map.jpg" style="height:400px; width:1000px" /></p>


## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from  <a href="http://www.usclimatedata.com/climate/hawthorne/nevada/united-states/usnv0039" target="_blank">US Climate Data</a> The graphs below show the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month, for two locations central to my dataset (Hawthorne, NV and Markleeville, CA). The wide range in elevations and climates in my dataset necessitated using more than one locale for comparison to my averages. </p>

<p><img alt="Climate_Boston_-_Massachusetts_and_Weather_averages_Boston.jpg" src="CA_figs/Climate_Hawthorne.jpg" style="float: left; height:400px; width:500px"; />
<img alt="Climate_Boston_-_Massachusetts_and_Weather_averages_Boston.jpg" src="CA_figs/Markleeville.png" style="float: right; height:400px; width:500px"/>
</p>


<p>We see that the min and max daily&nbsp;temperatures roughly agree with those in our dataset:</p>


<p><img alt="myTMIN,TMAX.png" src="CA_figs/MINMAX.png" style="float:center; height:400px; width:1000px" /></p>





<p> Specifically, the mean low temperature (TMIN) peaks at ~12 degrees Celsius in August, and the mean high temperature (TMAX) peaks at ~31 degrees Celsius around the same time of year. The trends are the same as the above US Climate Data plots, and the averages of those cities closely resemble that my dataset.  </p>


<p>&nbsp;<img alt="PRCP.png" src="CA_figs/PRCP.png" style="height:450px; width:600px" /></p>

<p>
With regard to precipation (PRCP), things are less clear. We see the same general trend with PRCP decreasing in the summer months. However, our plot shows average rainfall ranging from ~5mm to 45mm, while 

<a href="http://www.usclimatedata.com/climate/hawthorne/nevada/united-states/usnv0039" target="_blank">US Climate Data's</a>, data for Hawthorne shows a range of 7mm to 14mm. This why I opted to look at secondary location. While Hawthorne has significantly less rainfall than my average, Markleeville has significantly more, with average rainfall ranging from 20mm to 86mm per month. Averaging this two locations average rainfall values would generate a curve closely my dataset's average PRCP. 
</p>

## Preliminary Analysis

### Correlation between features

We generated 12 heatmaps that shows the correlations between our six features of interest for each month of the year. The color reflects the correlation coefficient of the two features being compared.

<p>&nbsp;<img alt="PRCP.png" src="CA_figs/heatmap3.png" style="height:600px; width:900px" /></p>


<p>&nbsp;<img alt="PRCP.png" src="CA_figs/heatmap4.png" style="height:600px; width:900px" /></p>

From these heatmaps, we can see that, all year round, there is a strong positive correlation between our temperatures metrics (TMIN, TMAX and TOBS). There is also a strong positive correlation between our snow metrics (SNOW and SNWD). Furthermore, there is a strong negative correlation between our temperature and snow metrics. 

Precipitation, on the other hand, shows weak correlations with the other features. Interestingly, the strength and directionality (positive or negative) of these correlations are dependent on the time of year. 

Below is a heatmap of annual averages for comparison. Again, this reinforces the conclusion that most of our correlations exist year-round

<p>&nbsp;<img alt="PRCP.png" src="CA_figs/heatmapALL.png" style="height:400px; width:450px" /></p>

### Correlation with Elevation

Because my data has a wide range of elevations, so I thought it would be interesting to see how my features were distributed across different elevations.

<tr>
<td> <img src = "CA_figs/elevation_vs_PRCP.png" alt="Sample_alt_text" style = "width: 400px;"/> </td> 
<td> <img src = "CA_figs/elevation_vs_SNOW.png" alt="SNWD_CDF_res1.png" style = "width: 400px;"/> </td>
<td> <img src = "CA_figs/elevation_vs_SNWD.png" alt="SNWD_CDF_res1.png" style = "width: 400px;"/> </td>
</tr>

<tr>
<td> <img src = "CA_figs/elevation_vs_TMIN.png" alt="Sample_alt_text" style = "width: 400px;"/> </td> 
<td> <img src = "CA_figs/elevation_vs_TMAX.png" alt="SNWD_CDF_res1.png" style = "width: 400px;"/> </td>
<td> <img src = "CA_figs/elevation_vs_TOBS.png" alt="SNWD_CDF_res1.png" style = "width: 400px;"/> </td>
</tr>

We can see that SNOW and SNWD are positively correlated with elevation: that is, there is more snowfall and snowdepth at higher elevations. 

Furthermore, TMIN and TMAX are negatively correlated with elevation, which also make intuitive sense: temperatures drop at higher elevations.

The relationship between TOBS and elevation, as well as PRCP and elevation are less clear.

## PCA analysis

For each of the six measurements, we computed the percentate of the variance explained as a function of the number of eigenvectors used.

### Percentage of variance explained.
![VarExplained1.png](CA_figs/VarExplained1.png)

We see that the top 5 eigenvectors explain approximately 60% of the variance for TMIN, 68% for TOBS and 55% for TMAX.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains over 60% of the variance.


![VarExplained1.png](CA_figs/VarExplained2.png)

The top 5 eigenvectors explain 14% of the variance for PRCP and 18% for SNOW. 

However, the top 5 eigenvectors explain 85% of the variance for SNWD. This means that these top 5 eigenvectors capture most of the variation in the snow signals. For this reason, we will dive deeper into the PCA analysis for snow-depth.

It makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on diffferent years.

## Analysis of snow depth

We chose to analyze the eigen-decomposition for snow-depth because the fewest eigenvectors explain the highest percent of variance. Interestingly, the top 50 eigenvectors explain 100% of the variance of my data for SNWD.




<p>&nbsp;<img alt="PRCP.png" src="CA_figs/SNWD_percentExplained.png" style="height:300px; width:450px" /></p>

Furthermore, the first 3 eigenvectors account for 80% of the variance, so those are what we will be focusing on. First, we graph the mean and the top 3 eigenvectors:

<p>&nbsp;<img alt="PRCP.png" src="CA_figs/SNWD_mean_eigs.png" style="height:400px; width:800px" /></p>

We observe that our dataset has a long and heavy snow season, which is not suprising, because it includes one the world's premier skiing locales, Lake Tahoe. The snow season is from early November to the beginning of June, with early February marking the peak of the snow-depth.


Next we interpret the eigenfunctions:

The first eigenfunction (eig1) has a shape very similar to the mean function. This similiarity is not suprising: the first eigenvector explains 68% of the variance for snowdepth. 

Eig2 and eig 3 oscillate between positive and negative values, corresponding to the changing distribution of snowdepth over the snow season. They don't change the total very much.

Specifically, eig2 shows more snow from the end of december to march and less snow from March to June and November to late December. 

Similarly, eig3 shows more snow from mid March to late April and less snow from November to mid March and mid April to late June.


While performing this analysis, we discovered that our eigenvalues have the following distribution:

<p>&nbsp;<img alt="PRCP.png" src="CA_figs/histogram_eigvals.png" style="float: center; height:300px; width:450px" /></p>

Roughly 100 of our 365 eigenvalues are less than zero, which suggests that my data may have been corrupted in preprocessing.

## Analysis of reconstructions

We can look at the eigenvector coefficients to ascertain where and when our different eigenvectors have the greatest influence. When coefficient values are highest, the eigenvector they correspond to have the most influence. Where they are close to zero, they're eigenvectors they have the least influence. Large negative coefficients show that their corresponding eigenvector has an in the influence in the opposite direction. 

### Coeffiecients

#### Coeff1
Coeff1: most positive
![SNWD_grid_Pos_coeff1.png](CA_figs/coeff1_max.png)
Coeff1: most negative
![SNWD_grid_neg_coeff1.png](CA_figs/coeff1_min.png)


Large positive values of coeff1 correspond to more than average snow. The magnitude of this eigenvector correlates with our target/actual data. 

#### Coeff2
Coeff2: most positive
![SNWD_grid_Pos_coeff2.png](CA_figs/coeff2_max.png)
Coeff2: most negative
![SNWD_grid_neg_coeff2.png](CA_figs/coeff1_min.png)

#### Coeff3
Coeff3: most positive
![SNWD_grid_Pos_coeff3.png](CA_figs/coeff3_max.png)
Coeff3: most negative
![SNWD_grid_neg_coeff3.png](CA_figs/coeff3min.png)

Large positive values of coeff3 reflect areas where eigenvector3 has the most influence and coincide with our target. Large negative values of coeff3 are also where eigenvector3 has influence. However, these are regions where our eigenvalue moves in the opposite direction as our target. 

### Geographical distribution of first 3 coefficients. 

<p>&nbsp;<img alt="PRCP.png" src="CA_figs/geo_coeffs.png" style="height:400px; width:1000px" /></p>

The geographical distributions of maximumns and minimums of the first 3 coefficients can be seen above. There are four points corresponding to each coefficient: the two locations with the highest coefficent value, and two locations for the lowest coefficient value. They color coded to match the eigenvector they correspond to: gold = coeff1, green = coeff2, red = coeff3

## Conclusions

### Summary

My data shows strong positive correlations between TMIN, TMAX and TOBS, as well as between SNOW and SNWD. Futhermore, there are a strong negative correlations between TMIN,TMAX, TOBS and SNOW,SNWD. There were also strong correlations with elevation (positive for SNOW and SNWD and negative for TMIN and TMAX) 

Furthermore, we look at PCA analyses to ascertain how effectively we could reduce the size of our dataset. From that we did an indepth analysis of our snowdepth feature. We show that 3 eigenvectors retain 80% of the variance of our data, and analyzed their temporal and spatial functionalities. 


### Future Directions

I had a great dataset with a lot of avenues of exploration. In the future, I would like to do more indepth analysis of the features other than SNWD. I would also like to create a linear regression model, where specific features' vectors predict elevation. Additionally, it would be interesting to see how my dataset changes across my landscape. Or how it correlates with any number of other datasets. More specifically, I would like to explore how/if my dataset supports current climate change models.