# California Weather Analysis
This report focuses on the analysis of historical weather patterns in an area, which lies in the state of California
We concentrated on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/boston/massachusetts/united-states/usma0046" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="Climate_San_Jose.png" src="h_figures/Climate_San_Jose.png" /></p>


<p>We see that the min and max daily&nbsp;temperature agree with the ones we got from our data, once we translate Fahrenheit to Centigrade.</p>
<p>&nbsp;</p>

<p><img alt="TMIN,TMAX.png" src="h_figures/Mean_Std_TMIN_TMAX.png" style="height:300px; width:800px" /></p>
<p>&nbsp;</p>

<p><img alt="TOBS,PRCP.png" src="h_figures/Mean_Std_TOBS_PRCP.png" style="height:300px; width:800px" /></p>

<p>In the figures above, the temperature unit is centigrade and the precipitation unit is inch. From these figures, we can see that the minimum/maximum temperature and the precipitation from the dataset agree with that from the US Climate. </p>

## PCA analysis

For each of the six measurements, we get the eigenvectors and show top 3 eigenvectors for them.

<p><img alt="Top_Eigen_TMIN_TMAX" src="h_figures/Top_Eigen_TMIN_TMAX.png" /></p>
<p><img alt="Top_Eigen_TOBS_PRCP" src="h_figures/Top_Eigen_TOBS_PRCP.png" /></p>
<p><img alt="Top_Eigen_SNOW_SNWD" src="h_figures/Top_Eigen_SNOW_SNWD.png" /></p>

<p>&nbsp;</p>
<p>Then let's look at how much variance top eigenvectors can explain: From two below figures, we can conclude that top 5 eigenvectors can explain 38% of variance for TMIN, 47% for TMAX, 61% for TOBS, 30% for SNOW, 49% for SNWD, 16% for precipitation. </p>
<p><img alt="Percent_V_TMIN_TMAX_TOBS" src="h_figures/Percent_V_TMIN_TMAX_TOBS.png" /></p>
<p>Of the TMIN, TMAX and TOBS, TOBS is best explained by top eigenvectors and its first eigenvector can explain 33% of variance by itself, which makes sense.</p>
<p><img alt="Percent_V_PRCP_SNOW_SNWD" src="h_figures/Percent_V_PRCP_SNOW_SNWD.png" /></p>
<p>Of the SNOW, SNWD and PRCP, SNWD is best explained by top eigenvectors. However, compared to SNWD, the percentages of variance explained for SNOW and PRCP are very low, even far less than 20%.  </p>
<p>We find that TOBS can be best explained compared to other measurements, so I take TOBS for future consideration based on this conclusion </p>

## PCA Analysis of TOBS
<p>After we ran the code to compute the residual after mean and after 3 top eigs. </p>
<p>residual normalized norm  after mean: 0.0734317048122
residual normalized norm  after mean + top eigs: [ 0.52056662  0.27601201  0.26945704]</p>
<p>This means mean value tell little information about the data, but additional top eigenvectors provide more information for dataset</p>



## Analysis of the average temperature for each day
<p>Firstly, we can plot two time series to warm up: </p>
<p><img alt="Two_Time.png" src="h_figures/Two_Time_TOBS.png" /></p>
Then, we graph the mean and the top 3 eigen-vectors for TOBS:
<p><img alt="TOBS_Mean_Eigen.png" src="h_figures/TOBS_Mean_Eigen.png" /></p>
<p>We observe that the average temperature goes up when summer comes and goes down when it's winter. The peak appears on August and the lowest point is on January</p>

Next we interpret the eigen-functions. The second eigen-function (eig2) has a shape very similar to the mean function. The main difference is that the eigen-function sometimes goes below zero while the mean is not.  The interpretation of this shape is that eig2 represents the overall average temperature above/below the mean, but without changing the distribution over time.

**eig1 and eig3** are similar in the following way. They correspond to changing the distribution of the average temperature over different months, but they don't change the total distribution very much. The shape of eig3 is a little bit weird, through observation of the mean, I think it's because the distribution is skewed.

They can be interpreted as follows:
* **eig1:** lower temperature in summer from May to October but higher temperature from November to April.
* **eig3:** higher temperature from jan to July but lower temperature from August to December

## Most Positive and Negative Reconstructions for TOBS
#### Coeff1
Coeff1: most positive
<p><img alt="Positive_C1.png" src="h_figures/Positive_C1.png" /></p>
Coeff1: most negative
<p><img alt="Negative_C1.png" src="h_figures/Negative_C1.png" /></p>

#### Coeff2
Coeff2: most positive
<p><img alt="Positive_C2.png" src="h_figures/Positive_C2.png" /></p>
Coeff2: most negative
<p><img alt="Negative_C2.png" src="h_figures/Negative_C2.png" /></p>

#### Coeff3
Coeff3: most positive
<p><img alt="Positive_C3.png" src="h_figures/Positive_C3.png" /></p>
Coeff3: most negative
<p><img alt="Negative_C3.png" src="h_figures/Negative_C3.png" /></p>

## The variation in the TOBS is mostly due to location-to-location variation
<p>We now estimate the relative importance of location-to-location variation relative to year-by-year variation.</p>
<p>These are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results:</p>

#### Coeff1
total RMS                   =  707.100124098,
RMS removing mean-by-station=  380.777409608, fraction explained= 46
RMS removing mean-by-year   =  668.246162667, fraction explained= 5.5

#### Coeff2
total RMS                   =  286.083532778, 
RMS removing mean-by-station=  129.759224797, fraction explained= 55.5
RMS removing mean-by-year   =  267.014357208, fraction explained= 6.7

#### Coeff3
total RMS                   =  167.3190488, 
RMS removing mean-by-station=  146.80645052, fraction explained= 12.3
RMS removing mean-by-year   =  131.740235653, fraction explained= 21.2

<p>From the above, We see that the variation by location explains more than the variation by year. Apparently, the influence of location is weaker for coefficient3 because location only explains 12% variance for it while yearly variance explains more for coefficient3. In general, variance by location explains more than that by year.</p>


## Correlation and Dependency Analysis

<p>&nbsp;</p>
<p>We arrange the data by station and day, then compute the root mean square of the data after removing mean by station or mean by day. Before removing mean value, the RMS is 166. The RMS after removing mean-by-station is 61 and the RMS after removing mean-by-day is 46.</p>
<p>The RMS removing mean-by-station is greater than the RMS removing mean-by-day, so the data vary more with stations than with days. That's why we want to dig deeper to figure out correlation and dependency among stations.</p>

### Correlation and Dependency Analysis for TOBS
<p>The correlation figure is as follows:</p>
<p><img alt="Correlation1.png" src="h_figures/Correlation1.png" /></p>
<p><img alt="Correlation2.png" src="h_figures/Correlation2.png" /></p>
<p>However, as we can see from the figures, correlation between stations isn't evident. I think it's because firstly, the average temperature in this area doesn't vary a lot with stations. Secondly, I think the temperatures are accurate values rather than boolean values as true or false, so it's harder to tell their relationship compared to precipitation, which is measured by true(rain) or false(not rain)</p>

### Correlation and Dependency Analysis for PRCP
<p>Although the eigenvectors explain most TOBS, the correlation between stations for TOBS is very weak from the figures above, so we also take another measurement PRCP into consideration:</p>
<p>The correlation figure is as follows:</p>
<p><img alt="PRCP_Correlation1.png" src="h_figures/PRCP_Correlation1.png" /></p>
<p><img alt="PRCP_Correlation2.png" src="h_figures/PRCP_Correlation2.png" /></p>
<p>From the figures above, we can see that about half of the stations have correlations with each other and the others do not. This indicates the correlation between stations is stronger on precipitation than average temperature. </p>

## Visualization of Geographical Distribution of First Coefficients For TOBS.
<p>The first figure is a visualization of first coefficient geographical distribution. The radius of a circle corresponds to the amount of stations in this area and the fill color of the circle corresponds to the coefficient values. It seems there's no evident geographic pattern here.</p>
<p><img alt="Map1.png" src="h_figures/Map1.png"/></p>
<p>The second figure is a visualization of top 4 coefficient geographical distribution. The four triangles in each station represent the top four coefficients. The opacity corresponds to positive/negative and the scale of triangle corresponds to the values of coefficients.</p>
<p><img alt="Map2.png" src="h_figures/Map2.png"/></p>
<p>The figure below is to illustrate the relationship between elevation and coefficient1. We can coefficient1 varies a lot even at the same elevation and the decomposition on coefficient1 has little to do with elevation, especially when elevation is very low. However, when elevation is high, the coefficients are much more similar. This agrees with the fact that at low altitudes, latitude has greater influence on temperature. </p>
<p><img alt="Elevation.png" src="h_figures/Elevation.png" /></p>