# Mid California Area Weather Analysis

Hongfei Hu, A53216516

This report analyzes the historical information of weather partterns. The places involved in may overlap the area of San Francisco, mid California.

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/lodi/california/united-states/usca0623" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="lodi_climate.jpg" src="fig/lodi_climate.jpg" height="400" width="400"/></p>

<p>Notice that the temperature in the above picture is in Fahrenheit. However, our data is in Centigrade. So we translate the temperature for convenience: 90F=32.22C, 70F=21.11C, 50F=10C.</p>

<p>Our date (The unit of the temperature is 10C. For example, 150 means 15C.) lies belew. From the curves we see that the shapes are similar to the above ones. And the highest point is about 32C where the lowest temperature is around 10C, similar to the data from the website.</p>

<p>&nbsp;</p>

<p><img alt="2.temp.png" src="fig/2.temp.png" style="height:300px; width:800px" /></p>

<p>To compare the precipitation&nbsp;we need to translate millimeter/day to inches/month. According to our analysis the average rainfall is 3.00 mm/day in winter and no rain in summer (Notice that in our data, the unit is 10mm. For example, 50 means 5mm). According to US-Climate-Data the average rainfall is 3.5 inch per month (about 2.96 mm/day) in winder but almost no rain in summer. The two sources agree on those data.</p>

<p>&nbsp;<img alt="PRCP.png" src="fig/3.prcp.png" style="height:300px; width:400" /></p>


## PCA analysis

In this section, we want to know how the eigenvalues can explain the whole data set. The less number of eigenvalues we need to get a high level of percentage, the more these eigenvalues can describe the data.

### Percentage of variance explained.
![VarExplained1.png](fig/4.percentage.explain.png)
We see that the top 5 eigen-vectors explain 37% of variance for TMIN, 59% for TOBS and 45% for TMAX.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 47% of the variance.

![VarExplained2.png](fig/5.percentage.explain.png)

The top 5 eigenvectors explain only 1.6% of the variance for PRCP, which is quite low. However, the top 5 eigenvectors explain almost 68% for SNWD and 72% for SNOW. This means that these top 5 eigenvectors capture most of the variation in the snow signals. Based on that we will dig deeper into the PCA analysis for snow-depth. Why the data for snow can be explained so well? We will reveil this in next part.

## Analysis of snow depth

We choose to analyze the eigen-decomposition for snow-depth because the first 5 eigen-vectors explain 70% of the variance.

First, we graph the mean and the top 5 eigen-vectors.

![SNWD_mean_eigs.png](fig/6.snowmean.png)

However, when we come to analyze the data of snow, we find the maximum of the mean is only 0.10! That means in this place, it almost never snows! So the research is meaningless if we work on snow analysis.

But we can find some interesting things using the eigenvalues. We see **eig1,2,5** show up in Feburary. They together describe the maximum places of the snow season (if we think that there indeed is a 'snow season').

Also, **eig3,4** are the minus value of some peaks. They may describe other main information, such as some local maximums in March and April.

This explained why 5 eigenvalues describe more than 70% of the variance. This is because most days will not snow. The eigenvalues show more concern on non-snow days and they may ignore the expression of snowy days. However, in the data there is no snow on most days, which fits the character of eigenvalues. So they have strong explanations on non-snowy days, which is the most situation.

## Analysis of TOBS (The average temperature for each day)

Now that snow information will not help and precipitation keeps in a low level, we choose to do some research on temperature. After observing the eigenvalues, we see that TOBS can be described well by only 4 eigenvalues (describe about 57% of the span), comparing with TMIN and TMAX. So we choose to see it.

![SNWD_mean_eigs.png](fig/7.tobsmean2.png)

Now let's discuss the shapes of the eigenvalues. **eig1** shapes quite similar to mean and the trends of their changes are also same, which means the temperature is higher in summer and lower in winter, which makes sense. 

**eig2** and **eig3,eig4** are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total (much). And also, **eig2** seems have a opposite trend towards the changes of mean. This may be because it is a kind of opposite symbol of temperature - the higher this value, the low the temperature. **eig3,eig4** looks more complex than others.
It seems **eig3** moves the maximum temperature to April.

### Examples of reconstructions

We plot the coefficients by ascending(positive) and descending(negative) to reconstruct the data(the purple line is reconstructed data). Here are some comparisons about the reconstructed lines and the mean.

#### Coeff1
Coeff1: most positive
![SNWD_grid_Pos_coeff1.png](fig/8_1pos.png)
Coeff1: most negative
![SNWD_grid_neg_coeff1.png](fig/8_1neg.png)
Large positive values of coeff1 correspond to more than average tempreature. Low values correspond to less than average tempreature.

#### Coeff2
Coeff2: most positive
![SNWD_grid_Pos_coeff2.png](fig/8_2pos.png)
Coeff2: most negative
![SNWD_grid_neg_coeff2.png](fig/8_2neg.png)

By carefully looking, we see the target lines are more gentle than mean.
Large positive values of coeff2 correspond to a late high-temperature season (the highest temperature occurs in October). Negative values for coeff2 correspond to an early high-temperature season (the highest temperature occurs in July). And the feature of mean shows the high-temperature season keeps from July to October.
#### Coeff3
Coeff3: most positive
![SNWD_grid_Pos_coeff3.png](fig/8_3pos.png)
Coeff3: most negative
![SNWD_grid_neg_coeff3.png](fig/8_3neg.png)

Large positive values of coeff2 correspond to a high-temperature season with two spikes: one in the start of July, the other at the end of October. Negative values of coeff3 correspond to a season with a single peak at the end of August.



## The variation in the timing of TOBS is mostly due to station-to-station variation

We now estimate the relative importance of location-to-location variation relative to year-by-year variation, on the TOBS data. Which one is more important (spatial or temperal)?

These are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results:

** coeff_1 **  

total RMS                   = 658

RMS removing mean-by-station= 437, fraction explained=33.59

RMS removing mean-by-year   = 631, fraction explained=4.10 

** coeff_2 **  

total RMS                   = 255

RMS removing mean-by-station= 132, fraction explained=48.23

RMS removing mean-by-year   = 236, fraction explained=7.45 

** coeff_3 **  

total RMS                   = 152  

RMS removing mean-by-station= 135, fraction explained=11.18

RMS removing mean-by-year   = 116, fraction explained=23.68


We see that the variation by year explains more than the variation by station on coeff1 and coeff2, which are the main parts. However coeff3 has different results. At all, the data of TOBS is more spatial than temperal.

All of the fractions explained are low, which means only the 3 coefficients cannot explain too much of the data. That is why only 50% of the variance of the data is explained only using 3 eigenvalues.

## measuring statistical significance
We want to find a statistical test for rejecting the null hypothesis that says that the rainfall in the two locations is independent. (This part is working on precipitation)

We calculate their p-values of stations and form the following pictures.

![91.png](fig/9.analy.png)

![92.png](fig/9.analy2.png)

From the first picture, we see there are not much data that under some value, which means we should reject the null hypothesis, that is to say the stations have something in common (maybe locally near). 

The second one show the p-values of every two stations. From this we can see that there is a group of about 30 stations and a small group of 10 stations are very similar, which means they are related to each other. And also some stations in the middle is a little similar.

![93.png](fig/9.3.png)

This picture shows the structure of rependency matrix, also means the relationships between stations. We find the similar conclusion. Also we see from the first subfigure that almost all the stations are connected in some level. This means we can predict the precipitation on this area more correctly than some weaker-related places.