In [5]:
!open my_figures/TMIN,TMAX.png

# State of Idaho Weather Analysis

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the area of the state of Massachusets.

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/idaho/united-states/3182" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="Climate_Idaho.png" src="my_figures/Climate_Idaho.png" /></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperature agree with the ones we got from our data, once we translate Fahrenheit to Centigrade.Although there are fluctuations, we can see a clear trend in temperature variation. We can see from the graph that January is the coldest while July-August is the warmest during the year</p>

<p>&nbsp;</p>

<p><img alt="TMIN,TMAX.png" src="my_figures/TMIN,TMAX.png" style="height:300px; width:800px" /></p>

<p>To compare the precipitation&nbsp;we need to translate millimeter/day to inches/month. According to our analysis the average rainfall is 0.93 mm/day which translates to about 1.11 Inche &nbsp;per month. According to US-Climate-Data the average rainfall is closer to 0.97 inch per month. However, there is clear agreement that average precipitation&nbsp;is close to a constant throughout the year</p>

<p>&nbsp;<img alt="PRCP.png" src="my_figures/PRCP.png" style="height:450px; width:600px" /></p>


## PCA analysis

For each of the six measurement, we compute the percentate of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
![VarExplained1.png](my_figures/VarExplained1.png)
We see that the top 5 eigen-vectors explain 31% of variance for TMIN, 56% for TOBS and 32% for TMAX.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 25% of the variance.

![VarExplained2.png](my_figures/VarExplained2.png)

The top 5 eigenvectors explain 11% of the variance for PRCP and 15% for SNOW. Both are low values. On the other hand the top 5 eigenvectors explain %87 of the variance for SNWD. This means that these top 5 eigenvectors capture most of the variation in the snow signals. Based on that we will dig deeper into the PCA analysis for snow-depth.

It makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on diffferent years.

## Analysis of snow depth

We choose to analyze the eigen-decomposition for snow-depth because the first eigen-vector itself explains 80% of the variance.

First, we graph the mean and the top eigen-vector.

We observe that the snow season is from mid-november to the end of march, where the middle of February marks the peak of the snow-depth.
![SNWD_mean_eigs.png](my_figures/SNWD_mean_eigs.png)

The first eigen-function (eig1) has a shape almost same as the mean function. 

### Examples of reconstructions

#### Coeff1
Coeff1: most positive
![SNWD_grid_Pos_coeff1.png](my_figures/SNWD_grid_Pos_coeff1.png)
Coeff1: most negative
![SNWD_grid_neg_coeff1.png](my_figures/SNWD_grid_neg_coeff1.png)
Large positive values of coeff1 correspond to more than average snow. Low values correspond to less than average snow.

#### Coeff2
Coeff2: most positive
![SNWD_grid_Pos_coeff2.png](my_figures/SNWD_grid_Pos_coeff2.png)
Coeff2: most negative
![SNWD_grid_neg_coeff2.png](my_figures/SNWD_grid_neg_coeff2.png)

Large positive values of coeff2 correspond to an early snow season (most of the snowfall is before mid-jan) Negative values for coeff2 correspond to a late snow season (most of the snow is after mid-jan)
#### Coeff3
Coeff3: most positive
![SNWD_grid_Pos_coeff3.png](my_figures/SNWD_grid_Pos_coeff3.png)
Coeff3: most negative
![SNWD_grid_neg_coeff3.png](my_figures/SNWD_grid_neg_coeff3.png)

Large positive values of coeff3 correspond to a snow season with two spikes: one in the start of February, the other in Marrch. Negative values of coeff3 correspond to a season with a single peak at the end of February.



## Residual and Co-efficient analysis

We then see the CDF of the residuals. Here, the y-axis denotes the probability that residual is at most something and
the x-axis denotes how much is residual relative to the variance after we subtract mean from it.

![SNWD_res_2_CDF.png](my_figures/SNWD_res_2_CDF.png)
We can see that for 40% of the instances, residual is about 25%, which means that 75% was explained by it, which is really good. 

Hence, we can restrict the data to about 40%, for analysis. This comes in handy when we are dealing with big data.

### Geographical distribution

Another way of looking at data is by using maps since the data we have is geographical.
We find the what is the average snow depth across all of the years for a particular station. This will signify how much does it snow at the staion on an average.

Each station described by circle. The size of the circle denotes the number of measurements we have, which means bigger the circle, more the number of measurements available. And the color denotes the average of 1st co-efficient.

![Map of Average snow depth.png](my_figures/Map of Average snow depth.png)


From the map we can see that the forest areas(signified by green regions on map) receive lesser snow that other regions.

## The variation in the timing of snow is mostly due to year-to-year variation

We then try to figure out what effects snowfall more - location or year. To figure this, we create a table with 
station as columns, year as rows, and values of 2nd co-efficient as the entries in the table.

when you have NaNs - dont fill them in
estimate whether year or station has more impact
avg NanMean of each year - how much co-eff is for that year and take avg. - completely effect of year - subtract mean from all values - remains noise
diff bec of station is much smaller thatn diff bec of year
effect of year-year variation is much bigger than station-station variation

To estimate the effect of time vs. location on the first eigenvector coefficient we
compute:

* The average row: `mean-by-station`
* The average column: `mean-by-year`

We get the average NanMean of each year which tells how much the co-effiecient is for that year. We then take take avgerage over the year to get the complete effect of year. We then subtract the mean from all values and what remains after this is noise.

The values of RMS got are as follows:

* total RMS                   =  848.69917793
* RMS removing mean-by-station=  828.268337939
* RMS removing mean-by-year   =  540.693118362

The difference because of station is much smaller than the difference because of the year.
We can conclude from this that the effect of year-year variation is much bigger than the station-station variation.

### Analysis of correlation between percipitation across locations

### Definition of statistical test
Analysing the correlation between percipitation across locations is difficult because the percentage of variance explained by the eigen vectors is poor. When we consider the residual of 0.9, we get only 10% of the original which is very poor and it has no exploratory power.
We have data for 140 stations and most stations produce very little data. Subtract mean tells us how much error we get rid of. An alternative is to ignore the amount of rain, and just ask whether it rained in both locations. We can then compute the probability associated with the number of overlaps under the null hypothesis.

### Null Hypothesis
We want to find a statistical test for rejecting the null hypothesis that says that the rainfall in the two locations is independent. To do this, we calculate the normalized log probability for each pair of stations. We then draw a correlation matrix, which maps p-values of the stations. Each entry signifies the extent to which the rain sequence is related at both the stations. 

![hypothesis.png](my_figures/hypothesis.png)

After doing hypothesis testing for all unique combinations of 2 stations, this plot displays a histogram of the resulting p-values. The null hypothesis for this example is that the probability of it raining on any given day is independent for the two stations in question.
 
The significance of the p-value is the probability of seeing the result given by the data under the null hypothesis. So a low p-value means that it is unlikely to see such results under the null hypothesis. We can see that many data points have significant p-values. We can conclude that data is extremely unlikely under the null hypothesis, and hence reject it.

### Correlations matrix
![correlation_matrix.png](my_figures/correlation_matrix.png)


The matrix above shows, for each pair of stations, the normalized log probability that the overlap in rain days is random.
We see immediately the first 8 stations are highly correlated with each other.

### PCA analysis

To find more correlations we use SVD (the term PCA is reserved for decomposition of the covariance matrix). As we shall see that the top 10 eigenvectors explain about 80% of the square magnitude of the matrix.

![pca.png](my_figures/pca.png)

When we reorder the rows and columns of the matrix using one of the eigenvectors, the grouping of the stations becomes more evident. For example, consider the upper left corner of the scond matrix (The upper right one). The stations at positions 0-8 are clearly strongly correlated with each other.