# Arizona Weather Analysis

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the area of the state of Arizona.

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/boston/massachusetts/united-states/usma0046" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="Climate_Boston_-_Massachusetts_and_Weather_averages_Boston.jpg" src="report_figs/arz.png" /></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperature agree with the ones we got from our data.</p>

<p>&nbsp;</p>

<p><img alt="TMIN,TMAX.png" src="report_figs/mean_std_1.png" style="height:300px; width:800px" /></p>

<p>According to our analysis the average rainfall is 19 mm/day which is almost similar to the average obtained from the US-Climate-Data. However, the avergae rainfall takes a dip from April-June, rest of the year it is close to a constant.</p>

<p>&nbsp;<img alt="PRCP.png" src="report_figs/prcp_1.png" style="height:450px; width:600px" /></p>


## PCA analysis

For each of the six measurement, we compute the percentate of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
![VarExplained1.png](report_figs/var_exp_1.png)
We see that the top 5 eigen-vectors explain 73% of variance for TMIN, 71% for TOBS and 61% for TMAX.

We conclude that of the three, TMIN is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 69% of the variance.

![VarExplained2.png](report_figs/var_exp_2.png)

The top 5 eigenvectors explain 8.7% of the variance for PRCP and 15% for SNOW. Both are low values. On the other hand the top 5 eigenvectors explain 82% of the variance for SNWD. This means that these top 5 eigenvectors capture most of the variation in the snow signals. Based on that we will dig deeper into the PCA analysis for snow-depth.

It makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on diffferent years.

## Analysis of snow depth

We choose to analyze the eigen-decomposition for snow-depth because the first 4 eigen-vectors explain 80% of the variance.

First, we graph the mean and the top 4 eigen-vectors.

We observe that the snow season is from mid-november to the end of march, where the middle of February marks the peak of the snow-depth.
![SNWD_mean_eigs.png](report_figs/mean_eigs_1.png)

Next we interpret the eigen-functions. The first eigen-function (eig1) has a shape very similar to the mean function. The main difference is that the eigen-function is close to zero during november while the mean is not.  The interpretation of this shape is that eig1 represents the overall amount of snow above/below the mean, but without changing the distribution over time.

**eig2,eig3 and eig4** are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total (much).

They can be interpreted as follows:
* **eig2:** less snow in jan - mid feb, more snow in mid-feb - april, very less snow in dec
* **eig3:** more snow in jan, less snow in feb - april, very less snow in dec
* **eig4:** more snow in jan, very less snow in start feb - mid-march, more snow in mid-march - april


### Examples of reconstructions

#### Coeff1
Coeff1: most positive
![SNWD_grid_Pos_coeff1.png](report_figs/recon_c1_max.png)
Coeff1: most negative
![SNWD_grid_neg_coeff1.png](report_figs/recon_c1_min.png)
All the values of coeff1 (including largest and least values) correspond to more than average snow. The larger values of coeff1 correspond to a stronger snowfall depth during the Jan-Feb season, however the least (negative) values of coeff1 favour more snowfall during year end (Nov-Dec).

#### Coeff2
Coeff2: most positive
![SNWD_grid_Pos_coeff2.png](report_figs/recon_c2_max.png)
Coeff2: most negative
![SNWD_grid_neg_coeff2.png](report_figs/recon_c2_min.png)

Large positive values of coeff2 correspond to a very less snowfall during end-of-the year (Nov-Dec). Whereas the snowfall is more favored towards end-of-the yeare by negative values of coeff2.
#### Coeff3
Coeff3: most positive
![SNWD_grid_Pos_coeff3.png](report_figs/recon_c3_max.png)
Coeff3: most negative
![SNWD_grid_neg_coeff3.png](report_figs/recon_c3_min.png)

Large positive values of coeff3 correspond to a snow season with an early onset around Jan starting however the negative(lesser) values display a peak around Mid-Feb to March. Otherwise there is not much coherence between the larger value plots and lesser value plots for coeff3.



### Spatial and temporal variation in snowfall depth
In the previous section we see the variation of Coeff1, which corresponds to the total amount of snow, with respect to location. We now estimate the relative importance of location-to-location variation relative to year-by-year variation.<br><br>
These are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively.<br>
To estimate the effect of time vs. location on the first eigenvector coefficient we
compute:

* The average row: `mean-by-station`
* The average column: `mean-by-year`

We then compute the MS(Mean squared error) before and after subtracting either  the row or the column vector.

#### coeff_1
total MS                   = 3291874.29605<br>
MS removing mean-by-station= 912531.911416, fraction explained= 72.3<br>
MS removing mean-by-year   = 1953090.63535, fraction explained= 40.7<br>
#### coeff_2
total MS                   = 497559.927705<br>
MS removing mean-by-station= 414501.987562, fraction explained= 16.7<br>
MS removing mean-by-year   = 251941.741568, fraction explained= 49.4<br>
#### coeff_3
total MS                   = 331986.760741<br>
MS removing mean-by-station= 298161.572536, fraction explained= 10.2<br>
MS removing mean-by-year   = 161812.53397, fraction explained= 51.3<br>

coeff_1 favors more towards a station-to-station variation (~41% variance explained) rather than year-to-year. However this effect is contradicted in coeff_2 and coeff_3 which display around 50% variance being explained by year-to-year variation (temporal variation).

coeff_1, which has to do with the total snowfall, vs. coeff_2,3 which, as we saw above have to do with the timining of snowfall. We see that for coeff_2,3 the stations explain 11-17% of the variance while the year explaines ~50%. Whereas for coeff_1, stations explain ~72% of the variance while the year explains 41%.<br>
The location is more correlated to the amount of snowfall whereas timing of snow season is explained by the years. 

### Cumulative Residual 
The following plots show the cumulative distribution of the residuals for corresponding coeffients.<br>
Residuals are obtained by subtracting in sequence: the mean, the projection on the first eigen-vector the projection on the second eigen-vector etc.

Residual_1:
![SNWD_grid_Pos_coeff1.png](report_figs/cdf_res1.png)
Residual_2:
![SNWD_grid_Pos_coeff1.png](report_figs/cdf_res2.png)
Residual_3: 
![SNWD_grid_Pos_coeff1.png](report_figs/cdf_res3.png)

### Geographical distribution of first coefficient
The following map plot shows the strength of snowfall for various geographical location of the stations. The size of the circle corresponds to the number of data-points for the station and the color denotes the strength of snowfall in the region.

<table><tr><td><img src='report_figs/bar.jpg' width='270'></td><td><img src='report_figs/map5.png'></td></tr></table>

## Analysis of correlation between percipitation across locations

### Definition of statistical test
It is likely to be hard to find correlations between the amount of rain on the same day in different stations. Because amounts of rain vary a lot between even close locations. It is more reasonable to try to compare whether or not it rained on the same day in different stations. As we see from the graph below, in our region it rains in about 15% of the days.
![report_figs/cdf_prcp.png](report_figs/cdf_prcp.png)


### Correlations Matrix
The matrix below shows, for each pair of stations, the normalized log probability that the overlap in rain days is random.
We see that there is very less correlation between stations, we can see black pixels but they are scatterd and very few in number. 
![report_figs/correl_mat.png](report_figs/correl_mat.png)

### PCA Analysis
As we shall see that the top 10 eigenvectors explain about 82% of the square magnitude of the matrix. To find more correlations we use SVD (the term PCA is reserved for decomposition of the covariance matrix).
![report_figs/pca_prcp](report_figs/pca_prcp.png)

These plots shown below are for the sorted (top - 4) pca components of the rependency matrix. Each plot is sorted for the kth top pca component. For instance the 2nd plot is sorted based on values of the 2nd pca component.
![report_figs/plot_top_eigen_prcp.png](report_figs/plot_top_eigen_prcp.png)

When we reorder the rows and columns of the matrix using one of the eigenvectors, the grouping of the stations becomes more evident. For example, in the figure below, consider the upper left corner of the first matrix, the stations at positions 0-3 are clearly strongly correlated with each other.
This type of organization is called Block Diagonal and it typically reveals important structure such as grouping or clustering.
![report_figs/correl_top_k.png](report_figs/correl_top_k.png)

### Geographical distribution of top pca components
We extract the sets of stations that form blocks for your region, and then plot them on the map to see their spatial relationship. We observe that the top three stations marked 1,2,3 are closely related to each other in terms of geographical locations and hence they have high correlation values. The stations 4 and 5 are relatively distant from these 3 stations and hence depict less correlation values.<br> 
![report_figs/map_top_prcp.png](report_figs/map_top_pca.png)
