# North East Arizona Data Analysis

This report is aimed at analyzing weather patterns of the north eastern region of arizona. A sattelite view of the area being examined is presented in the figure below. 

<img src="r_figures/satelliteView.png" alt="Drawing" style="width: 300px;"/>


As it is evident from the satellite view the area lies at a high elevation belt. Different exploratory, statistical and principal component analysis have been performed on the data and have been presented in this notebook.

[NOAA](https://www.ncdc.noaa.gov/) is the source of the dataset being studied. The major focus are on six climate measurements TMAX, TMIN, TOBS, PRCP, SNOW, SNWD. There were other climate realted attributes such as wind and soil, which were not considered due to there relative sparsity in the data set


# Data Visualization and Sanity Checks


The first set of visualizations performed on different climate measurements were to observe the variation across stations in the region for each of these parameters. I selected six stations which were the richest in terms of data coverage and plotted them for the temperature and precipitation parameters for a particular year. Since the snow depth data is sparser in the summer months, I plotted them separately for 10 different stations. Find the plots below. 


<img src="r_figures/Temp_PRCP_SnW.jpeg" alt="Drawing" style="width: 1500px;"/>


As can be observed from the figures above the temperatures across these six stations do not differ much from the mean temperature index. The precipitation and snow on the other hand has much greater variance across stations. 

The next set of visualizations were performed by fixing the most famous station in the region (Payson, AZ). This station was fixed to observe the variance for climate parameters there across years. Find below the plots which observes temperature, precipitation and snow for Payson, AZ for six consecutive years. 


<img src="r_figures/pst_payson.jpeg" alt="Drawing" style="width: 1500px;"/>

For the payson station the temperature curves had low standard deviation as compared to precipitation and SNWD. While the time of arrival of snow followed a similar pattern over the years but the volume stemmed greater variance 

It can be observed from the above two plots that the standard deviation/variance is greater in case of changing stations over changing years.  

These points would be further evident when PCA analysis is performed on the data. 

## Distribution of Climate Parameters

I then proceeded to observe the distribution of climate parameters in the overall data. This step would also help understand and make semantic inferences on the tests performed on the data. 

<img src="r_figures/distribution_climate.jpeg" alt="Drawing" style="width: 1500px;"/>

It can be observed that the temperature follows a normal distribution, a single gaussian curve fits for the entire region with TMAX mean around temperature index 200. The distribution of the precipitation is hyperbolic with exponentially decreasing regions of increasing rain. The distribution of the snow is somewhat uniform with slight up/down fluctuations (i.e different level of snow depths are observed almost uniformly) 

## Plotting Hottest and Coldest Stations

To perform sanity check and observe the standard deviation in the data. I extracted the hottest stations (maximum in TMAX) and the coldest stations (coldest in TMIN) and plot them on the MAP. The resultant regions and geography could then be tested for checking the sanity of the data. 

<img src="r_figures/hotColdRegion.png" alt="Drawing" style="width: 500px;"/>

The most hottest and coldest region reported were Bartlett Dam and Sunrise mountain. I verified the data provided by NOAA on them with various climate databases on the internet. The information provided was indeed true for this region.

<img src="r_figures/hot_cold.jpeg" alt="Drawing" style="width: 800px;"/>

Another observation made was there were certain regions which experienced both extremums of temperatures. These regions had overlapping heat and cold circles on the map.


<img src="r_figures/similar_hot_cold.png" alt="Drawing" style="width: 200px;"/>


## Distribution of Valid Counts

The above climate parameters are not observed each day of the year for all the stations in the region. Depending on the time of the year different climate parameters have different distributions of valid counts. 


<img src="r_figures/valid_distribution.jpeg" alt="Drawing" style="width: 500px;"/>

From the above plots it can be observed that the observations of snow suffers a dip in the valid counts during the winter months(this could be due to the difficulty to measure the snow depth in certain high altitude regions). The valid counts of temperature curves suffer a dip around 2-3 times a year, with the major dips being during January and July (peak winter and summer months respectively). The precipitation measurements suffer a major dip in October (during the heavy monsoon time in the region). 















# Statistical Analysis

After performing the data analysis and sanity checks, I had a fair idea on the climate behavior and patterns for the concerned geography. The next step was to perform statistical analysis on the data. The mean, variance and standard deviations of the climate parameters have been presented below. 

<img src="r_figures/stats_climate.jpeg" alt="Drawing" style="width: 300px;"/>

It can be observed that SNWD has much lesser variance in comparison to SNOW measurement. This is because snow depth is a cumulative measure of snow, and therefore has lesser spikes and fluctuations, making it much easier and better parameter to capture the variance in the snow data. All the three temperature parameters (TMIN, TMAX and TOBS) have similar normal distributions, with less standard deviation across the variance. 

However, the precipitation parameter follows an interesting curve. The s.d(σ) of the gaussian decreases during the summer months, almost squeezing the +σ and -σ curves. Immediately after which the σ and correspondingly variance increases drastically for the precipitation curve [broadening of the curve] (making mean a really bad estimate for the rain during the monsoon months).






# PCA Analysis

## Analyzing top eigenvectors for the climate parameters

After performing the above statistical analysis, it was evident that while it was difficult to capture all the variance in the precipitation and snow data. The temperature and snow depth parameters could be better candidates for capturing lesser number of high variance components. The plots for the top 3 Eigen vectors for the above climate parameters is presented below: 

<img src="r_figures/top3EV.jpeg" alt="Drawing" style="width: 400px;"/>

The semantic behind top eigen vectors for all climate parameters would be illustrated in the following sections. 

## Variance Explanation by top Eigen Vectors:

In the above section for statistical analysis it was illustrated that precipitation and snow were difficult candidates for capturing variance, on the other hand, SNWD and Temperature parameters could be easily decomposed into much lesser number of high variance components. This is indeed verified by the variance explanation plots considering the first three eigen vectors for the different parameters. Find the results illustrated in the figures below.

<img src="r_figures/Explanation.png" alt="Drawing" style="width: 400px;"/>


<img src="r_figures/Temperatur_Explanation.png" alt="Drawing" style="width: 400px;"/>
 
In case of snow and precipitation the first five eigen vectors capture 15% and 8% variance respectively. While in case of  SNWD and temperatures, the first five eigen vectors explain variance 85% and 75% respectively.


The semantics of the eigen vectors of the snow and precipitation data is illustrated in this section. Find the figure below which illustrates the way eigen vectors spread in comparison to the mean in an effort to capture the variance.

<img src="r_figures/prcp_snow.jpeg" alt="Drawing" style="width: 600px;"/>

## Plots  (Variance Explained vs Eigen Vectors)

<img src="r_figures/prcpEV.jpeg" alt="Drawing" style="width: 400px;"/>

This figure depicts how much variance is captured by increasing the number of eigen vectors (Images are in decreasing order) for PRCP data. For eigen vectors even as high as 100, the variance explained is still less than 70%. 


On the contrary, case of SNWD, as low as 10 eigen vectors explain around 80% of the variance.
<img src="r_figures/prcpEV.jpeg" alt="Drawing" style="width: 200px;"/>


# Semantics behind Eigen Vectors

## SNOW

### Coefficient 1:  Volume of the Snow 

To identify the meaning of each of the coefficients, the first step requires plotting the data for the which the coefficient were the highest and lowest. After sorting the coefficients in terms of ascending and descending coefficient 1 values for the low error cases, the target and the approximation by eigen vectors are presented below. 


<img src="r_figures/LVHV.jpeg" alt="Drawing" style="width: 1000px;"/>


It can be observed that in the left section(high positive coefficient) , the eigen vectors try to capture the high volume in the snow data and in the right section (low positive coefficients), while the data which corresponds to low positive coefficients are low volume snow regions.


<img src="r_figures/coeff0.jpeg" alt="Drawing" style="width: 2000px;"/>

To further verify and illustrate this point I took an example for the which the error reduced drastically, i.e,  the first eigenvector captured a majority of the variance. The behavior of the eigen vectors in approximating the function at different instances of increasing c0 is presented above.

It is evident that Coefficient_1 tries to capture the volume of snow in the measurement

### Coefficient 2:  Start of the Snow Season

Following the previous process again here, I plot the data for which the second coefficient was the lowest and highest (filtering in the low residual error points). 

<img src="r_figures/np.jpeg" alt="Drawing" style="width: 1000px;"/>

In the figure above the left section represents the lowest values of coeff2(low negatives) and the right section represents the highest values of coeff2 (high positives). It can be observed that in the left sub figure the eigen vectors try to the capture the early snow season, while the data corresponding to high positive coefficients have a late snow season.

### Coefficient 3: Spikes in the Snow Fall

Again find below the plots for which the data has high and low coefficients for the third eigen vector below.   

<img src="r_figures/coeff3.jpeg" alt="Drawing" style="width: 1000px;"/>

On the right side the it is evident that the error is reduced, as the third eigen vector is able to capture the variance caused due the sudden spikes in the snowfall a couple of times in the plot. However on the left side, the eigen vector is unable to capture the spikes in the target measurements.

<img src="r_figures/coeff3change.jpeg" alt="Drawing" style="width: 1000px;"/>


To further verify and illustrate this I considered an example for which the res3 error reduced drastically from res2. The behavior of the eigen vectors in approximating the function at different instances of c2 is presented above.

It is evident that Coefficient 3 tries to capture the sudden spikes in the measurement.


## Precipitation

### Coefficient 1:  Sparsity of the Rainfall 

Following suit from the above methods, the plots corresponding to the low and high values of coefficient 1 are presented in the image below: 

<img src="r_figures/sparseno.jpeg" alt="Drawing" style="width: 600px;"/>

It is evident that the rainfall on the regions in the left section is much more sparser than regions on the right, which have dense spells of rainfall.

### Coefficient 2 : Sporadic nature of the Rainfall / Spikes in Rainfall

Coeff2: Positive in the left side of the figure and negative on the right side
 

<img src="r_figures/sporadic.jpeg" alt="Drawing" style="width: 600px;"/>


Similar to coefficient3 of snow, it is evident from the above plots that the couple of high spikes in rainfall(start and end of the plots) are better captured in higher values of second coefficient in case of PRCP (left region).  








## Cumulative distribution plots for coefficient and residual 

Since snow depth was captured really well by the first three eigen vectors, it is insighful to observe the cumulative distribution plots for eigen coefficient and residual errors. 

<img src="r_figures/coeff_res.jpeg" alt="Drawing" style="width: 600px;"/>

It can be observed that coeffecient 1(Volume) is a positive constant, with around 80% of the instances having value less than 2000, and only a tail having coeff1 above this. Coeff2(Early/Late Snow) and Coeff3(Spikes) have similar plots having a widespread of both positive and negative values.

The curve for residual errors is as expected, with somewhat constact reduction in step sizes with which the error reduces. (this follows from eigen vectors being sorted in terms of capturing variances) 


For Precipitation that cumulative distribution of residual error after the three eigen vector is still very high (this follows from the claims above). Find the plot for this report below

<img src="r_figures/a.png" alt="Drawing" style="width: 400px;"/>



# Mean or Zero ? Which is a better estimate?

In case of SNWD, the mean performs worse than zero and is not a correct estimate of the overall distribution, this is because in my assigned region, there are months where there is no snow. (below)

In case of PRCP, the mean is a better estimate than snow, because PRCP is present almost througout the year. (above)


<img src="r_figures/meanzero.jpeg" alt="Drawing" style="width: 400px;"/>




# Representation of Eigen Coefficients On the Map

To better understand the correlations between regions of similar climate behavior or find regions with specific weather patterns it is of help to highlight regions of the map with similar eigen coefficients. In case of SNWD measurement, find below the plots for coeff1, coeff2 and coeff3. 


<img src="r_figures/c123.png" alt="Drawing" style="width: 300px;"/>

It can be observed that similarly behaving regions on the basis of volume, start of season and spikes are quite close to each other forming segregating clusters

## Is variance of climate measurements spatial or temporal?


### Is the Variation of Snow greater Year to Year or Station to Station?


It can be observed from the plot that in case of snow, the RMS after removing mean by year is lesser than RMS after removing mean by station in case coeff1. But in case of coeff2 and coeff3 it is greater. So for the given dataset, it can be concluded that the volume of snow varies across stations more than years. However the start of a snow season or number of spikes has more variance across years in comparion to station

<img src="r_figures/cps.jpeg" alt="Drawing" style="width: 400px;"/>



### Is the Variation of Precipitation greater Year to Year or Station to Station?

It can be observed from the plot that in case of precipitation, the RMS after removing mean by station is lesser than removing mean by year for all the eigen coefficients. So for this dataset, it an be concluded that spikes and density of rainfall changes more over the the years in comparison 

<img src="r_figures/cp.jpeg" alt="Drawing" style="width: 400px;"/>

### How much is the variation of the precipitation from day to day? 

<img src="r_figures/dinimp_prcp.png" alt="Drawing" style="width: 400px;"/>
<img src="r_figures/r.png" alt="Drawing" style="width: 400px;"/>


It is hard to find correlations between the amount of rain on the same day in different stations. Because amounts of rain vary a lot between even close locations. It is more reasonable to try to compare whether or not it rained on the same day in different stations. As we see from the graph above, in our region it rains in about 19% of the days.



## In the given region,  are two locations correlated in terms of precipitation? - A statistical test

The null hypothesis corresponding to this alternative hypothesis being, are two locations independent in terms of precipitation?

To measure the strength of the null hypothesis we use the p-value and p-norm. The negative log of the p-norm gives the significance of the hypothesis. The significance plot for the data is presented in the figure below. 


<img src="r_figures/sig.png" alt="Drawing" style="width: 400px;"/>

As we can observe, for most of the the stations there is hardly any correlation in terms of precipitation (the mean of the gaussian peaks at 0.08. Only a few stations are correlated to each other. Thereby I claim only a few stations in the given data are correlated in terms of precipitation

## Visualizing correlations

The correlation matrix for the station in terms of precipitation can be visualized below

<img src="r_figures/vis.png" alt="Drawing" style="width: 400px;"/>

It can be observed that around 18 locations are highly correlated with each other. To extract more correlation from the data, we can perform SVD on the p-norm matrix to find the latent factors between precipitation measurements and stations. 

The most important(4) latent factors between precipitation and stations are plotted and presented in the figure below

<img src="r_figures/eig.png" alt="Drawing" style="width: 1000px;"/>

If we reorder the p-norm matrix according these sorted latent factors, we would be extracting more correlated stations. After reordering the matrix in terms of decreasing latent factors, we get the new correlation matrix for the given dataset for the precipitation measurement. These new correlation matrices captures a lot more correlation by grouping stations which have similar latent factors. This is evident from the image below. Around 28 stations are now correlated to each other in comparison to 18 (previous matrix).
<img src="r_figures/neweig.png" alt="Drawing" style="width: 1000px;"/>

Performing a similar exercise for SNWD measurements in the region. There were similar inferences. Highlights of the analyzing residuals experiment with snow depth can be found below.

<img src="r_figures/snow_analyzingresiduals.jpeg" alt="Drawing" style="width: 150px;"/>


## Does Snow Volume/Spikes/Timing Change with Elevation? 

To check whether snow volume changes with elevation find the scatter plot figures which track the change in different eigen coefficients of SNWD measurement at different levels of elevation.

<img src="r_figures/eigs.jpeg" alt="Drawing" style="width: 1000px;"/>

coeff1(volume) : High elevation places have in general higher values of coeff1 (more volume).

coeff2(timing of snow season): High elevatoin places have early snowing. 

coeff3(spikes). Increasing sudden spikes with increasing elevation





## Are highly correlated rainy regions nearby/overlapping on Map?

Find the plot of highly correlated regions in terms of precipitation in the plots below: 

<img src="r_figures/corr.png" alt="Drawing" style="width: 500px;"/>

Some highly correlated regions are near on map, while some are not. Hence, a genaralized claim for the dataset cannot be made here.


## Are eigen coefficients of PRCP measurement similar for correlated stations  ? - A map view

Find the plot of eigen coefficients of PRCP measurements on the map plot below. The opaqueness of the triangles are dependent on the sign of the coefficient. The value of coefficients determing the size of the triangles.

<img src="r_figures/corr2.png" alt="Drawing" style="width: 700px;"/>

It can be seen that nearby regions have similar eigen coefficients, so the SVD was able to capture latent factors which were similar for nearby stations. The map presented in the previous cell has highly correlated regions in terms of preciptation, these regions have similar structures in terms of size and color of the 4 triangles. Thus, the latent factors of PRCP measurements are indeed correlated


