# Data Analysis for Weather of <i>New Mexico</i> State

##### (PID: A53219498)

### Primary Data Source
The data is obtained from [NOAA](https://www.ncdc.noaa.gov/). It was downloaded from this [FTP_site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

#### We will data from this source as NOAA data

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature. (in degree Centigrade)
* **TOBS:** The average temperature for each day. (in degree Centigrade)
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

Here, we analyze the weather data corresponding to the index=** "SSSSBSBB" **<br>
The data from this index corresponds to the city of **New Mexico**<br>
Obtaining the maximum and minimum latitude/longitude values from the data, and plotting the locations of different weather stations, we obtain the following map (The ** black ** spots refer to the locations of the stations):<br>
(The total number of stations from where the data is obtained is **55**)

<tr><td><img alt="NewMexicoMap.jpg" src="nsk_figures/US_CD_NewMexicoMap.png" width="300" height="300"/></td><td><img hspace="20"/><img alt="NewMexicoMapZoomed.jpg" src="nsk_figures/US_CD_NewMexicoZoomedMap.png" width="700" height="300"/></td></tr>

### Secondary Source for verification
Official General Statistics for the city of New Mexico can be obtained from: http://www.usclimatedata.com/climate/new-mexico/united-states/3201
#### We will call the data from this site as US-ClimateData (US-CD)

The graph below shows the annual distribution of <b>PRCP</b> (Precipiation) and <b>TMP</b> (Temperature) in the state of New Mexico, obtained from <i>US-CD</i>. The **Fahrenheit** and **Centigrade** degrees for Temperature are visible in the graph, the precipitation values are in the unit of **inches/month**

<p><img alt="ClimateNewMexico.jpg" src="nsk_figures/US_CD_NewMexico.png" /></p>

# 1. Sanity Check

## 1.1 Temperature

We plot the average of the Temperatures: <b>Max(TMAX), Min(TMIN) and Observed(TOBS) Values</b>, along with the standard deviation to cross-check with the US-CD data

<p><img alt="4_T_MeanSTD.jpg" src="nsk_figures/4_T_MeanSTD.png" /></p>

#### Observation:
It is clear that the ** mean (average) ** values of TMIN and TMAX from the analyzed NOAA data agree well with the US-CD data

## 1.2 Precipitation

<p><img alt="4_PRCP_MeanSTD.jpg" src="nsk_figures/4_PRCP_MeanSTD.png" /></p>

#### Observation:
The data in the US-CD graph above for **PRCP** is in **inches/month**, whereas in the above graph for **NOAA data**, it is in **mm/day**. In the NOAA data, the average **PRCP** is around ***1mm/day*** or ***1.2inches/month***, which decently agrees with the  US-CD data.

## 1.3 SNOW and SNWD

<p><img alt="4_SNW_MeanSTD.jpg" src="nsk_figures/4_SNW_MeanSTD.png" /></p>

#### Observation:
As an additional observation from the **NOAA** data we note that there is no SNOW(and consequently zero SNOWD) in New Mexico during the Summer season, May-September.

## 1.4 Correlation in data

<p><img alt="Correlation_Data.jpg" src="nsk_figures/Correlation_Data.png" /></p>

**Observations:**<br>
- Finally, to get an idea of what to expect during data analysis, we plot the above correlation matrix using Kendall's method, which gives higher weight to the finer variations in the data.
- As might be expected, the temperature types TOBS and TMAX are highly correlated, while TMIN uncorrelated with either of them. Clearly, we can conclude that the average temperature in New Mexico is generally high. And that drops in temperature is inconsistent with the average temperature distribution.
- Also, PRCP and SNOW are highly correlated. It is environmentally logical that the measured PRCP should agree with measure SNOWfall, as both are forms of precipitation but during different temperature.
- On careful observation, we see that SNOW and SNWD are more correlated than PRCP and SNWD. This is because SNWD varies proportionately with SNOW, whereas PRCP is unlikely to affect SNWD in the snowfall season

## 1.5 Valid Data

For the purpose of knowledge, we check the number of valid-data points in the data. <br>
Below, we plot the number of valid datapoints avaialble for each observation:<br>
<p><img alt="Valid_Count_T.png" src="nsk_figures/Valid_Count_T.png" /></p>
<p><img alt="Valid_Count_S.png" src="nsk_figures/Valid_Count_S.png" /></p>

**Observation:**<br>
As can be observed, most of the points in the data are valid, and a significant amount of data is available for every month of the year, except perhaps for the SNOW and SNWD observations where the number of valid points in the data are slightly less as compared to other observations, but in no way insignificant

# 2. Variance Explained

Below we plot the variance explained by the first **10** eigenvectors of the matrix for different weather fields:

<p><img alt="4_T_VarExp.png" src="nsk_figures/4_T_VarExp.png" /></p>
<p><img alt="4_SNW_VarExp.png" src="nsk_figures/4_SNW_VarExp.png" /></p>

#### Observation:
It can be observed that:<br>
The top 10 eigen-vectors explain around **50%** of variance for **TMIN**, **58%** for **TMAX**, **62%** for **TOBS**, **28%** for **SNWD**, **85%** for **PRCP** <br>
We can say that the top 10 eigen-vectors for SNWD best explain the variance, and thus represent most of the span of data that is distributed in the SNWD observations. We will analyze this further, as it can be expected that we get the best PCA results in case of SNWD.<br>
**TOBS** is another set of observations that is explained significantly by its top 10 eigen-vectors.
Similarly, the poorest representation by eigen-vectors occurs in case of **PRCP** and **SNOW**. We will also look at the PCA analysis of **PRCP** to contrast it with that of **SNWD**

# 3. PCA and Reconstruction

## 3.1 PCA for SNWD

To start, we first plot the mean and the top 5 eigen-vectors.

<p><img alt="45_PCA_SNWD.jpg" src="nsk_figures/45_PCA_SNWD.png" /></p>

#### Observation:
We observe that in **New Mexico**, the snowfall is from around November till the end of April. The snowdepth peaks around the month of **March**. It may be concluded that the snowfall is the highest during this time.<br>
We see that the first eigen-vector (**eig1**) has a shape similar to that of the Mean. Since eigen-vectors capture the variance of a particular aspect of data, this implies that **eig1** captures the overall amount of snow relative to the mean.<br>
We interpret the remaining eigen-vectors as follows:<br>
**eig2** captures the decrease in SNWD in Jan-March, and increase in SNWD in April and from Nov-Dec<br>
**eig3** captures the decrease in SNWD in Jan-Feb, and increase in SNWD in Feb-April and from Nov-Dec<br>
**eig4** captures the variance similar to **eig2** for Jan-April but captures the decrease in the SNWD from Nov-Jan<br>
**eig5** captures the variance similar to **eig4** for Jan-April but captures the decrease from Nov-Dec and increase from Dec-Jan<br>
<br>
- The difference in the scale in the above two plots is simply because the eigen-vectors are normalized and hence have a maximum magnitude of ***1***

## 3.2 Coefficients corresponding to eigen-vectors for SNWD

Here, we consider the coefficients corresponding to the first 3 eigen-vectors. Although we can do so for the first 5 eigen-vectors, the plots will be cluttered and difficult to examine.

**coeff1: **
coeff1 captures the variation in the magnitude of the snow
- most negative
<p><img alt="45_c1_NEG_SNWD.png" src="nsk_figures/45_c1_NEG_SNWD.png" /></p>
Lower values of coeff1 correspond to the lower spectrum of magnitudes in the months of Jan-Apr.
- most positve
<p><img alt="45_c1_POS_SNWD.png" src="nsk_figures/45_c1_POS_SNWD.png" /></p>
Higher values of coeff1 correspond to the increasing magnitude during the months of mid-Jan to Feb, and then a decrease till the month of Apr.<br>
**coeff2: **
coeff2 captures the nature of the data according to seasons/timings: increasing/decreasing values
- most negative
<p><img alt="45_c2_NEG_SNWD.png" src="nsk_figures/45_c2_NEG_SNWD.png" /></p>
Lower values of coeff2 show a single peak in the months between Jan-Aprl, suggesting that there is a peak Snow Season
- most positive
<p><img alt="45_c2_POS_SNWD.png" src="nsk_figures/45_c2_POS_SNWD.png" /></p>
The same goes for higher values of coeff2, except that they also correspond to the increase in snow Nov onwards.<br>
**coeff3: **
coeff3 captures the finer peaks/variations in the snow seasons
- most negative
<p><img alt="45_c3_NEG_SNWD.png" src="nsk_figures/45_c3_NEG_SNWD.png" /></p>
Low values of coeff3 correspond to an above average snow in Jan-Feb and then a below-average snow in months of Mar-Apr
- most positive
<p><img alt="45_c3_POS_SNWD.png" src="nsk_figures/45_c3_POS_SNWD.png" /></p>
Higher values of coeff3 correspond to a season that has below average snow in Jan-Feb and then an above-averate snow in months of Mar-Apr

### 3.2.1 Best Reconstruction
By taking the average of the top-3 coefficients, we obtain the following best reconstruction:<br>
<p><img alt="BR_SNWD.png" src="nsk_figures/BR_SNWD.png" /></p>

## 3.3 PCA for PRCP

Just as for SNWD, we first plot the mean and the top 5 eigen-vectors.

<p><img alt="45_PCA_PRCP.png" src="nsk_figures/45_PCA_PRCP.png" /></p>

#### Observation:
In case of **PRCP**, it can be observed that the top 5 eigen-vectors don't model the mean distribution as well as **SNWD**. In fact, as seen before, even the top 10 eigen-vectors explain only around **14%** of the variance. <br>
**However**, we do notice that the eigen-vectors capture the peak precipitation period between the months of Jul-Oct., showing high variance w.r.t. the mean.<br>
Finally, we conclude that the data has much **noise** to it. We go on to analyze the PRCP data further, analyzing the first 3 coefficients of its eigen-vectors, similar to SNWD data.

## 3.4 Coefficients corresponding to eigen-vectors for PRCP

**coeff1: **
coeff1 captures the variation in the magnitude of the snow
- most negative
<p><img alt="45_c1_NEG_PRCP.png" src="nsk_figures/45_c1_NEG_PRCP.png" /></p>
Lower values of coeff1 correspond to increase in magnitude in the months of Jul-Oct
- most positve
<p><img alt="45_c1_POS_PRCP.png" src="nsk_figures/45_c1_POS_PRCP.png" /></p>
Higher values of coeff1 correspond to those instances where the magnitude of PRCP is positive but still low in the months of Jul-Oct <br>
**coeff2: **
coeff2 captures the nature of the data across seasons/timings: increasing/decreasing values
- most negative
<p><img alt="45_c2_NEG_PRCP.png" src="nsk_figures/45_c2_NEG_PRCP.png" /></p>
Since there is just a single PRCP season, and only a single peak, this is captured by extreme lower values of coeff2, in the months of Jul-Oct
- most positive
<p><img alt="45_c2_POS_PRCP.png" src="nsk_figures/45_c2_POS_PRCP.png" /></p>
The same goes for higher values of coeff2, except that they perhaps correspond to the lower magnitude peaks <br>
**coeff3: **
coeff3 captures the finer peaks/variations in the snow seasons
- most negative
<p><img alt="45_c3_NEG_PRCP.png" src="nsk_figures/45_c3_NEG_PRCP.png" /></p>
Low values of coeff3 correspond to a PRCP season with a dip/reduced PRCP in the month of August, and then a peak from Sept. onwards
- most positive
<p><img alt="45_c3_POS_PRCP.png" src="nsk_figures/45_c3_POS_PRCP.png" /></p>
Higher values of coeff3 correspond to seasons with similar characteristics (a small dip and then an increase), but perhaps with lesser variation between the two, that's unnoticeable in the plotted scale

### 3.4.1 Best Reconstruction
By taking the average of the top-3 coefficients, we obtain the following best reconstruction:<br>
<p><img alt="BR_PRCP.png" src="nsk_figures/BR_PRCP.png" /></p>

## 3.5 Residual Distribution in PRCP data

<p><img alt="Residual_Distribution_PRCP.png" src="nsk_figures/Residual_Distribution_PRCP.png" /></p>

**Observation**:<br>
Plotting the variation of the total residuals considering coefficients correspond to the top-3 eigen-vectors shows that as the number of data instances increases, the top-3 eigen vectors better capture the data. The residuals decrease.<br>
In other words, as larger data points give more idea about the data, the top eigen-vectors form a better approximation basis

## 3.5 Cumulative Residual Distribution

<p><img alt="Res3_Cum.png" src="nsk_figures/Res3_Cum.png" /></p>

**Observations:**<br>
Similar to the graph in [3.4], we observe that the 90% of top-3 coefficients and theire corresponding residuals explain 

# 4. Analysis using Maps

As discussed previously during analysis using coefficients, we note that **coeff1** captures the magnitude of the data points, as it corresponds to the eigen-vector with the largest eigen-value<br>
Below, we plot the various stations and analyze them w.r.t coeff1.<br>
For these maps, we note the following details:
- Center of the circle is the location of the station
- Size of the circle is proportional to the number of measurements (instances available from that station)
- The color of the station is derived from the average of the coeff1 values. Different values of this average correspond to different colors in the color palette

### 4.1 SNWD

<p><img alt="Map_c1_SNWD.png" src="nsk_figures/Map_c1_SNWD.png" width="800"/></p>

**Observations: **<br>
- For the **SNWD** data, we note that different stations capture different number of observations (size of circles).<br> 
- However, the average value of coeff1 for these observations does not vary much (color of the circles). This means that the magnitude of the observations accross these different stations is consistent.<br>
- We may conclude from this data that, the snowfall and hence the accumulated snowdepth, is uniform across the majority of areas of New Mexico. In simpler words, the amount of snowfall is almost the same across the entire state, in the areas where it snows.
- Also, it's clear that the data from the stations is highly **"correlated"** (uniform variation). It now makes more sense that this data maintains consistency in lower dimensions. Hence, it can be better analyzed using PCA

### 4.2 PRCP

<p><img alt="Map_c1_PRCP.png" src="nsk_figures/Map_c1_PRCP.png" width="800"/></p>

**Observations: **<br>
- For the **PRCP** data, we note that different stations capture different number of observations (size of circles), similar to the **SNWD** data
- The average value of **coeff1** varies much, in contrast to **SNWD**, implying that the amount (magnitude) of rainfall recorded across different stations varies a lot (color of the circles).
- We may also note that the non-uniformity of the precipiation data leads to many relevant observations, and high-variability. Thus, it's **less** likely that the data lies on a lower dimension, as compared to **SNWD**. We observed this in the PCA analysis of the **PRCP** data. The variance explained by the top few eigen-vectors is very less, and to represent the data better using the same top eigen-vectors, we require more data (complete data). [This requirement has been noted during analysis of residual distribution in section **3.5**]

# 5. Spatial and Temporal Variation

- It can be interesting to know whether the observations vary from location to location(spatial) or from year to year (temporal), and if they do, across which feature do they vary more?
- In order to calculate this, we consider the amount of SNWD and PRCP, respectively and for this, we consider the variation across different coefficients
- We create a matrix of stations vs. years, and subtract the station-wise mean and year-wise mean respectively, from the columns and the rows.
- At every step, we calculate the MS(Mean Squared) value. We square the entries so that even small variations are magnified due to squaring them. That way, we can better analyze the variations.

## 5.1 SNWD

**coeff1:**<br>
Total MS= 310770.586241<br>
After removing mean-by-year    = 207213.63556 ,fraction explained:  33.3226358174<br>
After removing mean-by-stations= 120734.675998 ,fraction explained:  61.1499024221<br>
**coeff2:**<br>
Total MS= 48130.8355083<br>
After removing mean-by-year    = 31515.0527011 ,fraction explained:  34.5221158779<br>
After removing mean-by-stations= 23012.0000678 ,fraction explained:  52.1886544774<br>
**coeff3:**<br>
Total MS= 28207.4156708<br>
After removing mean-by-year    = 18655.7458932 ,fraction explained:  33.8622647643<br>
After removing mean-by-stations= 17634.8802018 ,fraction explained:  37.481404154<br>
## 5.2 PRCP

**coeff1:**<br>
Total MS= 10523.3144781<br>
After removing mean-by-year    = 6678.50248673 ,fraction explained:  36.5361312674<br>
After removing mean-by-stations= 2802.2939773 ,fraction explained:  73.3706145234<br>
**coeff2:**<br>
Total MS= 8985.44785647<br>
After removing mean-by-year    = 2958.66115768 ,fraction explained:  67.0727469021<br>
After removing mean-by-stations= 2756.68704809 ,fraction explained:  69.3205381399<br>
**coeff3:**<br>
Total MS= 8712.13058304<br>
After removing mean-by-year    = 3140.22083205 ,fraction explained:  63.9557648715<br>
After removing mean-by-stations= 2688.69560662 ,fraction explained:  69.1384836236<br>

**Observations**:<br>
- We note that we calculate the fraction by which the variance is reduced when we subtract from each year and station the respective averages. This gives us the fraction of the data that is still explained by the respective coefficients. Thus, higher the variation the coefficients capture, the lesser they will change when the mean is subtracted, and still explain most of the variance.
- For **SNWD**, we observe that variation by station explains more than variation by year. This means that the variation observed in the data by different stations is similar. This **agrees** with our previous observation that snowfall across New Mexico varies uniformly [Section 4.1]. Also, this means that the SNWD in New Mexico does not vary much with time, i.e., the snowfall has been consistent over the years.
- For **PRCP**, we observe a similar phenomena. The variation by station is greater than the variation by year. However, the variation explained by different coefficients varies greatly. This is **expected**, as we say in [section 4.2] that variation in precipitation data is complex to capture. We conclude that precipitation has been equally inconsistent in New Mexico across both stations and years, although variation across station is more.

# 6. Measuring Statistical Significance in PRCP data

**PRCP** data has complex characteristics. We explore more about with respect to correlation of data between stations.

We would like to analyze the data from different stations and obtain the correlation, if any. For this, we can consider the actual data from the two stations, and take the **dot product** to calculate **cosine similarity**. However, this product can be computationally expensive for every datapoint.<br>

Hence, we simply try to analyze the number of days it rained in two or more locations simultaneously. We then consider the following null hypothesis:

- In analyzing the data from various stations, we consider the null hypothesis. In our case, we assume that the null hypothesis is that:
    1. The data from different stations is uncorrelated, or independent
    2. For this, we consider that the data generated as each station is random and non-repeating
Finally, we calculate the proabibiliyt of the number of overlaps that occur between pair-wise stations, w.r.t the null hypothesis.

For this we consider pair-wise stations, and calculate the probability that number of overlap days **L** between two stations given : the total number of days for which we have measurements for both the stations, the number of days it rained on station1 and the number of days it rained on station2.

This probability approximates to **log probability: logP**.<br>
Thus, higher the value of logP, higher the correlation between two stations.<br>
However, the logP values are negative for the given data and thus for simplification, we negate the logP values and consider just the P-values.<br>
**Thus, lower the P-values, greater the correlation.**

We plot the histogram generated for the **P-values ** of pair-wise stations

<p><img alt="p_significance.png" src="nsk_figures/p_significance.png"/></p>

**Observation**:<br>
We note that most of the p-values lie in the range of **0.01-0.3**. This signifies that there is significant amount of correlation present between stations.

## Pair-wise correlation

We now measure pair-wise correlation between different stations by plotting the pair-wise normalized probability values as a heatmap.

<p><img alt="p_value_imshow.png" src="nsk_figures/p_value_imshow.png"/></p>

We observe that there is high correlation between the approximately the first **25** stations or so. There is also a slightly lesser but significant correlation between the stations 25-45.<br>
We obtain that the following **25** stations are correlated on further analysis:<br>
US1NMOT0063  USC00291286  USC00299128  USC00296845  USC00022754  USC00292468  USC00298872  USC00297340  USC00293015  USC00295532  USC00290806  US1NMSR0005  USC00294426  USC00293951  USC00296854  US1NMOT0027  US1NMOT0013  US1NMOT0012  US1NMOT0011  US1NMOT0017  USC00299720  US1NMOT0015  USC00291901  USC00292865  US1NMDA0127 

We further perform **SVD** on the normalized P-value matrix to obtain the following plot. 

<p><img alt="svd_var_explained.png" src="nsk_figures/svd_var_explained.png"/></p>

**Observation: **<br>
It shows that **75%** of the variance in the matrix is explained by the top 10 eigen-vectors

### Further analysis using PCA:

We further analyze the data by performing PCA on the normalized P-value matrix and observing correlation between different stations.<br>
For this, we calculate the eigen-vectors corresponding to the top-4 eigen-values. We then perform re-ordering of the normalized P-value matrix. To do this, we consider the indices of the elements corresponding to the top-4 pca components. We sort these indices according to the values they hold. This sorted index-list is ordered according to the pca component corresponding to the respective eigen-vector.<br>
Such re-ordering helps in understanding the correlation represented by different eigen-vectors.<br>
This can be seen in the following four matrices.<br>

<p><img alt="mat_show.png" src="nsk_figures/mat_show.png"/></p>
<p><img alt="mat_eigen.png" src="nsk_figures/mat_eigen.png"/></p>

**Observations**:<br>
The four graphs above show the variation observed using the 4 corresponding eigen-vectors. If we match the each of the plots below to the 4-matrices above, we observe that variation recorded by the eigen-vectors corresponds to the correlation obseved in the matrices. For example, overlap seen in the first part of the second graph corresponds to the correlation observed between the 20-stations in the second matrix.<br>
Such analysis according to different eigen-vectors helps reveal different clusters or groups.

# 7. PRCP Data on Maps using top-4 coefficients

- From the above two maps we notice that the PRCP data is worth analyzing further w.r.t. different top coefficient values, whereas PCA analysis for SNWD is sufficient
- To further analyze the data using PRCP, we conisder a subset of data attributes. Using the **latitude, longitude and the top 3 coefficients**, we plot the following map.
- In the map, we plot the averages of the first **4** coefficients across different stations. The **4** coefficients are represented by connected triangles. The center of the connection represents the station.
    - The top-right triangle correponds to the **coeff1**, the bottom-right to **coeff2**, the bottom-left to **coeff3** and the top-left to **coeff4**
    - The size of the triangle represents the magnitude of the average of the corresponding coefficients
    - A transparent triangle represents positive value of the coefficients whereas an opaque triangle represents the negative value
    

<p><img alt="Map_all_PRCP.png" src="nsk_figures/Map_all_PRCP.png" width="800"/></p>

**The above image zoomed and split into two:**<br>
<tr><td><img alt="Map_all_PRCP_1.jpg" src="nsk_figures/Map_all_PRCP_1.png" width="700" height="700"/></td><td><img alt="Map_all_PRCP_2.jpg" src="nsk_figures/Map_all_PRCP_2.png" width="700" height="700"/></td></tr>

**Observations: **
- The data for coeff1 (top-right triangle) in blue agrees with the map previously seen in section **4.1**. The data recorded varies from station to station, and so does the average coeff1
- The data for coeff2 shows even higher variability (red triangle). We have previously seen that coeff2 captures the increasing/decreasing nature in data according to seasons (or timing). Thus, it takes a variety of positive/negative values as PRCP data is non-uniform.  
- Similarly, data for coeff3 (green triangle) can be observed to be mostly positive. We've seen previously that coeff3 captures the finer variations in the data (which do not seem to vary negatively). Thus, majority of coeff3 values are positive.
- While coeff4 data wasn't analyzed separately, we can see that majority of coeff4 values are positive. This may imply that coeff4 captures features similar to coeff3

# Conclusion:

- In conclusion, we observe that we have a good data distribution for most of the observations. 
- PCA works best on the SNWD data analyzed, and worst on the PRCP data
- In order to analyze the best and worst part of the data (SNWD, PRCP), we looked at how we could understand more about the underlying distribution by plotting coefficients of the prinicpal components and identifying their role in data distribution
- We plotted these values on map to analyze spatial data distribution
- We also looked at correlation between PRCP data obtained from different stations