# North Dakota Weather Analysis

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the area of the state of North Dakota, in and around the city of Minot.

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

Thorough out this notebook, the we focus primarily on hypothesizing relationships and patterns in data, come up with the relevant visualizations that best fit the context of the hypothesis, and evaluate the statistical truth in our hypothesis.


# Data  Visualization : The Violin Plots!

The first step in any data analysis journey is to try and visualize the data as much as possible. It is often hard to get a good idea of the dataset we are dealing with just by using summary numerical statistics like the mean and the standard deviation. It helps to know apriori the structure of the dataset and the types and the nature of measurements available to us.

Violin Plots tremendously aid this process. They provide a "summary picture" and a peek into the data that makes up our dataset.

Interpreting Violin plots:
The X and Y axes are as labelled. How "fat" the violin is at each horizontal level determines the number of valid data points that are available to us. 

Useful quick-and-dirty inferences can be made by glancing at violin plots. For example, from the plot below, we can see that for  the data we are dealing with, SNWD's oldest measurements are relatively much newer than some of the other observations, which start from more than 20 years earlier. Also, all measurements have a significant fraction of data coming from the 1960-1980 era.
The weighted mean "year" is represented by a tiny dot inside the violin.

<img src="data_visualise.png" />




## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/minot/north-dakota/united-states/usnd0551" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month in the city of Minot, in North Dakota.</p>

<p>&nbsp;</p>

<p><img src="minot_temp.jpg" /></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperature agree quite well with the ones we got from our data, once we translate Fahrenheit to Centigrade.</p>

<p>&nbsp;</p>

![tmin_tmax_mean_stddev.png](tmin_tmax_mean_stddev.png)


<p>To compare the precipitation&nbsp;we need to translate millimeter/day to inches/month. According to our analysis the rainfall varies from less than 1.5 mm/day to a peak of around 3 mm/day, which translates to about 2.1 Inches&nbsp;per month. According to US-Climate-Data the average rainfall is closer to 1.5 inch per month. However, there is clear agreement that average precipitation&nbsp; pattern generally spikes before summer around July and then falls to minimum later in the year.</p>

![precipitation_mean_and _stddev.png](precipitation_mean_and _stddev.png)


## PCA analysis

This is the main statistical method we use in this notebook.

First, for each of the six measurements, we compute the percentate of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
![percentage_variance_tmin_tobs_tmax.jpg](percentage_variance_tmin_tobs_tmax.jpg)
We see that the top 5 eigen-vectors explain 17.5% of variance for TMIN, 35% for TOBS and 20% for TMAX.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains close to 25% of the variance.

![percentage_variance_snow_snwd_prcp.jpg](percentage_variance_snow_snwd_prcp.jpg)

The top 5 eigenvectors explain 12% of the variance for PRCP and 8% for SNOW. Both are low values. On the other hand the top 5 eigenvectors explain %85 of the variance for SNWD. This means that these top 5 eigenvectors capture most of the variation in the snow signals. Based on that we will dig deeper into the PCA analysis for snow-depth.

It makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on diffferent years. 

**Does this have something to with the pecularity of SNWD in the violin plot we observed earlier? ** 
We suspect that measurements taken in the earlier years are more noisy, and their dearth in the case of SNWD may make SNWD better explained and easier to capture in a lower dimension. 

Toward this end, we plot the correlation of the total variance of the SNWD data against the year. We also mention the p value and the pearson correlation coefficient, both calculated numerically. 

![old_data_more_rel.png](old_data_more_rel.png)


Interestingly, there is a mild increase in total variance with year, although this is a very weak pattern. We would need further information to reject our earlier suspicion.


## Analysis of snow depth

Moving on, we next choose to analyze the eigen-decomposition for snow-depth because the first 3 eigen-vectors explain 70% of the variance.

First, we graph the mean and the top 3 eigen-vectors.
We observe that the snow season is from mid-November to the end of April, where the middle of February marks the peak of the snow-depth.


![7_snwd_mean_eigs.jpg](7_snwd_mean_eigs.jpg)





Next we interpret the eigen-functions. The first eigen-function (eig1) has a shape very similar to the mean function. The main difference is that the eigen-function is close to zero during october-december while the mean is not.  The interpretation of this shape is that eig1 represents the overall amount of snow above/below the mean, but without changing the distribution over time.

**eig2 and eig3** are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total (much).

They can be interpreted as follows:

* **eig2:** less snow in Feb, more snow in March and April
* **eig3:** more snow in Dec, more snow in early April, quite nless otherwise.

### Projection  and Residuals

The main power of PCA lies in understanding the data in a few dimensions which is achieved by projecting the data onto the eigenvectors. The **magnitude** of the corresponding projection indicates the amount of  importance that eigenvector (which in our case is synonymous to 'a particular weather pattern or weather phenomenon') plays in determining the actual SNWD value.

We project SNWD values from every (station,year) pair onto the first couple of eigenvectors and we first present the results via a box plot, which is very apt for representing the distribution of the residuals and coefficients.


Coeff1, Coeff2 and Coeff3 represent the projections onto the principal components 1,2 and 3 respectively. Similarly, res1, res2 and res3 respresent the value of the residual variance after a specific vector has been reconstructed using 1,2 and 2 eignevectors (and the mean) respectively. 
Note that the principal components have been ordered according to their associated eigenvalues in descending order (which is also the order of importance in explaining the observed variance).

<center>**Coefficients box plot**</center>
![45_snwd_coeff_boxplot.png](45_snwd_coeff_boxplot.png)
<center>**Residuals box plot**</center>
![45_snwd_res_boxplot.png](45_snwd_res_boxplot.png)

Again, Violin Plots better help in visualizing the distribution of the residuals. Each black dot corresponds to a (station,year) pair having that specific residual value. Specifically,  in thre following three plots, notice how an increasing number of stations tend to have lower residual variance values. Note that this verifies the fact that as we use more eigenvectors to reconstruct, we keep getting progressively better approximations to the actual SNWD data.

<center>**Res1 violin Plot**</center>
![45_allm_violin_res1.png](45_allm_violin_res1.png)
<center>**Res2 violin Plot**</center>
![45_allm_violin_res2.png](45_allm_violin_res2.png)
<center>**Res3 violin Plot**</center>
![45_allm_violin_res3.png](45_allm_violin_res3.png)



We now analyze the residuals after reconstruction for the SNWD observation.

We provide a series of reconstructions. We first sort the data points by how much they are explained by coeff2 (the second eigenvector). We treat the first eigenvector separately (using a different method) later  in this notebook.
#### Coeff2
Coeff2: most positive
![45_snwd_recon_grid_2.jpg](45_snwd_recon_grid_2.jpg)
Coeff2: most negative
![45_snwd_recon_grid_1.jpg](45_snwd_recon_grid_1.jpg)

### Conclusions:

We infer that large positive values of coeff2 correspond to a snow season with in the April, while large negative values of coeff2 correspond to a season with a single peak in the beginning of February.

### CDF of Residuals
Specifically, we focus on residual 2 and residual 3 (the analysis for Eigenvector 1 is done later in the section on maps). We visualize their cumulative distribution in order to study how many of the stations are explained by reconstruction with the top-2 and top-3 eigenvectors.

<center>**Residual 2**</center>
![45_snwd_res2.jpg](45_snwd_res2.jpg)

<center>**Residual 3**</center>
![7_snwd_res3_cdf.jpg](7_snwd_res3_cdf.jpg)

We now are in a position to present the best reconstruction for SNWD using 2 Eigenvectors. Contrast this to the following plot for PRCP, which ends up being very poorly explained.
<center>**Best Reconstruction of SNWD**</center>
![45_snwd_best_reconst.jpg](45_snwd_best_reconst.jpg)

<center>**Best Reconstruction of PRCP**</center>
![45_prcp_best_reconst.jpg](45_prcp_best_reconst.jpg)


### Conclusions:
We conclude that we can do a decent job of capturing approximately 80% of the variance for SNWD with the top 2 eigenvectors. However, for PRCP, 2 eigenvectors are simply insufficient in explaining the variance. We will need more eigenvectors (around 100!) to be able to explain a good amount of variance. However, we refrain from doing this since that simply defeats the purpose of dimension reduction using PCA.

## Precipitation Analysis for stations:

We now delve into the PRCP measurements. Specifically, we set out to answer questions like "Do stations close by tend to experience rain fall equally?" 

Toward this end, we plot a correlation heat map of stations according to their precipitation log normal probablity. (The log normal probablity was calculated in class as the (joint) probabililty that two stations experience rainfall together). For our purposes of analyses, this means that the darker the square is for an index (i,j), the more likely are the two stations to experience similar precipitation behaviour than the pairs with lighter shades.

<center>**PRCP Heat Map**</center>
![7_seaborn_bw.png](7_seaborn_bw.png)

We next sort each of the stations by their projection values on eigenvectors 1,2,3 and 4 and then plot a new heat map to explain correlation on their dependence on these eigenvectors.

<center>**PRCP Heat Map (using eig1 and eig2) **</center>
![7_res_prcp_color_matrix_1.jpg](7_res_prcp_color_matrix_1.jpg)

<center>**PRCP Heat Map (using eig3 and eig4) **</center>
![7_res_prcp_color_matrix_2.jpg](7_res_prcp_color_matrix_2.jpg)

What do all those heatmaps mean and how can we visualize these maps?
To do so, we consider plotting the coefficients of the top eigen vectors for the stations which showed high correlation:
![close.png](close.png)

### Conclusion
We will find that most of these stations are very nearby.  We thus conclude that in this case geographic proximity explains the high correlation.


## Data on Maps

We now move on to visulazing more data on maps. First, to get a feel of the region in the united states that we will be working with, the latitude and longitude of 2 random stations in the dataset are shown below.
![5_latitude_approximate.png](5_latitude_approximate.png)

In the spirit of finding patterns in data both numerically and geographically, we now seek to address the following questions: Which stations' SNWD data tend to depend a  lot on Eigenvector 1? Do stations geographaphically close to each other tend to have the same amount of dependence? Does the latitude, longitude or elevation of the station primarily dictate this dependence? 
We will answer all of the questions using appropriate visualizations and the necessary statistical plots to support the visualizations.

It is time to address the Coeff1 values. First we (qualitatively) visulize the coefficient1 values of different stations in the region. In the following map, the color of the circle corresponds to the average value of coeff1 for the stations encompassed by the circle. the size of the circle is proportional to how many stations there are in that region.


![5_map_coeff1_circles.png](5_map_coeff1_circles.png)
**To the reader: It will be interseting to have a second look at this map after reading the  analysis in the next section.**


Next, we do the same for coeff2 and qualitatively visualize the dependece of the stations on the second principal component.
![5_map_coeff2.png](5_map_coeff2.png)
**To the reader: It will be interseting to have a second look at this map after reading the  analysis in the next section.**

Now, we simultaneously plot the values of all 4 coefficients in diamond shaped markings, where each triangle (of a given color) corresponds to a specific coefficient.
![snwd_new_maps.png](55_map_4coeff.png)
**To the reader: It will be interseting to have a second look at this map after reading the  analysis in the next section.**






## Statistical support for data on maps

What statistical plots can we provide to support the above qualitative inferred observations? One way is to approximate the probablity distributions of the variables involved (such as elevation, latitude and logitude) and numerically compute the joint probablility density of each of these variables with the coeff1 values. This will help us explain dependence on the statistically  most important eigenvector , namely eigenvector 1. 

Thankfully, there are inbuilt python libraries like seaborn that help us in this process.

We thus present the joint density plots along with the correlation plots side by side for easy comparison:

<center>** Do stations of a particular latitude tend to be heavily dictated by Principal component 1?**</center>

![latitude_new.png](latitude_new.png)

### Conclusion: The answer is no, not really. 
Observe that the pearson correlation coeffieicent is very weak.and as a result the density plot manifests accordingly. Specifically, this means that stations at a specific latitude tend to have no structure in their coeff1 value.

<center>**Do stations of a particular elevation tend to be heavily dictated by Principal component 1?**</center>
![elevation_new.png](elevation_new.png)

### Conclusion: The answer is yes, but only just. 
Observe that the pearson correlation coeffieicent is still somewhat weak, but not negligible. As as a result the density plot manifests accordingly. Specifically, this means that stations at a specific elevation tend to be (weakly) negatively correlated with coeff1 value.


<center>**Do stations of a particular longitude tend to be heavily dictated by Principal component 1?**</center>
![longitude_new.png](longitude_new.png)
### Conclusion: The answer is no, not really. 
Observe that the pearson correlation coeffieicent is very weak. As as a result the density plot manifests accordingly. Specifically, this means that stations at a specific longitude tend to be (very weakly) positively correlated with coeff1 value. Almost so weak, that it can be considered random!.


While doing my analysis, I also came across an intersting bimodal, somewhat independent gaussian looking distribution for coeff2 and latitudes, suggesting that latitudes and Principal component 2 are somewhat independently distributed:

![latitude_stations_system_2.png](latitude_stations_system_2.png)



