 # Minnesota Weather Analysis

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the Northern area of Minnesota.

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Determining Locale of Stations
From the given data, extracting the minimum and maximum latitudes and longitudes, it was seen that span was an area in the northern part of Minnesota (and some parts of Ontario (Canada))
<p>Latitude Span -> 46.3167 to 49.833</p>
<p>Longitude Span -> -92.5253 to -94.9281</p>

The following map shows the stations for which we have data available
<p><img src="Minnesota/allpoints.png" style="height:400px; width:600px" /></p>

## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/boston/massachusetts/united-states/usma0046" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>
<p>Note: As the data online is only for a single station and our analysis is an average over different stations within the area, discrepancies are expected while making comparisons</p>
<p>&nbsp;</p>

<p><img style='height:300px' img alt="Minnesota/Minnesota-BigFalls-SanityCheck.jpg" src="Minnesota/Minnesota-BigFalls-SanityCheck.jpg" /></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperature agree with the ones we got from our data (over the enter duration of the year), once we translate Fahrenheit to Centigrade. We also see that region experience hot/warm summers and cold winters which is very characteristic of Minnesota</p>

<p>&nbsp;</p>

<p><img src="Minnesota/TMIN,TMAX.png" style="height:300px; width:800px" /></p>

<p>To compare the precipitation&nbsp;we need to translate millimeter/day to inches/month. Both the precipitation data available online and from our analysis clearly follow the same trend. For example in the month of July, from the data online it seen that a precipation of 4inch/month is observed over the entire month. From our analysis we see that a precipitation of nearly 4 mm/day is observed over the month of July which approximately translates to 4inch/month (Slight discrepancy as our data is the average over a number of stations)</p>

<p>&nbsp;<img src="Minnesota/PRCP_1.png" style="height:300px; width:400px" /></p>


## PCA analysis

For each of the six measurements (TMIN, TMAX,TOBS, PRCP,SNOW,SNWD), we compute the percentate of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
![VarExplained1.png](Minnesota/VarExplained1.png)
We see that the top 5 eigen-vectors explain 6% of variance for TMIN (Although this appears to be corrupted as the eigen values cannot be negative for the given data - the right value would have been around 12-13%, 28% for TOBS and 18.5% for TMAX.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 18% of the variance. The eigen-value vs rank plot follows a power law distribution (log-log plot) and the percentage variance explained by subsequent eigenvectors reduces. The 28% of variance explained is still a very low measure and hence we shall look at some other observations below.

![VarExplained2.png](Minnesota/VarExplained2.png)

The top 5 eigenvectors explain 7.5% of the variance for PRCP and 12% for SNOW. Both are low values. On the other hand the top 5 eigenvectors explain %90 of the variance for SNWD. This means that these top 5 eigenvectors capture most of the variation in the snow signals. Based on that we will dig deeper into the PCA analysis for snow-depth.

It makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on diffferent years.

In this regard we look at the mean +-std plot for SNWD below.
<p>&nbsp;<img src="Minnesota/SNWD_mean.png" style="height:300px; width:400px" /></p>


## Analysis of snow depth

We choose to analyze the eigen-decomposition for snow-depth because the first 4 eigen-vectors explain more than 80% of the variance.

Looking at the valid counts figure below we see that we have sufficient readings across everyday throughout the year (taking into account all years). So it is reasonable to precoeed with the analysis of snow depth
<p>&nbsp;<img src="Minnesota/validcounts.png" style="height:300px;" /></p>

First, we graph the mean and the top 4 eigen-vectors.


Looking at the mean (we observe that the snow season is from mid-november to the early May, where the months of February, March marks the peak of the snow-depth) (It could be slighlty delayed which is explained by the eigenvectors)) and the top eigen vectors plot below, it is clearly evident why a combination of the top vectors explain a majority of the variance observed. (Not much fluctuations seen, and not much noise in the eigenvectors as well)

<p>&nbsp;<img src="Minnesota/Top4Eigen_SNWD.png" style="height:450px;" /></p>

Next we interpret the eigen-functions. The first eigen-function (eig1) follows a trend very similar to the mean function. The main difference is that the eigen-function is almost zero during November-December months while the mean is not.  The interpretation of this shape is that eig1 represents the overall amount of snow above/below the mean, but without changing the distribution over time.

**eig2,eig3 and eig4** are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months (they are zero during the other time as well), but they don't change the much of the total snow depth.

They can be interpreted as follows:
* **eig2:** very little snow in jan - mid may (osicllatory nature), more snow in mid nov-jan.
* **eig3:** less snow in jan - feb, more snow in march - mid may, less snow in nov-dec
* **eig4:** more snow in Jan-feb, less snow in march, more snow in april-mid may and almost no change in nov-dec


## Examples of reconstructions

In all of the below graphs we find that the reconstruction of the target vector gets more and more accurate with additional eigen vectors. The reconstruction helps to smoothen out the noise in the daily readings

#### Coeff1
Coeff1: small values (-ve values)
![c1small.png](Minnesota/c1_small.png)
Coeff1: large values (+ve values)
![c1large.png](Minnesota/c1_large.png)
From the above figures it is evident that large values of coeff_1 corresponds to more than average snow and low values correspond to less than average snow.

#### Coeff2
Coeff2: small values
![c2small.png](Minnesota/c2_small.png)
Coeff2: large values
![c2large.png](Minnesota/c2_large.png)

From the above figures it is evident that large values of coeff_2 corresponds to a very small snow season (with a peak in nov-dec) and low values correspond to a long now season(peaks in jan-march). This could also mean a different thing where if we have an early snow season start teh snow is cleared at the beginning of the year (probably as a goodwill gesture by the local authorities!)

#### Coeff3
Coeff3: small values
![c3small.png](Minnesota/c3_small.png)
Coeff3: large values
![c3large.png](Minnesota/c3_large.png)

From the above figures it is evident that large values of coeff_3 corresponds to a delayed snow (peaks in  mar-april) and low values correspond to an early snow season (peaks in jan-feb).

### Cumulative Distribution of residuals and coefficients
The below graph shows a plot of the cumulative distribution function of the first three residuals.From the figure it is evident that residual2 is better than residual 1, since the first two eigen vectors combined capture more variance than the first eigen vector alone. That is the residual error from reconstruction using only the first eigen vector grows faster than the residual error from reconstruction using the first two eigen vectors combined (Larger number of instances are explained better! i.e with a low residual error.). Similarly extending by using the third eigen vector we see that residual3 is better than residual 2. That said if we try to keep extending this we wouldn't see similar gains as the percentage variance explained by subsequent eigenvectors reduces drastically<br/>
![residuals_tobs.png](Minnesota/resall_cdf.png)

The next graph shows a plot of the cumulative distribution function of the first three coefficients of the eigen vectors. Since the direction of maximum variance is explained best by the first eigen vector it is easy to see that a majority of the data points are dominated by a large coeff_1 compared to the other two coefficients. As we can see there is not much difference between the coefficients 2 and 3, as the we have seen that the percentage variance explained by successive eigenvectors reduces as we go further along
![residuals_tobs.png](Minnesota/coeffall_cdf.png)

##  Visualizing data distribution for SNWD

The below map shows a distribution of the different weather stations that have collected the SNWD measurement in the northern Minnesota region. The size of the circles in the map is directly proportional to the number of datapoints contributed by a particular weather station. The color within the circles is a heatmap denoting the average value of coeff_1 which tells us whether we had snow more than or less than the mean. A dark red circle indicates snow higher than the mean while a blue circle indicates a low snow in comparison to the mean. 

![residuals_tobs.png](Minnesota/ipyleaflet_coeff.png)

The plot below gives a distribution of coeff_1 vs the elevation of the weather station. We see that with the increase in elevation we have a higher coeff_1 - indicating a higher than average snowfall (Although we do see some anomalies)
![residuals_tobs.png](Minnesota/scatter_ipyleaflet.png)

## Temporal Vs Spatial Analysis of Snow Depth
In the previous section we see the variation of Coeff1, which corresponds to the total amount of snow, with respect to location. We now estimate the relative importance of location-to-location variation relative to year-by-year variation.

These are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results:
**Coeffecient 1**:  
* total RMS                   =  1607.08350635.  
* RMS removing mean-by-station=  1475.61439358  Fraction Explained =8.2% 
* RMS removing mean-by-year   =  852.75670797  Fraction Explained = 46.9%



**Coeffecient 2**
* total RMS                   =  750.23614434
* RMS removing mean-by-station=  732.635844818  Fraction Explained = 2.4%
* RMS removing mean-by-year   =  396.294511574  Fraction Explained = 47.2%


**Coeffecient 3**
* total RMS                   =  710.360614167
* RMS removing mean-by-station=  672.647841391  Fraction Explained = 5.3%
* RMS removing mean-by-year   =  457.688315369  Fraction Explained = 35.63%

We see that the variation by year explains more than the variation by station. However this effect is weaker consider coeff_1, which has to do with the total snowfall, vs. coeff_2 which, as we saw above have to do with the length of the snowfall season

##  Analysis of Snow Depth

We again look at the mean snow and the various eigenvectors, with the first eigenvector telling us the amount of snow, the second eigenvector the length of the snow season and the third eigenvector the delay. The cummulative distributive function plot shows us that we have snowfall for about 40% (no snow for about 60% of the year) of the year which is also evident from the mean plot
<p>&nbsp;<img src="Minnesota/7_mean_eigen_snwd.png" style="height:300px;float:left;" /><img src="Minnesota/cdf_snowdepth.png" style="height:258px;" /></p>

Since the weather stations are all close to one another it is highly probable that the snow depth in one station guarantees snow in a nearby station and therefore snow depth (Unless ofcourse that community is more active in removing the snow!). To accept/reject our hypothesis we begin with plotting a correlation matrix of Log probabilities where each of the values represent the probability of a coincidental snow in two weather stations. 
<p>&nbsp;<img src="Minnesota/pca_correlation.png" style="height:400px;" /></p>

We can see from the above graph that the last 40 weather stations are correlated (darker colors indicates stronger correlation, Similarly the first six appear to be correlated). To find more correlations we use PCA of this correlation matrix and cluster the weather stations based on the first few principal components. As shown in the below graph the top 10 eigen vectors of the correlation matrix explain about 90% of the of the square magnitude of the matrix.

<p>&nbsp;<img src="Minnesota/7_percent_eigen.png" style="height:300px;" /></p>

For the purpose of clustering, we consider only the first four principal components. We sort the weather stations in the correlation matrix according to increasing order of the magnitude of dimensions of the first eigen vector. The resultant correlation matrix visualized as a heatmap shows clusters of weather stations that are correlated based on the first eigen vector. We repeat this process for the 2nd,3rd and 4th eigen vectors. Below are the new heatmaps obtained after sorting. 

From the upper right heatmap we can see that the last 25 stations are correlated and especially the last 15 are highly correlated.

<p>&nbsp;<img src="Minnesota/eigenvector_correlation.png" style="height:700px;" /></p>

Plotting the first few correlated weather stations on a geo map, we get the below plot. From our analysis, it is evident that the weather stations that are nearby have a good chance of experiencing snowfall on the same day or consecutive days of the year. 

<p>&nbsp;<img src="Minnesota/gmplot_correlated1.png" style="height:300px;" /></p>

An alernate method of visualizing the data is to plot the weather stations along with the coefficients of the first principal components. The below geo plot indicates the principal components by triangles with the size of the triangle representing the magnitude of the coefficient and the opacity of the triangle representing the sign of the coefficient (filled triangles for negative and un-filled triangles for positive coefficients). As we can see there are a few close weather stations with similar triangle structures,that is, similar principal components. Once again the red color indicates high coefficient value and blue indicates low coefficient value

<p>&nbsp;<img src="Minnesota/ipyleaflet_terrain.png" style="height:300px;" /></p>

Another way of looking at the snow depth correlation is using the significance plot. After doing hypothesis testing for all unique combinations of 2 stations, the plot below displays a histogram of the resulting p-values. The null hypothesis for this example is that the probability of it snowfall on any given day is independent for the two stations.

The significance of the p-value is the probability of seeing the result given by the data under the null hypothesis. So a low p-value means that it is unlikely to see such results under the null hypothesis. The usual conclusion that one makes is that if data is extremely unlikely under the null hypothesis, then the null hypothesis is wrong for that dataset.

<p>&nbsp;<img src="Minnesota/Significance.png" style="height:300px;" /></p>

The high p-values indicate that the null hypothesis cannot be trusted. i.e we do see a good correlation among the various weather stations in the dataset. The spike at 0.4 is a result of comparing datastations with themselves

## Eigen Value vs Rank for all measures (Additional)
In the plot below, the eigen values are plotted against their rank in the covariance matrix for all the six measures.
We see from the plot - which has a log log scale that the eigen value to rank obeys a power law distribution i.e the rate of decay of eigen values is high
<p><img src="Minnesota/eigen_tmin.png" style="height:300px;float:left;" /><img src="Minnesota/eigen_tmax.png" style="height:300px;" /></p>
<p><img src="Minnesota/eigen_tobs.png" style="height:300px;float:left;" /><img src="Minnesota/eigen_prcp.png" style="height:300px;" /></p>
<p><img src="Minnesota/eigen_snow.png" style="height:300px;float:left;" /><img src="Minnesota/eigen_snwd.png" style="height:300px;" /></p>