# North Dakota and Minnesota weather analysis

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the eastern area of the state of North Dakota and western area of the state of Minnesota.

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Geographical location of the region

Let us first try to plot the region on maps to visualize the location.

![Map.png](final_report/Map.png)

The region is located on the Northern side of USA, close to the Canadian border.

## Sanity-check

### Comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/fargo/north-dakota/united-states/usnd0115" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="US_Data" src="final_report/USData.png" /></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperature agree with the ones we got from our data.</p>

<p>&nbsp;</p>

<p><img alt="TMIN,TMAX.png" src="final_report/TMax_TMin.png" style="height:300px; width:800px" /></p>

<p>To compare the precipitation&nbsp;we need to translate millimeter/day to inches/month. According to our analysis the average annual rainfall is 650mm. According to US-Climate-Data the average annual rainfall is closer to 573 mm.

However, we can clearly see that it rains much more during the monsoons than other times of the year.</p>

<p>&nbsp;<img alt="PRCP.png" src="final_report/PRCP_avg.png" style="height:450px; width:600px" /></p>

<p>The data what we got was not very perfect either. As we will later show in the analysis, the temperature data was 
possibly corrupted. A lot of the values for precipitation readings were missing and this resulted in somewhat biased analysis.</p>

### Initial Data Analysis

<p>We wanted to look for overall patterns of rainfall and snowfall in the area assigned. Thus, we plotted a graph for precipitation and snow depth in all the stations, as observed over the years. This helped us in visualizing the pattern and the presence of rainfall and snowfall in the region.</p>

![PRCPForAllStationsForAllYears.png](final_report/PRCPForAllStationsForAllYears.png)

<p>We were not very surprised to note that in some regions, snow was present till as late as the month of May and June and arrived as early as September. This observation is quite understandable, since most of the North Dakota and Minnesota region lies on the Northern side of North America, just below the Canadian border. Those locations are quite famous for their extreme weather conditions, which is evident from the graph below.</p>

![SNOWForAllStationsForAllYears.png](final_report/SNOWForAllStationsForAllYears.png)

<p>One surprising thing to note, though, was that it rained almost consistently throughout the year and not during specific periods. Each month had a minimum of 30 mm of rainfall, which shot up to as high as 80mm during the monsoon season.</p>

### Missing Observations

<p>This is probably the hardest part of data analysis. No data is perfect and neither was this one.</p>

<p>To investigate more into the missing data, we plotted the frequency of valid counts for each measurement over the spread of an year. The graphs of the valid counts are shown below.</p>

![Missing1.png](final_report/Missing1.png)

![Missing2.png](final_report/Missing2.png)

![Missing3.png](final_report/Missing3.png)

<p>The striking thing to note is that the graphs take the same shape for all the measurements. We have a low number of valid counts from the month of January to the month of March, followed by a good consistent number from the month of April to the month of October, finally culminating in a low number for the months of November and December.</p>

<p>Now, since these are pretty chilly and cold places, temperatures go down to -60 degree Fahrenheit (-51 degree Celsius). We are presently in the age of modern and digital equipments, which makes recording weather data look an easy task, but back in the day, it was a manual task, so it justifies the fact that we have low readings during the winter season, because these places observe harsh and chilly winters.</p>

### Mean and standard deviations

<p>Next we plot the mean and standard deviations of all the six measurements. The graphs for the temperature data are quite standard. It is straightforward to observe that the temperatures are low during winters, gradually increase during the summers, and then finally decline during the end of the year.</p>

<p><img alt="TMIN,TMAX.png" src="final_report/TMax_TMin.png" style="height:300px; width:800px" /></p>

<p>The precipitation graph looks interesting. We can quite easily observe the amount of variance present in the precipitation data. The mean line is far away from the other two plots, which shows that the precipitaion possibly encapsulates the maximum variance out of all the measurements.</p>

<p>&nbsp;<img alt="PRCP.png" src="final_report/PRCP_avg.png" style="height:450px; width:600px" /></p>

<p>The snow graph is not as consistent as the snow depth graph. This is because it might snow on one day and not on the next, so there are spikes in the graph. However, when it comes to snow depth, it remains pretty consistent even though it wasn't snowing consistently. </p>

<p>&nbsp;<img alt="SNOW.png" src="final_report/SNOW_avg.png" style="height:300px; width:800px" /></p>


## PCA analysis

For each of the six measurement, we compute the percentate of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
![VarianceExplainedTemp.png](final_report/VarianceExplainedTemp.png)
We see that the top 5 eigen-vectors explain 10% of variance for TMIN, 30% for TOBS and 17.5% for TMAX.

It is quite interesting to see that there is a dip in the variance explained when we add the second eigen vector of TMIN. This means that the value of the second eigen vector is negative.

This usually happens when our covariance matrix is not positive definite. There are a number of reasons because of which the covariance matrix is non positive definite. We list some of them below:

<ol>
<li>Linear dependency</li>
It may be possible that one feature of our input vector is completely linearly dependent on the some other feature. This results in linear dependency of one variable on the other and might result in a non positive definite matrix.
<li>Missing data</li>
A lot of missed readings can also result in a non positive definite matrix. It might be the case that we have a low number of readings for our input.
<li>Corrupt data</li>
This is one of the most likely reasons for a non positive definite matrix. We generally obtain negative eigen vectors when our data is corrupted by different sources. For ex, error during the data extraction and preprocessing stage.
</ol>

However, to circumvent this problem, we calculated the nearest positive definite covariance matrix using the technique provided on <a href="http://stackoverflow.com/questions/10939213/how-can-i-calculate-the-nearest-positive-semi-definite-matrix" target="_blank">this</a> website.

After, performing the above operation, we obtained a new covariance matrix. We, then went on to calculate new eigen vectors and the variance explained graph was plotted again as shown below.

![AfterCorrection.png](final_report/AfterCorrection.png)

We can clearly observe that there is no dip in the graph this time and the variance explained by top 5 eigen vectors is around 10%.

We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. This is especially true for the first eigen-vector which, by itself, explains 20% of the variance. We also conclude that the data for TMIN might have been corrupted during the preprocessing stage.

![VarianceSnow.png](final_report/VarianceSnow.png)

The top 5 eigenvectors explain 8% of the variance for PRCP and 14% for SNOW. Both are low values. This is quite understandable if we relate this to one of the previous graph we plotted, where we showed that precipitation is one of the measurement with most variance. 

On the other hand the top 5 eigenvectors explain %75 of the variance for SNWD. This means that these top 5 eigenvectors capture most of the variation in the snow signals. 

It makes sense that SNWD would be less noisy than SNOW. That is because SNWD is a decaying integral of SNOW and, as such, varies less between days and between the same date on diffferent years.

We will first dig deeper into the precipitation analysis and show that it is not really fruiful since lack of variance explained by the Eigen vectors. We will then go on to analyze snow depth since a lot of variance is being explained by Eigen vectors.

## Analysis of precipitation

First, we graph the mean and the top 3 eigen-vectors.

We observe that although there is rain throughout the year, the maximum rain is recorded during the monsoon season, which is quite understandable and intuitive.
![PRCP_Mean.png](final_report/PRCP_Mean.png)

From the graph, we can observe that since there is so much variance present, none of the Eigen vectors are able justify the mean convincingly. They all oscillate between positive and negative values.

This goes to show that 3 Eigen vectors will not be sufficient to capture the variance present in the precipitation data, which was also infered from the graphs of the previous section.

Next we try to reconstruct the graph using coefficients c0, c1 and so on. From the graph below, we can clearly see that the plot depicting c3 (which is equivalent to mean + c0 + c1 + c2), is nowhere close to the target plot. This is also the best we can approximate from the initial 3 coefficients of the Eigen vectors and this clearly is not a good representation.

![PRCP_Recon.png](final_report/PRCP_Recon.png)

### Examples of reconstructions

#### Residual 1
Residual 1: Ascending order
![PRCP_R1Asc.png](final_report/PRCP_R1Asc.png)
Residual 1: Descending order
![PRCP_R1Desc.png](final_report/PRCP_R1Desc.png)

#### Residual 2
Residual 2: Ascending order
![PRCP_R2Asc.png](final_report/PRCP_R2Asc.png)
Residual 2: Descending order
![PRCP_R2Desc.png](final_report/PRCP_R1Desc.png)

#### Residual 3
Residual 3: Ascending order
![PRCP_R3Asc.png](final_report/PRCP_R3Asc.png)
Residual 3: Descending order
![PRCP_R3Desc.png](final_report/PRCP_R3Desc.png)

As can be clearly seen from these plots of reconstruction, the Eigen vectors are not doing a very good at capturing the variance.

Moreover, when we sort the coefficients in the descending order, the plot is nowhere close to the target line. High residual value means lower variance captured and poor reconstruction.



## Analysis of snow depth

We now choose to analyze the eigen-decomposition for snow-depth because the first 4 eigen-vectors explain 80% of the variance.

First, we graph the mean and the top 4 eigen-vectors.

We observe that the snow season is fairly long for the region, from late October, it snows until early May and February marks the peak of the snow season.
![SNWD_Mean.png](final_report/SNWD_Mean.png)

Also, note how well the mean is being captured by the first Eigen vector itself. This also justifies the fact that over 60% of the variance is being explained by the first Eigen vector and very few Eigen vectors have captured most of the variance.

Next we interpret the eigen-functions. The first eigen-function (eig1) has a shape very similar to the mean function. The main difference is that the eigen-function is close to zero during october-december while the mean is not.  The interpretation of this shape is that eig1 represents the overall amount of snow above/below the mean, but without changing the distribution over time.

**eig2, eig3 and eig4** are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total (much).

They can be interpreted as follows:
* **eig2:** less snow in january - mid february, more snow in mid feb-march and december.
* **eig3:** less snow in january, slight snow in feb, less snow in December.
* **eig4:** more snow in january, less in feb, again more snow in march.


### Examples of reconstructions

#### Residual 1
Residual 1: Ascending order
![SNWD_R1Asc.png](final_report/SNWD_R1Asc.png)
Residual 1: Descending order
![SNWD_R1Desc.png](final_report/SNWD_R1Desc.png)

#### Residual 2
Residual 2: Ascending order
![SNWD_R2Asc.png](final_report/SNWD_R2Asc.png)
Residual 2: Descending order
![SNWD_R2Desc.png](final_report/SNWD_R2Desc.png)

#### Residual 3
Residual 3: Ascending order
![SNWD_R3Asc.png](final_report/SNWD_R3Asc.png)
Residual 3: Descending order
![SNWD_R3Desc.png](final_report/SNWD_R3Desc.png)

As we can clearly see, the plot for the first three Eigen vectors almost overlaps the target plot. This means that there is very little residual and the top three Eigen vectors capture most of the variance.



The intuitive conclusion out of this analysis is that we can easily capture snow depth in very few dimensions. Think of it this way, if we are asked to represent 365 days of snow depth data in 4 different dimensions, we can divide the year in 4 quarters, predict the respective mean for each quarter and still get away with very little error because snow depth is usually constant for long periods. Remember, for the entire summer the mean is 0 as well as the data is also 0. This is why we can capture snow depth in fewer dimensions.

Now, coming to precipitation, it is possible that it rains for 300 days out of 365 days. Hence, representing 365 days of precipitation data into fewer dimensions is not possible without having to deal with a huge error. This is because it rains in different amounts in each day, so there is a lot of variance in precipitation data.

This is what is being observed with the graphs and plots above.

## Rainfall correlation between stations

It is very difficult to find out the correlation for the amount of rainfall between stations. This is because the amount of varies each for the same place as well. It is, however, possible to find out correlation for whether it rained or no between regions.

In this case, we state our null hypothesis as, rainfall in two locations is independent. We then compute p-values for station and plot the significance graph as shown below.

![Significance.png](final_report/Significance.png)

Since almost all the values are less than 0.5, we can reject our null hypothesis. Thus, there some correlation for the amount of rainfall between different stations.

We now dig deeper to find out correlation between stations. To visualize this, we plot a matrix of the p-values for all the stations.

![Mat.png](final_report/Mat.png)

If we look at the matrix, we can clearly see that ~10-12 stations are highly correlated with each other, while the rest, not so much.

If we reorder the matrix using the top 4 eigen vectors, then the groups of correlated stations become more evident.

![EMat.png](final_report/EMat.png)

Now, let us plot these highly correlated stations on the map to check if there is any spatial dependence between these regions.

![MapL.png](final_report/MapL.png)

As we can clearly see from the map, these stations are not located near each other, still there is high correlation for rainfall for these stations. This observation is quite surprising.

## Using data to visualize global warming

Weather data is the best source for visualizing the effects of global warming on climate change. Since we had some data for the past century, we tried plotting the amount of rainfall per year, as well as the amount of snow per year.

It is quite interesting to see that the amount of rainfall has increased gradually and the amount of snow has decreased with time.

![PRCP.png](final_report/PRCP.png)

![SNWD.png](final_report/SNWD.png)

This is somewhat being displayed by the graphs below. Note that there are certain sharp dips in the graph, but they are probably because of missing data.

## Conclusion

As is the case with big data, it was not perfect but it gave good results on looking deeper. The temperature data for this particular region was probably corrupted and thus wasn't analyzed further. Also, the Eigen vectors were not able to explain much of the variance for the temperature measurement.

Precipitation data had high variance which was shown by plotting the residuals on the graph. The variance explained graph was almost linear which basically means that we would need all of the 365 dimensions to represent our data. Also, precipitation wasn't constant throughout the entire year. There was heavy rainfall during the monsoons and light rainfall during the rest of the year.

Snow depth was probably the only measurement to be well captured by Eigen vectors and it was visualized during reconstruction as well. The plot of the mean with Eigen vectors was as good as the target.

Finally, we proposed a null hypothesis stating that no two stations are correlated in terms of rainfall, which was later rejected on the basis of a significance graph. 
We then showed that about 10-12 stations are highly correlated, using the p-value norms. When we plot these stations on the map, we find that these stations are not spatially close to each other, thus, the correlation between them is not spatial, it is rather temporal.