# Massacussets Weather Analysis

This report is analysis the weather pattern overlapping with the region of Minnesota (Massacussets).

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN:** the daily minimum temperature (tenths of degrees C).
* **TMAX:** the daily maximum temperature (tenths of degrees C).
* **TOBS:** Temperature at the time of observation (tenths of degrees C).
* **PRCP:** Daily Percipitation (in mm).
* **SNOW:** Daily snowfall (in mm).
* **SNWD:** The depth of accumulated snow (in mm).

## Sanity-check: comparison with outside sources

<p>We sanity-checked the data from <a href="http://www.usclimatedata.com/climate/international-falls/minnesota/united-states/usmn0376" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="Climate_International_Falls_-_Minnesota_and_Weather_averages_International_Falls.jpg" src="images/Fig1(Temp+prep).jpg" style="width: 400px;"/></p>


<p>&nbsp;</p>

<p>As can be seen below, min and max daily&nbsp;temperature approximately matches the ones we got from our data.</p>

<p>&nbsp;</p>

<p><img alt="TMIN,TMAX.png" src="images/Fig2(Temp).jpg" style="height:300px; width:800px" /></p>

<p>The precipitation figures also match to some approximation as can be seen the figure below:</p>

<p>&nbsp;<img alt="PRCP.png" src="Images/fig3(prep).jpg" style="width:400px" /></p>


## PCA analysis
Next we checked the percent of variances explained using the top 5 eigen-values for each type of measurement:

![Fig5_variance_explained1.jpg](images/Fig5_variance_explained1.jpg)
![Fig6_variance_explained2.jpg](images/Fig6_variance_explained2.jpg)
<i>Note that the dip in case of TMIN should not have been there as more the number of eigen values used, more should be the variance explained.</i>

The graphs clearly show that SNWD is best explained using top 5 eigen values while PRCP is least explained using its corresponding top eigen values. So, we will next try to analyze these two measurements further.

## Snow Depth Analysis

We choose to analyze the eigen-decomposition for snow-depth because the first 4 eigen-vectors explain 80% of the variance.

First, we graph the mean and the top 3 eigen-vectors.

We observe that the snow season is from mid-november to the end of april, where the Feb-March marks the peak of the snow-depth.

<p>&nbsp;<img alt="SNWD.png" src="images/Fig7_snow-depth.jpg" style="width:500px" /></p>

### Possible eigen-function interpretations:
Although, the first eigen-function (eig1) appears very similar to the mean function, the eigen-function is close to zero during november-december while the mean is not.  The interpretation of this shape is that eig1 represents the overall amount of snow above/below the mean, but without changing the distribution over time.

**eig2 and eig3** are similar in the following way:
They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total (much). This can be verified from the 'variance explained' figure for 'SNWD' as eig1 explains more than 60% of variance in data.

They can be interpreted as follows:
* **eig2:** less snow in dev - mid feb, more snow in mid feb-march.
* **eig3:** less snow in jan, more snow in feb, slightly more snow in march.

In the figures below we analyze the extreme values of coefficients of the top three eigen-values when performing reconstructions:

### Examples of reconstructions

![Fig9_coeff1-neg.jpg](images/Fig9_coeff1-neg.jpg)
![Fig45month_labels.jpg](images/Fig45month_labels.jpg)
![Fig10_coeff1-pos.jpg](images/Fig10_coeff1-pos.jpg)
![Fig45month_labels.jpg](images/Fig45month_labels.jpg)

Large positive values of coeff1 correspond to more than average snow. Low values correspond to less than average snow.

![Fig11_coeff2-neg.jpg](images/Fig11_coeff2-neg.jpg)
![Fig45month_labels.jpg](images/Fig45month_labels.jpg)
![Fig12_coeff2-pos.jpg](images/Fig12_coeff2-pos.jpg)
![Fig45month_labels.jpg](images/Fig45month_labels.jpg)

Large positive values of coeff2 also correspond to the amount of snow that is not explained by coeff_1.


![Fig13_coeff3-neg.jpg](images/Fig13_coeff3-neg.jpg)
![Fig45month_labels.jpg](images/Fig45month_labels.jpg)
![Fig14_coeff3-pos.jpg](images/Fig14_coeff3-pos.jpg)
![Fig45month_labels.jpg](images/Fig45month_labels.jpg)

Large positive values of coeff3 correspond to a snow season which has a peak in february. Negative values of coeff3 correspond to a snow season which has a peak in january.



### Geographical distribution of first 4 coefficients. 
The figure below shows the stations with there corresponding coefficients (shown as triangles).

<img src="images/Fig40_map1.jpg" style="width: 800px;"/>

Where the notation is:
1. coeff_1 - (Blue Color)
2. coeff_2 - (Orange Color)
3. coeff_3 - (Green Color)
4. coeff_4 - (Red Color)
5. Size of triangle corresponds to the magnitude of the coefficient value.
6. Filled triangles corresponds to negative values.

#### Zoomed in image of one random station near Kenora:
<img src="images/Fig39_map2.jpg" style="width: 200px;"/>

It is difficult to infer good meaningful inferences using just the maps shown above. So, we use the correlation matrix to find similarity between different stations.

Correlation matrix for 56 stations for SNWD measurements:
<img src="images/Fig46correl_1_snwd.jpg" alt="Fig46correl_1_snwd.jpg" style="width: 400px;"/>

Correlation matrix after reordering on the basis of most common dimension (characteristic):
<img src="images/Fig47correl_2_snwd.jpg" alt="Fig47correl_2_snwd.jpg" style="width: 400px;"/>
<i> Darker values in the matrix represent smaller p-values which means strong evidence against the null-hyposthesis. </i>
<p>&nbsp;</p>
Clearly, only the stations from 29 to 56 appear to differ somehow from one another while most of the other pairs of stations appear to have common features (darker shades of color signifies more similarity).


## The variation in the snow depth is mostly due to year-to-year variation
Below we now estimate the relative importance of location-to-location variation relative to year-by-year variation.

These are measured using the fraction by which the variance is reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results:

** coeff_1 **
total RMS                   =  1607.08350635
RMS removing mean-by-station=  1475.61439358, fraction explained= 8.18060245454
RMS removing mean-by-year   =  852.75670797, fraction explained= 46.9376230543 

 ** coeff_2 **
total RMS                   =  750.23614434
RMS removing mean-by-station=  732.635844818, fraction explained= 2.34596795361
RMS removing mean-by-year   =  396.294511574, fraction explained= 47.177363479 

** coeff_3 **
total RMS                   =  710.360614167
RMS removing mean-by-station=  672.647841391, fraction explained= 5.30896167717
RMS removing mean-by-year   =  457.688315369, fraction explained= 35.5695816686  

We see that the variation by year explains more than the variation by station. We see that for coeff_1,2,3 the stations explain 2-9% of the variance while the year explaines 35-50%.

## PRCP Analysis

We choose to analyze the eigen-decomposition for PRCP because the first 4 eigen-vectors explain just 6% of the variance. Hence, we want to show that it is very difficult to find meaningful patterns for PRCP measurements using eigen-decomposition.
<p>&nbsp;<img alt="Fig32_prcp_3.jpg" src="images/Fig32_prcp_3.jpg" style="width:400px" /></p>

Next, we graph the mean and the top 3 eigen-vectors.

We observe that unlike SNWD measurements, PRCP measurements appear to be very noisy and hence all eigen-values keep on oscillating up and down.

<p>&nbsp;<img alt="Fig31_prcp_2.jpg" src="images/Fig31_prcp_2.jpg" style="width:400px" /></p>

### Examples of reconstructions

#### Best reconstruction
<p>&nbsp;<img alt="Fig33_prcp_4.jpg" src="images/Fig33_prcp_4.jpg" style="width:400px" /></p>


#### Plotting reconstructions with
<i>Graphs when mean explains the most variance:</i>
![Fig34_prcp_5.jpg](images/Fig34_prcp_5.jpg)
![Fig45month_labels.jpg](images/Fig45month_labels.jpg)

<i>Graphs when mean explains the least variance:</i>
![Fig35_prcp_6.jpg](images/Fig35_prcp_6.jpg)
![Fig45month_labels.jpg](images/Fig45month_labels.jpg)

<i>Graphs when mean and first eigen value explains the most variance:</i>
![Fig36_prcp_7.jpg](images/Fig36_prcp_7.jpg)
![Fig45month_labels.jpg](images/Fig45month_labels.jpg)

<i>Graphs when mean and first eigen value explains the least variance:</i>
![Fig37_prcp_8.jpg](images/Fig37_prcp_8.jpg)
![Fig45month_labels.jpg](images/Fig45month_labels.jpg)

Even the lowest residual values after considering the variance explained by mean and first eigen-value is more than 80 percent. The graph below verifies our claim that it is nearly impossible to make sense of PRCP measurements using small number of dimensions.

<p>&nbsp;<img alt="Fig38_prcp_9.jpg" src="images/Fig38_prcp_9.jpg" style="width:400px" /></p>


### Plotting correlation matrix

Correlation matrix for 92 stations for PRCP measurements:
<img src="images/Fig48correl_1_prcp.jpg" alt="Fig48correl_1_prcp.jpg" style="width: 400px;"/>

Correlation matrix after reordering on the basis most similar characteristic:
<img src="images/Fig49correl_2_prcp.jpg" alt="Fig49correl_2_prcp.jpg" style="width: 400px;"/>

Again, the light shades of the matrix favor the null-hypothesis, verifying our initial claim that it is nearly impossible to find patterns in PRCP data using a small number of eigen-values.

## A different kind of Analysis 1
### Is the data skewed?
<img src="images/Fig51_extra_counts.jpg" alt="Fig51_extra_counts.jpg"/>

We can see that for all types of observations the number of valid observations reduce near the start and end of an year.
<p>&nbsp;</p>
Below is a snapshot of a small region of massacussets:
<img src="images/Fig50_sample_map_extra.jpg" alt="Fig50_sample_map_extra.jpg" style="width: 800px;"/>

This part of the united states of America is known to be very cold and filled with snow during the winter season. Below is a snapshot of an image taken from [Wikipedia](https://en.wikipedia.org/wiki/Climate_of_Massachusetts) outlining the monthly normal min and max temperature in some parts of Massacusets: <i>(Note the temperature values in dec, jan and feb).</i>
<img src="images/Fig52_extra_mass_temp.jpg" alt="Fig52_extra_mass_temp.jpg" style="width: 700px;"/>

## Possible reasons of dip in number of observations during the winter season
1. The extreme climate during the winter season makes it difficult to keep on making observations with the same frequency as that of non-winter seasons.
2. A subset of stations lie in extreme climate regions which are closed during winter seasons.
3. Since the data was recorded for many years, it is possible that for a subset of years it was not possible to make observations for some of the stations during winter season.


## A different kind of Analysis 2
### Is this Global Warming?
Below are some plots of stations which had atleast 20 years worth of data for both 'Snow' and 'TMIN' measurements.
<img src="images/Fig53_gw_1.jpg" alt="Fig53_gw_1.jpg" style="width: 700px;height:300px;"/>
<img src="images/Fig54_gw_2.jpg" alt="Fig54_gw_2.jpg" style="width: 700px;height:300px;"/>
<img src="images/Fig55_gw_3.jpg" alt="Fig55_gw_3.jpg" style="width: 700px;height:300px;"/>

The above graphs show that even though there are many fluctuations (change in weather) in the plots, the overall effect is increase in TMIN (or little decrease in snow) which can be attributed to climate change.

Note: We have only showed graphs for a subset of regions which showed increasing minimum temperatures or decreasing snow (or both) in the regions. There were staions where the minimum temperatures and snow did not change much with time but we did not find any evidence of <b>global cooling</b>.


## Conclusion
We analysed massacussets weather data around the minessota region. And we found that out of all measurements <i>snow-depth</i> made the most sense when analysed using eigen-decomposition. We used methods like plot-reconstructions using eigen vectors, fraction of variance explained eigen-values and correlation matrix (using p-values) to analyze the data.
Later, we repeated this procedure for PRCP measurements and found that eigen-decomposition fails to find any meaningful patterns in this case.

In the end, we extended the analysis by raising questions about skewness of the measurements itself and global warming trends in some stations.