# Weather data analysis for the state of Idaho

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the area of the state of Idaho.

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/boston/massachusetts/united-states/usma0046" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="Climate_Idaho_Averages.jpg" src="Figures/Climate_US_Idaho.png" /></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperature agree with the ones we got from our data.</p>

<p>&nbsp;</p>

<p><img alt="TMIN,TMAX.png" src="Figures/TMIN_MAX.png" style="height:300px; width:800px" /></p>

<p>To compare the precipitation&nbsp;we need to translate millimeter/day to inches/month. As we can see, PRCP goes as high as 90mm/month, which translates to 3 mm/day. And PRCP goes as low as 10mm/month which translates to 1.7mm/day. Hence, our data aligns with another data source and there is nothing suspicious about it.</p>

<p>&nbsp;<img alt="PRCP.png" src="Figures/PRCP_single.png" style="height:450px; width:500px" /></p>


## Now let us analyze the distribution of our 6 major measurements throughout the year
In the graphs below, we take mean, mean-standard_deviation, and mean+standard_deviation for each measurement, measured throughout the year, for each station.
The 6 measurements are:

a.) TMIN: Minimum temperature recorded that day

b.) TMAX: Maximum temperature recorded that day

c.) TOBS: Observed temperature of that day

d.) PRCP: Preciptitation of that day

e.) SNOW: Amount it snowed that day

f.) SNWD: Snow depth that day
<p><img style="float: left ; height:400px; width:500px";alt="TMIN_image.png" src="Figures/TMIN.png" /></p>
<p><img style="float: right ; height:400px; width:500px";alt="TOBS_image.png" src="Figures/TOBS.png"/></p>
<p><img style="float: left ; height:400px; width:500px";alt="TMAX_image.png" src="Figures/TMAX.png"/></p>
<p><img style="float: right ; height:400px; width:500px";alt="SNOW_image.png" src="Figures/SNOW.png"/></p>
<p><img style="float: left ; height:400px; width:500px";alt="PRCP_image.png" src="Figures/PRCP.png"/></p>
<p><img style="float: right ; height:400px; width:500px";alt="SNWD_image.png" src="Figures/SNWD.png"/></p>
<div style="height: 1200px;"></div>
<p>Conclusion:</p>
<p>
a.) As we can see from the above figures, the temperature measurements, viz. TMIN, TMAX, and TOBS, all of them follow a normal distribution over the months. This implies that the trend has a similar pattern over the year, among all stations. Hence, knowing a certain time of the year gives a good amount of information on the temperature, although the variance exhibited is quite less.
</p>
<p>
b.) Moreover, we can see that PRCP, that is the precipitation follows a very uniform distribution, and the entries seem quite uniformly distributed at random and do not seem to follow a nice trend upwards/downwards trend. Hence, knowing a certain time of the year yields very little information on the amount of precipitation.
</p>
<p>
c.) And finally, we see a very sensible trend in the SNOW and SNWD graphs which shows that the amount of snow and snow depth are pretty much consistent across stations through the year, and it doesn't snow anywhere in the middle of the year. Hence, knowing a certain time of the year yields a lot of information on the amount of snow, and since the variance is quite high, we expect to get good eigen decomposition of SNWD vectors.
</p>

## Before moving to statistical analysis and decomposition of our vectors, let us analyze the distribution of missing values for each of the measurement for this region, and analyze its trend
<p><img alt="TMIN,TMAX.png" src="Figures/Missing_1.png" style="height:300px; width:800px" /></p>
<p><img alt="TMIN,TMAX.png" src="Figures/Missing_2.png" style="height:300px; width:800px" /></p>
<p><img alt="TMIN,TMAX.png" src="Figures/Missing_3.png" style="height:300px; width:800px" /></p>
<p>Conclusion:</p>
<p>
As we can see the valid entries for Temperature measurements are much more than Snow related measurements. We also see a common trend that, there are much more missing entries in the start and end of the year for all the measurements. Although, there is no arbitrary missing entries in Precipitation, except for a day or two at the end of September so we can proceed with our data analysis.
</p>
<p>&nbsp;</p>
<p>The table of frequency of measurements also aligns with our above graph, as we can see maximum number of entries in Temperature measurements.</p>
<p><img alt="distribution_of_measurements.png" src="Figures/Distribution_of_measurements.png" style="height:300px; width:800px" /></p>
<p>&nbsp;</p>
Now let us see the distribution of missing entries per row in the table where each row is an observation for a particular station and for a particular year. Hence, each entry inside a 365 dimensional vector implies a measurement for a day.
<p><img alt="undefined_counts.png" src="Figures/undefined.png" style="height:400px; width:700px" /></p>
<p>&nbsp;</p>
As we can see, most of the rows in our table have very few missing entries (less than 10 days out of 365). Thus, there is a good amount of data available for analysis, for most rows, and our statistical analysis should not be strongly affected by the long-tail missing entries as shown in the figure above.

## Analyzing variance of the measurements for the state of Idaho using PCA

Let us analyze how much variance exists for each of the measurement in our data. For this purpose, we will do eigen decomposition on the covariance matrix for each of the measurement. We will then approximate our original data vectors with eigen vectors and find out how much of their variance can be explained

<p><img alt="variance_1.png" src="Figures/variance_1.png" style="height:300px; width:1000px" /></p>
<p><img alt="variance_1.png" src="Figures/variance_2.png" style="height:300px; width:1000px" /></p>
<p>Conclusion:</p>
<p>
As we can see in the above 6 figures, SNWD is very well explained using the top 5 eigen vectors (almost 90% of the data), whereas for TOBS, 50% of the data is explained by the first eigen vector and then the contribution of subsequent eigen vectors fades out. On the other hand, for the measurements such as TMIN, TMAX, SNOW, PRCP, the eigen vectors do a poor job in explaining the variance in the data.

From this we can conclude that, for SNWD and TOBS, the variance is high in a certain direction that PCA is able to recover whereas for the rest of the measurements, the variance is distributed in all directions. Hence, for the rest of the measurements the approximation is not as good as SNWD and TOBS.

</p>

## Analyzing Snow Depth with PCA
### Analyzing the eigen vectors of snow depth data
As seen in the above figure, we saw that almost 90% of the variation in SNWD (Snow Depth) is explained by the top 5 eigen vectors. Hence, it becomes a very suitable choice for further spectral analysis of snow depth. Moreover, since snow depth is cumulative of snow fall and decays/grows in a much smoother fashion, we should be able to visualize its trend effectively.

First let us plot the mean vector for snow-depth. This will show us the overall trend of snow-depth in the state of Idaho.
<p><img alt="snowdepth_mean.png" src="Figures/snwd_mean.png" style="height:300px; width:900px" /></p>
<p>
This shows that the general trend of snow in the state of Idaho is such that: it snows a lot in the beginning of the year (January, February, March), starts decaying from March, and it snows neglibily from the month of April to the month of November. It again starts catching up around December.
</p>
<p>
Now, let us analyze how the eigen vectors look like for the SNWD measurement. Basically, we will try to figure out what trend each eigen vector (principal axis) is trying to capture.</p>
<p><img alt="snowdepth_mean.png" src="Figures/snwd_eigen3.png" style="height:300px; width:900px" /></p>
<p>
Thus, eigen vector or principal directions represent different kind of trends in snow depth and oscillate between positive and negative values.
Eigen vector 1, 2, and 3 can be interpreted as follows:</p>
- Eigen vector 1: The first principal component shows the exact opposite trend of mean snow depth, for example it has negative values for the month when mean snow depth is high, and positive values when mean snow depth is low. Thus, it signifies low snow depth trend.
- Eigen vector 2: The second principal component signifies low snow at the beginning of the year, and high snow at the end of the year
- Eigen vector 3: The third principal component signifies low snow at the very beginning of the year (Jan, Feb), high snow in March, April, May and then a small amount of snow at the end of the year.

### Analyzing the residuals and co-efficients after projecting the snow depth data to eigen space
Here, we will consider the data points which have low residual error (less than 20%) after approximation, and have really low values for coefficient_1 corresponding to the first eigen vector
<p><img alt="snowdepth_mean.png" src="Figures/coeff_1_res_1.png" style="height:500px; width:1000px" /></p>
As we saw in the diagram of eigen vectors, eigen vector 1 captures the trend that is exact opposite to the trend that mean captures. Here, we are plotting points that have highly negative values for eigen vector 1. Hence, these data points do end up looking a lot like mean vector. Thus, these points have high values of snow-depth.
<p>&nbsp;</p>
<p>Let us analyze the distribution of coefficient 1 to see if there are a lot of negative values.
<p><img alt="snowdepth_mean.png" src="Figures/coeff_1_distribution.png" style="height:400px; width:600px;" /></p>
<p>And we indeed see that a significant majority (about 70%) of the data points have negative value of coefficient 1 (which explains the relation between the mean vector and eigen vector-1)
<p>&nbsp;</p>
<p>Now let us analyze the data points that have residual error of less than 20% and very high values of co-efficient 2.
<p><img alt="snowdepth_mean.png" src="Figures/coeff_2_res_1.png" style="height:500px; width:1000px" /></p>
As we saw in the diagram of eigen vectors, eigen vector 2 captures the trend of having high value of snow depth towards the end of the year. And the plot above aligns with our representation of eigen vectors (since all these data points have high value of snow depth at the end of the year).
Thus, eigen vector 1 captures the amount of snow depth, whereas eigen vector 2 captures the temporal trend of snow depth.

### Analyzing the co-efficients of snow depth data and finding if the variation is temporal or spatial or both:
Here, we will analyze the root mean square values of snow depth vector's coefficient 1, 2, and 3. Next we will subtract the root mean square values across 2 axis: time and space, i.e. year and station, respectively. This will help us find whether the data varies temporally or spatially.
- coeff_1:
 - total RMS: 2765.116
 - RMS removing mean-by-station=  1528.810233, fraction explained: 44.71%
 - RMS removing mean-by-year   =  2144.76650138, fraction explained: 22.43%

- coeff_2:
 - total RMS = 848.699
 - RMS removing mean-by-station= 828.268, fraction explained = 2.4%
 - RMS removing mean-by-year = 540.693, fraction explained = 36.29%

- coeff_3:
 - total RMS = 784.218
 - RMS removing mean-by-station=  670.921, fraction explained = 14.4%
 - RMS removing mean-by-year   =  588.677, fraction explained = 24.9%
 
Thus, we can see that when it comes to coefficient 1, which belongs to eigen vector 1, the snow depth data varies more by station than by year. And as we saw in the plot, eigen vector 1 shows information regarding quantity of snow and follows the same but opposite direction trend as that of mean vector.

Moreover, when it comes to coefficent 2, which belongs to eigen vector 2, the snow depth data varies more by year than by station. And as we saw in the plot of eigen vector 2, it captured the timeline of snow depth. And we see the same trend here that coefficient 2 varies more by year than by station.

For coefficient 3, it follows the same trend as coefficient 2, but the variation weakens.

Thus, we can conclude that there is a decent amount of variation across both year and station axis.

## Analyzing rainfall/PRCP with Statistical testing:
As we saw in our plots of explained variance, decomposition of covariance matrix of PRCP did not give us good residual with eigen vectors. So we are going to approach the analysis of PRCP in a different manner.
First let's see how often does it rain in the state of Idaho on average.
<p><img alt="snowdepth_mean.png" src="Figures/rainfall_cdf.png" style="height:300px; width:500px" /></p>
<p>Hence, we can see that it does not rain at all in 70% of the days. Thus, there are a lot of missing entries for rainfall. Hence, we will go about by understanding the correlation between stations in terms of whether it rained or not, and not the actual amount of rainfall. We will test a hypothesis that amount of rainfall is not independent of location. 
Thus, we will assume our null hypothesis to be: Amount of rainfall is independent of location.
</p>
<p>&nbsp;</p>
Fix two stations. We restrict our attention to the days for which we have measurements for both stations, and define the following notation:
* $m$ : the total number of days (for which we have measurements for both stations).
* $n_1$ : the number of days that it rained on station 1
* $n_2$ : the number of days that it rained on station 2
* $l$ : the number of days that it rained on both stations.

We want to calculate the probability that the number of overlap days is $l$ given $m,n_1,n_2$.

The answer is:
$$
P = {m \choose l,n_1-l,n_2-l,m-n_1-n_2+l} /{m \choose n_1}{m \choose n_2}
$$

Where
$$
{m \choose l,n_1-l,n_2-l,m-n_1-n_2+l} = \frac{m!}{l! (n_1-l)! (n_2-l)! (m-n_1-n_2+l)!}
$$

We use the fact that $\Gamma(n+1) = n!$ and denote $G(n) \doteq \log \Gamma(n+1)$
$$
\log P = \left[G(m) - G(l) -G(n_1-l) -G(n_2-l) -G(m-n_1-n_2+l) \right] - 
\left[G(m)-G(n_1)-G(m-n_1)\right] - \left[G(m)-G(n_2)-G(m-n_2)\right]
$$
Which slightly simplifies to 
$$
\log P = -G(l) -G(n_1-l) -G(n_2-l) -G(m-n_1-n_2+l) - G(m)+G(n_1)+G(m-n_1) +G(n_2)+G(m-n_2)
$$

The log probability scales with $m$ the length of the overlap. So to get a per-day significance we consider $
\frac{1}{m} \log P $
<p>Next, we will use this p-value to find out whether we can accept or reject this null hypothesis.</p>
<p>Let us see the distribution of p-value when computed for pairs of stations.</p>
<p><img alt="snowdepth_mean.png" src="Figures/pvalue_hist.png" style="height:300px; width:500px" /></p>
<p>Here, we can see that there are quite a lot of points that have low p-value (less than 0.1), therefore we can reject our null hypothesis that amount of rainfall is completely independent of the location/station. Now, let us try to visualize these p-values in a square correlation matrix.
<p><img alt="snowdepth_mean.png" src="Figures/pvalue_matrix.png" style="height:500px; width:500px" /></p>
<p>We are not able to see strong correlation in this matrix, so we will perform PCA on p-value matrix and construct a block diagonal plot to see the correlations clearly.
<p><img alt="snowdepth_mean.png" src="Figures/pvalue_diagonal.png" style="height:800px; width:800px" /></p>
<p>In the block diagonal matrices, especially the top 2, we can see strong correlation in the bottom right and top left parts, respectively. Now the clustering of stations make more sense. For example, using the top-left figure, we can say that the stations 60-75 are strongly correlated. Also from the top right figure, we can say that the rainfall for stations 0-10 are strongly correlated.
<p>&nbsp;</p>
Now let us analyze the cluster of points that are located in the bottom right of the top-left plot. For our analysis, we will pick last 10 data points and plot them on the map. Then we will see if they have any other factor in common that explains the correlation of PRCP between them.
<p><img alt="snowdepth_mean.png" src="Figures/corr_map2.png" style="height:400px; width:1000px" /></p>
<p>Thus, we can see that there is slight spatial clustering of points here. The data points belong to the same region. Now let us analyze the elevation of these stations. Below are the box plots of elevation of the points that are drawn on the map, v/s all the data points. We can clearly see that the elevation of the points on the map have values very close to each other (low standard deviation), as oppose to elevation of all data points (high standard deviation)</p>
<p><img alt="snowdepth_mean.png" src="Figures/boxplot_1.png" style="height:400px; width:600px" /></p>
To confirm our hypothesis that the two distributions are different and the points in our map have a particular set of elevation values, we will do a kruskal-willis test to check the null hypothesis that both distributions are equivalent. (assuming threshold of 0.1)
<p>
The p-value of this test turns out to be 0.1. Thus, we can reject the null-hypothesis and say that the differences between the median is statistically significant. Therefore, we can conclude that the correlation of rainfall is due to similarity in elevation of the stations.</p>

## At last, an effort to convince President Donald J. Trump of the existence of Global Warming, atleast in the state of Idaho (where he won):
We analyze TOBS measurement because it is more reliable than TMIN and TMAX, considering it is comparatively stable than the minimum and maximum temperature. For this purpose, we first take the maximum value of TOBS measurement for each data point, then the entries are grouped together by year and we take average of TOBS value across all stations for a particular year. Using this data, the plot below is created.
<p><img alt="snowdepth_mean.png" src="Figures/tobs_globa_warming.png" style="height:400px; width:600px" /></p>
<p>
Conclusion:&nbsp;
As we can see in the figure above, there is a clear significant upward trend in the TOBS measurement in the state of Idaho, over a period of 100 years, leading to prove our assumption of global warming.
</p>
May the force be with the state of Idaho!