# Alabama Weather Analysis

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the area of the state of Alabama.

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.



## Distribution of readings across the years
<p> As our first observation, we first plot the histogram of the number of readings per year to compare how many of the observations are recent </p>
<p><img alt="readings.jpg" src="s_figures/readings.png"></p>
We see that the number of observations are relatively less before 1900 and most of the observations are made after 1940

## Geolocation
<p> Next we plot the geo location of each of these places where the readings are taken from, which will help us to reference our observations with the ground truth </p>
<p>We see that our observations come from the locations in Alabama state<img alt="gmap.jpg" src="s_figures/gmap.png"></p>
<p>The below diagram shows the picture of the terrain to see if there are any hilly regions in the area which can affect the climate. We see that the terrain is mostly flat (with an elevation of atmost 130m which comes from the later sections)<img alt="gmap.jpg" src="s_figures/gmap_terrain.png"></p>


## Variation of parameters
<p> We next see how the mean of the parameters (**TMIN**, **TMAX**,**TOBS**,**SNWD**,**SNOW**,**PRCP**) vary across years </p>
<p><img alt="mean_across_years.jpg" src="s_figures/mean_across_years.png"></p>
### PRCP
* Most of the peaks in the precipitation can be explained by some tornodo events which are common in the state of Alabama as explained in the wiki article(https://en.wikipedia.org/wiki/Climate_of_Alabama)
* For example, the peak at 1974 can be explained by https://en.wikipedia.org/wiki/1974_Super_Outbreak
and peak at 2011 can be explained by the recent tornado event https://en.wikipedia.org/wiki/2011_Super_Outbreak
* For see that the precipitation rate almost stayed the same all through the years. (this is in agreement to the wikipedia article on climate of Alabama which is one of the data point ref: https://en.wikipedia.org/wiki/Climate_of_Alabama)

### TMIN,TMAX,TOBS
* Tmin and Tobs have decreased from 1940s to current date.
* Tmax remained more or less the same.

### SNWD,SNOW
* There were periods of periods of high snow before 1900 but the data on snow depth in those times is missing.
* We also see that snow in the region is occasional (not continuous). Also the peaks of the SNOW have decreased along with time (Proof of global warming? Probably not)

## Variation of parameters within an year
<p> Following is the analysis for the variation of the parameters with in an year (across seasons) </p>
<p> Below is the variation of the ** TMAX** ,** TMIN **, ** TOBS **  and ** avg(TMAX,TMIN) ** over the years </p><p><img alt="vary_temp.png" src="s_figures/vary_temp.png"></p>

### Observations

* We see that the avg line between TMAX and TMIN roughly lies along TOBS which is in agreement with our intuition saying TOBS (on a higher scale) can be regarded as the average temperature of the day, even though it is actually one reading at some random point in the day.
* We also see that the temperature peaks somewhere in after July and extends till September which might be summer for the region (with Alabama (32.5906,-86.2103) being one of the data point)
* Finally, we also infer that the temperatures in the Alabama are usually hot compared to rest of USA (Source: wikipedia)

<p><img alt="mean_std_tobs_prcp.png" src="s_figures/mean_std_tobs_prcp.png"></p>

### Observations

* We see that temperature across the year roughly follows a bell curve with peak at the end of july
* while the behavior of precipitation is usually erratic and remains the same almost throughout the year
<p><img alt="mean_std_tmax_tmin.png" src="s_figures/mean_std_tmax_tmin.png"></p>

<p><img alt="mean_std_snwd_snow.png" src="s_figures/mean_std_snwd_snow.png"></p>

### Observations

* This plot shows that it rarely snows in the region corresponding to the label 'BSSSBSBB' (around Alabama)

** <i> For the rest of the section, we rarely consider snow as an important parameter as it rarely snows. We will instead focus on the Temperature and precipitation </i> **

<p> To get a closer look at the mean of SNOW, we next plot just the mean in a separate graph </p>
<p><img alt="vary_snow.png" src="s_figures/vary_snow.png"></p>

## Sanity-check: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/alabama/united-states/3170" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="reference.png" src="s_figures/reference.png" /></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperature agree with the ones we got from our data, once we translate Fahrenheit to Centigrade.</p>

<p>&nbsp;</p>


<p>Also, there is clear agreement that average precipitation&nbsp;is close to a constant throughout the year</p>



## Eigen Vectors
<p> We next plot the top 3 eigen vectors of the parameters and see what we can infer from them </p>
<p><img src="s_figures/eigen_tobs_prcp.png" /></p>
### Observations
* We see that the first eigen vector (which usually is the scaled version of the mean) of the ** TOBS** is above the zero line 
* We see that the eigen vectors follow no particular trend in ** PRCP ** (which says that there is no particular significant trend in the precipitation across the year, <i> neither going up nor going down</i>)
<p><img src="s_figures/eigen_snwd_snow.png" /></p>
### Observations
* The eigen vectors for SNOW and SNWD have few non-zero values as the data itself corresponds to zero snow on most of the days (and those non-zeros only exist between December to April as in the data)



## PCA analysis

For each of the six measurement, we compute the percentate of the variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.
![VarExplained1.png](s_figures/VarExplained1.png)


![VarExplained2.png](s_figures/VarExplained2.png)

* In each of the cases above, the plot is above the 45 degree line (X=365*Y, in our case) which is expected. More the AUC (area under the curve), more the eigen vectors can explain the variation in the data.
* We see the majority of the data about the snow can be explained by the first 5 eigen vectors (about 95% of the SNWD and 60% of ** SNOW**)
* 40% of **TOBS** (average temperature) can be explained by the first 5 eigen vectors. This might be significant compared to other regions as the temperature is expected to be higher most of the times with a relatively simple bell curve.
* However, **TMIN** and **TMAX** temperature can only be explained to half the extent (about 20%) of the TOBS by the first 5 eigen vectors. This can be tolerated as the extremes of the data usually varies twice as much as the average (can be proved mathematically)
* The **PRCP** however is the one that is least explained of the lot by the top eigen vectors. This is due almost noise like behavior of **PRCP** variation across the year with no noticable rise or fall.

### Variation of temperature with latitude, longitude and elevation
<p> We observe how the mean of the ** TOBS** varies with latitude, longitude and elevation </p>
<p><img src="s_figures/long_lat_ele.png" /></p>
We observe that there is no direct correlation in this case between latitude, longitude and elevation to determine a place's temperature


## Analysis of TOBS

We choose to analyze the eigen-decomposition for ** TOBS **  because the first 3 eigen-vectors explain about 40% of the variance.

First, we graph the mean and the top 3 eigen-vectors.

![tobs.png](s_figures/tobs.png)

* Next we interpret the eigen-functions. The first eigen-function (eig1) has a shape almost like a straight line, whose components is like the base temperature for that year
* Secondly **eig2** is greater in the middle than at the ends. Therefore that component corresponds to how high the bell of the curve is, compared to its tail
* ** eig3 ** has a sharp dip before and after the bell. Probably, that roughly translates to how steep the bell curve. <i> Note that this is only one interpretation of the eigen vectors. It is certainly possible that there could be other ways of interpretting the meaning of the eigen vectors </i>

### Examples of reconstructions

#### Good Cases
![good_cases_recon.png](s_figures/good_cases_recon.png)
* Good cases are the instances in the data with the least residual after reconstruction using three eigen vectors
* We see that the good cases are the cases, where the graph almost passes through the bell curve except the jittery noise like variations

#### Bad Cases
![good_cases_recon.png](s_figures/bad_cases_recon.png)
* Bad cases are the instances in the data with the highest residual after reconstruction using three eigen vectors
* We see that the bad cases are the cases, where the graph completely misses the bell curve

## The variation in the average temperature is more due to station-station variation
<p> In the previous section we see the variation of Coeff1, which corresponds to the average amount of snow (This is due the corresponding vector itself acting as the base line for the bell curve of the temperature variation) with respect to location. We now estimate the relative importance of location-to-location variation relative to year-by-year variation. </p>

<p> These are measured using the fraction by the variance reduced when we subtract from each station/year entry the average-per-year or the average-per-station respectively. Here are the results: </p>

** coeff_1 **  
* total RMS                   =  626.076787628
* RMS removing mean-by-station=  476.203195948, fraction explained = 23%
* RMS removing mean-by-year   =  485.772902404, fraction explained = 22.5%

<p> We see that both year-to-year and station-to-station ended up having almost equal effect on average temperature </p> 

## The variation in the relative peak temperature is mostly due to year-year variation
<p> By relative peak temperature, we mean how much the peak changed w.r.t the tail in that particular year which is ** coeff_2 ** in our case </p>

** coeff_2 **  

* total RMS                   =  192.931081902
* RMS removing mean-by-station=  189.319527345, fraction explained = 2%
* RMS removing mean-by-year   =  104.605404872, fraction explained = 45%