# Weather Analysis

This report analyzes a historical record of weather patterns in an area of the US. The dataset ID is BSSSBSBB, which was downloaded from [this link](https://mas-dse-open.s3.amazonaws.com/Weather/small/US_Weather_BSSSBSBB.csv.gz).

The dataset is composed by six major measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

This report focuses on three aspects of this specific region (South Alabama):

* Analyze the **snow condition from year to year**
* Analyze the **daily temperature patterns**
* Analyze **historical droughts in this region**



## 0. Sanity check for dataset

From the longtitude and altitude of stations in my dataset, I locate that my region should be South Alabama in the U.S.

Below is a report of Alabama's average daily temperature throughout the year, taken from [US Climate Data](http://www.usclimatedata.com/climate/alabama/united-states/3170):

<img src="lelu_r_figures/alabama_gt.png" alt="alabama_gt" style="width: 500px;"/>

From which we can see that the daily temperature throughout the year in Alabama is ~35C max & ~ 2C min, which resembles my dataset in terms of mean TMIN (~3C) and mean TMAX (~37C):

![TMIN_TMAX.png](lelu_r_figures/TMIN_TMAX.png)

Although daily temperature from my dataset shows a bit higher value than this online report, it makes sense since the online report reflects the data from whole Alabama while my dataset only contains data from South Alabama, where it is expected to be warmer. 





## 1. Analyze the snow condition from year to year

### 1.1 PCA Analysis of all variance explained

For each of the six measurements, I compute the percentate of the variance explained as a function of the number of eigen-vectors used. This way, it is easy to pick a measurement for eigen-decomposition analysis.

![percentage_of_variance_1.png](lelu_r_figures/percentage_of_variance_1.png)

As we can see, the top 5 eigenvectors explain ~21% for TMIN, ~43% for TOBS and ~19% for TMAX. Therefore, among these three measurements, TOBS can be best explained by the top 5 eigenvectors.

![percentage_of_variance_2.png](lelu_r_figures/percentage_of_variance_2.png)

In addition, the top 5 eigenvectors explain ~58% for SNOW, ~95% for SNWD and ~7% for PRCP. Therefore, among these three measurements, SNWD can be best explained by the top 5 eigenvectors. With such high percentage of variance explained for SNWD, we can conclude that the top 5 eigenvectors capture almost all variation in the snow depth. Also note that even the first eigenvector already explained ~80% of variance for SNWD.

Therefore, I first dig into PCA analysis for SNWD.

### 1.2 Analysis of snow depth

I choose to analyze the eigen-decomposition for SNWD because the top 5 eigenvectors explain ~95% of the variance.

Graphs below show the daily SNWD mean and the top 5 eigenvectors for SNWD throughout the year.

![SNWD_mean_eig.png](lelu_r_figures/SNWD_mean_eig.png)

As we can see, although percentage of variance explained for SNWD is high and SNWD shows very drastic and sharp mount, the variance itself is actually very small (max 1.3mm), which means that this area does not snow much. Therefore, this measurement analysis may not give us very useful information.

I then graph the mean TOBS and SNOW to verify that this area (South Alabama) does not snow much:
![TOBS_SNOW.png](lelu_r_figures/TOBS_SNOW.png)

As we can see, the daily mean temperatures are almost all above 0 degrees, and even the lowest daily temperatures (mean-std for TOBS) are mostly above 0 degrees, which explains why SNWD does not have much variance. Also, since the region is relatively warm, that snow is expected to melt fast, thus snow depth change drastically despite its low absolute value.


## Conclusion 1
In conclusion, **this place's snow depth does not vary much from year to year** and it is mainly because that this place rarely snows throughout the year. In most days the daily temperature is above 0 degrees and it may only drop below 0 for a short period around January.

## 2. Analyze the daily temperature patterns

Since this region does not snow much, SNWD and SNOW analysis would not be very meaningful. At this point, it's better to analyze another measurement that has more significant variance. I choose to continue analyze the eigen-decomposition for TOBS since the top few eigenvectors explain a good percentage of its variance, compared to other measurements. 

### 2.1 PCA Analysis of average daily temperature

First, let's take a look at the mean TOBS and the first 4 eigenvectors.

![TOBS_4eigvecs.png](lelu_r_figures/TOBS_4eigvecs.png)

These eigen-functions all roughly resemble the mean function, with their own differences. **eig1** does not show much variations and it represents the overall temperature offset above/below the mean, but keeps constant distibution overtime. 

The other eigen-functions can be interpreted as:
* **eig2:** colder from Dec to Apr; warmer from mid-Apr to late-Nov
* **eig3:** warmer from late-Jan to Oct; colder from Nov to Jan
* **eig4:** warmer from late-Jan to mid-March; colder from mid-March to late-Apr; warmer from May to Jan

Also, it is clear that although these 4 eigen-functions depict TOBS mean to some extend, the variation between May to Oct is not strongly reflected from composition of eigen-functions. Let's explore how many eigen-functions may be needed in order to well explain the TOBS variance.

![TOBS_numEigs.png](lelu_r_figures/TOBS_numEigs.png)

As we observe, the first 16 eigenvectors will lead to ~60% of variance explained and the first 45 eigenvectors will lead to ~80% of variance explained; from this fact we know that variances are more evenly distirbuted after the first few eigenvectors. 

### 2.2 TOBS reconstruction

We reconstuct TOBS using top 16 eigenvectors (which explain ~60% of variance). The following coefficients form the best construction.

![all_coefficient.png](lelu_r_figures/all_coefficient.png)

![TOBS_best_constuction.png](lelu_r_figures/TOBS_best_reconstruction.png)



### 2.3 Does daily temperature vary more from year to year or station to station?

After reconstructing TOBS, we obtained a coefficient for each top eigenvector. 

In order to estimate the effect of time vs. location (i.e. year vs. station) on the first eigenvector coefficient, we compute:

* The average row: `mean-by-station`
* The average column: `mean-by-year`

We then compute the RMS before and after subtracting either the row or the column vector.


```
total RMS                    =  192.931081902
RMS removing mean-by-station =  189.319527345
RMS removing mean-by-year    =  104.605404872
```
As we can see, the RMS does not change much after removing mean-by-station (reduced by ~4), yet changes greatly after removing mean-by-year (reduced by ~85). This change amount is significant enough to conclude that daily temperature vary more from year to year than station to station.

Furthermore, we do this reduction for 10 iterations with removing mean-by-year first and RMS stops changing much after the first iteration:

```
initial RMS= 192.931081902
0 after removing mean by year     = 104.605404872
0 after removing mean by stations = 99.5261923657
1 after removing mean by year     = 99.3453902518
1 after removing mean by stations = 99.3116040689
2 after removing mean by year     = 99.3001023212
2 after removing mean by stations = 99.2953353876
3 after removing mean by year     = 99.2932068539
3 after removing mean by stations = 99.2922238056
4 after removing mean by year     = 99.2917620307
4 after removing mean by stations = 99.2915431789
5 after removing mean by year     = 99.2914389634
5 after removing mean by stations = 99.2913892099
6 after removing mean by year     = 99.2913654244
6 after removing mean by stations = 99.291354045
7 after removing mean by year     = 99.2913485986
7 after removing mean by stations = 99.2913459913
8 after removing mean by year     = 99.291344743
8 after removing mean by stations = 99.2913441453
9 after removing mean by year     = 99.2913438591
9 after removing mean by stations = 99.2913437221
```

As we can see, RMS is reduced by ~88 after removing mean-by-year, yet only reduced by ~5 after removing mean-by-station afterwards. 


## Conclusion 2

In conclusion, this region's **daily temperature vary MORE from year to year than station to station.**

## 3. Analyze historical droughts in this region

Drought is a recurring climate component on the earth and a severe long-lasting drought often causes many negative impacts. Here I want to analyze historical droughts in this region based on the yearly precipitation of the past 140 years from 100 stations. Since some stations have missing data, I focus on historical droughts of the whole region. 

Firstly, I graph the average yearly precipation evolution from 1873 to 2012. The a-axis shows the year and the y-axis shows the yearly precipitation in mm.

![yearly_prcp.png](lelu_r_figures/yearly_prcp.png)

From the graph above we can see that the yearly precipitation varies a lot and seems to demonstrate drops in a certain pattern (i.e. cyclic drops). In order to further investigate this, we need to recognize drought events by defining them.

### 3.1 define drought

Mentioned by Zhao et al. [1], _**Rainfall Variability Index (RVI)**_ can be useful in determining drought condition of a region. RVI is calculated as follows:

* **RVI = (P_i - mean)/std **

where **P_i** is the annual rainfall for year i, **mean** is the mean rainfall and **std** is the standard deviation for the period of year 1873 to 2012. When **RVI** is negative, a drought year occurs.

Also, according to WMO [2], rainfall time series can be classified into different climatic regimes [1]:


| Regimes       | Condition     | 
| ------------- |:-------------:|
|extremely dry  | P_i < mean - 2 * std | 
| dry   | mean - 2 * std < P_i < mean - std | 
| normal| mean - std < P_i < mean + std |   
| wet| P_i > mean + std |  


*[[1] Yuhu Zhang, Wanyuan Cai, Qiuhua Chen, Yunjun Yao, and Kaili Liu, “Analysis of Changes in Precipitation and Drought in Aksu River Basin, Northwest China,” Advances in Meteorology, vol. 2015, Article ID 215840, 15 pages, 2015. doi:10.1155/2015/215840](https://www.hindawi.com/journals/amete/2015/215840/cta/)*

*[2]World Meteorological Organization (WMO), Drought and Agriculture, WMO/TN 138, World Meteorological Organization (WMO), Geneva, Switzerland, 1975.*

### 3.2 recognize historical droughts

After defining drought, I replot the yearly precipitate graph with y-axis representing RVI of each year. This graph could also be consisered as a residual plot such that 0 stands for no-drought estimation.

![residual_prcp_rvi.png](lelu_r_figures/residual_prcp_rvi.png)

From the graph, it is clearly showing that almost half of investigated years have experienced drought to some extent. With a secondary focus on only "extremely dry" and "dry" regimes shown in #3.1, I filter the following years from years considered to experience drought:

| Regimes       | Years     |  Total | Percentage|
| ------------- |:-------------:| -----:| -----:|
|extremely dry  | 1954, 82, 99 | 3 years| 2.14%|
| dry   | 1883, 1908, 27, 30, 31, 33, 51, 55, 81, 87, 90, 95, 2000, 02 |  14 years|10.00%|

Therefore, among the past 140 years, 17 years (12.14%) are considered to have significant drought.

### 3.3 visualize historical drought years, summers and winters

It could be interesting to visualize historical drought years, summers and winters. The three graphs below show the distribution of droughts among the past 140 years (1873 to 2012) for years, summers (Jun, Jul, Aug) and winters (Dec, Jan, Feb), respectively.

![drought_years_3.png](lelu_r_figures/drought_years_3.png)
![drought_summers_3.png](lelu_r_figures/drought_summers_3.png)
![drought_winters_3.png](lelu_r_figures/drought_winters_3.png)

From these graphs we can tell much information about historical droughts in South Alabama. For example, we observe that:
* In general, droughts occur in summer more frequently than in winter. (which makes sense since summer temperature is much higher in South Alabama.)
* In early years (pre-1980), droughts occurred less frequently than the later 33 years (1980-2012). (which may be related to global warming.)
* Winter droughts are more scattered throughout the past 140 years than summer droughts; summer droughts appear to be more frequent in the later years.
* For all seasonal analysis (years/summers/winters), there are obvious "drought periods" and "normal periods": 
    * During "drought periods", droughts occur every few years/summers/winters.
    * During "normal periods", there are no droughts.
    * From whole-year analysis, "normal periods" are longer than "drought periods"; From summer analysis, "drought periods" are longer than "normal periods"; From winter analysis, "drought periods" and "normal periods" last for equal amount of time.

## Conclusion 3
In conclusion, **South Alabama has experienced recurring droughts historically and this situation still develops**. 