# East Georgia Weather Analysis
Author: Yuwei Wang

Student ID: A53212966

This is a report analyzing the historical weather patterns on the data from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from this [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/). According to my Student  ID, the report is about the **BSBSSSBB** part of data, which approximately overlaps the east part of the state of Georgia.

![VarExplained1.png](figures/stations.png)

We focused on six measurements:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

## Sanity-check: comparison with outside sources

First, we compare some of the statics with graphs obtained from <a href="http://www.usclimatedata.com/climate/douglas/georgia/united-states/usga0177" target="_blank">US Climate Data</a>. The figure below shows the minimum and maximum temperatures for each month.

![Douglas Climate.png](figures/Douglas.png)

The temperature is closed to ones from the dataset.

![Static data](figures/tmin-max.png)

## PCA Analysis

We compute the percentate of variance explained as a function of the number of eigen-vectors used.

### Percentage of variance explained.

![VarExplained1.png](figures/variance-explained1.png)
We can see that the top 5 eigen-vectors explain 25% of the variance of TMIN, 55% of the variance of TOBS and 25% of the variance of TMAX. 

Among the three measurements, TOBS is the best explained by the top 5 eigen-vectors. The first eigen-vector can even explain 45% percent of its variance.
![VarExplained2.png](figures/variance-explained2.png)
We can see that the top 5 eigen-vectors explain 65% of the variance of SNOW, 93% of the variance f SNWD and 6.3% of the variance of PRCP.

We will give more detailed inspection into the TOBS measurement, although the eigen-vectors of SNOW and SNWD can explain more variance. This is because as we can see from the following figure, there is only few days snowing in georgia. Thus, most of the snow data is useless. In addition, only a few days have non-zero eigen values. It is more significant to do analysis on the TOBS data. 

![snow-raw.png](figures/snow-raw.png)
![snow-eigen.png](figures/snow-eigen.png)

## Analysis of average temperature

As is mentioned above, we choose to analyze the eigen-decomposition for the average temperature in east Georgia.

First, we plot the mean and the top 3 eigen-vectors of the TOBS. We can see the temperature increases from January to July and then decreases from August to December.

![tobs-mean-eigen.png](figures/tobs-mean-eigen.png)

Then, we explain the eigen-value functions. The first eigen-function has positive correlations with means, but its graph is much more flat. The interpretation of the shape is that eig1 represents the temperature above/below the mean, but without changing the distribution over time.

The other eigen-functions oscilate between positive and negative values. They represent the distribution of temperature in a year, but they do not change the mean of the temperature in a year.

Thus, eig2 and eig3 can be explained as follow:
* eig2: Tempature increasing from December to March, but decreasing from April to November.
* eig2: Tempature increasing in January, and from April to September, but decreasing in February, March, and from October to December.

### Examples of reconstructions

We reconstruct some samples with the highest and lowest coefficient in the eigen-decomposition.

#### coeff1
![coeff1-high.png](figures/coeff1-high.png)
![coeff1-low.png](figures/coeff1-low.png)
Higher **coeff1** means that the temperature is lower than average in the sample, while lower **coeff1** means that the temperature is higher in the sample.

#### coeff2
![coeff2-high.png](figures/coeff2-high.png)
![coeff2-low.png](figures/coeff2-low.png)
Higher **coeff2** means that the temperature is more flat in the year (lower in summer and higher in winter), while lower **coeff2** means that the temperature changes more dramatically in the year (higher in summer and lower in winter).

#### coeff3
![coeff3-high.png](figures/coeff3-high.png)
![coeff3-low.png](figures/coeff3-low.png)
Higher **coeff3** means that there is an increase in the temperature in February, while lower **coeff3** means that there is a decline in the temperature in Februray.

## Is average temperature variation spatial or temporal?

To estimate the effect of time and location on the average temperature, we compute the RMS before and after subtracting either the row or column vector. Here are the results

**coeff1**

total RMS                   =  776.024596913

RMS removing mean-by-station =  549.688445392 (29.166105% explained)

RMS removing mean-by-year   =  706.771548021 (8.924079% explained)

**coeff2**

total RMS                   =  212.114913682

RMS removing mean-by-station =  191.675900704 (9.635821% explained)

RMS removing mean-by-year   =  140.912370859 (33.567910% explained)

**coeff3**

total RMS                   =  174.891906272

RMS removing mean-by-station =  170.321790935 (2.613109% explained)

RMS removing mean-by-year   =  101.817543966 (41.782587% explained)

We see that the variation by station explains more in **coeff1**, but that by year explains more in **coeff2** and **coeff3**. As is mentioned above, **coeff1** is corresponding to the average temperature in the station in a year, but **coeff2** and **coeff3** play more role in the variation of temperature in the year. Thus, we can conclude that stations determine the average temperature, but the seasonal temperature change varies temporally.

## Data Visualization on the map

### Percipitation

I used iPyLeaflet to show the data on the map. First, I calculated the average of percipitation in each year for all stations. Then, I used iPyLeaflet to show the values.

![rain-mean](figures/rain-mean.png)

In the figure above, warm colors infer more rain, while cold colors infers less rain. We see that in this area, the south is more rainy than the north. This is because of the forest in the south in my opinion.

### Temperature

Then, I painted the average coefficients of the eigen-decomposition as follows.

![temperature-coeff.png](figures/temperature-coeff.png)

In the figures, the blue triangles show **coeff1**, yellow triangles show **coeff2**, green triangles show **coeff3**, and red triangles show **coeff4**. The size of the triangles is proportional to the absolute value of the coefficient. Full triangles indicates position values, while hollow triangles indicates negative values. As is mentioned before, the second coefficient reveals the amount of temperature change in a year. Thus, we see that in the forests, the weather is milder (with larger full yellow triangles). In other areas, lower **coeff2** indicates that the possibility of extreme weather is higher.

## Independence Analysis

### Percipitation

Let's focus on the percipitation records. We can plot the average, eigen values, as well as the percent of variance explained.

![prcp-mean-eigen.png](figures/prcp-mean-eigen.png)
![prcp-explained.png](figures/prcp-explained.png)

We see that the eigen vectors can only explain a very limited part of the variance. There is still plenty of information in the residuals.
![prcp-res3.png](figures/prcp-res.png)

We removed the mean by station and by day, and then calculated the remaining variance.

total RMS = 91.4703129625

RMS removing mean-by-station=  86.0669449859

RMS removing mean-by-day   =  66.8640711071

We can see that time plays a more important role in the percipitation.

![rain-sig.png](figures/rain-cdf.png)

To find whether it tends to rain on the same day in different stations. I did a statistical test for rejecting the null hypothesis that says that the rainfall in the two locations is independent.

I choose two stations, and then calculate the probability of the number of days that it rained on both stations. I calculate the normalized log probability for each pair of stations, and then plot the histagram of the significance.


![rain-sig.png](figures/rain-sig.png)


As is shown in the figure above, there are still many pair whose significance is larger than 0.1. Thus, we cannot reject the hypothesis. There is no significant dependency of the rain in two stations.

We plot the normalized log probability that the overlap in raindays is random ordered by the coefficients.

![rain-order.png](figures/rain-order.png)

Although the rain in many stations is independent, we can see that some eigen values still group the stations. For example, consider the upper left and bottom right corner of the second matrix. The stations at positions 0-20 and 80-100 are clears strongly correlated with each other. Thus, the rain does have some spatial relationship.