# San Francisco Weather Analysis


AIM : The goal of this report is to unearthern hidden patterns and trends in the weather data, which might provide valuable information and interesting insights. 

We focused on six measurements:
* TMIN, TMAX: the daily minimum and maximum temperature.
* TOBS: The average temperature for each day.
* PRCP: Daily Percipitation (in mm)
* SNOW: Daily snowfall (in mm)
* SNWD: The depth of accumulated snow.

## A. Preliminary analysis

### 1. Average temperature vs max and min temperature
Here, we wish to see how far is TMAX and TMIN from the average tempature TOBS ona ny given day. For eg, it would be interesting to see, that average temperature obserevedover a day is close to minimum temperature or the maximum temperature observed on the same day. Here are our results - 

![Average_Temp.png](r_figures/Average_Temp.png)

Well, we observe that average temperature over a given day is actually equidistant from the extremes on that day.

### 2. Mean and standard deviation for precipitation and snow fall

Here, we try to look at the mean and standard deviation of following metrics over 365 days of the year -

* PRCP - Daily precipiation (mm)
* SNOW - Daily snowfall (mm)
* SNWD - Snow depth

![prcp_mean_std.png](r_figures/prcp_mean_std.png)
![snow_mean_std.png](r_figures/snow_mean_std.png)

* Observation 1 : For precipitation, observe that is doesn't rain from July - Oct (Fall season) but good amount of rain pours during winters (Jan - Mar) peeking at feb
* Observation 2 : Interestingly snowfall trend is in tandem with precipitation. Although, the snowfall is on average 0 mm (suggesting dearth of snowfall), whenever it happens, it happens around winter season

### 3. Eigen vectors and variance explained

For each of the six measurement, we compute the percentate of the variance explained as a function of the number of eigen-vectors used.

![VE_1.png](r_figures/VE_1.png)

We see that the top 5 eigen-vectors explain 35% of variance for TMIN, 60% for TOBS and 50% for TMAX. We conclude that of the three, TOBS is best explained by the top 5 eigenvectors. 


![VE2.png](r_figures/VE2.png)
The top 5 eigenvectors explain 16% of the variance for PRCP and 65% for SNOW. On the other hand the top 5 eigenvectors explain %75 of the variance for SNWD. This means that these top 5 eigenvectors capture most of the variation in the snow signals. 

Based on that we will dig deeper into the PCA analysis for 
* **SNWD** 
* **TOBS**



## B. PCA Analysis

### 1. Plot Reconstructions for SNWD

Construct approximations of a time series using the mean and the $3$ top eigen-vectors
First, we plot the mean and the top $3$ eigenvectors

![SNWD_REC.png](r_figures\SNWD_REC.png)

Observation : We observe that top 3 eigen vectors very well approximates the mean. But again, this was expected because there isn't much variation in the mean for SNWD. 


### 2. Plot Reconstructions for TOBS

Construct approximations of a time series using the mean and the  33  top eigen-vectors First, we plot the mean and the top  33 eigenvectors

![TOBS_REC.png](r_figures\TOBS_REC.png)

Observation : We observe that top 3 eigen vectors very well approximates the mean. **Also, there was a lot more variance in TOBS data  across the days of the year, we proceed ahead with TOBS instead of SNWD**


###  3. Plotting approximation for TOBS for various different stations + year combination

Here, we attempt to plot the approximation for TOBS only using the first 3 eigen vectors.

![3rec_TOB.png](r_figures\3rec_TOB.png)

Observation 1 : Indeed, first 3 eigen vectors along with mean does a very comprehensive approximation of the original wave over 365 days. They all increase and decrease in tandem with each other.

Observation 2: Notice how the presence of gaussian noise in original TOBS wave is effortlessly filtered out by first 3 eigen vectors and mean.

## C. Geographical and Topological Analysis
### 1. The city/area/metropolitan corresponing to the file_index assigned

We find that, the area corresponding to the file_index assigned is the Bay Area

![map.png](r_figures\map.png)

### 2. Station location in the Bay Area  and analyizing the number of measurements from each station

The circles in the map depict the location of the station. 
Area of the circle determines the number of measurements received from that station

### 3. Decoding the color code

The circles with red color signify the highest value corresponding the first eigen vector for that station.

Similarly, the cirles with blue color tone signify the highest negative value orresponding the first eigen vector for that station.

This basically signifies where does first eigen coffecient lie around the mean TOBS. 


### 4. Analyizing different elevation points


![ele.png](r_figures\ele.png)

It is clear from the map, most of the stations are located at locations, with alltitude same as sea level. This is expected as the Bay Area has very even topology with a few variations in altitude.

## D. Temporal vs Spatial analysis for TOBS

### Root Mean Square pivoted over temporal and spatial analysis

**DATA-**
* total RMS                   =  255.753539901
* RMS removing mean-by-station=  132.020692328
* RMS removing mean-by-year   =  236.585402855

Observation :  It is very evident from the data above, that if we take RMS after removing mean by station from every data point, we get lower MSE. Hence the data is better represented by spatial locality, that is on the station basis. The RMS obtained by removing mean-by-year is is higher, showing large variance in temporal pivot.

This can be interpreted as over the years the value for TOBS at any given station doesn't change much. However, for a given year, there is wide variation in the value for TOBS from station to station