## Weather Analysis for Northern California/Bay Area

### A Glance of Geographical Location of the Data

![geolocmt.png](r_figures/geolocmt.png)


#### The stations covered in this dataset are spread around San Jose and the Bay area.




### Overview

San Jose/Bay Area is typical subtropical Mediterranean climate. In the given dataset, there are six variables included in this analysis as target-- three temperature variables (TMIN, TMAX, TOBS) and three weather variables (PRCP, SNOW, SNWD). 

The PCA analysis and Regression analysis will be performed to examine the weather data in this area in order to find effects and patterns.

Let's start with looking at basic statistics for these six main variables.

##### Mean +/- Standard Deviation for TMIN and TMAX:
![TMINTMAXmt.png](r_figures/TMINTMAXmt.png)
##### Mean +/- Standard Deviation for TOBS and PRCP:
![TOBSPRCPmt.png](r_figures/TOBSPRCPmt.png)
##### Mean +/- Standard Deviation for SNOW and SNWD:
![SNOWSNWDmt.png](r_figures/SNOWSNWDmt.png)


From the fix figures above, the orange line in the middle represents the mean and blue line at the lower position represents mean minus standard deviation and green line at the higher position shows mean plus standard deviation. 

We can tell that SNOW, SNWD and PRCP do not have very obvious seasonal variations. While temperature factors such as TMIN, TMAX and TOBS show better seasonal variations.

It's also very clear from the plots that the snow and snow depth is very little throughout the year. There are some percipitations during winter seasons, but not much, and percipitation in summer almost approaches zero. The warmest season starts in July and last to October.

Comparing the max temperature data with third party data from US Climate Data, we confirm that the US Climate Data agree to our data in TMAX (as an example) for San Jose area.

![USClimatemt.png](r_figures/USClimatemt.png)
![TMAXmt.png](r_figures/TMAXmt.png)




### PCA Analysis


PCA Analysis is performed on all six variables. The percentage of variance explained as a function of the number of eigen-vectors used is plotted for the six variables. 

The results are shown in the plots below. 

![Pctweathermt.png](r_figures/Pctweathermt.png)

Among the three weather variables--SNOW, SNWD and PRCP-- we can see that 30% of variance is explained by the top 5 eigen-vectors for SNOW; around 50% of variance is explained by the top 5 eigen-vectors for SNWD; less than 16% of variance is explained by the top 5 eigen-vectors for PRCP. 

San Jose/Bay Area is typical subtropical Mediterranean climate which snow is rarely seen. (As shown in previous section, mean for SNOW and SNWD were both approximately zero millimeter throughout years.) Due to the limitation of the nature of data, it’s not a surprise to see a low percentage of variance being explained. On the other hand, even though with limited snow amount, as SNWD (Snow depth) has less variation among dates and same date in different years, SNWD compared with SNOW has higher percentage of variance explained by eigen-vectors than SNOW. Meanwhile, since precipitation is a complicated weather phenomenon and is controlled by various factors, it’s not surprising to see that there’s no dominant factor found using PCA analysis on PRCP.
 
![Pcttempmt.png](r_figures/Pcttempmt.png)
Among the three temperature variables: 35%-40% of  variance is explained by the top 5 eigen-vectors for TMIN; 60% of variance is explained by the top 5 eigen-vectors (50% is explained by the first eigen-vector) for TOBS and around 45% of variance is explained by the top 5 eigen-vectors for TMAX.

We conclude that TOBS has the highest variance explained, hence we’ll look deeper into this variable in next section.



### PCA Analysis of TOBS


We’ll take a closer look at TOBS, as it has 60% of variance explained by the top 4 eigen-vectors.
First we plot the mean and the top 4 eigen-vectors. From the plot below we can see that eigen-vector 1 (with negative sign) and 2 has the most similar shape with the mean. 

![topeigenmt.png](r_figures/topeigenmt.png)

#### Reconstructions:

The smaller the residuals are, the better the reconstruction is. The operations below exclude residuals that are higher than 0.4, and sorted coefficients by both ascending and descending orders to show the lowest (most negative) and highest (most positive) coefficients. 


##### Most negative coefficient 1:

![negcoef1mt.png](r_figures/negcoef1mt.png)

##### Most positive coefficient 1:

![poscoef1mt.png](r_figures/poscoef1mt.png)

##### Review of coefficient 1:
Large negative values of coefficient 1 represent higher value of TOBS than average, and the difference peaked at July-August time frame.
Large positive values of coefficient 1 represent lower value of TOBS than average.


##### Most negative coefficient 2:

![negcoef2mt.png](r_figures/negcoef2mt.png)

##### Most positive coefficient 2:

![poscoef2mt.png](r_figures/poscoef2mt.png)

##### Review of coefficient 2:
Large positive coefficient 2 shows warmer summer in July-August time frame, with the curve steeper around the peak than average; large negative coefficient 2 shows mild summer with the curve lower than the average around July-August time frame.


#### Best Reconstruction

The chart below shows the best reconstruction. From the result below we can see that coefficient 1 and coefficient 2 are very close to the target and following the same trend and span.
![bestreconmt.png](r_figures/bestreconmt.png)


##### Finding structure in the rependency matrix


To find the normalized log probability for each pair of stations, we draw matrices as shown below. The plots ordered the rows and columns using the first four eigen-vectors. As the plots show, there is no evident correlation between paired stations; some stations show weak correlations, for example in the upper left plot, stations positioned at 0-5 show a certain level of correlations with each other, but not very strong.

![corrmt.png](r_figures/corrmt.png)

### Regression Analysis on Elevation and Temperature

#### After PCA analysis, we now move on to regression analysis and see how much effect elevation has on each factor. The regression model applied in this analysis is linear regression and OLS (Ordinary Least Squares) method is used.

#### Explanatory/Independent variable (X) is elevation for all models below.

##### Regression on Elevation (X) and TMAX (Y)

We start with TMAX as dependent variable Y. The null hypothesis is that elevation has no impact on TMAX. The result of regression is shown in table below:

The coefficient shows as -0.002, meaning that one unit of increase of elevation will result in 0.002 unit of decrease in maximum temperature.

   T-statistics is -8.54 and P-value is 0.000, which is strong evidence of statistical significance. Given that, we reject the null hypothesis and conclude that there is statistical significance to support that elevation has negative impact on maximum temperature.
However, R-square is 0.0324, meaning that this model can only explain 3.24% of the entire population.

![tmaxreg.pdf](r_figures/tmaxreg.pdf)

##### Regression on Elevation (X) and TMIN (Y)

We now move onto TMIN as dependent variable Y. The null hypothesis is that elevation has no impact on TMIN. The result of regression is shown in table below:

The coefficient shows as -0.0001, meaning that one unit of increase of elevation will result in 0.0001 unit of decrease in minimum temperature.

   T-statistics is -0.66 and P-value is 0.5069, which is higher than the cut-off point of 0.05. Given that, we fail to reject the null hypothesis and conclude that there is no statistical significance to support that elevation has negative impact on minimum temperature.

Meanwhile, R-square is 0.0002, meaning that this model can only explain 0.02% of the entire population.
Overall this regression model does not have a pratical meaning.

![tminreg.pdf](r_figures/tminreg.pdf)

##### Regression on Elevation (X) and TOBS (Y)

Let's then look at the impact of elevation on TOBS--observed daily average temperature.

The coefficient shows as -0.0016, meaning that one unit of increase of elevation will result in 0.0016 unit of decrease in daily average temperature.

   T-statistics is -4.02 and P-value is 0.0001, which is strong evidence of statistical significance. Given that, we reject the null hypothesis and conclude that there is statistical significance to support that elevation has negative impact on daily average temperature.

However, R-square is 0.0137, meaning that this model can only explain 1.37% of the entire population. It's not a good representation of the population in general.

![tobsreg.pdf](r_figures/tobsreg.pdf)


#### After examining the relationship between elevation and the three temperature variables, let's move on to the three weather variables-- PRCP, SNWD and SNOW.

##### Regression on Elevation (X) and PRCP (Y)

We start with PRCP percipitation as dependent variable Y. The null hypothesis is that elevation has no impact on PRCP. The result of regression is shown in table below:

The coefficient shows as 0.0006, meaning that one unit of increase of elevation will result in 0.0006 unit of increase in percipitation.

   T-statistics is 10.0 and P-value is 0.000, which is strong evidence of statistical significance. Given that, we reject the null hypothesis and conclude that there is statistical significance to support that elevation has positive impact on percipitation.
However, note that R-square is 0.0331, meaning that this model can only explain 3.31% of the entire population.

![prcpreg.pdf](r_figures/prcpreg.pdf)

##### Regression on Elevation (X) and SNWD (Y)

We then use SNWD--snow depth-- as dependent variable Y. The null hypothesis is that elevation has no impact on snow depth. The result of regression is shown in table below:

The coefficient shows as 0.0001, meaning that one unit of increase of elevation will result in 0.0001 unit of increase in snow depth.

   T-statistics is 24.9 and P-value is 0.000, which is strong evidence of statistical significance. Given that, we reject the null hypothesis and conclude that there is statistical significance to support that elevation has positive impact on snow depth.
   
Also note that R-square is 0.2272, meaning that this model can explain 22.72% of the entire population; this ratio has been much higher than in models for temperature variables.

![snwdreg.pdf](r_figures/snwdreg.pdf)

##### Regression on Elevation (X) and SNOW (Y)

Finally, we use SNOW as dependent variable Y. The null hypothesis is that elevation has no impact on snow. The result of regression is shown in table below:

The coefficient shows as positive 0.000, meaning that one unit of increase of elevation will result in increase in snow, however the effect is very minimal.

   T-statistics is 32.87 and P-value is 0.000, which is strong evidence of statistical significance. Given that, we reject the null hypothesis and conclude that there is statistical significance to support that elevation has positive impact on snow.
   
Also note that R-square is 0.3305, meaning that this model can explain 33.05% of the entire population.

![snowreg.pdf](r_figures/snowreg.pdf)



### Conclusion 

In general the climate in San Jose/Bay Area is Mediterranean climate with rare snows and mild temperatures. The PCA analysis and Regression analysis were performed to examine the weather data in general in order to find effects and patterns.

There are six variables included in this analysis as target-- three temperature variables (TMIN, TMAX, TOBS) and three weather variables (PRCP, SNOW, SNWD).

PCA analysis is very helpful in analyzing temperature related factors such as TMIN, TMAX and TOBS. Especially for TOBS, we were able to get more than 60% of variance explained by the top 5 eigen-vectors, within which the top eigen-vector alone explained 50% of the variance. However, PCA analysis didn't show much strength in explaining weather related variables such as PRCP, SNOW or SNWD.

On the other hand, we used Linear Regression analysis (OLS method) on all six variables to find out what's the impact of elevation on each of them. We found that elevation has statistically significant (positive or negative) impact on five out of the six variables (TMIN is the only one not significant). However, even though the p-values and t-statistics strongly support statistical significance on the models, all three of the temperature models do not have a good representation of the entire data with low R-squared values. On the other hand, three weather variables all show strong statistical significance that support the effect of elevation on each of them; especially for SNOW, the R-Square was also able to explain at least one third of the entire data.

 