# Montana Weather Analysis

This is a report on the historical analysis of weather patterns in an area that approximately overlaps the area of eastern part of the state of Montana.

The data we will use here comes from [NOAA](https://www.ncdc.noaa.gov/). Specifically, it was downloaded from This [FTP site](ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/).

We focused on three measurements of the dataset:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TDIFF:** the daily difference between maximum and minimum temperature.
* **PRCP:** Daily Percipitation (in mm)
    
## Sanity-check1: comparison with outside sources

<p>We start by comparing some of the general statistics with graphs that we obtained from a site called <a href="http://www.usclimatedata.com/climate/montana/united-states/919" target="_blank">US Climate Data</a> The graph below shows the daily minimum and maximum temperatures for each month, as well as the total precipitation for each month.</p>

<p>&nbsp;</p>

<p><img alt="Montana_Climate.png" src="images/Montana_Climate.png" /></p>

<p>&nbsp;</p>

<p>Here is the plot based on Mean and Standard Deviation of TMAX and TMIN from the data we got. </p>

<p>&nbsp;</p>

<p><img alt="TMIN_TMAX_mean+std.png" src="images/TMIN_TMAX_mean+std.png" /></p>

<p>&nbsp;</p>

<p>We see that the min and max daily&nbsp;temperature agree with the ones we got from our data, once we translate Fahrenheit to Centigrade. It verifies the source of our data. </p>

<p>&nbsp;</p>



## Sanity-check2: valid number of days for both TMAX and TMIN

<p>Since we are going to calculate the difference of TMAX and TMIN on a given day, we need to make sure that they both have a good amount of valid days. </p>

<p>&nbsp;</p>

<p><img alt="TMIN_TMAX_valid.png" src="images/TMIN_TMAX_valid.png" /></p>

<p>&nbsp;</p>

<p>We can clearly see that they both have around 2300 valid entries, which are enough for our analysis. </p>

<p>From later computation, we know that there are actually 2368 pieces of records, 18 abandoned records (because of missing data on either TMAX or TMIN).  </p>



## Analysis of temperature difference

### Data collection on temperature difference

<p>We choose to analysis the temperature difference on a given day and how eigen decomposition technique might help with us on explaining and reconstructing temperature difference patterns. </p>

<p>First of all, here is a picture of average "TDIFF" through out the year. </p>

<p>&nbsp;</p>

<p><img alt="TDIFF_average.png" src="images/TDIFF_average.png" /></p>

<p>&nbsp;</p>

<p>Derived from the following two images of average TMAX and TMIN. </p>

<p>&nbsp;</p>

<p><img alt="TMIN_TMAX_average.png" src="images/TMIN_TMAX_average.png" /></p>

<p>&nbsp;</p>

<p>The method to calculate TDIFF on a given day is first to go through the data set once based on Station ID and Year.  I realize these combination of keys uniquely defines a piece of entry in the database.  If dealt with careless, we might end up calculating temperature difference at the same station of different years, or temperature difference of the same year but difference stations at different locations (which will make us end up analyzing nothing). </p>

<p>Here is the picture of TDIFF table after processing: </p>

<p>&nbsp;</p>

<p><img alt="TDIFF_table.png" src="images/TDIFF_table.png" /></p>

<p>&nbsp;</p>

### PCA analysis on temperature difference

<p>First, let us see the percentage of variance explained in TMIN and TMAX: </p>

<p>&nbsp;</p>

<p><img alt="TMIN_TMAX_var.png" src="images/TMIN_TMAX_var.png" /></p>

<p>&nbsp;</p>

<p>We can see that with top 5 eigenvectors we can explain about 20% to 25% of the variance.  We would expect that top 5 eigenvectors can explain about the same amount of variance for TDIFF.  Notice that, if and only if two matrices commute, that is AB = BA, we would expect eigenvectors in A+B can formed by linear combination of eigenvectors of A and B. The above condition clearly is not satisfied in our case, so we can perform PCA on TDIFF. </p>

<p>Here is the percentage of variance explained in TDIFF: </p>

<p>&nbsp;</p>

<p><img alt="TDIFF_var.png" src="images/TDIFF_var.png" /></p>

<p>&nbsp;</p>

<p>We can see that top 5 eigenvectors can explain about 25% of the variance, which meets our expectation. </p>

<p>Then, let us see the picture of top 3 eigenvectors along with Mean of TDIFF </p>

<p>&nbsp;</p>

<p><img alt="TDIFF_eig.png" src="images/TDIFF_eig.png" /></p>

<p>&nbsp;</p>

<p>From the above graph, we can see that the first eigen function stays stable, acts as a base.  The second eigen function has unstable value, but it tries to capture the shape of TDIFF Mean from July to October.  The third eigen function tries to capture the shape of TDIFF Mean from May to July and also the peak happens around October. </p>

### Reconstruction of temperature difference

<p>Here is the plot of best reconstruction. </p>

<p>&nbsp;</p>

<p><img alt="best_recon.png" src="images/best_recon.png" /></p>

<p>&nbsp;</p>

<p>Here, we choose not to use mean as a vector to reconstruct TDIFF, because it cannot explain too much. From the graph, we can see that top 3 eigenvectors reconstruct some part of TDIFF, as least the shape of it. As comparing to the following graph:</p>

<p>&nbsp;</p>

<p><img alt="TDIFF.png" src="images/TDIFF.png" /></p>

<p>&nbsp;</p>

<p>Here is the plot of reconstruction using other parameters of residuals: </p>

<p>&nbsp;</p>

<p><img alt="plot_recon.png" src="images/plot_recon.png" /></p>

<p>&nbsp;</p>

## Conclusion

<p>From the above analysis, we can see that with top 5 eigenvectors explaining about 25% of the variance, we can still reconstruct the plot of TDIFF, at least the shape of it.  Also, although given the whole part of data of TMAX and TMIN, the variance explained does not improve much, because TMAX and TMIN matrix does not commute with each other.  During the implementation stage, I notice some bug in the function Decompose (cell 23) in my code.  So I mannually rewrite the function using normal Python without spark feature, which lowers the speed of code during that part, but code still successfully achieves the goal of computing.  </p>