# Minnesota and Ontario Region Weather Analysis

We seek to analyze weather patterns in the a section of the United States and Canadian border. In particular we are interested in the Minnesota and Ontario region. The region from which observations were analyzed is displayed in the figure below.
<p><img alt="map_circles.png" src="report_figures/map_circles.png"/>**Figure 1:** Observations by location. A larger radius indicates more observations in the area.</p>

<p>
The dataset used was originally mined from the [National Oceanic and Atmospheric Administration](https://www.ncdc.noaa.gov/) website. It contains annually reported data from weather stations in the area. Most of our data is concentrated in the middle to southern region of the bounding box. This means there are data points further away from the equator that will likely have lower average temperatures. The region is close to bodies of waters also attributing to lower temperatures on average.

We are focused on analyzing six measurements although there were other data points available:
* **TMIN:** the daily minimum temperature.
* **TMAX:** the daily maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Precipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.

Notice that temperature accounts for three of the measurements and the rest deal with humidity in relation to the temperature. There is no direct humidity measurement but the measurement of rain and snow are partly explained by humidity.

## Data Check
As a quick sanity check, we have also pulled some data from a <a 
href="http://www.usclimatedata.com/climate/minnesota/united-states/3193" target="_blank">US Climate Data</a> website that contains the minimum and max temperatures over a period of months as well as the average precipitation per month. The relevant figure is shown below and we can compare it to the data from the NOAA dataset.
</p>
<p><img alt="sanity_check.png" src="report_figures/sanity_check.png"/>**Figure 2:** Alternate data to be compared to NOAA data.</p>
<p><img alt="tmin_tmax.png" src="report_figures/tmin_tmax.png"/>**Figure 3:** NOAA max and minimum temperature data to be compared to usclimatedata.com data.</p>
<p>The NOAA dataset encompasses more area than just International Falls. Nonetheless we see that the two occupy very similar ranges. The mean minimum temperature and the mean maximum temperature from both datasets display the same curve so we can be more confident that the NOAA dataset is reporting valid data.</p>
<p><img alt="precipitation.png" src="report_figures/precipitation.png"/>**Figure 4:** NOAA tenth of mm of precipitation per day converted to monthly rate of mm precipitation by the day.</p>
<p>The reports for precipitation were given in different units. When we convert the NOAA data to a monthly mm rate we see that it follows the shape of the average mm of rain per month at International Falls, the geographic center of the observed region. When starting from January to December, the amount of rainfall per month is modeled by a bell curve distribution. The bell curve model has no mathematical significance here as the middle can be shifted by observing the data starting at different months.

# Principal Components
For each of the six focused measurement, we seek to compute the percentage of the variance in the data that is explained in relaion to the number of eigenvectors used.

<p><img alt="var_expl1.png" src="report_figures/var_expl1.png"/>**Figure 5**: Temperature component variance explained based on number of eigenvectors used.</p>
<p>
The figures follow a logarithmic curve meaning that at lower amounts of eigenvectors the percentage explained increases quickly and the amount of variance explained dwindles down at higher additional amounts of eigenvectors used. It is clear that TOBS is explained better at lower amounts of eigenvectors than TMIN and TMAX. Intuitively, if there is not much data given, regression by the mean tells us that it is extremely likely that the reconstructed measurement is likely somewhat close to the mean.
</p>

<p><img alt="var_expl2.png" src="report_figures/var_expl2.png"/>**Figure 6**: Condition component variance explained based on number of eigenvectors used.</p>
<p>
In figure 6, it is clear that SNWD is explained by few eigenvectors compared to SNOW and PRCP. This is intuitive because SNWD does not have much variance across data points. It is only a valid measurement when there is actually snow, which is not every day of the year. Note that when using large amounts of eigenvectors, we start to overfit to the data to the observed dataset rather than natural patterns explained by the components.
</p>

# Snow and Snow Depth Variance
Snow and snow depth are clearly related as snow depth can only occur when snow exists. A quick correlation coefficient calculation revealed that the correlation coefficient between the SNOW and SNWD mean over a year is 0.78637037. This is quite high and so one could argue the two are strongly correlated. 
<p><img alt="snow_snwd_coeff.png" src="report_figures/snow_snwd_coeff.png"/>**Figure 7**: Snow and Snow Depth Explained By Eigenvectors and Scalings</p>
<p>
In figure 7 it is apparent that when there is no snow in the summer to early fall months there is no snow depth. Again this makes sense intuitively. and makes it so that SNWD is explained by relatively little eigenvectors. All of the displayed eigenvectors in SNOW and SNWD capture that there is little to no snow in the middle months but the eigenvectors in SNOW do not display clear patterns throughout the winter and spring months.


In both plots, the 0-line is the eigenvector explaining the most variance and each next eigenvector is the vector explaining the next most variance. We see that eigenvectors start to explain more of the extremes not captured in previous eigenvectors but none of them influence the mean all that much.
</p>

# Station Relation
We seek to explore how stations are related.

<p><img alt="block.png" src="report_figures/block.png"/>**Figure 8**: Block Diagonal by station indices. Left plot is in regards to the component with the highest coefficient and right plot is in regards to the component with the second highest coefficient.</p>
<p>
It makes sense to have related stations as they are in the same region. They should be reporting related information. For the first component, we see this is the case as stations in the same index range are darker colored. This means they are more closely related. The second component starts to display less of this notion it is explaining a different component, which may be explained by time rather than location. Notice that along the diagonal the plot is symmetrix so most of the information can be seen by focusing on the dark green plots.

<p><img alt="dataMap.png" src="report_figures/dataMap.png"/>**Figure 9**: Station components projected onto a geographical map. </p>

In figure 9 we can directly see related stations by location and how much they influence the coefficients. Larger triangles mean larger loadings when the eigenvectors are scaled. It can be seen that clustered stations appear to be the same visually. This makes sense as they should be gathering measurements with small variation.

# Variance By Precipitation

We want to see how much of the data can be explained by precipitation as a component and using a couple eigenvectors. A good reconstruction would mean that precipitation can explain lots of other components such as wind speed, amount of light in a day, and ground hardness. This is not expected to be the case.

<p><img alt="varPRCP.png" src="report_figures/varPRCP.png"/>**Figure 10**: Residual variance after subtrating the mean and the top 2 loadings.</p>

It is clear that the residual is high with a couple of eigenvectors. Using a couple more does not help that much and using too many overfits the data. Nonetheless we can see that with many instances of datapoints we begin to see a trend. About 20% of the variance is explained by precipitation. 20% is a non-trivial amount so clearly precipitation does play a good role in the weather. This makes sense because water has an effect on many environments. Many variations in environments are due to the amount of rainfall the region receives.