# Clustering of US Hydropower 
**This project is for GEOG5990 Geospatial Big Data at the University of Colorado Denver  
Spring 2020  
Student contributor: Quinsen Joel**  

![title](2560px-Power_Grid_-_Flickr_-_brewbooks_(1).jpg)




## Introduction

Hydropower, despite it's contentious history with environmental and public acceptance concerns continues to be touted as a major player in the transition to a renewable energy future. Hydropower has immense potential to provide both utility-scale generation and storage of renewable power, and various new innovations such as Closed-Loop Pumped Hydro Systems, Small-Scale Hydropower or Retrofitted Hydropower hope to address these historical concerns. A broad scope of research is being performed to integrate these new solutions into the US energy market, and a key part of this discussion is the study of existing hydropower systems.

##### Segmentation of US Hydropower

One way of studying the existing technology is through *segmentation*, or the division of multi-variable data into distinct clusters or classes. Segmenting existing hydropower sites can help with two broad objectives:  

1. narrowing future studies to existing sites with comparable characteristics (for example, studying more about what makes a particular set of hydropower sites produce a certain level of efficiency) and   

2. identifying gaps in coverage of relevant variables and help prioritize new research and development (for example, realizing we'd like to encourage more hydropower sites owned by private companies to generate high capacity).   

If found, the distinctiveness and existence of segments in existing hydropower sites will prove to be a useful jumping off point for the new development or improvement of hydropower. 

##### Clustering Analysis

An exploratory approach will be taken with this project; the goal will be to discover segments of hydropower sites using various common clustering algorithms including K-Means Clustering and Principal Components Analysis. Generally, segmentation will be determined by means of clustering results that exhibit the most "within-cluster" similarity and the most "between-cluster" differences.

## ORNL-EHA Data

The National Hydropower Asset Assessment Program (NHAAP) funded by the Department of Energy (DOE) is the nation's leading effort to assess and expand hydropower technology. Under this program, the Oak Ridge National Laboratory (ORNL) in Tennessee has produced a geospatial dataset consisting of all existing hydropower assets (EHA) currently operating in the US (https://hydrosource.ornl.gov/market-info-and-data/existing-hydropower-assets). This dataset offers the opportunity to perform the type of segmentation needed to understand the state of current US hydropower.

##### Features and Observations
The dataset comes with 34 features and a total of 2310 observations (hydropower assets). The features can be divided into three categories; 
1. **"local"** (features that describe geography, owners, or other highly "localized" features), 
2. **"global"** (features that describe widely comparable qualities such as energy capacity, ownership type, or sector) and 
2. **"irrelevant"** features (features such as identifiers, license numbers, ). 

These different feature types will have different implications for clustering. Additionally, a natural segmentation to make is from the different levels of the 'Type' variable; These are the three major technologies at use in the dataset and thus should be portioned out to increase the meaning of the clustering. A future study could combine all technologies, create comparable metrics on them, and perform the clustering. 

![image.png](../Figures/features.png)

*Table 1. "Global" features highlighted in blue, "local" features highlighted in green, and "irrelevant" features highlighted in orange.*


##### Preprocessing
Some standard preprocessing checks are required to perform the clustering methods in this project. 

1. First, the clustering features should be chosen. All 'global' variables will be considered.

2. Additionally, the manual segementation of splitting the data on "Type" is performed to get three datasets.

2. Next, categorical variables must be chosen carefully and "one-hot" recoded as numeric; [0, 1] 'dummy' variables will be added to represent their different levels.  

3. Finally, some data transformations must be made; all features must be scaled in order to more accurately represent distances in the clustering algorithms, and PCA will be applied to reduce the amount of features required in analysis. 

The observations for the resulting datsets after accounting for preprocessing are as follows;

1. Hydropower sites (**HY**) - 50 features, 1841 observations. 
2. Hybrid Hydropower-Pumped Hydropower (**HYPS**) - 54 features, 14 observations, and
3. Pumped Storage Hydropower sites (**PS**) - 50 features, 24 observations.

<!-- ![title](../Figures/descriptiveoriginal.png) -->

<img src="../Figures/descriptives.png" alt="Drawing" style="width: 1000px;"/>

*Table 2. Descriptive statistics for original features.*


## Methods

This project will specifically exclude 'local' features to focus on finding the most geographically-agnostic clustering as possible. The question of whether a feature should be considered local or global is an interesting discussion in its own right, and future segmentation studies should consider making use of geography, owner name, and other more local features (perhaps with hierarchical clustering). However, the goal of this project is to specifically exclude geographical features in the hope of discovering globally-relevant clusterings. In figure 1. local features are highlighted in orange and global features are highlighted in yellow.

##### Sweep Clustering
As this is an exploratory project, unsupervised learning is used to discover clusters without assummption of the clustering before hand. Because K-means requires a number of clusters prior to running, a "Sweep" of K-means clustering will be performed from k=2 to k=8; the results of which will be evaluated using the *Pseudo F Score (Calinski-Harabasz Criterion)* and the *Silhouette Score*. 

This scheme will be performed for a total of nine datasets; HY, HY-scaled, HY-PCA_transformed, HYPS, HYPS-scaled, HYPS-PCA_transformed, PS, PS-scaled, PS-PCA_transformed. The purpose of these permutations is to illustrate the differing clustering results on different schemes of data and different transformations of data.

All results will be interpreted in terms of the original features; ie. cluster labels derived from scaled pr PCA-transformed versions of the original features will be applied to the original features and analyzed. One can make use of the corresponding indicies of cluster labels-to-data to accomplish this.

The python package *Sci-Kit Learn* was used to perform the analysis.

## Results
After running the analysis, the following is found:

For **HY**, 3 relevant clusters were found, 6 total were computed.


<!-- ![title](../Figures/HYresults.png) -->

<img src="../Figures/HYresults.png" alt="Drawing" style="width: 700px;"/>

*Figure 1. Sweep Clustering results for HY. On the left are feature distributions by cluster, and on the right are Sweep clustering metric results per k clusters.*


For **HYPS**, 3 relevant clusters were found, 7 total were computed.

<!-- ![title](../Figures/HYPSresults.png) -->

<img src="../Figures/HYPSresults.png" alt="Drawing" style="width: 700px;"/>

*Figure 2. Sweep Clustering results for HYPS. On the left are feature distributions by cluster, and on the right are Sweep clustering metric results per k clusters.*

For **PS**, 6 relevant clusters were found, 6 total were computed.
<!-- ![title](../Figures/PSresults.png) -->

<img src="../Figures/PSresults.png" alt="Drawing" style="width: 700px;"/>

*Figure 3. Sweep Clustering results for PS. On the left are feature distributions by cluster, and on the right are Sweep clustering metric results per k clusters.*



<!-- ![title](../Figures/descriptiveoriginal.png) -->

<img src="../Figures/clusters_describe.png" alt="Drawing" style="width: 1000px;"/>

*Table 3. Averaged feature values for for final clusters. "Noise" clusters are included.*


## Analysis and Recommendations

The results of the Sweep Clustering analysis showed two broad conclusions; first, scaled data produces more seperated clusters, and PCA-transformed data even more so. Second, there is significant variation between the technologies in terms of feature clustering. 

Use of the Pseudo-f score to evaluate clusters did not instill confidence in clustering validity because most plots of Pseudo-f score were non-decreasing. It is quite possible that Pseudo-f score rises monotonically with k clusters. Because of this, Silhouette score was used to select the final k for clustering. Clusters with too few observations are regarded as "Noise" clusters beacuse they are unlikely to be generalizable to future datasets. 

##### Interpretation
Interpretation of clusters are important for operationalizing them. Descriptions of the original features in terms of their final cluster labels would provide insight into their characteristics. One important note is in the treatment of "noies clusters". "Noise" clusters consisted of very few observations and could be regarded as spurious products of the clustering algorithm, However, only further analysis and hydropower knoweldge expertise could decide whether or not these outliers are to be valued. 

##### Validation
Unsupervised clustering presents the challange of validating the results of clustering based on the intended practical use. Here, Silhouette Score was regarded as the main validating criterion, but other methods exist. Namely, clusters could be compared with ideas or "seeds" of a-priori clusters based on hydropower knowledge expertise.

##### Improvement
In terms of obtaining better clustering, we could try different intrinsic methods such as performing different clustering algorithms, hypothesis testing between clusters, and adding more datapoints to solidify clustering metrics. Additionally, we could try different clustering feature schemes based on pre-segmentations of data based on hydropower knowledge expertise.

## References

“EXISTING HYDROPOWER ASSETS.” HydroSource, hydrosource.ornl.gov/market-info-and-data/existing-hydropower-assets.
“Scikit-Learn.” Scikit, scikit-learn.org/stable/.

https://qjoel6398.github.io

