# Compile existing and newly digitized solar array datasets and remove rooftop arrays
This is a multi-step script. 
1. First, we compile existing and newly digitized array boundaries
2. Then we use Google Earth Engine (GEE) to remove arrays intersecting with building rooftops resulting in only ground mounted solar arrays. 

**Compile datasets**
* This involves pulling in the ouputs from `script1` and `script2`
* Before spatially joining, we'll exlode the datasets, remove sub-shapes by intersection and spatial-quality, then dissolve by nativeID and Source to return best available dataset.

**Use Google Earth Engine (GEE) to Remove Rooftop Arrays from Existing Solar Array Datasets**
* Inputs: *existingSolarArrayShapes.shp* local shape file and asset upload to GEE.
* Uses GEE to pull in [USA Structures](https://gee-community-catalog.org/projects/usa_structures/?h=ornl) dataset, calcualte the intersection with our newly-compiled solar array dataset, and remove arrays that are likely rooftop mounted. 
* Output: Ground mounted solar array dataset.

## Import Libraries

In [2]:
# Import libraries
import numpy as np
import pandas as pd
import geopandas as gpd
import os 
import ee
import geemap

# Load config file
def load_config(filename):
    config = {}
    with open(filename, 'r') as f:
        for line in f:
            # Strip whitespace and split by '='
            key, value = line.strip().split('=')
            # Try to convert to numeric values if possible
            try:
                value = float(value) if '.' in value else int(value)
            except ValueError:
                pass  # Leave as string if not a number
            config[key] = value
    return config

## Initialize GEE

In [3]:
# Trigger the GEE authentication
ee.Authenticate()

# Initialize the cloud project
ee.Initialize(project='ee-stidjaco')


Successfully saved authorization token.


## Set paths and variables

In [4]:
# Set folder paths
wd = r'S:\Users\stidjaco\R_files\BigPanel'
downloaded_path = os.path.join(wd, r'Data\Downloaded')
derived_path = os.path.join(wd, r'Data\Derived')
derivedTemp_path = os.path.join(derived_path, r'intermediateProducts')

# Set initial solar array shapefile locations to compile
existingArraysPath = os.path.join(derivedTemp_path, r'existingDatasetArrayShapes.shp')
georectifiedArraysPath = os.path.join(derivedTemp_path, r'georectifiedSolarArrays.geojson')

# Set compiled solar array and panel shapefile locations (asset and local)
arraysLocalPath = os.path.join(derivedTemp_path, r'compiledArrayDataset.shp') # We create this file in this script
panelsLocalPath = os.path.join(derivedTemp_path, r'existingDatasetPanelShapes.shp') # This exists, and is created in script1
arraysAssetPath = r'projects/ee-stidjaco/assets/BigPanel/compiledArrayDataset'

# GM-SEUS initial output paths
gmseusArraysInitPath = os.path.join(derivedTemp_path, r'initialGMSEUS_Arrays.shp')
gmseusPanelsInitPath = os.path.join(derivedTemp_path, r'initialGMSEUS_Panels.shp')
gmseusGetPanelsPath = os.path.join(derivedTemp_path, r'initialGMSEUS_getPanels.shp')

# Get building database path
buildingAssetPath = r"projects/sat-io/open-datasets/VIDA_COMBINED/USA"

# Load the config from the text file
config = load_config('config.txt')

# Set threshold for building area contained within array inferring rooftop
build_threshold = config['build_threshold'] # 50% of building area contained within array

# Other variables
gee_crs = config['gee_crs'] # native projection of Google Earth Engine exports
minPanelRowArea = config['minPanelRowArea'] # minimum area of panel to be considered

# Get US Boundary to subset global/non-CONUS datasets
CONUS_path = os.path.join(wd, r'Data\Downloaded\CONUS_NoGreatLakes\CONUS_No_Great_Lakes.shp')
existingArrays = gpd.read_file(existingArraysPath) # Existing arrays shapefile
US_boundary = gpd.read_file(CONUS_path) # CONUS boundary shapefile
US_boundary = US_boundary.set_crs(epsg=4269) # Native projection of US boundary - NAD83
US_boundary = US_boundary.to_crs(existingArrays.crs) # Transform to projection of existingArrays
US_boundary['geometry'] = US_boundary.buffer(10) # Buffer US boundary by 10 meters to ensure that array bounds are not clipped

## Helper Functions

In [5]:
# Function to check for and remove erroneous geometries in arrays
def checkArrayGeometries(arrays): 
    # For a collection of reasons, array boundaries may contain erroneous geometries that result in a near-zero area, linestrings, or points. 
    # To check for and remove these, we'll explode arrays, calculate a temporary area, remove subarrays that are less than a minimum area, then dissolve by tempID.
    arrays['tempDissolveID'] = (1 + np.arange(len(arrays)))  # Create a temporary ID for dissolving
    arrays = arrays.explode(index_parts=False)
    arrays['tempArea'] = arrays['geometry'].area
    arrays = arrays[arrays['tempArea'] >= minPanelRowArea]
    arrays = arrays.dissolve(by=['tempDissolveID'], as_index=False)
    arrays = arrays.drop(columns=['tempArea', 'tempDissolveID'])
    arrays = arrays.reset_index(drop=True)
    return arrays

## Compile existing and newly georectified array datasets and send to GEE

In [6]:
# Call existing arrays and georectified arrays
existingArrays = gpd.read_file(existingArraysPath)
georectifiedArrays = gpd.read_file(georectifiedArraysPath)

# Set georectifiedArrays projection to gee_crs
georectifiedArrays = georectifiedArrays.set_crs(epsg=gee_crs)

# Transform georectifiedArrays to existingArrays projection
georectifiedArrays = georectifiedArrays.to_crs(existingArrays.crs)

# Subset georectifiedArrays to US boundary
georectifiedArrays = georectifiedArrays[georectifiedArrays.intersects(US_boundary.union_all())]

# Set Source for georectifiedArrays to 'GMSEUSdigitized'
georectifiedArrays['Source'] = 'GMSEUSgeorect'

# Similar to in script1, our georectified dataset may overlap with an existing dataset partially, but not fully. Because we want all the information, we need to make an adjustment here. 
# To navigate this, we'll exlode the datasets, remove sub-shapes by intersection and spatial-quality, then dissolve by nativeID to return best available dataset.

# First, exlode existing arrays
existingArrays = existingArrays.explode(index_parts=False)

# Remove existingArrays that intersect with georectifiedArrays. We do this because our digitized arrays contain more spatial and value added information. 
existingArrays = existingArrays[~existingArrays.intersects(georectifiedArrays.union_all())]

# Now, dissolve by nativeID and Source to return existing arrays
existingArrays = existingArrays.dissolve(by=['nativeID', 'Source'], as_index=False)
existingArrays = existingArrays.reset_index(drop=True)

# Merge existingArrays and georectifiedArrays
compiledArrays = gpd.GeoDataFrame(pd.concat([existingArrays, georectifiedArrays], ignore_index=True), crs=existingArrays.crs)

# Check for and remove erroneous geometries in compiledArrays
compiledArrays = checkArrayGeometries(compiledArrays)

# Set a temporary ID for indexing
compiledArrays['tempID'] = compiledArrays.index

# Print initial number of arrays prior to removing rooftop arrays
print('Number of arrays prior to removing rooftop arrays: ' + str(len(compiledArrays)))

# Save compiledArrays to local
compiledArrays.to_file(arraysLocalPath)

Number of arrays prior to removing rooftop arrays: 16691


## STOP: Upload _arraysLocalPath_ to GEE Asset following the nomenclature of the arraysAssetPath prior to progressing

## Create single featureCollection building asset

In [7]:
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Google Global Building Asset Dataset

# Set asset path
buildingsAsset = ee.FeatureCollection(buildingAssetPath)

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ORNL Structures Dataset (Not as accurate as Google Global Building Asset Dataset). Some smaller buildings were slightly misaligned with actual building footprints, resulting in retention of rooftop arrays.

# Get states and building database
#statesAssetPath = r'TIGER/2018/States'
#buildingAssetBasePath = r'projects/sat-io/open-datasets/ORNL/USA-STRUCTURES' # Append with 'USA_ST_MI' where the last two letters are the state abbreviation

# Call states
#states = ee.FeatureCollection(statesAssetPath)

# Get a compiled building asset feature collection
# List all assets in the folder
#assetList = ee.data.listAssets({'parent': buildingAssetBasePath}).get('assets')

# Function to convert asset paths to FeatureCollections
#def assetToFeatureCollection(asset):
#    return ee.FeatureCollection(asset['name'])

# Map over the asset list to get FeatureCollections
#featureCollections = list(map(assetToFeatureCollection, assetList))

# Merge all FeatureCollections into one
#buildingsAsset = ee.FeatureCollection(featureCollections).flatten()

## Get solar array intersections with buildings and save as pandas df

In [8]:
# Function to calculate the intersection and proportional area
def calculate_intersection_area(array):
    # Get the array geometry
    aoi = array.geometry()
    
    # Filter the building vectors to the bounds of the aoi and get unified geometry (This works for all normal featureCollections. The Google Buildings Dataset is an exception)
    #localBuildings = buildingsAsset.filterBounds(aoi).union(1).geometry()

    # Filter the building vectors to the bounds of the aoi. 
    # For this Google Buildings Dataset, this always results in 29 features, only the last of which is the correct shape. We have not ascertained why this is the case. 
    # When we filter and acquire multipolygon geometries (if more than one building intersects), the correct shapes are index 29:numFeatures. 
    localBuildings = buildingsAsset.filterBounds(aoi).geometry().geometries() 
    actualFootprintStart = 29 # First building footprint index for the Google Buildings Dataset. 
    numGeometries = localBuildings.size() # Get total number of polyigons
    validGeometries = localBuildings.slice(actualFootprintStart, numGeometries) # Slice the geometries from index 29 to the end
    localBuildings = ee.Geometry.MultiPolygon(validGeometries) # Convert the list of valid geometries back into a MultiPolygon

    # Now, acquire the intersecting area of roof and solar and get the rooftop proportion
    intersectionArea = aoi.intersection(localBuildings, ee.ErrorMargin(1)).area(1)
    rooftopAreaProp = intersectionArea.divide(aoi.area(1)).multiply(100).toInt() 
    
    # Set as new attribute and return the feature
    return array.set({'roofProp': rooftopAreaProp})

# Call asset arrays
#arraysAsset = geemap.shp_to_ee(arraysLocalPath) # Although this works, there are resulting memory issues if the asset is too large and not physcially uploaded to GEE first. So we'll use the asset path instead.
arraysAsset = ee.FeatureCollection(arraysAssetPath)

# Create a temporary folder within the derivedTemp_path to store the results
rooftopResultsPath = os.path.join(derivedTemp_path, 'rooftopResults')
if not os.path.exists(rooftopResultsPath):
    os.makedirs(rooftopResultsPath)

# Get assed ID list
assetIDList = arraysAsset.aggregate_array('tempID').getInfo()

# Break the assetIDList into equal chunks smaller than 5000 to overcome GEE memory limitations. Then, for each chunk, get the corresponding feature collection and apply the calculate_intersection_area function to each feature. Append the results to the lists.
chunkSize = 1000 # This used to be 4999 to prevent memory issues using ORNL structures dataset, but with the google dataset, we needed multiple exports of csv to overcome memory issues.
chunks = [assetIDList[i:i + chunkSize] for i in range(0, len(assetIDList), chunkSize)]

# For each chunk, get the corresponding feature collection and apply the calculate_intersection_area function to each feature. Append the results to the lists.
for chunk in chunks:
    
    # Initialize lists
    id_list = []
    roofProp_list = []

    # Get the feature collection for the chunk
    arraysAssetChunk = arraysAsset.filter(ee.Filter.inList('tempID', ee.List(chunk)))

    # Apply the function to each feature in the arrays collection, and append the results to the lists
    arrays_with_rooftop = arraysAssetChunk.map(calculate_intersection_area)
    for feature in arrays_with_rooftop.getInfo()['features']:
        id_list.append(feature['properties']['tempID'])
        roofProp_list.append(feature['properties']['roofProp'])

    # Create a dictionary and convert to a dataframe
    data = {'tempID': id_list, 'roofProp': roofProp_list}
    arraysRooftopDf = pd.DataFrame(data)

    # Export the result to CSV
    arraysRooftopDf.to_csv(os.path.join(rooftopResultsPath, 'arraysRooftopProp'+str(chunk[0])+'.csv'), index=False)

# Call in all the csv files and concatenate them into one dataframe from the rooftopResultsPath
arraysRooftopDf = pd.concat([pd.read_csv(os.path.join(rooftopResultsPath, f)) for f in os.listdir(rooftopResultsPath)], ignore_index=True)

# Check to ensure indexing logic is correct
print("Number of arrays assessed: ", len(arraysRooftopDf))
print("Correct total number of arrays: ", arraysAsset.size().getInfo())

# Export dataframe to derivedTemp_path
arraysRooftopDf.to_csv(os.path.join(derivedTemp_path, 'arraysRooftopProp.csv'), index=False)

Number of arrays assessed:  16691
Correct total number of arrays:  16691


## Remove buildings from existing and digitized solar array and panel databases

### Arrays

In [9]:
# Call local arrays
arraysLocal = gpd.read_file(arraysLocalPath)

# Print length of local arrays
print("Original number of arrays: ", len(arraysLocal))

# Call the CSV
arraysRooftopDf = pd.read_csv(os.path.join(derivedTemp_path, 'arraysRooftopProp.csv'))

# Merge the dataframes on a common identifier
mergedArrays = arraysLocal.merge(arraysRooftopDf[['tempID', 'roofProp']], on='tempID')

# Drop mergedArrays with a rooftopProp below the threshold
mergedArrays = mergedArrays[mergedArrays['roofProp'] <= build_threshold]

# Print length of merged arrays
print("Number of arrays after filtering: ", len(mergedArrays))

# Print total area of ground-mounted arrays in square km 
print("Total area of ground-mounted arrays: ", mergedArrays['area'].sum() / 10**6)

# Print the number of rooftop arrays removed 
print("Number of rooftop arrays removed: ", len(arraysLocal) - len(mergedArrays))

# Drop the temporary ID column
mergedArrays = mergedArrays.drop(columns=['tempID'])

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Set a subset identifier for GEE memory limitations

# Create a 'subset' column for processing panels in GEE. 
# This is a unique identifier for each array depending on its size. It should be numbers 0 through 999. 
# Larger arrays requrie more vectorization and more GEE memory, and should be procesed in smaller batches.
# Calculate the 90th, 50th, and 10th, percentile area values
percentile_90 = mergedArrays['area'].quantile(0.90)
percentile_50 = mergedArrays['area'].quantile(0.50)
percentile_10 = mergedArrays['area'].quantile(0.10)

# Set subset values based on the 90th percentile area value
percentile_90thSubset = 500 # Number of subset values for arrays larger than the 90th percentile area
percentile_50thSubset = 400 # Number of subset values for arrays between the 50th and 90th percentile area
percentile_10thSubset = 99 # Number of subset values for arrays smaller than the 10th percentile area

# Function to assign subset values
def assign_subset(area, percentile_90):
    if area > percentile_90:
        return np.random.randint(0, percentile_90thSubset)
    elif area <= percentile_90 and area > percentile_50:
        return np.random.randint(percentile_90thSubset, percentile_90thSubset + percentile_50thSubset)
    else:
        return np.random.randint(percentile_90thSubset + percentile_50thSubset, percentile_90thSubset + percentile_50thSubset + percentile_10thSubset)

# Apply the function to create the 'subset' column
mergedArrays['subset'] = mergedArrays['area'].apply(assign_subset, percentile_90=percentile_90)

# Print min max subset values
print(f'Minimum subset value: {mergedArrays["subset"].min()}')
print(f'Maximum subset value: {mergedArrays["subset"].max()}')

# Set a initID column that is the row index
mergedArrays = mergedArrays.reset_index(drop=True)
mergedArrays['arrayID'] = mergedArrays.index

# Export the merged arrays to shapefile
mergedArrays.to_file(gmseusArraysInitPath)

Original number of arrays:  16691
Number of arrays after filtering:  14905
Total area of ground-mounted arrays:  3055.8286719458915
Number of rooftop arrays removed:  1786
Minimum subset value: 0
Maximum subset value: 998


### Panels

In [10]:
# Re-call GM-SEUS initial arrays and call the panel-row data (both are already in the correct projection)
gmseusArrays = gpd.read_file(gmseusArraysInitPath)
panelsLocal = gpd.read_file(panelsLocalPath)

# Print the original number of panels
print("Original number of panels: ", len(panelsLocal))

# Spatially join gmseus arrays to panels, copy the arrayID to the panels, and drop the index columns. 
panelsLocal = gpd.sjoin(panelsLocal, gmseusArrays[['arrayID', 'geometry']], how='left', predicate='intersects')
panelsLocal = panelsLocal.reset_index(drop=True)
panelsLocal = panelsLocal.drop(columns=['index_left', 'index_right'], errors='ignore')

# Drop panels that do not have an arrayID
gmPanelsLocal = panelsLocal.dropna(subset=['arrayID'])

# Print the number of panels after filtering
print("Number of panels after filtering: ", len(gmPanelsLocal))

# Print the number of rooftop panels removed
print("Number of rooftop panels removed: ", len(panelsLocal) - len(gmPanelsLocal))

# Print the total area of ground-mounted panels in square km
print("Total area of ground-mounted panels: ", gmPanelsLocal['area'].sum() / 10**6)

# Print the number of unique initIDs in the panels
print("Number of GMSEUS arrays with existing panels: ", len(gmPanelsLocal['arrayID'].unique()))

# Drop arrayID column
gmPanelsLocal = gmPanelsLocal.drop(columns=['arrayID'])

# Save a new panelID column that is the row index (after resetting the index)
gmPanelsLocal = gmPanelsLocal.reset_index(drop=True)
gmPanelsLocal['panelID'] = gmPanelsLocal.index

# Export the panels to shapefile
gmPanelsLocal.to_file(gmseusPanelsInitPath)

Original number of panels:  1076800
Number of panels after filtering:  1071181
Number of rooftop panels removed:  5714
Total area of ground-mounted panels:  137.31026341708852
Number of GMSEUS arrays with existing panels:  4470


# Prepare Initial GM-SEUS to Get Panel-Rows in GEE
* Because `script4_getPanel` doesnt care about *solar farm* ids, we'll exlode sub-array multi-polygons and treat them individually. 
* This drastically improves computational efficiency, reducing the amount of large multi-polygon arrays to chunk up and panelize. 
* Upload _gmseusGetPanelsPath_ to GEE asset

In [12]:
# Call gmseusArrays initial 
gmseusArraysInit = gpd.read_file(gmseusArraysInitPath)

# Explode gmseusArraysInit
gmseusArraysInit = gmseusArraysInit.explode(index_parts=False)
gmseusArraysInit = gmseusArraysInit.reset_index(drop=True)

# Set a subArrID column that is the row index
gmseusArraysInit['subArrID'] = gmseusArraysInit.index

# Set a temporary area column
gmseusArraysInit['tempArea'] = gmseusArraysInit['geometry'].area

# Export the exploded arrays to shapefile
gmseusArraysInit.to_file(gmseusGetPanelsPath)

### Print out numbers of exloded array shapes to decide area thresholds for running `script4_getPanels.js`

In [13]:
# Call gmseusArrays getPanels
gmseusGetPanels = gpd.read_file(gmseusGetPanelsPath)

# Print the number of arrays in gmseusArrays
print('Number of arrays shapes in exloded gmseusArrays: ' + str(len(gmseusGetPanels)))

# Print the number arrays shapes with an area less than 20000
print('Number of arrays shapes with an area less than 2 ha: ' + str(len(gmseusGetPanels[gmseusGetPanels['tempArea'] < 20000])))

# Print the number arrays shapes with an area greater than 20000 and less than 100000 m2
print('Number of arrays shapes with an area greater than 2 and less than 10 ha: ' + str(len(gmseusGetPanels[(gmseusGetPanels['tempArea'] > 20000) & (gmseusGetPanels['tempArea'] < 100000)])))

# Print the number of arrays shapes with area greater than 100000 m2
print('Number of arrays shapes with area greater than 10 ha: ' + str(len(gmseusGetPanels[gmseusGetPanels['tempArea'] > 100000])))

Number of arrays shapes in exloded gmseusArrays: 21258
Number of arrays shapes with an area less than 2 ha: 8824
Number of arrays shapes with an area greater than 2 and less than 10 ha: 7867
Number of arrays shapes with area greater than 10 ha: 4567


In [14]:
# Print head
print(gmseusGetPanels.head())

  nativeID Source  instYr    cap_mw          area modType AVtype  azimuth  \
0        1  CCVPV    2018  1.030127  14385.760279    c-si   None  -9999.0   
1       10  CCVPV    2018  0.177505   2992.634674    c-si   None  -9999.0   
2       10    OSM    2017  7.400000  88628.816881    c-si   None  -9999.0   
3       10    OSM    2017  7.400000  88628.816881    c-si   None  -9999.0   
4       10    OSM    2017  7.400000  88628.816881    c-si   None  -9999.0   

        mount    id  roofProp  subset  arrayID  subArrID      tempArea  \
0  fixed_axis  None         0     992        0         0  14385.760279   
1  fixed_axis  None         0     930        1         1   2992.634674   
2        None  None        13     889        2         2  17720.386654   
3        None  None        13     889        2         3  17715.559753   
4        None  None        13     889        2         4   7352.367665   

                                            geometry  
0  POLYGON Z ((-2046882.826 -33209.73