# Prepare Solar Energy Array and Panel Databases

**Script Notes**
* Calls and preprocesses existing spatiotemporal solar pv databases. 
* This is a multi-step process:
    * Process and remove duplicates for existing array polygon shapefiles from USPVDB, CCVPV, CWSD, OSM, and SAM datasets. Export as existing array shapes
    * Process and remove duplicates for centroid shapefiles from InSPIRE, LBNL-USS PV-DAQ, and SolarSPACES, GSPT, and GPPDB. Export as need to digitize centroids, and run `script2_digitizeSolarArrays` GEE script.

## Array Polygon-Level Data

**United States Solar Photovoltaic Database (USPVDB)**
* Downloaded from [USPVDB Portal](https://eerscmap.usgs.gov/uspvdb/data/)
* Last Download: 10-11-2024 (Up-to-date as of 12-11-2024)
* Version 2.0

**California's Central Valley Photovoltaic Dataset (CCVPV) Arrays and Panels**
* Downloaded from [figshare](https://doi.org/10.6084/m9.figshare.23629326.v1)
* Last Download: 07-18-2024 (Up-to-date as of 12-11-2024)
* Version 1.0

**Chesapeake Watershed Solar Data (CWSD) Arrays**
* Downloaded from [OSFHOME](https://osf.io/vq7mt/)
* Last Download: 12-01-2024 (Up-to-date as of 12-11-2024)
* We downloaded derived polygons as well as manually annotated training polygons, and preferenced training polygons over derived for their completeness and quality
* No Version details

**OpenStreetMap Solar Panels and Arrays (OSM)**
* Array and panel objects were downloaded _osmnx_ package in `script0_getOSMdata.ipynb`
* Last OSM scrape: 12-11-2024
* Previously, we used data from:
    * Harmonzied Global Wind and Solar Farm Locations (HGLOBS)
    * Downloaded from [figshare](https://doi.org/10.6084/m9.figshare.11310269.v6)
    * Last Downloaded: 07-23-2024
    * Version 6.0

**TransitionZero Global Solar Asset Mapper (SAM)**
* Downloaded from [TZ-SAM Portal](https://zenodo.org/records/11368204)
* Last Download: 12-11-2024 
* Other information: [Website](https://solar.transitionzero.org/), [Viewer](https://solar-map.transitionzero.org/), [SciData Preprint](https://zenodo.org/records/11368204/files/tz-sam_scientific_data.pdf?download=1)
* Version Q3-2024 (Version 2)
* Follow-on project containing all information from [Kruitwagen et al., 2021](https://zenodo.org/records/5005868). 
* NOTE: TZ-SAM also contains *raw_polygons*, which are all overlapping polygon shapefiles from all sources (including prior TZ-SAM versions). Could be useful in the future, or even a pathway that we use to share data. 

## Array Point-Level Data

**NREL Innovative Solar Practices Integrated with Rural Economies and Ecosystems (InSPIRE) Database**
* Downloaded from [InSPIRE Portal](https://openei.org/wiki/InSPIRE/Agrivoltaics_Map)
* Last Download: 12-11-2024 

**LBNL Utility-Scale Solar (USS), 2024 Edition**
* Downloaded from [LBNL Utility-Scale Solar Portal](https://emp.lbl.gov/utility-scale-solar/)
* Last Downloaded: 11-16-2024 (Up-to-date as of 12-11-2024)
* Large excel report, project level data was copied from original report .xlsx to a new .csv from Individual_Project_Data tab

**NREL PV Data Acquisition (PV-DAQ) Database**
* Downloaded from [PV-DAQ Portal - Available Systems Information](https://data.openei.org/submissions/4568), and [PVDAQ Data Map](https://openei.org/wiki/PVDAQ/PVData_Map)
* Last Downloaded: 07-23-2024 (Up-to-date as of 12-11-2024)

**International Energy Agency (IEA) & NREL Solar Power and Chemical Energy System (SolarPACES) Database**
* Downloaded from [Project Page](https://solarpaces.nrel.gov/)
* Last Downloaded: 07-29-2024 (Up-to-date as of 12-11-2024)
* More information at [US CSP Project Pages](https://solarpaces.nrel.gov/by-country/US)
* While SolarPACES is the overarching project (and how we refer to the dataset here), the product is called [CSP.guru](https://csp.guru/)

**Global Solar Power Tracker (GSPT) from Global Energy Monitor (GEM) and TransistionZero**
* Downloaded from [GEM Portal](https://globalenergymonitor.org/download-data-success/)
* Last Downloaded: 07-24-2024 (Up-to-date as of 12-11-2024)
* Access request required

**World Resource Institute (WRI) Global Power Plant Database (GPPDB)**
* Downloaded from [WRI Portal](https://datasets.wri.org/dataset/globalpowerplantdatabase)
* Last Downloaded: 07-30-2024 (Up-to-date as of 12-11-2024)
* Version 1.3.0

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import geopandas as gpd
import os 

# Load config file
def load_config(filename):
    config = {}
    with open(filename, 'r') as f:
        for line in f:
            # Strip whitespace and split by '='
            key, value = line.strip().split('=')
            # Try to convert to numeric values if possible
            try:
                value = float(value) if '.' in value else int(value)
            except ValueError:
                pass  # Leave as string if not a number
            config[key] = value
    return config

## Set paths and helper functions

In [2]:
# Set folder paths
wd = r'S:\Users\stidjaco\R_files\BigPanel'
downloaded_path = os.path.join(wd, r'Data\Downloaded')
derived_path = os.path.join(wd, r'Data\Derived')
derivedTemp_path = os.path.join(derived_path, r'intermediateProducts')

# Set CONUS Shapefile path
CONUS_path = os.path.join(wd, r'Data\Downloaded\CONUS_NoGreatLakes\CONUS_No_Great_Lakes.shp')

# Set file paths
uspvdb_path = os.path.join(downloaded_path, r'SolarDB\USPVDB\uspvdb_v2_0_20240801.shp')
ccvpv_ArraysPath = os.path.join(downloaded_path, r'SolarDB\CCVPV\PV_ID_CV.shp')
ccvpv_PanelsPath = os.path.join(downloaded_path, r'SolarDB\CCVPV\PV_ID_panels.shp')
cwsd_ArraysYearPath = os.path.join(downloaded_path, r'SolarDB\CWSD\CPK_solarJun22_firstyear.geojson')
cwsd_ArraysAllYearPath = os.path.join(downloaded_path, r'SolarDB\CWSD\CPK_solarJun22_annual.geojson')
cwsd_NYarraysPath = os.path.join(downloaded_path, r'SolarDB\CWSD\ny_footprints\ny_footprints.shp')
cwsd_DEarraysPath = os.path.join(downloaded_path, r'SolarDB\CWSD\de_footprints\de_footprints.shp')
cwsd_VAarraysPath = os.path.join(downloaded_path, r'SolarDB\CWSD\va_footprints\va_footprints.shp')
cwsd_MDarraysPath = os.path.join(downloaded_path, r'SolarDB\CWSD\md_footprints\md_footprints.shp')
cwsd_PAarraysPath = os.path.join(downloaded_path, r'SolarDB\CWSD\pa_footprints\pa_footprints.shp')
osm_ArraysPath = os.path.join(downloaded_path, r'SolarDB\OSM\OSMSolarArrays.shp')
osm_PanelsPath = os.path.join(downloaded_path, r'SolarDB\OSM\OSMSolarPanels.shp')
sam_path = os.path.join(downloaded_path, r'SolarDB\TZSAM\runs_2024-Q3_outputs_external_analysis_polygons.gpkg')
inspire_path = os.path.join(downloaded_path, r'SolarDB\InSPIRE\result.csv')
lbnlUss_path = os.path.join(downloaded_path, r'SolarDB\LBNLUSS\USS_2024_Individual_Project_Data.csv')
pvdaq_SysPath = os.path.join(downloaded_path, r'SolarDB\PVDAQ\systems.csv')
pvdaq_MapPath = os.path.join(downloaded_path, r'SolarDB\PVDAQ\result.csv')
solarPaces_path = os.path.join(downloaded_path, r'SolarDB\SolarPACES\csp-guru.csv')
gspt_path = os.path.join(downloaded_path, r'SolarDB\GSPT\Global-Solar-Power-Tracker-June-2024.xlsx')
gppdb_path = os.path.join(downloaded_path, r'SolarDB\GPPDB\global_power_plant_database.csv')

# Get US Boundary to subset global/non-CONUS datasets
uspvdb = gpd.read_file(uspvdb_path) # USPVDB shapefile
US_boundary = gpd.read_file(CONUS_path) # CONUS boundary shapefile
US_boundary = US_boundary.set_crs(epsg=4269) # Native projection of US boundary - NAD83
US_boundary = US_boundary.to_crs(uspvdb.crs) # Transform to projection of USPVDB
US_boundary['geometry'] = US_boundary.buffer(10) # Buffer US boundary by 10 meters to ensure that array bounds are not clipped

# Load the config from the text file
config = load_config('config.txt')

# Set variables
mostRecentInstallYear = config['mostRecentInstallYear'] # Most recent installation year of the datasets that we consider due to remote sensing data availability (full year)
acre_to_m2 = config['acre_to_m2'] # 1 acre = 4046.86 m2
gee_crs = config['gee_crs'] # native projection of Google Earth Engine exports
overlapDist = config['overlapDist'] # 190 meters, Set a overlap distance for checking if points/mismatched geometries between Solar PV datasets are duplicates
panelArrayBuff = config['panelArrayBuff'] # 10 meters, Set a distance for checking if points/mismatched geometries between Solar PV datasets are part of the same array
minPanelRowArea = config['minPanelRowArea'] # 15 m2, minimum area for a single panel row from the 1st percentile panel area from Stid et al., 2022. Filter small sub-panel chunks

# Create a function to format the data to the schema
def formatDf(df, nativeIdentifier, installationYear, capacityMWdc, area_m2, moduleType, agrivoltaicType, azimuth, mountTechnology, source):
    # Change column names to match the schema
    df = df.rename(columns={nativeIdentifier: 'nativeID', capacityMWdc: 'cap_mw', area_m2: 'area', installationYear: 'instYr', moduleType: 'modType', azimuth: 'azimuth', mountTechnology: 'mount', agrivoltaicType: 'AVtype'})

    # Set source
    df['Source'] = source

    # Fill empy numeric column rows with -9999, and empty string column rows with NaN
    df['cap_mw'] = df['cap_mw'].fillna(-9999)
    df['area'] = df['area'].fillna(-9999)
    df['instYr'] = df['instYr'].fillna(-9999)
    df['azimuth'] = df['azimuth'].fillna(-9999)
    df['modType'] = df['modType'].fillna('')
    df['AVtype'] = df['AVtype'].fillna('')
    df['mount'] = df['mount'].fillna('')

    # Force data types to match schema
    df['nativeID'] = df['nativeID'].astype(str)
    df['instYr'] = df['instYr'].astype(int)
    df['cap_mw'] = df['cap_mw'].astype(float)
    df['area'] = df['area'].astype(float)
    df['azimuth'] = df['azimuth'].astype(float)
    df['modType'] = df['modType'].astype(str)
    df['modType'] = df['modType'].str.lower() # Ensure modtype is lowercase
    df['AVtype'] = df['AVtype'].astype(str)
    df['AVtype'] = df['AVtype'].str.lower() # Ensure AVtype is lowercase
    df['mount'] = df['mount'].astype(str)
    df['mount'] = df['mount'].str.lower() # Ensure mount is

    # As a default, if modType is not c-si or csp or thin-film, set to c-si. We use this information in the image classification.
    df.loc[~df['modType'].isin(['c-si', 'csp', 'thin-film']), 'modType'] = 'c-si'

    # Select schema columns
    df = df[['nativeID', 'instYr', 'cap_mw', 'area', 'modType', 'AVtype', 'azimuth', 'mount', 'Source', 'geometry']]
    return df

Cannot find header.dxf (GDAL_DATA is not defined)


## Prepare United States Photovoltaic Database (USPVDB)

In [4]:
# Call data
uspvdb = gpd.read_file(uspvdb_path)

# Set a column for agrivoltaic type and fill with NaN
uspvdb['AVtype'] = np.nan

# Set a mount column. 
# If p_axis is 'single-axis', set to single_axis. If p_axis is 'fixed-axis', set to fixed_axis. If p_axis is 'dual-axis', set to dual_axis. If p_axis is 'fixed-tilt,single-axis' or 'fixed-tilt,single-axis,dual-axis', set to mixed
uspvdb['mount'] = np.nan
uspvdb['mount'] = uspvdb['mount'].astype(object)
uspvdb.loc[uspvdb['p_axis'] == 'single-axis', 'mount'] = 'single_axis'
uspvdb.loc[uspvdb['p_axis'] == 'fixed-tilt', 'mount'] = 'fixed_axis'
uspvdb.loc[uspvdb['p_axis'] == 'dual-axis', 'mount'] = 'dual_axis'
uspvdb.loc[uspvdb['p_axis'].isin(['fixed-tilt,single-axis', 'fixed-tilt,single-axis,dual-axis']), 'mount'] = 'mixed'

# Format data
uspvdb = formatDf(df = uspvdb, nativeIdentifier = 'case_id', installationYear = 'p_year', capacityMWdc = 'p_cap_dc', area_m2 = 'p_area', moduleType = 'p_tech_sec', agrivoltaicType = 'AVtype', azimuth = 'p_azimuth', mountTechnology = 'mount', source = 'USPVDB')

# Print the number of unique arrays
print(f'Number of arrays in USPVDB in the US is {len(uspvdb)}')

# Print the total area in km2 of the arrays
print(f'Total area of arrays in USPVDB in the US is {uspvdb["area"].sum() / 1e6} km2')

# Export to shapefile
uspvdb.to_file(os.path.join(derivedTemp_path, r'uspvdb_poly.shp'))

Number of arrays in USPVDB in the US is 4185
Total area of arrays in USPVDB in the US is 1605.532313 km2


## Prepare California's Central Valley Photoltaic Dataset (CCVPV)

In [26]:
# Call data
ccvpv = gpd.read_file(ccvpv_ArraysPath)

# Transform to projection of uspvdb
ccvpv = ccvpv.set_crs(epsg=4326) # Native projection dataset - WGS84
ccvpv = ccvpv.to_crs(uspvdb.crs)

# Set a column for agrivoltaic type and fill with NaN
ccvpv['AVtype'] = np.nan

# Set a column for moduleType and assume all modules are 'c-si'
ccvpv['modType'] = 'c-si'

# Set an empty column for azimuth filled with -9999
ccvpv['azimuth'] = -9999

# Set a mount column. If Class is 'Si_Fixed_S', set to fixed_axis. If Class is 'Si_Single_E/W', set to single_axis. 
ccvpv['mount'] = np.nan
ccvpv['mount'] = ccvpv['mount'].astype(object)
ccvpv.loc[ccvpv['Class'] == 'Si_Fixed_S', 'mount'] = 'fixed_axis'
ccvpv.loc[ccvpv['Class'] == 'Si_Single_E/W', 'mount'] = 'single_axis'

# Format data
ccvpv = formatDf(df = ccvpv, nativeIdentifier = 'Index', installationYear = 'Yr_inst', capacityMWdc = 'TPVPp', area_m2 = 'Tot_a', moduleType = 'modType', agrivoltaicType = 'AVtype', azimuth = 'azimuth', mountTechnology = 'mount', source = 'CCVPV')

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Some corrections specific to CCVPV

# We will omit the negative buffer from the CCVPV array preparation. 
# Although we are highly confident that this dataset contains near-zero comissions and omissions, and we are fully aware of the array generation methods (buffer 5m, dissolve), there is benefit to keeping the boundary.
# The negative buffer provides a more accurate representation of the array boundary, particularly for smaller arrays, but maintaining more area (still bounded by actual panel-row and spacing) allows for edge pixels to be included in the image classification. 
# The unbuffering here can also lead to some erroneous array shape artifacts, that we deal with after getPanels.

# Unbuffer ccvpv arrays by -5 meters (based on derivation methods from Stid et al., 2022)
#ccvpv['geometry'] = ccvpv.buffer(-5)

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Export

# Print the number of unique arrays
print(f'The number of arrays in CCVPV in the US is {len(ccvpv)}')

# Print the total area in km2 of the arrays
print(f'The total area of arrays in CCVPV in the US is {ccvpv["area"].sum() / 1e6} km2')

# For now, export only array shapefile
ccvpv.to_file(os.path.join(derivedTemp_path, r'ccvpv_poly.shp'))

The number of arrays in CCVPV in the US is 1006
The total area of arrays in CCVPV in the US is 58.90118407347751 km2


## Prepare Chesapeake Watershed Solar Data (CWSD)

In [None]:
# Call in CWSD first year and all year data
cwsd_firstYear = gpd.read_file(cwsd_ArraysYearPath)
cwsd_allYear = gpd.read_file(cwsd_ArraysAllYearPath)

# Call in CWSD state training data
cwsd_NY = gpd.read_file(cwsd_NYarraysPath)
cwsd_DE = gpd.read_file(cwsd_DEarraysPath)
cwsd_VA = gpd.read_file(cwsd_VAarraysPath)
cwsd_MD = gpd.read_file(cwsd_MDarraysPath)
cwsd_PA = gpd.read_file(cwsd_PAarraysPath)

# Merge all state training data
cwsdTraining = pd.concat([cwsd_NY, cwsd_DE, cwsd_VA, cwsd_MD, cwsd_PA])

# Set first year as desired cwsd gdf
cwsd = cwsd_firstYear.copy()

# Set native crs (EPSG:4326) to projection of USPVDB
cwsd = cwsd.set_crs(epsg=4326) # Native projection dataset - WGS84
cwsd = cwsd.to_crs(uspvdb.crs)
#cwsdTraining = cwsdTraining.set_crs(epsg=4269) # Native projection dataset - GCS North American 1983
cwsdTraining = cwsdTraining.to_crs(uspvdb.crs)

# Add year_right column from cwsd to cwsdTraining
cwsdTraining = gpd.sjoin(cwsdTraining, cwsd[['geometry', 'year_right']], how='left', predicate='intersects')

# Drop all columns except geometry and year_right for both cwsd and cwsdTraining
cwsd = cwsd[['geometry', 'year_right']]
cwsdTraining = cwsdTraining[['geometry', 'year_right']]

# Drop cwsd that intersects with cwsdTraining
cwsd = cwsd[~cwsd.intersects(cwsdTraining.unary_union)]

# Merge cwsd and cwsdTraining
cwsd = pd.concat([cwsd, cwsdTraining])

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Some corrections specific to CWSD

# CWSD contains self intersections due to Evans et al., 2023 methodology. We will remove these self intersections by dissolving the arrays.
cwsdAttributes = cwsd.copy()
cwsd['temp'] = 1
cwsd = cwsd.dissolve(by='temp')
cwsd = cwsd.reset_index(drop=True)

# Drop all columns except geometry
cwsd = cwsd[['geometry']]

# Explode the dissolved arrays and spatially join with the original attributes
cwsd = cwsd.explode(index_parts=False)
cwsd = cwsd.reset_index(drop=True)
cwsd = gpd.sjoin(cwsd, cwsdAttributes, how='left', predicate='intersects')

# Set a column for agrivoltaic type and fill with NaN
cwsd['AVtype'] = np.nan

# Set a column for moduleType and assume all modules are 'c-si'
cwsd['modType'] = 'c-si'

# Set an empty column for azimuth filled with -9999
cwsd['azimuth'] = -9999

# Set a column for mount and fill with NaN
cwsd['mount'] = np.nan

# Set a column for cap_mw and fill with NaN
cwsd['cap_mw'] = np.nan

# If it has not been already, drop initial area column, then set a coloumn for area and set polygon area to this column
cwsd = cwsd.drop(columns=['area'], errors='ignore')
cwsd['area'] = cwsd['geometry'].area

# Set a nativeID which is 1 to the length of the dataframe
cwsd['nativeID'] = range(1, len(cwsd) + 1)

# CWSD year_right is the installation year, but is limited by Sentinel-2 data availability. Therefore, we will set any year_right equal to 2017 (min S2 year) as -9999, and retain the remaining years.
cwsd.loc[cwsd['year_right'] <= 2017, 'year_right'] = -9999

# Format data
cwsd = formatDf(df = cwsd, nativeIdentifier = 'nativeID', installationYear = 'year_right', capacityMWdc = 'cap_mw', area_m2 = 'area', moduleType = 'modType', agrivoltaicType = 'AVtype', azimuth = 'azimuth', mountTechnology = 'mount', source = 'CWSD')

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CWSD contains repeat (overlapping) arrays. Remove duplicates

# Print the number of unique arrays
print(f'Original CWSD arrays pre-overlap filtering: {len(cwsd)}')

# If any cwsd arrays overlap, remove duplicates by keeping the largest array.
cwsdDissolved = cwsd.dissolve(aggfunc='max').reset_index(drop=True).explode().reset_index(drop=True)

# Recalculate area
cwsdDissolved['area'] = cwsdDissolved['geometry'].area

# Set a tempID
cwsdDissolved['tempID'] = range(1, len(cwsdDissolved) + 1)

# Drop nativeID column, spatially join with original cwsd to get nativeID, drop duplicates by tempID
cwsdDissolved = cwsdDissolved.drop(columns=['nativeID'], errors='ignore')
cwsdDissolved = gpd.sjoin(cwsdDissolved, cwsd[['nativeID', 'geometry']], how='left', predicate='intersects')
cwsdDissolved = cwsdDissolved.drop_duplicates(subset='tempID')
cwsdDissolved = cwsdDissolved.drop(columns=['tempID'], errors='ignore')

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Export

# Print the number of unique arrays
print(f'The number of arrays in CWSD in the US is {len(cwsdDissolved)}')

# Print the total area in km2 of the arrays
print(f'The total area of arrays in CWSD in the US is {cwsdDissolved["area"].sum() / 1e6} km2')

# Export to shapefile
cwsdDissolved.to_file(os.path.join(derivedTemp_path, r'cwsd_poly.shp'))

Original CWSD arrays pre-overlap filtering: 1465


  cwsdDissolved = cwsd.dissolve(aggfunc='max').explode().reset_index(drop=True)
  cwsdDissolved.to_file(os.path.join(derivedTemp_path, r'cwsd_poly.shp'))


The number of arrays in CWSD in the US is 1352
The total area of arrays in CWSD in the US is 59.33720047380677 km2


## Prepare OpenStreetMap Dataset (OSM)

In [28]:
# Call data
osm = gpd.read_file(osm_ArraysPath)

# OSM data is already in USPVDB projection and exclusive to CONUS.

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Some corrections specific to OSM

# Print unique Source values and counts
#print(osm['Source'].value_counts())

# Set a column for agrivoltaic type and fill with NaN
osm['AVtype'] = np.nan

# Set an empty column for azimuth filled with -9999
osm['azimuth'] = -9999

# Set an empty column for mount filled with NaN
osm['mount'] = np.nan
osm['mount'] = osm['mount'].astype(object)

# Format data
osm = formatDf(df = osm, nativeIdentifier = 'nativeID', installationYear = 'instYr', capacityMWdc = 'cap_mw', area_m2 = 'area', moduleType = 'modType', agrivoltaicType = 'AVtype', azimuth = 'azimuth', mountTechnology = 'mount', source = 'OSM')

# OSM nativeID is not unique. We took nativeID from osmid, but there was an issue with the data. The OSM download will retain the osmid nativeID, we will save a new one here. 
osm = osm.reset_index()
osm['nativeID'] = osm.index.astype(str)

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Export

# Print the number of unique arrays
print(f'The number of arrays in OSM in the US is {len(osm)}')

# Print the total area in km2 of the arrays
print(f'The total area of arrays in OSM in the US is {osm["area"].sum() / 1e6} km2')

# Export shapefile
osm.to_file(os.path.join(derivedTemp_path, r'osm_poly.shp'))

The number of arrays in OSM in the US is 10531
The total area of arrays in OSM in the US is 2438.020597214556 km2


## Prepare TransitionZero Solar Asset Mapper (SAM)

In [3]:
# Call data
sam = gpd.read_file(sam_path)

# Transform to projection of USPVDB
sam = sam.to_crs(uspvdb.crs)

# For now, subset sam to only include the area of the US
sam = sam[sam.intersects(US_boundary.unary_union)]

# Set a column for agrivoltaic type and fill with NaN
sam['AVtype'] = np.nan

# Calculate area and add as a new column
sam['area'] = sam['geometry'].area

# Set a column for moduleType and assume all modules are 'c-si'
sam['modType'] = 'c-si'

# Set an empty column for azimuth filled with -9999
sam['azimuth'] = -9999

# Set an empty column for mount filled with NaN
sam['mount'] = np.nan
sam['mount'] = sam['mount'].astype(object)

# Format constructed_before and constructed_after columns as dates. Current format is "2017-12-21T15:41:18.469999+00:00" or "None"
sam['constructed_before'] = pd.to_datetime(sam['constructed_before'], errors='coerce')
sam['constructed_after'] = pd.to_datetime(sam['constructed_after'], errors='coerce')

# Get the median date between constructed_before and constructed_after
sam['instYr'] = sam[['constructed_before', 'constructed_after']].mean(axis=1).dt.year

# If instYr is NaN or equal to or less than 2017, or greater than or equal to 2024, set to -9999 -- we now allow for 2024 as the most recent year
sam.loc[sam['instYr'].isnull(), 'instYr'] = -9999
sam.loc[sam['instYr'] <= 2017, 'instYr'] = -9999
#sam.loc[sam['instYr'] >= 2024, 'instYr'] = -9999

# Format data
sam = formatDf(df = sam, nativeIdentifier = 'cluster_id', installationYear = 'instYr', capacityMWdc = 'capacity_mw', area_m2 = 'area', moduleType = 'modType', agrivoltaicType = 'AVtype', azimuth = 'azimuth', mountTechnology = 'mount', source = 'SAM')

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Some corrections specific to SAM 

# Don't unbuffer here, as we need to keep the original geometry for the following random forest model

# # Some cluster_id values are repeated, so we need to drop duplicates
sam = sam.drop_duplicates(subset='nativeID')

# Explode multipolygons. Many are erroneous, or contain rooftop arrays next to ground mounted. We need to append the 'nativeID' columns with '_n' where n is the index of the polygon in the multipolygon
sam = sam.explode(index_parts=False) # We'll do this manually
sam['nativeID'] = sam['nativeID'] + '_' + sam.groupby('nativeID').cumcount().astype(str)

# Drop rows where the geometry type is None (some can arrise from the explode function)
sam = sam[~sam["geometry"].isnull()]

# Drop area column, and recalculate area
sam = sam.drop(columns=['area'], errors='ignore')
sam['area'] = sam['geometry'].area

# Reset index
sam = sam.reset_index(drop=True)

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Export

# Print the number of unique arrays
print(f'The number of arrays in SAM in the US is {len(sam)}')

# Print the total area in km2 of the arrays
print(f'The total area of arrays in SAM in the US is {sam["area"].sum() / 1e6} km2')

# Export shapefile
sam.to_file(os.path.join(derivedTemp_path, r'sam_poly.shp'))

The number of arrays in SAM in the US is 12208
The total area of arrays in SAM in the US is 3910.159133517089 km2


## Create an Existing Solar PV Array Shapefile with Ordered Importance
NOTE: If this code chunk changes (e.g., reordering preference or a new dataset is included), `script7` **Return GMSEUSgeorect Source Attribute to Original Spatial Source** must also be updated. 

In [33]:
# Call polygon data
uspvdb = gpd.read_file(os.path.join(derivedTemp_path, r'uspvdb_poly.shp'))
ccvpv = gpd.read_file(os.path.join(derivedTemp_path, r'ccvpv_poly.shp'))
cwsd = gpd.read_file(os.path.join(derivedTemp_path, r'cwsd_poly.shp'))
osm = gpd.read_file(os.path.join(derivedTemp_path, r'osm_poly.shp'))
sam = gpd.read_file(os.path.join(derivedTemp_path, r'sam_poly.shp'))

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Merge datasets removing duplicates
 
# Some datasets have multipolygons that add value, and are grouped togehter either erroneously or with updated data compared to higher-spatial qualtiy datasets. 
# To navigate this, we'll exlode the datasets, remove sub-shapes by intersection and spatial-quality, then dissolve by nativeID to return best available dataset.

# First, give each dataset a unique identifier that does not repeat across datasets. We'll use the 'tempID' column for this, which is the row index of the dataset + Source as a string. We will reset this index later.
uspvdb['tempID'] = uspvdb.index.astype(str) + '_USPVDB'
ccvpv['tempID'] = ccvpv.index.astype(str) + '_CCVPV'
cwsd['tempID'] = cwsd.index.astype(str) + '_CWSD'
osm['tempID'] = osm.index.astype(str) + '_OSM'
sam['tempID'] = sam.index.astype(str) + '_SAM'

# Explode each dataset to polygons, except USPVDB (maintain quality of USPVDB)
ccvpv = ccvpv.explode(index_parts=False)
cwsd = cwsd.explode(index_parts=False)
osm = osm.explode(index_parts=False)
sam = sam.explode(index_parts=False)

# Remove arrays with overlap in the following level of priority: USPVDB, CCVPV, HGLOBS, SAM
ccvpv = ccvpv[~ccvpv.intersects(uspvdb.unary_union)]
cwsd = cwsd[~cwsd.intersects(uspvdb.unary_union)]
cwsd = cwsd[~cwsd.intersects(ccvpv.unary_union)]
osm = osm[~osm.intersects(uspvdb.unary_union)]
osm = osm[~osm.intersects(ccvpv.unary_union)]
osm = osm[~osm.intersects(cwsd.unary_union)]
sam = sam[~sam.intersects(uspvdb.unary_union)]
sam = sam[~sam.intersects(ccvpv.unary_union)]
sam = sam[~sam.intersects(cwsd.unary_union)]
sam = sam[~sam.intersects(osm.unary_union)] 

# Dissolve by tempID, maintain both columns. 
ccvpv = ccvpv.dissolve(by=['tempID'], as_index=False)
cwsd = cwsd.dissolve(by=['tempID'], as_index=False)
osm = osm.dissolve(by=['tempID'], as_index=False)
sam = sam.dissolve(by=['tempID'], as_index=False)

# Merge all datasets
merged = gpd.GeoDataFrame(pd.concat([uspvdb, ccvpv, cwsd, osm, sam], ignore_index=True), crs=uspvdb.crs)

# Mask the merged dataset with the US boundary (this is a redundant step, is a good check)
merged = merged[merged.intersects(US_boundary.unary_union)]

# Given the modifications to the datasets, we need to re-calculate the area. Also note that cap_mw is not longer accurate given the possibility of multipolygon alteration.
merged['area'] = merged['geometry'].area

# Depending on the source method, there may be some erroneous geometries created that have an area of 0 or near zero (e.g. unbuffer of CCVPV arrays). 
# The smallest real array/panel-row is ~28 m2. Remove all arrays smaller than this. 
merged = merged[merged['area'] >= minPanelRowArea]

# Drop all columns that are not needed for the analysis
merged = merged[['nativeID', 'instYr', 'cap_mw', 'area', 'modType', 'AVtype', 'azimuth', 'mount', 'Source', 'geometry']]

# Print the number of unique arrays
print(f'The number of unique arrays in the merged dataset is {len(merged)}')

# Print the total area in km2 of the arrays
print(f'The total area of arrays in the merged dataset is {merged["area"].sum() / 1e6} km2')

# Set a tempID that is the row index for the merged dataset
merged = merged.reset_index(drop=True)
merged['tempID'] = merged.index

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Check geometries and export

# Due to the buffering and unbuffering of CCVPV and OSM arrays, and for unknown reasons in other datasets, some mulitpolygons contain erroneous geometries that result in a near-zero area, linestrings, or points.
# To check for and remove these, we'll explode merged, calculate a temporary area, remove subarrays that are less than 28 m2, then dissolve by tempID.
merged = merged.explode(index_parts=False)
merged['tempArea'] = merged['geometry'].area
merged = merged[merged['tempArea'] >= minPanelRowArea]
merged = merged.dissolve(by=['tempID'], as_index=False)
merged = merged.drop(columns=['tempArea'])
merged = merged.reset_index(drop=True)

# Export shapefile
merged.to_file(os.path.join(derivedTemp_path, r'existingDatasetArrayShapes.shp'))

The number of unique arrays in the merged dataset is 16661
The total area of arrays in the merged dataset is 3051.2023025288795 km2


# Prepare Solar Array Datasets with centroid spatial infomration, overlay with existing, and manually digitize missing array shapes

## Prepare NREL InSPIRE Database

In [34]:
# Call data
inspire = pd.read_csv(inspire_path)

# Remove rows with NaN in the "Coordinates" column
inspire = inspire.dropna(subset=['Coordinates'])

# Split the "Coordinates" column into two columns: "Latitude" and "Longitude" (base column is "41.843493, -90.036077")
inspire[['Latitude', 'Longitude']] = inspire['Coordinates'].str.split(', ', expand=True)

# Create a centroid shape file from the Latitude and Longitude columns
inspire['geometry'] = gpd.points_from_xy(inspire['Longitude'], inspire['Latitude'])

# Convert to GeoDataFrame
inspire = gpd.GeoDataFrame(inspire, geometry='geometry', crs='EPSG:4326')

# Set (WGS84) and Transform to projection of USPVDB
inspire = inspire.set_crs(epsg=4326)
inspire = inspire.to_crs(uspvdb.crs)

# Mask inspire database to CONUS Boundary 
inspire = inspire[inspire.intersects(US_boundary.unary_union)]

# Remove "InSPIRE/Sites" from all rows in the "Name" column
inspire['Name'] = inspire['Name'].str.replace('InSPIRE/Sites/', '')

# Site Size is in acres, convert to square meters
inspire['Site Size'] = inspire['Site Size'] * acre_to_m2

# Set a column for moduleType and assume all modules are 'c-si'. This dataset does contain 'PV Technology' column differentiating between Monofacial, Bifacial, and Translucent modules, but we will assume all are c-si for now.
inspire['modType'] = 'c-si'

# Set an empty column for azimuth filled with -9999
inspire['azimuth'] = -9999

# Create a mount column. If 'Type Of Array' is Single-axis Tracking, set to single_axis. If 'Type Of Array' is Fixed, set to fixed_axis.
inspire['mount'] = np.nan
inspire['mount'] = inspire['mount'].astype(object)
inspire.loc[inspire['Type Of Array'] == 'Single-axis Tracking', 'mount'] = 'single_axis'
inspire.loc[inspire['Type Of Array'] == 'Fixed', 'mount'] = 'fixed_axis'

# Format data
inspire = formatDf(df = inspire, nativeIdentifier = 'Name', installationYear = 'Year Installed', capacityMWdc = 'System Size', area_m2 = 'Site Size', moduleType = 'modType', agrivoltaicType = 'Habitat Type', azimuth = 'azimuth', mountTechnology = 'mount', source = 'InSPIRE')

# Print the number of arrays
print(f'The number of arrays in InSPIRE in the US is {len(inspire)}')

# Export shapefile
inspire.to_file(os.path.join(derivedTemp_path, r'inspire_point.shp'))

The number of arrays in InSPIRE in the US is 571


## Prepare LBNL-USS Database

In [35]:
# Call LBNL USE data
lbnlUss = pd.read_csv(lbnlUss_path)

# Create a centroid shape file from the Latitude and Longitude columns
lbnlUss['geometry'] = gpd.points_from_xy(lbnlUss['Longitude'], lbnlUss['Latitude'])

# Convert to GeoDataFrame
lbnlUss = gpd.GeoDataFrame(lbnlUss, geometry='geometry', crs='EPSG:4326')

# Set (WGS84) and Transform to projection of USPVDB
lbnlUss = lbnlUss.set_crs(epsg=4326)
lbnlUss = lbnlUss.to_crs(uspvdb.crs)

# Mask lbnlUss database to CONUS Boundary 
lbnlUss = lbnlUss[lbnlUss.intersects(US_boundary.unary_union)]

# Set a new column called modType. Set to lower case. 
# If Solar Tech Sub is c-si, cpv, combo (c-si, cpv), combo (c-si, thin-film, cpv), set to c-si. If Solar Tech Sub is thin-film, set to thin-film. If Solar Tech Sub is trough or tower, set to csp. Fill all other values with c-si
lbnlUss['modType'] = lbnlUss['Solar Tech Sub'].str.lower()
lbnlUss.loc[~lbnlUss['modType'].isin(['c-si', 'cpv', 'combo (c-si, cpv)', 'combo (c-si, thin-film, cpv)']), 'modType'] = 'c-si'
lbnlUss.loc[lbnlUss['modType'] == 'thin-film', 'modType'] = 'thin-film'
lbnlUss.loc[lbnlUss['modType'].isin(['trough', 'tower']), 'modType'] = 'csp'

# Set an agrivoltaic type column and fill with NaN
lbnlUss['AVtype'] = np.nan

# Set a column for area in m2 and fill with -9999
lbnlUss['area'] = -9999

# Try to convert Azimuth column string to float. If fails, fill with NaN
lbnlUss['Azimuth'] = pd.to_numeric(lbnlUss['Azimuth'], errors='coerce')

# Set a mount column. If 'Tracking Type' is 'Single Axis', set to single_axis. If 'Tracking Type' is 'Fixed Tilt', set to fixed_axis. If 'Tracking Type' is 'Dual-Axis', set to dual_axis. If 'Tracking Type' is 'Fixed, Single, Double' or 'Fixed, Single', set to mixed. 
lbnlUss['mount'] = np.nan
lbnlUss['mount'] = lbnlUss['mount'].astype(object)
lbnlUss.loc[lbnlUss['Tracking Type'] == 'Single Axis', 'mount'] = 'single_axis'
lbnlUss.loc[lbnlUss['Tracking Type'] == 'Fixed Tilt', 'mount'] = 'fixed_axis'
lbnlUss.loc[lbnlUss['Tracking Type'] == 'Dual-Axis', 'mount'] = 'dual_axis'
lbnlUss.loc[lbnlUss['Tracking Type'].isin(['Fixed, Single, Double', 'Fixed, Single']), 'mount'] = 'mixed'

# Format data
lbnlUss = formatDf(df = lbnlUss, nativeIdentifier = 'Project Name', installationYear = 'Solar COD Year', capacityMWdc = 'Solar Capacity MW-DC', area_m2 = 'area', moduleType = 'modType', agrivoltaicType = 'AVtype', azimuth = 'Azimuth', mountTechnology = 'mount', source = 'LBNLUSS')

# Print the number of arrays
print(f'The number of arrays in LBNLUSE in the US is {len(lbnlUss)}')

# Export shapefile
lbnlUss.to_file(os.path.join(derivedTemp_path, r'lbnlUss_point.shp'))

The number of arrays in LBNLUSE in the US is 1503


## Prepare NREL PV-DAQ Database

In [36]:
# Call data
pvdaq_sys = pd.read_csv(pvdaq_SysPath)
pvdaq_map = pd.read_csv(pvdaq_MapPath)

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Prepare site and system names for merging

# Remove "PVDAQ/Sites/" from system_public_name in pvdaq_map
pvdaq_map['system_public_name'] = pvdaq_map['system_public_name'].str.replace('PVDAQ/Sites/', '')

# Remove all system ids (in the form of "[####] ", for 1 to 4 numbers) for system_public_name in pvdaq_sys
pvdaq_sys['system_public_name'] = pvdaq_sys['system_public_name'].str.replace(r'\[\d{1,4}\]\s', '', regex=True)

# Replace "_" with " " in system_public_name in pvdaq_sys
pvdaq_sys['system_public_name'] = pvdaq_sys['system_public_name'].str.replace('_', ' ')

# Try to merge on system_public_name
pvdaq = pd.merge(pvdaq_map, pvdaq_sys, on='system_public_name', how='inner')

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Process and prepare PVDAQ data

# Remove all rows with PVDAQArray Configuaration = "Fixed Roof"
pvdaq = pvdaq[pvdaq['PVDAQArray Configuration'] != 'Fixed Roof']

# Create a centroid shape file from the Latitude and Longitude columns
pvdaq['geometry'] = gpd.points_from_xy(pvdaq['site_longitude'], pvdaq['site_latitude'])

# Convert to GeoDataFrame and transform to projection of USPVDB
pvdaq = gpd.GeoDataFrame(pvdaq, geometry='geometry', crs=4326)
pvdaq = pvdaq.to_crs(uspvdb.crs)

# Mask the US boundary
pvdaq = pvdaq[pvdaq.intersects(US_boundary.unary_union)]

# To start, assume that 'PVDAQfirst timestamp' is the installation year. In the format '12/29/2010  2:30:00 PM', extract the year. For arrays with timestampst that dont fit this format, set year to -9999
pvdaq['system_year'] = pd.to_datetime(pvdaq['PVDAQfirst timestamp'], errors='coerce').dt.year.fillna(-9999)

# SThe column 'PVDAQSystemSize' in in kWdc, convert to MWdc
pvdaq['system_size'] = pvdaq['PVDAQSystemSize'] / 1000

# Set a column for agrivoltaic type and fill with NaN
pvdaq['AVtype'] = np.nan

# Set a column for installation year and fill with NaN integer value
pvdaq['Year'] = -9999

# Set a column for area and fill with NaN
pvdaq['area'] = np.nan

# Set a column for moduleType and assume all modules are 'c-si'
pvdaq['modType'] = 'c-si'

# Set an empty column for azimuth filled with -9999
pvdaq['azimuth'] = -9999

# Set an empty column for mount filled with NaN
pvdaq['mount'] = np.nan
pvdaq['mount'] = pvdaq['mount'].astype(object)

# Format data (used to use system_id, but public name gives more context for searches)
pvdaq = formatDf(df = pvdaq, nativeIdentifier = 'system_public_name', installationYear = 'system_year', capacityMWdc = 'system_size', area_m2 = 'area', moduleType = 'modType', agrivoltaicType = 'AVtype', azimuth = 'azimuth', mountTechnology = 'mount', source = 'PVDAQ')

# Print the number of arrays
print(f'The number of arrays in PVDAQ in the US is {len(pvdaq)}')

# Export 
pvdaq.to_file(os.path.join(derivedTemp_path, r'pvdaq_point.shp'))

The number of arrays in PVDAQ in the US is 16


## Prepare NREL & IEA SolarPaces Database

In [None]:
# Call data
solarPaces = pd.read_csv(solarPaces_path)

# Subset for County = 'United States' and Status = 'Operational' or 'Currently Non-Operational'
solarPaces = solarPaces[solarPaces['Country'] == 'United States']
solarPaces = solarPaces[solarPaces['Status'].isin(['Operational', 'Currently Non-Operational'])]

# Split Location_coordinates column into Latitude and Longitude (comma separated)
solarPaces[['Latitude', 'Longitude']] = solarPaces['Location_coordinates'].str.split(',', expand=True)

# Create a centroid shape file from the Latitude and Longitude columns
solarPaces['geometry'] = gpd.points_from_xy(solarPaces['Longitude'].astype(float), solarPaces['Latitude'].astype(float))

# Convert to GeoDataFrame and transform to projection of USPVDB
solarPaces = gpd.GeoDataFrame(solarPaces, geometry='geometry', crs=4326)

# Transform to projection of USPVDB
solarPaces = solarPaces.to_crs(uspvdb.crs)

# Mask for US boundary
solarPaces = solarPaces[solarPaces.intersects(US_boundary.unary_union)]

# Set a column for agrivoltaic type and fill with NaN
solarPaces['AVType'] = np.nan

# Set a column for area and fill with -9999
solarPaces['area'] = -9999 # Land_area_whole_station_not_solar_field_km2 column exists, but is not solar field area

# Set a modType column to 'csp' for all rows
solarPaces['modType'] = 'csp'

# Set an empty column for azimuth filled with -9999
solarPaces['azimuth'] = -9999

# Set a mount column. If Technology is 'Parabolic Trough', 'Linear Fresnel', 'Hybrid, Parabolic Trough', or Hybrid, Linear Fresnel', set to single_axis. If Technology is 'Power Tower', 'Dish', or Beam-Down Tower', set to dual_axis. We wont make assumptions about other hybrids. 
solarPaces['mount'] = np.nan
solarPaces['mount'] = solarPaces['mount'].astype(object)
solarPaces.loc[solarPaces['Technology'].isin(['Parabolic Trough', 'Linear Fresnel', 'Hybrid, Parabolic Trough', 'Hybrid, Linear Fresnel']), 'mount'] = 'single_axis'
solarPaces.loc[solarPaces['Technology'].isin(['Power Tower', 'Dish', 'Beam-Down Tower']), 'mount'] = 'dual_axis'

# Print the number of single_axis and dual_axis arrays
print(f'The number of single_axis arrays in SolarPACES in the US is {len(solarPaces[solarPaces["mount"] == "single_axis"])}')

# Format data (used to be OpenCSP_ID, but Power_station gives more context for searches)
solarPaces = formatDf(df = solarPaces, nativeIdentifier = 'Power_station', installationYear = 'Year_operational', capacityMWdc = 'Capacity_MW', area_m2 = 'area', moduleType = 'modType', agrivoltaicType = 'AVType', azimuth = 'azimuth', mountTechnology = 'mount', source = 'SolarPACES')

# Print the number of arrays
print(f'The number of arrays in SolarPACES in the US is {len(solarPaces)}')

# Export 
solarPaces.to_file(os.path.join(derivedTemp_path, r'solarPaces_point.shp'))

The number of arrays in SolarPACES in the US is 13


In [48]:
# Print the number of single_axis and dual_axis arrays
print(f'The number of single_axis arrays in SolarPACES in the US is {len(solarPaces[solarPaces["mount"] == "single_axis"])}')

The number of single_axis arrays in SolarPACES in the US is 8


## Prepare GEM GSPT Database

In [4]:
# Read the second and third sheets of the excel file (first is an about page)
gspt2 = pd.read_excel(gspt_path, sheet_name=1)  # Second sheet (<20 MW)
gspt3 = pd.read_excel(gspt_path, sheet_name=2)  # Third sheet (>20 MW)

# Combine the dataframes
gspt = pd.concat([gspt2, gspt3], ignore_index=True)

# Subset for the United States from Country/Area column
gspt = gspt[gspt['Country/Area'] == 'United States']

# Subset for Status = 'operating', Location accuracy is = 'exact', and Start year is less than or equal to mostRecentInstallYear
gspt = gspt[gspt['Status'] == 'operating']
print(f'The number of gspt arrays where location accuracy is not exact is {len(gspt[gspt["Location accuracy"] != "exact"])}') # Print the number of gspt where location accuracy is not exact 
#gspt = gspt[gspt['Location accuracy'] == 'exact'] # 54 potentially lost arrays, although some may innacurate centroids of existing arrays
gspt = gspt[gspt['Start year'] <= mostRecentInstallYear]

# Create a centroid shape file from the Latitude and Longitude columns
gspt['geometry'] = gpd.points_from_xy(gspt['Longitude'], gspt['Latitude'])

# Convert to GeoDataFrame and transform to projection of USPVDB
gspt = gpd.GeoDataFrame(gspt, geometry='geometry', crs=4326)
gspt = gspt.to_crs(uspvdb.crs)

# Mask for US boundary
gspt = gspt[gspt.intersects(US_boundary.unary_union)]

# For all rows with capacity rating of 'MWac', convert to 'MWdc' by multiplying by 1.2 (assumed 20% DC to AC ratio)
gspt.loc[gspt['Capacity Rating'] == 'MWac', 'Capacity (MW)'] = gspt.loc[gspt['Capacity Rating'] == 'MWac', 'Capacity (MW)'] * 1.2

# Set a column for agrivoltaic type and fill with NaN
gspt['AVtype'] = np.nan

# Set a column for area and fill with NaN
gspt['area'] = np.nan

# Set Technology Type as modType
gspt['modType'] = gspt['Technology Type'].str.lower()

# If modType contains 'pv', set to c-si. If modType contains 'thermal', set to csp. Else, set to c-si. 
gspt.loc[gspt['modType'].str.contains('pv', case=False, na=False), 'modType'] = 'c-si'
gspt.loc[gspt['modType'].str.contains('thermal', case=False, na=False), 'modType'] = 'csp'
gspt.loc[~gspt['modType'].isin(['c-si', 'csp']), 'modType'] = 'c-si'

# Set an empty column for azimuth filled with -9999
gspt['azimuth'] = -9999

# Set an empty column for mount filled with NaN
gspt['mount'] = np.nan
gspt['mount'] = gspt['mount'].astype(object)

# Format data
gspt = formatDf(df = gspt, nativeIdentifier = 'Project Name', installationYear = 'Start year', capacityMWdc = 'Capacity (MW)', area_m2 = 'area', moduleType= 'modType', agrivoltaicType = 'AVtype', azimuth = 'azimuth', mountTechnology = 'mount', source = 'GSPT')

# Print the number of arrays
print(f'The number of arrays in GSPT in the US is {len(gspt)}')

# Export
gspt.to_file(os.path.join(derivedTemp_path, r'gspt_point.shp'))

The number of gspt arrays where location accuracy is not exact is 54
The number of arrays in GSPT in the US is 5524


## Prepare WRI GPPDB Database

In [39]:
# Call data
gppdb = pd.read_csv(gppdb_path)

# Filter for county = 'USA' and primary_fuel = 'Solar'
gppdb = gppdb[(gppdb['country'] == 'USA') & (gppdb['primary_fuel'] == 'Solar')]

# Create centroid geometry from latitute and longitude
gppdb['geometry'] = gpd.points_from_xy(gppdb['longitude'], gppdb['latitude'])

# Create a GeoDataFrame
gppdb = gpd.GeoDataFrame(gppdb, crs=gee_crs)

# Transform to projection of USPVDB
gppdb = gppdb.to_crs(uspvdb.crs)

# Mask the gppdb dataset with the US boundary
gppdb = gppdb[gppdb.intersects(US_boundary.unary_union)]

# Print unique values for source
# print(gppdb['source'].unique())

# Set a column for agrivoltaic type and fill with NaN
gppdb['AVtype'] = np.nan

# Add modType column and fill with 'c-si'
gppdb['modType'] = 'c-si'

# Add area column and fill with -9999
gppdb['area'] = -9999

# Add azimuth column and fill with -9999
gppdb['azimuth'] = -9999

# Add mount column and fill with NaN
gppdb['mount'] = np.nan
gppdb['mount'] = gppdb['mount'].astype(object)

# Format data (used to use gppd_idnr, but name gives more context for searches)
gppdb = formatDf(df = gppdb, nativeIdentifier = 'name', installationYear = 'commissioning_year', capacityMWdc = 'capacity_mw', area_m2 = 'area', moduleType = 'modType', agrivoltaicType = 'AVtype', azimuth = 'azimuth', mountTechnology = 'mount', source = 'GPPDB')

# Print the number of arrays
print(f'The number of arrays in GPPDB in the US is {len(gppdb)}')

# Export shapefile
gppdb.to_file(os.path.join(derivedTemp_path, r'gppdb_point.shp'))

  gppdb = pd.read_csv(gppdb_path)


The number of arrays in GPPDB in the US is 3248


## Export InSPIRE, LBNL-USS, PV-DAQ, SolarPaces, GSPT, and GPPDB, in Need of Digitization
Preferenced in order of perceived attribute and method derivation quality

In [40]:
# Call polygon data
existingDatasetArrayShapes = gpd.read_file(os.path.join(derivedTemp_path, r'existingDatasetArrayShapes.shp'))

# Call point data
inspire = gpd.read_file(os.path.join(derivedTemp_path, r'inspire_point.shp'))
lnblUse = gpd.read_file(os.path.join(derivedTemp_path, r'lbnlUss_point.shp'))
pvdaq = gpd.read_file(os.path.join(derivedTemp_path, r'pvdaq_point.shp'))
solarPaces = gpd.read_file(os.path.join(derivedTemp_path, r'solarPaces_point.shp'))
gspt = gpd.read_file(os.path.join(derivedTemp_path, r'gspt_point.shp'))
gppdb = gpd.read_file(os.path.join(derivedTemp_path, r'gppdb_point.shp'))

# Buffer the point data by 100 meters (overlapDist) account for potential misalignment of point data
inspire_buffer = inspire.copy()
inspire_buffer['geometry'] = inspire_buffer.buffer(overlapDist)
lbnlUss_buffer = lbnlUss.copy()
lbnlUss_buffer['geometry'] = lbnlUss_buffer.buffer(overlapDist)
pvdaq_buffer = pvdaq.copy()
pvdaq_buffer['geometry'] = pvdaq_buffer.buffer(overlapDist)
solarPaces_buffer = solarPaces.copy()
solarPaces_buffer['geometry'] = solarPaces_buffer.buffer(overlapDist)
gspt_buffer = gspt.copy()
gspt_buffer['geometry'] = gspt_buffer.buffer(overlapDist)
gppdb_buffer = gppdb.copy()
gppdb_buffer['geometry'] = gppdb_buffer.buffer(overlapDist)

# Get point data that is not within 190 meters of other point data. Spatial quality is in theory the same for these, given percieved dataset quality, use the following order of priority: InSPIRE, PVDAQ, SolarPACES, GSPT, GPPDB
inspire_unique = inspire_buffer
lbnlUss_unique = lbnlUss_buffer[~lbnlUss_buffer.intersects(inspire_buffer.unary_union)]
pvdaq_unique = pvdaq_buffer[~pvdaq_buffer.intersects(inspire_buffer.unary_union)]
pvdaq_unique = pvdaq_unique[~pvdaq_unique.intersects(lbnlUss_buffer.unary_union)]
solarPaces_unique = solarPaces_buffer[~solarPaces_buffer.intersects(inspire_buffer.unary_union)]
solarPaces_unique = solarPaces_unique[~solarPaces_unique.intersects(lbnlUss_buffer.unary_union)]
solarPaces_unique = solarPaces_unique[~solarPaces_unique.intersects(pvdaq_buffer.unary_union)]
gspt_unique = gspt_buffer[~gspt_buffer.intersects(inspire_buffer.unary_union)]
gspt_unique = gspt_unique[~gspt_unique.intersects(lbnlUss_buffer.unary_union)]
gspt_unique = gspt_unique[~gspt_unique.intersects(pvdaq_buffer.unary_union)]
gspt_unique = gspt_unique[~gspt_unique.intersects(solarPaces_buffer.unary_union)]
gppdb_unique = gppdb_buffer[~gppdb_buffer.intersects(inspire_buffer.unary_union)]
gppdb_unique = gppdb_unique[~gppdb_unique.intersects(lbnlUss_buffer.unary_union)]
gppdb_unique = gppdb_unique[~gppdb_unique.intersects(pvdaq_buffer.unary_union)]
gppdb_unique = gppdb_unique[~gppdb_unique.intersects(solarPaces_buffer.unary_union)]
gppdb_unique = gppdb_unique[~gppdb_unique.intersects(gspt_buffer.unary_union)]

# Merge the unique point data
mergedPoints = gpd.GeoDataFrame(pd.concat([inspire_unique, lbnlUss_unique, pvdaq_unique, solarPaces_unique, gspt_unique, gppdb_unique], ignore_index=True), crs=uspvdb.crs)

# Get points that are not within 190 meters of merged arrays
points_unique = mergedPoints[~mergedPoints.intersects(existingDatasetArrayShapes.unary_union)]

# Print the resulting number of arrays needing manual digitization
print("Unique arrays needing manual digitization: ", len(points_unique), " arrays")

# Export
points_unique.to_file(os.path.join(derivedTemp_path, r'points_toDigitize.shp'))
mergedPoints.to_file(os.path.join(derivedTemp_path, r'points_all.shp'))

Unique arrays needing manual digitization:  1616  arrays


# Prepare Solar Panel Datasets

In [41]:
# This cell requires ~200 minutes to run

# Call OSM and CCVPV panel data
osmPanels = gpd.read_file(osm_PanelsPath)
ccvpvPanels = gpd.read_file(ccvpv_PanelsPath)

# Call in OSM and CCVPV array data
osmArrays = gpd.read_file(osm_ArraysPath)
ccvpvArrays = gpd.read_file(ccvpv_ArraysPath)

# Ensure all datasets are in the same projection as USPVDB
osmPanels = osmPanels.to_crs(uspvdb.crs)
osmArrays = osmArrays.to_crs(uspvdb.crs)
ccvpvPanels = ccvpvPanels.to_crs(uspvdb.crs)
ccvpvArrays = ccvpvArrays.to_crs(uspvdb.crs)

# Drop unnecessary columns from OSM panels
osmPanels = osmPanels.drop(columns=['arrayID', 'PnlNum', 'Source', 'ProjName', 'cap_mw', 'instYr'])

# For Class column in CCVPV, if "Si_Fixed_S" set to 'fixed_axis", if 'Si_Single_E/W" set to 'single_axis'. Change class name to mount
ccvpvPanels.loc[ccvpvPanels['Class'] == 'Si_Fixed_S', 'Class'] = 'fixed_axis'
ccvpvPanels.loc[ccvpvPanels['Class'] == 'Si_Single_E/W', 'Class'] = 'single_axis'
ccvpvPanels = ccvpvPanels.rename(columns={'Class': 'mount'})

# Add a nativeID column to CCVPV panels that is the row index
ccvpvPanels['nativeID'] = ccvpvPanels.index

# Add a modType column to CCVPV panels that is 'c-si'
ccvpvPanels['modType'] = 'c-si'

# Calculate the area of each panel (in square meters)
ccvpvPanels['area'] = ccvpvPanels['geometry'].apply(lambda x: x.area if x.is_valid and x.area > 0 else np.nan)

# Calculate the perimeter-to-area ratio of each panel
ccvpvPanels['PmArRatio'] = ccvpvPanels['geometry'].apply(lambda x: x.length / x.area if x.is_valid and x.area > 0 else np.nan)

# Buffer osm panels by 10 meters, dissolve, and unbuffer by -10 meters to create array geometries, and remove ccvpv panels interset with osm panel arrays
osmPanels_buffer = osmPanels.copy()
osmPanels_buffer['geometry'] = osmPanels_buffer.buffer(panelArrayBuff)
osmPanels_dissolved = osmPanels_buffer.dissolve()
osmPanels_dissolved['geometry'] = osmPanels_dissolved.buffer(-panelArrayBuff)

# Remove ccvpv panels that intersect with osm panel arrays. Here we prioritize OSM over CCVPV because CCVPV is derived from imagery, and because Stid et al., 2022 manually digitized row geometries that were missing row-portions, while preserving eCognition geometries (not intersecting digitized geometries).
ccvpvPanels = ccvpvPanels[~ccvpvPanels.intersects(osmPanels_dissolved.unary_union)]

# Save the panel dataset source to each dataset
osmPanels['Source'] = 'OSM'
ccvpvPanels['Source'] = 'CCVPV'

# Merge the panel data
mergedPanels = gpd.GeoDataFrame(pd.concat([osmPanels, ccvpvPanels], ignore_index=True), crs=uspvdb.crs)

# Export (not necesssary, but the inersects above is the most intensive process [~182 minutes], so valuable to save out product)
#mergedPanels.to_file(os.path.join(derivedTemp_path, r'mergedPanels.shp')) 

# Print the initial number of mergedPanels to check if we are producing duplicates in the following lines
print(f'The number of initial mergedPanels is {len(mergedPanels)}')

# Add a initID column to mergedPanels that is the row index
mergedPanels['tempID'] = mergedPanels.index

# For both osmArrays and ccvpvArrays, add an array ID column name that is the row index (as a string) + a string for the datas
osmArrays['osmID'] = osmArrays.index.astype(str) + '_OSM'
ccvpvArrays['ccvpvID'] = ccvpvArrays.index.astype(str) + '_CCVPV'

# Copy instYr column from OSM arrays to mergedPanels using spatial join. Also grab osmID column from osmArrays
mergedPanels = gpd.sjoin(mergedPanels, osmArrays[['instYr', 'osmID', 'geometry']], how='left', predicate='intersects')
mergedPanels = mergedPanels.reset_index(drop=True)
mergedPanels = mergedPanels.drop(columns=['index_left', 'index_right'], errors='ignore')

# Copy Yr_inst column from CCVPV arrays to mergedPanels using spatial join and call it instYr. Also grab ccvpvID column from ccvpvArrays
mergedPanels = gpd.sjoin(mergedPanels, ccvpvArrays[['Yr_inst', 'ccvpvID', 'geometry']], how='left', predicate='intersects')
mergedPanels = mergedPanels.reset_index(drop=True)
mergedPanels = mergedPanels.drop(columns=['index_left', 'index_right'], errors='ignore')

# Fill instYr column with Yr_inst where instYr is NaN
mergedPanels['instYr'] = mergedPanels['instYr'].fillna(mergedPanels['Yr_inst'])

# Drop Yr_inst column
mergedPanels = mergedPanels.drop(columns=['Yr_inst'])

# The join above may have duplicated rows, so drop duplicates
mergedPanels = mergedPanels.drop_duplicates(subset='tempID')

# Drop panels with less area than the minimum panel area
mergedPanels = mergedPanels[mergedPanels['area'] >= minPanelRowArea]

# Print number of rows in mergedPanels
print(f'Number of final panel-rows in existing datasets: {len(mergedPanels)}')

# Print the total sum of 'area' in the mergedPanels dataset in km2
print(f'Total area of panels in mergedPanels dataset is {mergedPanels["area"].sum() / 1e6} km2')

# Set an arrayID column. This should be osmID, and if NaN, ccvpvID. If both are NaN, keep NaN. 
# Then, we want to find the number of unique arrayID's in the mergedPanels dataset and print 
mergedPanels['arrayID'] = mergedPanels['osmID'].fillna(mergedPanels['ccvpvID'])
print(f'Number of unique arrays in the mergedPanels dataset is {len(mergedPanels["arrayID"].unique())}')

# Drop tempID that is the row index for the mergedPanels dataset, and arrayID columns
mergedPanels = mergedPanels.reset_index(drop=True)
mergedPanels = mergedPanels.drop(columns=['tempID', 'arrayID', 'osmID', 'ccvpvID'])
mergedPanels['panelID'] = mergedPanels.index

# Export
mergedPanels.to_file(os.path.join(derivedTemp_path, r'existingDatasetPanelShapes.shp'))

The number of initial mergedPanels is 1079042
Number of final panel-rows in existing datasets: 1076800
Total area of panels in mergedPanels dataset is 138.13188623671547 km2
Number of unique arrays in the mergedPanels dataset is 5087


### Get Panel Area Percentiles for CCVPV and OSM

In [42]:
# Call ccvpv panels, transform to projection of USPVDB, calculate area, and print the min, 1st, 95th, 99th, and max percentile panel area
ccvpvPanels = gpd.read_file(ccvpv_PanelsPath)
ccvpvPanels = ccvpvPanels.to_crs(uspvdb.crs)
ccvpvPanels['Pnl_a'] = ccvpvPanels['geometry'].area
print(f'The minimum panel area is {ccvpvPanels["Pnl_a"].min()}')
print(f'The 1st percentile panel area is {ccvpvPanels["Pnl_a"].quantile(0.01)}')
print(f'The 5th percentile panel area is {ccvpvPanels["Pnl_a"].quantile(0.05)}') 
print(f'The 95th percentile panel area is {ccvpvPanels["Pnl_a"].quantile(0.95)}')
print(f'The 99th percentile panel area is {ccvpvPanels["Pnl_a"].quantile(0.99)}')
print(f'The maximum panel area is {ccvpvPanels["Pnl_a"].max()}')

# Now do the same for the OSM panels
osmPanels = gpd.read_file(osm_PanelsPath)
osmPanels = osmPanels.to_crs(uspvdb.crs)
osmPanels['Pnl_a'] = osmPanels['geometry'].area
print(f'The minimum panel area is {osmPanels["Pnl_a"].min()}')
print(f'The 1st percentile panel area is {osmPanels["Pnl_a"].quantile(0.01)}')
print(f'The 5th percentile panel area is {osmPanels["Pnl_a"].quantile(0.05)}')
print(f'The 95th percentile panel area is {osmPanels["Pnl_a"].quantile(0.95)}')
print(f'The 99th percentile panel area is {osmPanels["Pnl_a"].quantile(0.99)}')
print(f'The maximum panel area is {osmPanels["Pnl_a"].max()}')

The minimum panel area is 0.004586387346827453
The 1st percentile panel area is 14.638318945940224
The 5th percentile panel area is 27.900074620621066
The 95th percentile panel area is 254.17212304611706
The 99th percentile panel area is 449.9727782791575
The maximum panel area is 1310.7696045794355
The minimum panel area is 15.000837364512863
The 1st percentile panel area is 15.204260869212009
The 5th percentile panel area is 15.329141866298716
The 95th percentile panel area is 407.12461017643625
The 99th percentile panel area is 866.9789413470319
The maximum panel area is 1980.0579333428238


# END: Digitize Missing Point Location Rough Array Bounds with `script2_digitizeSolarArrays`, call and combine with existing arrays

*END*

In [None]:
# Call exisitng dataset panel data
existingDatasetPanelShapes = gpd.read_file(os.path.join(derivedTemp_path, r'existingDatasetPanelShapes.shp'))

# Print the minimum area and the 5th percentile area of the existingDatasetPanelShapes dataset
print(f'The minimum area of panels in the existingDatasetPanelShapes dataset is {existingDatasetPanelShapes["area"].min()} m2')
print(f'The 5th percentile area of panels in the existingDatasetPanelShapes dataset is {existingDatasetPanelShapes["area"].quantile(0.01)} m2')

The minimum area of panels in the existingDatasetPanelShapes dataset is 0.004586387346827 m2
The 5th percentile area of panels in the existingDatasetPanelShapes dataset is 35.286073053263934 m2


In [None]:
print(f'The 5th percentile area of panels in the existingDatasetPanelShapes dataset is {existingDatasetPanelShapes["area"].quantile(0.999)} m2')

The 5th percentile area of panels in the existingDatasetPanelShapes dataset is 1245.953201627507 m2
