## Kingfisher County Well Header Locations
#### Comparing Vendors With Databricks Geospatial

Oil and Gas companies in the United States generally obtain data used by geoscientists from a few vendors. Many states also provide data that may be slightly more current.

Which vendor has more data? Is the data identical among vendors? Which data source is the _best_ and what does that even mean? We're not going to answer most of those questions! 😂

This exercise just compares well surface locations (Lat/Lon) in busy [Kingfisher County](https://kingfisher.okcounties.org/about), Oklahoma.

The two biggest E&P data vendors in the U.S. are: 

| ||
| ----------- | ----------- |
| [S&P Global](https://www.spglobal.com/commodity-insights/en/products-solutions/upstream-midstream-oil-gas) | The large incumbent, supplying data to the industry for decades |
| [Enverus](https://www.enverus.com/products/enverus-core/) | A fast-growing, SAAS challenger |

_Both of these vendors have grown through multiple acquisitions; many private equity deals led to their present incarnations. You may have heard of PI/Dwights, IHS, Drilling Info..._

We are targeting Kingfisher County in Oklahoma, so we'll include data from the [OCC](https://gisdata-occokc.opendata.arcgis.com/) (Oklahoma Corporation Commission).


This seemingly simple exercise reveals more complexity than you might think. First, [what is a well](https://ppdm.org/ppdm/PPDM/Standards/What_is_a_Well/PPDM/What_is_a_Well.aspx)? Nobody agrees on common identifiers. How does the Geodetic datum affect lat/lon points? Hopefully the [Streamlit](https://streamlit.io/) map will help clarify things.






In [0]:

def make_csv_path(county="kingfisher", vendor=""):
    catalog  = "geodata"
    schema   = "staging"
    volume   = "raw"
    file     = f"{county.lower()}_well_header_{vendor.lower()}.csv"

    csv_path = f"/Volumes/{catalog}/{schema}/{volume}/{file}"
    return csv_path

In [0]:
%sql
CREATE SCHEMA IF NOT EXISTS geodata.bronze;

DROP TABLE IF EXISTS geodata.bronze.well_header_occ;
DROP TABLE IF EXISTS geodata.bronze.well_header_sp;
DROP TABLE IF EXISTS geodata.bronze.well_header_env;

## OCC (Oklahoma Corporation Commission)
#### Live CSV download using ArcGIS Online REST API
The OCC has a pretty robust (and free) ESRI ArcGIS Online site that allows REST queries. We limit the query to just "KINGFISHER" county and grab all available fields in 2000-row chunks. Data is written to our previously defined S3 external data source.

**Latitude/Longitude is stored in WGS84 (SRID=4326)**

In [0]:
import requests
import pandas as pd


OCC_URL = "https://gis.occ.ok.gov/server/rest/services/Hosted/RBDMS_WELLS/FeatureServer/220/query"


def _query_arcgis(county: str) -> pd.DataFrame | None:
    params = {
        "where": f"county='{county.upper()}'",
        "outFields": "*",
        "returnGeometry": "false",
        "f": "json",
        "resultOffset": 0,
        "resultRecordCount": 2000,  # 2_000
    }

    rows = []
    while True:
        print(f"Fetching offset {params['resultOffset']} …")
        data = requests.get(OCC_URL, params=params).json()
        if data.get("error"):
            print("Error received from server:", data["error"])
            return None

        features = data.get("features", [])
        if not features:
            break

        rows.extend(f["attributes"] for f in features)
        if len(features) < params["resultRecordCount"]:
            break
        params["resultOffset"] += params["resultRecordCount"]

    return pd.DataFrame.from_records(rows) if rows else None


def get_occ_well_header_by_county(county="kingfisher") -> str | None:
    volume_path = make_csv_path(county=county, vendor="occ")

    df = _query_arcgis(county)

    if df is None or df.empty:
        print(f"⚠ No data found for {county} county.")
        return None

    df.to_csv(volume_path, index=False)

    return f"✅ {len(df)} records written to {volume_path}"


result = get_occ_well_header_by_county(county="kingfisher")
print(result)

In [0]:
%sql
CREATE SCHEMA IF NOT EXISTS geodata.bronze;

CREATE TABLE IF NOT EXISTS geodata.bronze.well_header_occ
USING DELTA
TBLPROPERTIES ('delta.columnMapping.mode' = 'name'); 

COPY INTO geodata.bronze.well_header_occ
FROM 's3://databricks-purrio/raw/kingfisher_well_header_occ.csv' 
FILEFORMAT = CSV 
FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'true') 
COPY_OPTIONS ('mergeSchema' = 'true');

## Enverus 
#### Live CSV download using Enverus Developer API
Enverus has a great, massive, swagger-defined Python. We use a secret key stored using the local databricks cli:

--
```
databricks secrets create-scope kingfisher_secrets
databricks secrets put-secret kingfisher_secrets enverus_secret_key
```

**Latitude/Longitude is stored in WGS84 (SRID=4326)**

In [0]:
%pip install -q enverus-developer-api

from enverus_developer_api import DeveloperAPIv3
import pandas as pd
import json

SECRET_KEY = dbutils.secrets.get(scope="kingfisher_secrets", key="enverus_secret_key")
STATE = "OK"

v3 = DeveloperAPIv3(secret_key=SECRET_KEY)

def get_enverus_well_header_by_county(county="kingfisher") -> str | None:
    volume_path = make_csv_path(county=county, vendor="enverus")

    records = []
    query = v3.query("well-headers", StateProvince=STATE, County=county.upper())

    for i, record in enumerate(query, start=1):
        records.append(record)

    df = pd.DataFrame.from_records(records)

    df.to_csv(volume_path, index=False)
    return f"✅ {len(df)} records written to '{volume_path}"


result = get_enverus_well_header_by_county(county="kingfisher")
print(result)

In [0]:
%sql
CREATE SCHEMA IF NOT EXISTS geodata.bronze;

CREATE TABLE IF NOT EXISTS geodata.bronze.well_header_env
USING DELTA
TBLPROPERTIES ('delta.columnMapping.mode' = 'name'); 

COPY INTO geodata.bronze.well_header_env
FROM 's3://databricks-purrio/raw/kingfisher_well_header_enverus.csv' 
FILEFORMAT = CSV 
FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'true') 
COPY_OPTIONS ('mergeSchema' = 'true');

## S&P Global 
#### CSV download via well header report from Enerdeq (NOT LIVE).

Sorry, I don't have API credentials for this one. The S&P Global Web Services platform is vast and has undergone multiple generations. It's not quite as "modern" as Enverus. Note that in the U.S. data is still delivered using NAD27. This is a relic from pre-internet days when data was shipped to clients on DVDs!

Also, the S&P Well Header report format has spaces in the column headers. Normally this would cause a `DELTA_INVALID_CHARACTERS_IN_COLUMN_NAMES` error in SQL. You can permit the spaces using this option:
`TBLPROPERTIES ('delta.columnMapping.mode' = 'name');`

**Latitude/Longitude is stored in NAD27 (SRID=4267)**

In [0]:
%sql
CREATE SCHEMA IF NOT EXISTS geodata.bronze;

CREATE TABLE IF NOT EXISTS geodata.bronze.well_header_sp
USING DELTA
TBLPROPERTIES ('delta.columnMapping.mode' = 'name'); 

COPY INTO geodata.bronze.well_header_sp
FROM 's3://databricks-purrio/raw/kingfisher_well_header_sp.csv' 
FILEFORMAT = CSV 
FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'true') 
COPY_OPTIONS ('mergeSchema' = 'true');

In [0]:
# Alternatively, use this spark-based import to replaces spaces with underscores

# df = spark.read.format("csv") \
#     .option("header", "true") \
#     .option("inferSchema", "true") \
#     .load("s3://databricks-purrio/raw/kingfisher_well_header_sp.csv")

# for col in df.columns:
#     df = df.withColumnRenamed(col, col.replace(" ", "_"))

# df.write.format("delta") \
#     .option("mergeSchema", "true") \
#     .mode("overwrite") \
#     .saveAsTable("geodata.bronze.well_header_sp")


Sanity check...

In [0]:
%sql
SHOW TABlES IN geodata.bronze;

In [0]:
%sql

select 'OCC', count(*) as row_count from geodata.bronze.well_header_occ
union
select 'S&P', count(*) as row_count from geodata.bronze.well_header_sp
union
select 'Enverus', count(*) as row_count from geodata.bronze.well_header_env;

