In [1]:
# Imports
import numpy as np
import pandas as pd
import pathlib
import datetime
from data_methods import convert_utc, convert_to_twentyfour

# Load raw data submitted by operators
### Notes on formatting:
- Operators added their own QC indicators, thus not all columns are uniform across reports
- Values left in the Excel file are replaced during import into PyCharm with "nan"
- Naming convention for dataframes: operator_stage

### Carbon Mapper Stage 1 and 2 data

#### Submission details
- Stage 1 submitted on 2023-01-03
- Stage 2 submitted on 2023-02-13

#### QC Indicator:
Column: "Good Quality (Y/N)"
- Y = good quality, quantification included for this stage
- N = not good quality, quantification estimate included for potential use in a later stage, but not included in this stage
- nan = left blank by Carbon Mapper


In [None]:
# Carbon Mapper Stage 1
cm_1_path = pathlib.PurePath('00_raw_data', 'CM_Stage1_submitted-2023-01-03.xlsx')
cm_1 = pd.read_excel(cm_1_path, sheet_name='Survey Summary')

# Carbon Mapper Stage 2
cm_2_path = pathlib.PurePath('00_raw_data', 'CM_Stage2_submitted-2023-02-13.xlsx')
cm_2 = pd.read_excel(cm_1_path, sheet_name='Survey Summary')


### GHGSAT Stage 1 and 2 data

#### Submission details:
- Stage 1 data submitted on 2022-11-21
- Stage 2 data submitted on 2022-12-23
- Stage 3 data same as Stage 2, submitted 2023-02-17

#### QC Indicator:

Column: "QC Flag"
- 1 = Good conditions
- 2 = Emissions detected and quantified, but suboptimal conditions may affect SR
- 3 = Emissions detected, but not quantified due to suboptimal conditions
- 4 = Diffuse emission visible over site (presumably from previous release, due to low wind)
- 5 = Discarded (Bad weather/conditions, including clouds, cloud shadow, highly irregular aircraft trajectory, etc.)

#### Data processing notes
Unable to open the .xlsx file provided by GHGSat in Python, possibly related to read-only restrictions. I have saved the relevant data sheets as csv files to load instead, original submissions by GHGSat are included in 00_raw_data as .xlsx files.

In [None]:
# GHGSat Stage 1
ghg_1_path = pathlib.PurePath('00_raw_data', 'GHG_Stage1_submitted-2022-11-21.csv')
ghg_1 = pd.read_csv(ghg_1_path)

# GHGSat Stage 2
ghg_2_path = pathlib.PurePath('00_raw_data', 'GHG_Stage2_submitted-2022-12-23.csv')
ghg_2 = pd.read_csv(ghg_2_path)

### Kairos Stage 1 and 2 data

#### Submission details
- Stage 1 submitted on 2022-11-17
- Stage 2 submitted on 2022-12-20
- Kairos submitted data for two pods, LS23 and LS25. They analyzed the data independently, but did not report this until after testing was complete.

#### QC Indicator:
(I ran the UNIQUE function in Excel to identify values in their original report)
- "Plane deviated from flightline"
- "PARTIAL DETECTION"
- "Cutoff - low confidence quantification"
- "Excessive methane pooling near site"
- "Excessive methane pooling over site" (appears twice - possible extra space at end?)
- "Plane deviation from flightpath"
- "Glare"

#### Data processing notes
Unable to open the .xlsx file provided by GHGSat in Python, possibly related to read-only restrictions. I have saved the relevant data sheets as csv files to load instead, original submissions by GHGSat are included in 00_raw_data as .xlsx files.


In [2]:
# Kairos Stage 1
kairos_ls23_1_path = pathlib.PurePath('00_raw_data', 'Kairos_Stage1_podLS23_submitted-2022-11-17.csv')
kairos_ls25_1_path = pathlib.PurePath('00_raw_data', 'Kairos_Stage1_podLS25_submitted-2022-11-17.csv')

kairos_ls23_1 = pd.read_csv(kairos_ls23_1_path)
kairos_ls25_1 = pd.read_csv(kairos_ls25_1_path)

# GHGSat Stage 2
kairos_ls23_2_path = pathlib.PurePath('00_raw_data', 'Kairos_Stage2_podLS25_submitted-2022-12-20.csv')
kairos_ls25_2_path = pathlib.PurePath('00_raw_data', 'Kairos_Stage2_podLS25_submitted-2022-12-20.csv')
kairos_ls23_2 = pd.read_csv(kairos_ls23_2_path)
kairos_ls25_2 = pd.read_csv(kairos_ls25_2_path)

### Scientific Aviation Data

#### Submission details
- Only submitted Phase I estimates
- Submitted on 2023-01-21

#### QC Indicator
Initial Spreadsheet reports quantification estimates for all releases, and indicates if release does not meet Scientific Aviation criteria by printing the text in grey. All greyed out releases include a comment explaining why the measurement was not valid

- Entries in the "Comments" column:
- "*not enough of plumes captured at low end due to restrictions in altitude (powerlines)"
- "* same as above" (referring to powerline problem)
- "*too few laps"
- "*too few laps; upwind interference from landfill"
- "*not enough of plume captured near surface"

In [None]:
# Scientific Aviation data import

sciav_path = pathlib.PurePath('00_raw_data', 'SciAv_Stage1_submitted-2023-02-21.xlsx')
sciav_1 = pd.read_excel(sciav_path)

# Clean Operator Data

### Generate data frame with the following columns:
- Operator: name of operator (kairos, ghgsat, carbonmapper, scientificav)
- Stage: stage of unblinding (1, 2, or 3)
- PerformerExperimentID: Unique alphanumeric ID for each overpass. First two letters indicate the operator being tested (CM = Carbon Mapper, GH = GHGSat, KA = Kairos, SC = SciAv), followed by the overpass number. Overpass number starts at 1 and increments by 1 for each overpass. For example:
    - CM-001 means Carbon Mapper - first overpass
    - KA-34 means Kairos - 34th overpass
- DateOfSurvey: date in YYYY-MM-DD format
- TimestampUTC: timestamp in UTC using 24 hour time
- QuantifiedPlume: boolean input, 1 indicates operator submitted a valid quantification estimate for this overpass (excludes quantification estimates that are provided but fail operator QC standards)
- FacilityEmissionRate: estimated emissions in kgh
- FacilityEmissionRateUpper: upper bound of uncertainty on quantification estimate
- FacilityEmissionRateLower: lower bound of uncertainty on quantification estimate
- UncertaintyType: type of uncertainty for upper and lower values reported above
- OperatorWindspeed: operator reported windspeed in m/s
- QCFlag: operator specific QC flag, first digits indicate operator.

In [None]:
# Array with column names for clean data

report_col = ['Operator',
              'Stage',
              'PerformerExperimentID',
              'DateOfSurvey',
              'TimestampUTC',
              'QuantifiedPlume',
              'FacilityEmissionRate',
              'FacilityEmissionRateUpper',
              'FacilityEmissionRateLower',
              'UncertaintyType',
              'OperatorWindspeed',
              'QCFlag']

### Clean Carbon Mapper Data

#### Required data cleaning:

- QuantifiedPlume:
    - CarbonMapper reports whether they quantified a plume using two metrics:
        - "CR plume present (Y/N)" indicates whether they detected a plume. "N" indicates that they do not detect a plume, ie they are estimating 0 kgh
        - "Good Quality (Y/N)" indicates if they quantified the plume observed or not.
    - For data cleaning, if "CR plume present (Y/N)" == "Y" AND "Good Quality (Y/N)" == Y, then QuantifiedPlume = 1;
    - For now, I will exclude zero values from the quantification plot

- TimestampUTC: CarbonMapper reports in local time (UTC - 7), needs to be adjusted to UTC
- QCFlag: CarbonMapper only uses on QC indicator ("Good Quality (Y/N)"), so I use a QC flag of "CM-1" for all items that fail to pass their QC test. For items that pass QC test, I use "clear"

#### Notes:
- For detection capabilities, use "CR plume present (Y/N)" column. "Y" indicates detection by Carbon Mapper, "N" indicates no detection by Carbon Mapper

In [None]:
# Carbon Mapper data cleaning

# Carbon Mapper conducted 121 overpasses
total_overpass = 121
# Set Stage for data analysis (1, 2 or 3)
cm_stage = 1

cm_overpasses = np.linspace(1, total_overpass, total_overpass) # for indexing for loop
overpass_list = [] # for generating all new rows

for overpass in cm_overpasses:

    # Check if the quantification estimate is valid by passing Carbon Mapper's "Good Quality" criteria. Use quantification estimates if valid, otherwise input nan
    if cm_1.loc[overpass-1, "CR plume present (Y/N)"] == "Y" and cm_1.loc[overpass-1, "Good Quality (Y/N)"] == "Y":
        quantified = 1
        emission_rate = cm_1.loc[overpass-1, "Emission Rate (kg/hr)"]
        emission_upper = cm_1.loc[overpass-1, "FacilityEmissionRateUpper"]
        emission_lower = cm_1.loc[overpass-1, "FacilityEmissionRateLower"]
    else:
        quantified = 0
        emission_rate = float("nan")
        emission_upper = float("nan")
        emission_lower = float("nan")

    # If the overpass does not pass Carbon Mapper's criteria,
    if cm_1.loc[overpass-1, "Good Quality (Y/N)"] == "N":
        qc_flag = 'CM-1'
    # else:
    #     qc_flag = 'clear' # for now, use flag "clear" to indicate if the overpass passes all operator QC

    # Convert local time to UTC
    local_time = cm_1.loc[overpass-1, "Timestamp (hyperspectral technologies only)"]
    utc_time = convert_utc(local_time, 7)

    new_row = {
            'Operator': 'CarbonMapper',
            'Stage': cm_stage,
            'PerformerExperimentID': f'CM-{overpass:1.0f}',
            'DateOfSurvey': cm_1.loc[overpass-1, "DateOfSurvey"].strftime('%Y-%m-%d'),
            'TimestampUTC': utc_time,
            'QuantifiedPlume': quantified,
            'FacilityEmissionRate': emission_rate,
            'FacilityEmissionRateUpper': emission_upper,
            'FacilityEmissionRateLower': emission_lower,
            'UncertaintyType': '1-sigma',
            'OperatorWindspeed': cm_1.loc[overpass-1, "WindSpeed (m/s)"],
            'QCFlag': qc_flag,
    }
    overpass_list.append(new_row)

cm_clean = pd.DataFrame(overpass_list)

### Clean GHGSat Data

#### Required Data Cleaning:
- PerformerExperimentID:
    - Current ExperimentID is a 54 character string including indicators for date, time, line number and frame. The last entry is an indicator for overpass number that starts at 1 and incrementally increases with each overpass
    - Replace this with GH-1, GH-2, etc.
- TimestampUTC: GHGSat reports in local time (UTC - 7), needs to be adjusted to UTC
- QCFlag: add "GH-" prefix to the QC flags used by GHGSat
- QuantifiedPlume: if QCFlag (defined below in notes) is 1 or 2, I will consider this a valid and submitted quantification.
- UncertaintyType: 1-sigma

#### Notes:
- QC Flags as defined by GHGSat:
    - 1 = "Good conditions"
    - 2 = "Emissions detected and quantified, but suboptimal conditions may affect SR" (what is SR?)
    - 3 = "Emissions detected, but not quantified due to suboptimal conditions"
    - 4 = "Diffuse emissions visible over site (presumably from previous release, due to low wind)"
    - 5 = "Discarded (Bad weather/conditions, including clouds, cloud shadow, highly irregular aircraft trajectory, etc.)"
- It looks like in some instances, GHGSat quantified plumes with QC Flag = 4, and in some places they did not. Will need to look into this more rigorously
- QuantifiedPlume: compare QCFlags 1 and 2 for any significant difference in error estimation

In [None]:
# GHGSat Data Cleaning

# GHGSat conducted 192 overpasses
total_overpass = 192
# Set Stage for data analysis (1, 2 or 3)
ghg_stage = 1

# Code variables for iterating in the for loop
gh_overpasses = np.linspace(1, total_overpass, total_overpass) # for indexing for loop
overpass_list = [] # for generating all new rows

for overpass in gh_overpasses:

    # Convert local time to UTC
    local_time = ghg_1.loc[overpass-1, "Timestamp (hyperspectral technologies only)"]
    local_time = convert_to_twentyfour(local_time)
    utc_time = convert_utc(local_time, 7)

    # Determine if plume was quantified and set emission rate and uncertainty based on the reported QC flags by GHGSat
    if ghg_1.loc[overpass-1, "QC Flag "] == 1 or ghg_1.loc[overpass-1, "QC Flag "] == 2:
        quantified = 1
        emission_rate = ghg_1.loc[overpass-1, "FacilityEmissionRate"]
        emission_upper = ghg_1.loc[overpass-1, "FacilityEmissionRateUpper"]
        emission_lower = ghg_1.loc[overpass-1, "FacilityEmissionRateLower"]
    else:
        quantified = 0
        emission_rate = float("nan")
        emission_upper = float("nan")
        emission_lower = float("nan")

    # Set QC flag:
    ghg_flag = ghg_1.loc[overpass-1, "QC Flag "]
    qc_flag = f'GH-{ghg_flag:1.0f}'


    new_row = {
            'Operator': 'GHGSat-AV',
            'Stage': ghg_stage,
            'PerformerExperimentID': f'GH-{overpass:1.0f}',
            'DateOfSurvey': ghg_1.loc[overpass-1, "DateOfSurvey"],
            'TimestampUTC': utc_time,
            'QuantifiedPlume': quantified,
            'FacilityEmissionRate': emission_rate,
            'FacilityEmissionRateUpper': emission_upper,
            'FacilityEmissionRateLower': emission_lower,
            'UncertaintyType': '1-sigma',
            'OperatorWindspeed': ghg_1.loc[overpass-1, "WindSpeed"],
            'QCFlag': qc_flag,
    }
    overpass_list.append(new_row)

ghg_clean = pd.DataFrame(overpass_list)

### Clean Kairos Data

#### Required Data Cleaning:
QC Flag:
- Kairos uses the following QC flags in their column "Kairos Flag for Dropped Passes or Uncertain Rate Quantification":
    - "Plane deviated from flightline"
    - "PARTIAL DETECTION"
    - "Cutoff - low confidence quantification"
    - "Excessive methane pooling near site"
    - "Excessive methane pooling over site" (appears twice - possible extra space at end?)
    - "Plane deviation from flightpath"
    - "Glare"
- Kairos reported for pods LS23 and LS25

#### Notes:
- Starting with Pod LS23


In [5]:
# Kairos Data Cleaning

# Kairos conducted 349 overpasses
total_overpass = 349
# Set Stage for data analysis (1, 2 or 3)
kairos_stage = 1

# Code variables for iterating in the for loop
kairos_overpasses = np.linspace(1, total_overpass, total_overpass) # for indexing for loop
overpass_list = [] # for generating all new rows

for overpass in kairos_overpasses:

    # Convert local time to UTC
    local_time = kairos_ls23_1.loc[overpass-1, "Timestamp (hyperspectral technologies only)"]
    local_time = datetime.datetime.strptime(local_time, '%H:%M:%S').time()
    utc_time = convert_utc(local_time, 7)

    # Determine if plume was quantified and set emission rate and uncertainty based on the reported QC flags by GHGSat
    kairos_quantified = kairos_ls23_1.loc[overpass-1, "Kairos Flag for Dropped Passes or Uncertain Rate Quantification"]
    if pd.isna(kairos_quantified):
        quantified = 1
        emission_rate = kairos_ls23_1.loc[overpass-1, "FacilityEmissionRate"]
        emission_upper = kairos_ls23_1.loc[overpass-1, "FacilityEmissionRateUpper"]
        emission_lower = kairos_ls23_1.loc[overpass-1, "FacilityEmissionRateLower"]
    else:
        quantified = 0
        emission_rate = float("nan")
        emission_upper = float("nan")
        emission_lower = float("nan")

    # Set QC flag:
    kairos_flag = kairos_ls23_1.loc[overpass-1, "Kairos Flag for Dropped Passes or Uncertain Rate Quantification"]
    if pd.isna(kairos_flag):
        qc_flag = 'clear'
    elif kairos_flag[:5] == 'Plane':
        qc_flag = 'KA-1'
    elif kairos_flag[:5] == 'PARTI':
        qc_flag = 'KA-2'
    elif kairos_flag[:5] == 'Cutof':
        qc_flag = 'KA-3'
    elif kairos_flag[:5] == 'Exces':
        qc_flag = 'KA-4'
    elif kairos_flag[:5] == 'Glare':
        qc_flag = 'KA-5'
    else:
        qc_flag = 'ERROR! Identify misclassified QC'


    new_row = {
            'Operator': 'Kairos - LS23',
            'Stage': kairos_stage,
            'PerformerExperimentID': f'KA-{overpass:1.0f}',
            'DateOfSurvey': kairos_ls23_1.loc[overpass-1, "DateOfSurvey"],
            'TimestampUTC': utc_time,
            'QuantifiedPlume': quantified,
            'FacilityEmissionRate': emission_rate,
            'FacilityEmissionRateUpper': emission_upper,
            'FacilityEmissionRateLower': emission_lower,
            'UncertaintyType': 'nan',
            'OperatorWindspeed': kairos_ls23_1.loc[overpass-1, "WindSpeed"],
            'QCFlag': qc_flag,
    }
    overpass_list.append(new_row)

kairos_clean = pd.DataFrame(overpass_list)
print(kairos_clean)

          Operator  Stage PerformerExperimentID DateOfSurvey TimestampUTC  \
0    Kairos - LS23      1                  KA-1     10/24/22     16:46:16   
1    Kairos - LS23      1                  KA-2     10/24/22     16:49:22   
2    Kairos - LS23      1                  KA-3     10/24/22     16:52:33   
3    Kairos - LS23      1                  KA-4     10/24/22     16:55:42   
4    Kairos - LS23      1                  KA-5     10/24/22     16:59:02   
..             ...    ...                   ...          ...          ...   
344  Kairos - LS23      1                KA-345     10/28/22     20:27:06   
345  Kairos - LS23      1                KA-346     10/28/22     20:30:10   
346  Kairos - LS23      1                KA-347     10/28/22     20:33:34   
347  Kairos - LS23      1                KA-348     10/28/22     20:36:39   
348  Kairos - LS23      1                KA-349     10/28/22     20:39:57   

     QuantifiedPlume FacilityEmissionRate  FacilityEmissionRateUpper  \
0  

In [None]:
# Scratch work
kairos_quantified = kairos_ls23_1.loc[2, "Kairos Flag for Dropped Passes or Uncertain Rate Quantification"]
print(kairos_quantified)
# if math.isnan(kairos_quantified):
#     print('Value is nan according to math')
# else:
#     print('Math can take a string and not throw an error!')

if pd.isna(kairos_quantified):
    print('Value is nan according to pandas')
else:
    print('Numpy can take a string and not throw an error!')

In [None]:
import datetime
time = kairos_ls23_1.loc[overpass-1, "Timestamp (hyperspectral technologies only)"]
print(time)
datetime.datetime.strptime(time, '%H:%M:%S').time()

# Load metering data

### Data description
- Data uploaded on 2023-02-22 was generated by Philippine Burdeau on the same date. Sahar manually added a column for "PerformerExperimentID" and checked that the final number matched the "PerformerExperimentID" in the operator reports.
- PerformerExperimentID:
    - Kairos: 1 - 349, incrementing by 1 with each overpass
    - GHGSat: First digits are the date, month, time (apparently in UTC), followed by info on line number and frame number, then incrementally increasing numbers from 1 - 192
    - Carbon Mapper: 1 - 121, incrementing by 1 with each overpass
- Timestamps match those reported by the operators


### To do:
- Compare all operator timestamps to our timestamps and to Flight Radar
- Ultimately, use either our timestamp or FlightRadar for the definitive time