# Clean Operator Data Reports
Code author: Sahar H. El Abbadi
Date started: 2022-02-22
Date last edited: 2022-02-23

### Notes
- Currently, this calls and cleans data for Stage 1 for Carbon Mapper, GHGSat, and Kairos LS23 pod
- Takes operator reports as input from 00_raw_data. No changes were made to operator reported results

### To Do:
- Change code so that cleaning each data file is a function that takes a the spreadsheet as input. This way, I can run the function on Stage 1, 2, 3 data and not copy huge chunks of code with only minor changes to syntax along the way
- Consider: do I want to combine all the output clean data into one massive dataframe, which I then filter as needed downstream?
- Need to add Scientific Aviation

In [1]:
# Imports
import numpy as np
import pandas as pd
import pathlib
import datetime
from data_manipulation_methods import convert_utc, convert_to_twentyfour
from data_cleaning_methods import clean_cm, clean_ghgsat, clean_kairos


# Load and clean raw data submitted by operators
### Notes on formatting:
- Operators added their own QC indicators, thus not all columns are uniform across reports
- Values left in the Excel file are replaced during import into PyCharm with "nan"
- Naming convention for dataframes: operator_stage

## Notes on Cleaning Operator Data

### Generate data frame with the following columns:
- Operator: name of operator (Kairos, GHGSat, CarbonMapper, scientificav)
- Stage: stage of unblinding (1, 2, or 3)
- PerformerExperimentID: Overpass number for a specific operator, incrementing by 1 for each overpass
- DateOfSurvey: date in YYYY-MM-DD format
- TimestampUTC: timestamp in UTC using 24 hour time
- QuantifiedPlume: boolean input, 1 indicates operator submitted a valid quantification estimate for this overpass (excludes quantification estimates that are provided but fail operator QC standards)
- FacilityEmissionRate: estimated emissions in kgh
- FacilityEmissionRateUpper: upper bound of uncertainty on quantification estimate
- FacilityEmissionRateLower: lower bound of uncertainty on quantification estimate
- UncertaintyType: type of uncertainty for upper and lower values reported above
- OperatorWindspeed: operator reported windspeed in m/s
- QCFlag: operator specific QC flag, first digits indicate operator.

### Carbon Mapper: Import and Clean Stage 1 and 2 data

#### Submission details
- Stage 1 submitted on 2023-01-03
- Stage 2 submitted on 2023-02-13

#### QC Indicator:
Column: "Good Quality (Y/N)"
- Y = good quality, quantification included for this stage
- N = not good quality, quantification estimate included for potential use in a later stage, but not included in this stage
- nan = left blank by Carbon Mapper


In [2]:
# Carbon Mapper Stage 1
cm_1_path = pathlib.PurePath('00_raw_data', 'CM_Stage1_submitted-2023-01-03.xlsx')
cm_1 = pd.read_excel(cm_1_path, sheet_name='Survey Summary')

# Carbon Mapper Stage 2
cm_2_path = pathlib.PurePath('00_raw_data', 'CM_Stage2_submitted-2023-02-13.xlsx')
cm_2 = pd.read_excel(cm_2_path, sheet_name='Survey Summary')


### Clean Carbon Mapper Data

#### Required data cleaning:

- QuantifiedPlume:
    - CarbonMapper reports whether they quantified a plume using two metrics:
        - "CR plume present (Y/N)" indicates whether they detected a plume. "N" indicates that they do not detect a plume, ie they are estimating 0 kgh
        - "Good Quality (Y/N)" indicates if they quantified the plume observed or not.
    - For data cleaning, if "CR plume present (Y/N)" == "Y" AND "Good Quality (Y/N)" == Y, then QuantifiedPlume = 1;
    - For now, I will exclude zero values from the quantification plot

- TimestampUTC: CarbonMapper reports in local time (UTC - 7), needs to be adjusted to UTC
- QCFlag: CarbonMapper only uses on QC indicator ("Good Quality (Y/N)"), so I use a QC flag of "CM-1" for all items that fail to pass their QC test. For items that pass QC test, I use "clear"

#### Notes:
- For detection capabilities, use "CR plume present (Y/N)" column. "Y" indicates detection by Carbon Mapper, "N" indicates no detection by Carbon Mapper

In [3]:
# Carbon Mapper data cleaning
# Carbon Mapper conducted 121 overpasses
total_overpass = 121

# Import Stage 1 data
stage = 1
cm_1_clean = clean_cm(cm_report=cm_1, cm_overpasses=total_overpass, cm_stage=stage)

# Import Stage 2 data
stage = 2
cm_2_clean = clean_cm(cm_report=cm_2, cm_overpasses=total_overpass, cm_stage=stage)

# Save data
cm_1_clean.to_csv(pathlib.PurePath('01_clean_data', 'cm_1_clean.csv'))
cm_2_clean.to_csv(pathlib.PurePath('01_clean_data', 'cm_2_clean.csv'))


### GHGSAT Stage 1 and 2 data

#### Submission details:
- Stage 1 data submitted on 2022-11-21
- Stage 2 data submitted on 2022-12-23
- Stage 3 data same as Stage 2, submitted 2023-02-17

#### QC Indicator:

Column: "QC Flag"
- 1 = Good conditions
- 2 = Emissions detected and quantified, but suboptimal conditions may affect SR
- 3 = Emissions detected, but not quantified due to suboptimal conditions
- 4 = Diffuse emission visible over site (presumably from previous release, due to low wind)
- 5 = Discarded (Bad weather/conditions, including clouds, cloud shadow, highly irregular aircraft trajectory, etc.)

#### Data processing notes
Unable to open the .xlsx file provided by GHGSat in Python, possibly related to read-only restrictions. I have saved the relevant data sheets as csv files to load instead, original submissions by GHGSat are included in 00_raw_data as .xlsx files.

In [4]:
# GHGSat Stage 1
ghg_1_path = pathlib.PurePath('00_raw_data', 'GHG_Stage1_submitted-2022-11-21.csv')
ghg_1 = pd.read_csv(ghg_1_path)

# GHGSat Stage 2
ghg_2_path = pathlib.PurePath('00_raw_data', 'GHG_Stage2_submitted-2022-12-23.csv')
ghg_2 = pd.read_csv(ghg_2_path)

### Clean GHGSat Data

#### Required Data Cleaning:
- PerformerExperimentID:
    - Current ExperimentID is a 54 character string including indicators for date, time, line number and frame. The last entry is an indicator for overpass number that starts at 1 and incrementally increases with each overpass
    - Replace this with GH-1, GH-2, etc.
- TimestampUTC: GHGSat reports in local time (UTC - 7), needs to be adjusted to UTC
- QCFlag: add "GH-" prefix to the QC flags used by GHGSat
- QuantifiedPlume: if QCFlag (defined below in notes) is 1 or 2, I will consider this a valid and submitted quantification.
- UncertaintyType: 1-sigma

#### Notes:
- QC Flags as defined by GHGSat:
    - 1 = "Good conditions"
    - 2 = "Emissions detected and quantified, but suboptimal conditions may affect SR" (what is SR?)
    - 3 = "Emissions detected, but not quantified due to suboptimal conditions"
    - 4 = "Diffuse emissions visible over site (presumably from previous release, due to low wind)"
    - 5 = "Discarded (Bad weather/conditions, including clouds, cloud shadow, highly irregular aircraft trajectory, etc.)"
- It looks like in some instances, GHGSat quantified plumes with QC Flag = 4, and in some places they did not. Will need to look into this more rigorously
- QuantifiedPlume: compare QCFlags 1 and 2 for any significant difference in error estimation

In [5]:
# GHGSat Data Cleaning

# GHGSat conducted 192 overpasses
ghg_overpass = 192

# Import Stage 1 data
ghg_stage = 1
ghg_1_clean = clean_ghgsat(ghg_report=ghg_1, ghg_overpasses=ghg_overpass, ghg_stage=ghg_stage)

# Import Stage 2 data
ghg_stage = 2
ghg_2_clean = clean_ghgsat(ghg_report=ghg_2, ghg_overpasses=ghg_overpass, ghg_stage=ghg_stage)

# Save data
ghg_1_clean.to_csv(pathlib.PurePath('01_clean_data', 'ghg_1_clean.csv'))
ghg_2_clean.to_csv(pathlib.PurePath('01_clean_data', 'ghg_2_clean.csv'))

### Kairos Stage 1 and 2 data

#### Submission details
- Stage 1 submitted on 2022-11-17
- Stage 2 submitted on 2022-12-20
- Kairos submitted data for two pods, LS23 and LS25. They analyzed the data independently, but did not report this until after testing was complete.

#### QC Indicator:
(I ran the UNIQUE function in Excel to identify values in their original report)
- "Plane deviated from flightline"
- "PARTIAL DETECTION"
- "Cutoff - low confidence quantification"
- "Excessive methane pooling near site"
- "Excessive methane pooling over site" (appears twice - possible extra space at end?)
- "Plane deviation from flightpath"
- "Glare"

#### Data processing notes
Unable to open the .xlsx file provided by GHGSat in Python, possibly related to read-only restrictions. I have saved the relevant data sheets as csv files to load instead, original submissions by GHGSat are included in 00_raw_data as .xlsx files.


In [6]:
# Kairos Stage 1
kairos_ls23_1_path = pathlib.PurePath('00_raw_data', 'Kairos_Stage1_podLS23_submitted-2022-11-17.csv')
kairos_ls25_1_path = pathlib.PurePath('00_raw_data', 'Kairos_Stage1_podLS25_submitted-2022-11-17.csv')
kairos_ls23_1 = pd.read_csv(kairos_ls23_1_path)
kairos_ls25_1 = pd.read_csv(kairos_ls25_1_path)

# Kairos Stage 2
kairos_ls23_2_path = pathlib.PurePath('00_raw_data', 'Kairos_Stage2_podLS23_submitted-2022-12-20.csv')
kairos_ls25_2_path = pathlib.PurePath('00_raw_data', 'Kairos_Stage2_podLS25_submitted-2022-12-20.csv')
kairos_ls23_2 = pd.read_csv(kairos_ls23_2_path)
kairos_ls25_2 = pd.read_csv(kairos_ls25_2_path)

# Kairos Stage 3
kairos_ls23_3_path = pathlib.PurePath('00_raw_data', 'Kairos_Stage3_podLS23_submitted-2023-02-23.csv')
kairos_ls25_3_path = pathlib.PurePath('00_raw_data', 'Kairos_Stage3_podLS25_submitted-2023-02-23.csv')
kairos_ls23_3 = pd.read_csv(kairos_ls23_3_path)
kairos_ls25_3 = pd.read_csv(kairos_ls25_3_path)

### Clean Kairos Data

#### Required Data Cleaning:
QC Flag:
- Kairos uses the following QC flags in their column "Kairos Flag for Dropped Passes or Uncertain Rate Quantification":
    - "Plane deviated from flightline"
    - "PARTIAL DETECTION"
    - "Cutoff - low confidence quantification"
    - "Excessive methane pooling near site"
    - "Excessive methane pooling over site" (appears twice - possible extra space at end?)
    - "Plane deviation from flightpath"
    - "Glare"
- Kairos reported for pods LS23 and LS25

#### Notes:
- Starting with Pod LS23


In [7]:
# Kairos Data Cleaning

# Kairos conducted 349 overpasses
kairos_overpass = 349

# Stage 1, pods LS23 and LS25
kairos_stage = 1
kairos_1_ls23_clean = clean_kairos(kairos_report=kairos_ls23_1, kairos_overpasses=kairos_overpass, kairos_stage=kairos_stage)
kairos_1_ls25_clean = clean_kairos(kairos_report=kairos_ls25_1, kairos_overpasses=kairos_overpass, kairos_stage=kairos_stage)

# Stage 2, pods LS23 and LS25
kairos_stage = 2
kairos_2_ls23_clean = clean_kairos(kairos_report=kairos_ls23_2, kairos_overpasses=kairos_overpass, kairos_stage=kairos_stage)
kairos_2_ls25_clean = clean_kairos(kairos_report=kairos_ls25_2, kairos_overpasses=kairos_overpass, kairos_stage=kairos_stage)

# Stage 3, pods LS23 and LS25
kairos_stage = 3
kairos_3_ls23_clean = clean_kairos(kairos_report=kairos_ls23_3, kairos_overpasses=kairos_overpass, kairos_stage=kairos_stage)
kairos_3_ls25_clean = clean_kairos(kairos_report=kairos_ls25_3, kairos_overpasses=kairos_overpass, kairos_stage=kairos_stage)

# Save Data
kairos_1_ls23_clean.to_csv(pathlib.PurePath('01_clean_data', 'kairos_1_ls23_clean.csv'))
kairos_1_ls25_clean.to_csv(pathlib.PurePath('01_clean_data', 'kairos_1_ls25_clean.csv'))
kairos_2_ls23_clean.to_csv(pathlib.PurePath('01_clean_data', 'kairos_2_ls23_clean.csv'))
kairos_2_ls25_clean.to_csv(pathlib.PurePath('01_clean_data', 'kairos_2_ls25_clean.csv'))
kairos_3_ls23_clean.to_csv(pathlib.PurePath('01_clean_data', 'kairos_3_ls23_clean.csv'))
kairos_3_ls25_clean.to_csv(pathlib.PurePath('01_clean_data', 'kairos_3_ls25_clean.csv'))

# Clean Operator Data

### Generate data frame with the following columns:
- Operator: name of operator (kairos, ghgsat, carbonmapper, scientificav)
- Stage: stage of unblinding (1, 2, or 3)
- PerformerExperimentID: Overpass number for a specific operator, incrementing by 1 for each overpass
- DateOfSurvey: date in YYYY-MM-DD format
- TimestampUTC: timestamp in UTC using 24 hour time
- QuantifiedPlume: boolean input, 1 indicates operator submitted a valid quantification estimate for this overpass (excludes quantification estimates that are provided but fail operator QC standards)
- FacilityEmissionRate: estimated emissions in kgh
- FacilityEmissionRateUpper: upper bound of uncertainty on quantification estimate
- FacilityEmissionRateLower: lower bound of uncertainty on quantification estimate
- UncertaintyType: type of uncertainty for upper and lower values reported above
- OperatorWindspeed: operator reported windspeed in m/s
- QCFlag: operator specific QC flag, first digits indicate operator.

In [8]:
# Array with column names for clean data

report_col = ['Operator',
              'Stage',
              'PerformerExperimentID',
              'DateOfSurvey',
              'TimestampUTC',
              'QuantifiedPlume',
              'FacilityEmissionRate',
              'FacilityEmissionRateUpper',
              'FacilityEmissionRateLower',
              'UncertaintyType',
              'OperatorWindspeed',
              'QCFlag']

### Scientific Aviation Data

#### Submission details
- Only submitted Phase I estimates
- Submitted on 2023-01-21

#### QC Indicator
Initial Spreadsheet reports quantification estimates for all releases, and indicates if release does not meet Scientific Aviation criteria by printing the text in grey. All greyed out releases include a comment explaining why the measurement was not valid

- Entries in the "Comments" column:
- "*not enough of plumes captured at low end due to restrictions in altitude (powerlines)"
- "* same as above" (referring to powerline problem)
- "*too few laps"
- "*too few laps; upwind interference from landfill"
- "*not enough of plume captured near surface"

In [9]:
# Scientific Aviation data import

sciav_path = pathlib.PurePath('00_raw_data', 'SciAv_Stage1_submitted-2023-02-21.xlsx')
sciav_1 = pd.read_excel(sciav_path)