# Comparing CEO labels 🏷️
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nasaharvest/openmapflow/blob/main/openmapflow/notebooks/compare_ceo_labels.ipynb)

**Description:** This notebook provides code to compare labels of the same data points from two Collect Earth Online (CEO) projects.

In [None]:
import pandas as pd

In [None]:
# Load the CSV label files
ceo_set1_path = # Path to your first csv label file, 
                # e.g., ceo-Hawaii-Jan-Dec-2020-(Set-1)-sample-data-2022-08-16.csv
ceo_set2_path = # Path to your second csv label file, 
                # e.g., ceo-Hawaii-Jan-Dec-2020-(Set-2)-sample-data-2022-08-16.csv

ceo_set1 = pd.read_csv(ceo_set1_path)
ceo_set2 = pd.read_csv(ceo_set2_path)

if ceo_set1.shape != ceo_set2.shape:
    print('''ERROR: The size of the two dataframes does not match. 
          Most likely, there is a duplicate in the plotid column 
          resulting from an error in CEO. You need to delete the 
          duplicate manually before continuing.''')
    print(ceo_set1[ceo_set1.duplicated(subset=['plotid'])])
    print(ceo_set2[ceo_set2.duplicated(subset=['plotid'])])
else:
    print('Loaded two dataframes with equal size: {}'.format(ceo_set1.shape))

In [None]:
# Sometimes there are slight variations in the labeling question used, 
# so we get this from the question column
label_question = ceo_set1.columns[-1]

In [None]:
ceo_agree = ceo_set1[ceo_set1[label_question] == ceo_set2[label_question]]

print('Number of samples that are in agreement: %d out of %d (%.2f%%)' % 
          (ceo_agree.shape[0], 
           ceo_set1.shape[0], 
           ceo_agree.shape[0]/ceo_set1.shape[0]*100))

In [None]:
ceo_disagree_set1 = ceo_set1[ceo_set1[label_question] != ceo_set2[label_question]]
ceo_disagree_set2 = ceo_set2[ceo_set1[label_question] != ceo_set2[label_question]]

print('Number of samples that are NOT in agreement: %d out of %d (%.2f%%)' % 
          (ceo_disagree_set1.shape[0], 
           ceo_set1.shape[0], 
           ceo_disagree_set1.shape[0]/ceo_set1.shape[0]*100))

In [None]:
pd.set_option('display.max_rows', None)

In [None]:
ceo_disagree_set1[['sampleid', 'email', 'flagged', 'collection_time', 
                   'analysis_duration', 'imagery_title', label_question]]

In [None]:
ceo_disagree_set2[['sampleid', 'email', 'flagged', 'collection_time', 
                   'analysis_duration', 'imagery_title', label_question]]

The above tables show the points from each of the two sets for which labelers disagreed on the assigned label. Review these as a group and determine which label should be assigned by consensus. 