# CleanLab 2nd Round
After updating labels for several CEO surveys, the cleanlab assessment is performed again to identify which plots to remove prior to training.

In order to run the CleanLab experiment, the following inputs are needed:
- x and y training data saved as csv
- list of final plot ids used in training
- trained catboost model in ../models/model.joblib
- list of selected features

In [1]:
# use plantations6 kernel
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
import joblib
import ast
import json

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import confusion_matrix

In [2]:
# run cleanlab assessment with cleanlab kernel
from cleanlab.datalab.datalab import Datalab
from cleanlab.classification import CleanLearning

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
import sys
print(sys.executable)  # Full path to the Python interpreter
print(sys.version) 

/Users/jessica.ertel/miniforge3/envs/plantations6/bin/python
3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:59) [Clang 16.0.6 ]


Background:
- 238 plots labeled "unknown" were dropped.
- 118 plots did not have ARD.
- Training data batch includes: 1040 plots.

## Load x and y

In [12]:
dir = '../../data/cleanlab/round2/'

In [4]:
df = pd.read_csv(f'{dir}cleanlab_xy.csv')
df.shape

(203056, 95)

In [5]:
# import the saved list of plot ids from comb training batches
with open(f"{dir}final_plot_ids.json", "r") as file:
    id_list = json.load(file)
    
# create array where all ids appear 196 times + assign to df 
ceo_plot_ids = np.repeat(id_list, 196)
df['plot_id'] = ceo_plot_ids
counts = df.plot_id.value_counts()
error = counts[counts != 196]
len(error)

0

In [6]:
df.head()

Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,...,feature_86,feature_87,feature_88,feature_89,feature_90,feature_91,feature_92,feature_93,label,plot_id
0,0.046525,0.07097,0.059846,0.303777,0.105798,0.247639,0.305501,0.332181,0.207858,0.111704,...,0.7,0.261911,0.68,1.0,3.05,0.030758,0.268511,14.25,2.0,8001
1,0.04596,0.071916,0.060067,0.304967,0.10705,0.251919,0.309712,0.335264,0.207988,0.111795,...,0.65,0.262012,0.695,0.85,4.05,0.480713,0.201347,23.95,2.0,8001
2,0.04625,0.07042,0.059739,0.289906,0.106706,0.23978,0.289128,0.31857,0.206722,0.111093,...,0.75,0.415425,0.645,0.95,4.1,0.578118,0.192824,24.9,2.0,8001
3,0.043542,0.06553,0.055383,0.267246,0.101976,0.2252,0.270993,0.302373,0.197826,0.105371,...,0.8,0.348029,0.62,1.0,6.0,0.005655,0.066262,48.4,2.0,8001
4,0.043664,0.066476,0.053536,0.289464,0.100282,0.236721,0.28893,0.322209,0.194232,0.101556,...,0.8,0.0,0.62,1.0,5.65,0.141905,0.092988,45.85,2.0,8001


In [7]:
df.label.value_counts()

label
2.0    74992
0.0    52269
1.0    38379
3.0    37416
Name: count, dtype: int64

In [None]:
# filter to selected features that model was trained with 
top_feats = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 18, 20, 21, 22, 24, 27, 32, 42, 50, 52, 60, 64, 65, 70, 71, 72, 73, 75, 76, 77, 81, 85, 89, 90, 93]
feat_cols = ['feature_' + str(i) for i in top_feats]
feats = df.loc[:, df.columns.isin(feat_cols)]

## Find common issues in data
Use cleanlab to detect issues in the dataset (label errors, outliers, near duplicates)

In [8]:
cat = joblib.load("../../models/model.joblib")

In [9]:
# generate probabilities
pred_probs = cross_val_predict(estimator=cat, 
                               X=feats, 
                               y=df.label, 
                               cv=5, 
                               method="predict_proba")

lab = Datalab(data=df, label_name="label")
lab.find_issues(features=feats, pred_probs=pred_probs)
lab.report()

Finding label issues ...
Finding outlier issues ...
Fitting OOD estimator based on provided features ...
Finding near_duplicate issues ...
Finding non_iid issues ...

Audit complete. 89276 issues found in the dataset.
Here is a summary of the different kinds of issues found in the data:

    issue_type  num_issues
         label       38942
near_duplicate       37723
       outlier       12610
       non_iid           1

Dataset Information: num_examples: 203056, num_classes: 4


----------------------- label issues -----------------------

About this issue:
	Examples whose given label is estimated to be potentially incorrect
    (e.g. due to annotation error) are flagged as having label issues.
    

Number of examples with this issue: 38942
Overall dataset quality in terms of this issue: 0.8028

Examples representing most severe instances of this issue:
        is_label_issue   label_score  given_label  predicted_label
153420            True  5.195775e-07          3.0              2.

## Analysis of Results

In [13]:
# do not run - import csvs from next cell
label_assessment = lab.get_issues("label").sort_values("label_score")
dup_assessment = lab.get_issues("near_duplicate").sort_values("near_duplicate_score")
outlier_assessment = lab.get_issues("outlier").sort_values("outlier_score")
non_iid_assessment = lab.get_issues("non_iid").sort_values("non_iid_score")
for var_name in ["label_assessment", "dup_assessment", "outlier_assessment", "non_iid_assessment"]:
    globals()[var_name].to_csv(f"{dir}{var_name}.csv", index=True)

In [14]:
## imports
label_assessment = pd.read_csv(f"{dir}label_assessment.csv", index_col="Unnamed: 0")
dup_assessment = pd.read_csv(f"{dir}dup_assessment.csv", index_col="Unnamed: 0")
outlier_assessment = pd.read_csv(f"{dir}outlier_assessment.csv", index_col="Unnamed: 0")
non_iid_assessment = pd.read_csv(f"{dir}non_iid_assessment.csv", index_col="Unnamed: 0")

In [15]:
# filter to problem pixels
label_issues = label_assessment[label_assessment.is_label_issue == True]
dup_issues = dup_assessment[dup_assessment.is_near_duplicate_issue == True]
outlier_issues = outlier_assessment[outlier_assessment.is_outlier_issue == True]
non_iid_issues = non_iid_assessment[non_iid_assessment.is_non_iid_issue == True]

# add label, survey and plot id
label_issues = label_issues.assign(
    plot_id = label_issues.index.map(df.set_index(df.index)['plot_id']),
    survey = lambda x: x['plot_id'].str[:2],
)

dup_issues = dup_issues.assign(
    plot_id = dup_issues.index.map(df.set_index(df.index)['plot_id']),
    survey = lambda x: x['plot_id'].str[:2],
    label = dup_issues.index.map(df.set_index(df.index)['label'])
)

outlier_issues = outlier_issues.assign(
    plot_id = outlier_issues.index.map(df.set_index(df.index)['plot_id']),
    survey = lambda x: x['plot_id'].str[:2],
    label = outlier_issues.index.map(df.set_index(df.index)['label'])
)

### Label issues
- A numeric quality score (between 0 and 1) estimating how severe this issue is exhibited in each example from a dataset. Highers scores (closer to 1) are less affected.
- Note that the threshold for partial plots was lowered from 190-196 to 180-196

In [18]:
label_issues.head(10)

Unnamed: 0,is_label_issue,label_score,given_label,predicted_label,plot_id,survey
153420,True,5.195775e-07,3.0,2.0,22014,22
135023,True,6.400492e-07,1.0,0.0,21119,21
143818,True,6.885929e-07,3.0,2.0,21194,21
143747,True,6.9496e-07,3.0,2.0,21194,21
143829,True,7.426894e-07,3.0,2.0,21194,21
143843,True,7.94544e-07,3.0,2.0,21194,21
143734,True,9.02411e-07,3.0,2.0,21194,21
143761,True,9.214077e-07,3.0,2.0,21194,21
143746,True,9.742502e-07,3.0,2.0,21194,21
143733,True,1.045339e-06,3.0,2.0,21194,21


In [40]:
# how many pixels in a plot are affected by label issues?
counts = label_issues.plot_id.value_counts()
full_plots_li = counts.loc[counts == 196]
almost_full_li = counts.loc[(counts >= 180) & (counts < 196)]
partial_plots_li = counts.loc[(counts >= 98) & (counts < 196)] 
print(f"full: {len(full_plots_li)}")
print(f"almost full: {len(almost_full_li)}")
print(f"partial: {len(partial_plots_li)}")

full: 28
almost full: 27
partial: 140


In [22]:
# identify which survey contains the most full plot (26 total) label errors
tmp = label_issues[label_issues.plot_id.isin(full_plots_li.index)]
tmp.groupby(['survey']).is_label_issue.count()/196

survey
14     1.0
20     1.0
21    25.0
22     1.0
Name: is_label_issue, dtype: float64

In [23]:
# which pixels have the most severe scores?
tmp.sort_values(by=['label_score'])[:20]

Unnamed: 0,is_label_issue,label_score,given_label,predicted_label,plot_id,survey
143818,True,6.885929e-07,3.0,2.0,21194,21
143747,True,6.9496e-07,3.0,2.0,21194,21
143829,True,7.426894e-07,3.0,2.0,21194,21
143843,True,7.94544e-07,3.0,2.0,21194,21
143734,True,9.02411e-07,3.0,2.0,21194,21
143761,True,9.214077e-07,3.0,2.0,21194,21
143746,True,9.742502e-07,3.0,2.0,21194,21
143733,True,1.045339e-06,3.0,2.0,21194,21
143721,True,1.116282e-06,3.0,2.0,21194,21
143778,True,1.133127e-06,3.0,2.0,21194,21


In [24]:
# identify which survey contains the most partial plot label errors
tmp2 = label_issues[label_issues.plot_id.isin(partial_plots.index)]
tmp2.groupby(['survey']).is_label_issue.count()

survey
08    5468
14     747
15    1094
19     836
20    1397
21    8637
22    1869
Name: is_label_issue, dtype: int64

### Duplication Issues
A (near) duplicate issue refers to two or more examples in a dataset that are extremely similar to each other, relative to the rest of the dataset, and can potentially cause issues in model training and analytics. When near-duplicates are present, models may unexpectedly emphasize these examples, especially if they were accidentally duplicated. In such cases, it can help to remove (near) duplicate copies from your dataset to ensure accurate and reliable results.  
The choice of which example to keep in each set of near-duplicate examples can be made in a variety of ways. Here, the example with the lowest near-duplicate score is chosen. 

In [43]:
# how many pixels in a plot are affected by duplication issues?
counts = dup_issues.plot_id.value_counts()
full_plots_dup = counts.loc[counts == 196]
almost_full_dup = counts.loc[(counts >= 180) & (counts < 196)]
partial_plots_dup = counts.loc[(counts >= 98) & (counts < 196)] 
print(f"full: {len(full_plots_dup)}")
print(f"almost full: {len(almost_full_dup)}")
print(f"partial: {len(partial_plots_dup)}")

full: 6
almost full: 31
partial: 153


In [45]:
# identify which survey contains the most full plot (6 total) dup errors
tmp_dup = dup_issues[dup_issues.plot_id.isin(full_plots_dup.index)]
tmp_dup.groupby(['survey']).is_near_duplicate_issue.count()/196

survey
23    6.0
Name: is_near_duplicate_issue, dtype: float64

In [49]:
# identify the plot with the cumulative lowest near dup score - 23048
# this plot will be kept unless and others dropped
tmp_dup.groupby(['plot_id']).near_duplicate_score.sum()

plot_id
23012    0.004496
23017    0.004372
23023    0.004139
23046    0.004446
23047    0.003688
23048    0.002391
Name: near_duplicate_score, dtype: float64

### Outlier Issues

In [50]:
# are outliers present in a specific class?
outlier_issues

Unnamed: 0,is_outlier_issue,outlier_score,plot_id,survey,label
128857,True,0.855374,21077,21,1.0
128856,True,0.891947,21077,21,1.0
11968,True,0.895923,08084,08,1.0
37908,True,0.905133,14050,14,2.0
16273,True,0.906014,08114,08,3.0
...,...,...,...,...,...
190922,True,0.982661,22235,22,2.0
184384,True,0.982661,22192,22,0.0
123655,True,0.982661,21046,21,0.0
173330,True,0.982661,22131,22,0.0


In [51]:
outlier_issues.label.value_counts()

label
0.0    5186
2.0    3842
3.0    2009
1.0    1573
Name: count, dtype: int64

## Cleaning Approach
- drop all plots with full or partial label errors (168 unique total)
- drop all plots with full duplicate label errors, except 1 (6 - 1 = 5 total)

In [42]:
label_issues_to_drop = list(full_plots_li.index) + list(partial_plots_li.index)
len(label_issues_to_drop)

168

In [60]:
dup_issues_to_drop = list(full_plots_dup.index)
dup_issues_to_drop.remove('23048')

In [64]:
final = label_issues_to_drop + dup_issues_to_drop
final = list(set(final))
len(final)

173

In [65]:
with open(f"{dir}cleanlab_id_drops.json", "w") as file:
    json.dump(final, file)

In [10]:
# view a sorted list of ids
with open(f"{dir}cleanlab_id_drops.json", "r") as file:
    drops = json.load(file)
    
sorted_drops = sorted(drops, key=lambda x: int(x))
sorted_drops

['08013',
 '08026',
 '08039',
 '08042',
 '08061',
 '08143',
 '08194',
 '14017',
 '14025',
 '14089',
 '14096',
 '14213',
 '14221',
 '15013',
 '15015',
 '15024',
 '15031',
 '15035',
 '19076',
 '19078',
 '19081',
 '19082',
 '19083',
 '20023',
 '20134',
 '20154',
 '21102',
 '21107',
 '21108',
 '21112',
 '21133',
 '21134',
 '21135',
 '21136',
 '21137',
 '21142',
 '21143',
 '21147',
 '21148',
 '21149',
 '21155',
 '21156',
 '21158',
 '21160',
 '21161',
 '21162',
 '21163',
 '21166',
 '21166',
 '21169',
 '21170',
 '21174',
 '21176',
 '21181',
 '21183',
 '21185',
 '21188',
 '21188',
 '21193',
 '21193',
 '21194',
 '21196',
 '21201',
 '21207',
 '21208',
 '21210',
 '21212',
 '21215',
 '21216',
 '21216',
 '21219',
 '21231',
 '21237',
 '21237',
 '22015',
 '22051']