# Correcting GoogleNews labels with Cleanlab
In this notebook, we will use [Cleanlab](https://github.com/cgnorthcutt/cleanlab) to assess how well we did with our labeling effort.
We'll update the data set and correct some inconsistencies that Cleanlab indicates.

In [41]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import matplotlib.pyplot as plt
from tqdm import tqdm
from random import sample
from sklearn.model_selection import train_test_split
import numpy as np 
import cleanlab
import pandas as pd
from sklearn.linear_model import LogisticRegression

import os
import sys
import inspect
from pathlib import Path 
currentdir = Path.cwd()
parentdir = os.path.dirname(currentdir)
sys.path.insert(0,parentdir) 

from mlyoucanuse.embeddings import get_embeddings_index

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [42]:
%%time
gnews_embed = get_embeddings_index('GoogleNews', parent_dir=parentdir, embedding_dimensions=300)
gnews_vocab = {tmp for tmp in tqdm(gnews_embed.keys())} 
sample(gnews_vocab, 5)

100%|██████████| 3000000/3000000 [00:01<00:00, 1784789.96it/s]

CPU times: user 2min 15s, sys: 8.05 s, total: 2min 23s
Wall time: 2min 28s





In [43]:
occs_df = pd.read_csv('occupations.wikidata.all.gnews.labeled.csv', sep='\t')
print(f"Number of occupations: {len(occs_df):,}")

Number of occupations: 10,974


In [46]:
occs_df.tail()

Unnamed: 0,item_id,occupation_count,occupation,description,in_google_news,language_detected,source,label,labeled_by,label_error_reason
10969,10526680,1,hovkamrerare,,0,sl,wikidata,-1,,
10970,10526703,1,Hovrättspresident,,0,sv,wikidata,-1,,
10971,66486266,1,activista taurí,subclase de activista,0,oc,wikidata,-1,,
10972,360443,1,Escort,Wikimedia disambiguation page,1,oc,wikidata,1,human,
10973,87252988,1,弓道家,,0,Unknown,wikidata,-1,,


In [47]:
all_occupation_names = occs_df['occupation'].tolist()
all_occ_names_gnews_format = [ tmp.replace(' ', '_') for tmp in all_occupation_names ]

print(f"Number of single occupations in GNews and Wikidata: {len(gnews_vocab & set(all_occupation_names)):,}")
print(f"Number of compound occupations in GNews and Wikidata: {len(gnews_vocab & set(all_occ_names_gnews_format)):,}")

good_occs = occs_df.query("in_google_news ==1 and label ==1")['occupation'].tolist()
bad_occs = occs_df.query("in_google_news ==1 and label ==0 ")['occupation'].tolist()
print(f"Number of positive examples: {len(good_occs):,}; negative examples: {len(bad_occs):,}") 

sample(good_occs, 5), sample(bad_occs, 5)

Number of single occupations in GNews and Wikidata: 2,937
Number of compound occupations in GNews and Wikidata: 3,520
Number of positive examples: 1,950; negative examples: 1,570


(['pensioner', 'mascot', 'firefighter', 'gallerist', 'student'],
 ['Nanga', 'newspaper', 'ballet', 'delegate', 'videography'])

In [48]:
y = np.concatenate([np.ones(len(good_occs), dtype=np.int32), np.zeros(len(bad_occs), dtype=np.int32)])

all_words = good_occs + bad_occs
all_words = [tmp.replace(' ', '_') for tmp in all_words]

# We'll save the mapping of index to words for decoding
idx_label_map = {}
for idx, name in enumerate(all_words):
    idx_label_map[idx]= name
    
X = [gnews_embed[tmp] for tmp in all_words]
X = np.array(X)
X.shape, y.shape

((3520, 300), (3520,))

## Use Cleanlab to find potentially problematic labels
    * Get out-of-sample predicted probabilities using cross-validation
    * Compute confident joint
    * Find label errors    
    
## Step 1: Get out-of-sample predicted probabilities using cross-validation    

In [49]:
# using a simple, non-optimized logistic regression classifier; the cross-validation will expose label weaknesses
psx = cleanlab.latent_estimation.estimate_cv_predicted_probabilities(
    X, y, clf=LogisticRegression(max_iter=1000, multi_class='auto', solver='lbfgs'))

## Step 2: Compute confident joint

In [50]:
def compute_confident_joint(psx, y):
    # Verify inputs
    psx = np.asarray(psx)

    # Find the number of unique classes if K is not given
    K = len(np.unique(y))

    # Estimate the probability thresholds for confident counting
    # You can specify these thresholds yourself if you want
    # as you may want to optimize them using a validation set.
    # By default (and provably so) they are set to the average class prob.
    thresholds = [np.mean(psx[:,k][y == k]) for k in range(K)] # P(s^=k|s=k)
    thresholds = np.asarray(thresholds)

    # Compute confident joint
    confident_joint = np.zeros((K, K), dtype = int)
    for i, row in enumerate(psx):
        y_label = y[i]
        # Find out how many classes each example is confidently labeled as
        confident_bins = row >= thresholds - 1e-6
        num_confident_bins = sum(confident_bins)
        # If more than one conf class, inc the count of the max prob class
        if num_confident_bins == 1:
            confident_joint[y_label][np.argmax(confident_bins)] += 1
        elif num_confident_bins > 1:
            confident_joint[y_label][np.argmax(row)] += 1

    # Normalize confident joint (use cleanlab, trust me on this)
    confident_joint = cleanlab.latent_estimation.calibrate_confident_joint(
        confident_joint, y)
    cleanlab.util.print_joint_matrix(confident_joint)
    return confident_joint

confident_joint = compute_confident_joint(psx, y)


 Joint Label Noise Distribution Matrix P(s,y) of shape (2, 2)
 p(s,y)	y=0	y=1
	---	---
s=0 |	1469	101
s=1 |	117	1833
	Trace(matrix) = 3302



## Step 3: Find label errors

In [51]:
def find_label_errors(confident_joint, y):
    # We arbitrarily choose at least 5 examples left in every class.
    # Regardless of whether some of them might be label errors.
    MIN_NUM_PER_CLASS = 5
    # Leave at least MIN_NUM_PER_CLASS examples per class.
    # NOTE prune_count_matrix is transposed (relative to confident_joint)
    prune_count_matrix = cleanlab.pruning.keep_at_least_n_per_class(
        prune_count_matrix=confident_joint.T,
        n=MIN_NUM_PER_CLASS,
    )
    K = len(np.unique(y)) # number of unique classes
    y_counts = np.bincount(y)
    noise_masks_per_class = []
    # For each row in the transposed confident joint
    for k in range(K):
        noise_mask = np.zeros(len(psx), dtype=bool)
        psx_k = psx[:, k]
        if y_counts[k] > MIN_NUM_PER_CLASS:  # Don't prune if not MIN_NUM_PER_CLASS
            for j in range(K):  # noisy label index (k is the true label index)
                if k != j:  # Only prune for noise rates, not diagonal entries
                    num2prune = prune_count_matrix[k][j]
                    if num2prune > 0:
                        # num2prune'th largest p(classk) - p(class j)
                        # for x with noisy label j
                        margin = psx_k - psx[:, j]
                        y_filter = y == j
                        threshold = -np.partition(
                            -margin[y_filter], num2prune - 1
                        )[num2prune - 1]
                        noise_mask = noise_mask | (y_filter & (margin >= threshold))
            noise_masks_per_class.append(noise_mask)
        else:
            noise_masks_per_class.append(np.zeros(len(s), dtype=bool))

    # Boolean label error mask
    label_errors_bool = np.stack(noise_masks_per_class).any(axis=0)

     # Remove label errors if given label == model prediction
    for i, pred_label in enumerate(psx.argmax(axis=1)):
        # np.all let's this work for multi_label and single label
        if label_errors_bool[i] and np.all(pred_label == y[i]):
            label_errors_bool[i] = False

    # Convert boolean mask to an ordered list of indices for label errors
    label_errors_idx = np.arange(len(y))[label_errors_bool]
    # self confidence is the holdout probability that an example
    # belongs to its given class label
    self_confidence = np.array(
        [np.mean(psx[i][y[i]]) for i in label_errors_idx]
    )
    margin = self_confidence - psx[label_errors_bool].max(axis=1)
    label_errors_idx = label_errors_idx[np.argsort(margin)]

    print('Indices of label errors found by confident learning:')
    print('Note label errors are sorted by likelihood of being an error')
    print('but here we just sort them by index for comparison with above.')
    label_errors_idx.sort()
    print(np.array(label_errors_idx))
    return label_errors_idx

label_errors_idx = find_label_errors(confident_joint, y)

Indices of label errors found by confident learning:
Note label errors are sorted by likelihood of being an error
but here we just sort them by index for comparison with above.
[  84  138  212  251  314  363  375  433  449  468  528  541  595  729
  732  774  781  835  837  838  852  855  856  891  923  932  939  973
  988  999 1009 1041 1091 1102 1155 1188 1193 1196 1206 1218 1224 1228
 1237 1241 1245 1261 1270 1298 1313 1317 1329 1361 1366 1378 1393 1398
 1399 1416 1458 1477 1485 1495 1513 1529 1556 1559 1571 1578 1582 1585
 1586 1591 1595 1597 1603 1621 1658 1660 1665 1675 1701 1703 1709 1715
 1718 1722 1723 1726 1727 1741 1750 1754 1756 1757 1763 1776 1780 1786
 1797 1803 1804 1806 1813 1814 1823 1826 1828 1834 1837 1854 1858 1859
 1892 1901 1906 1912 1933 1952 1954 1959 1961 1963 1972 1973 1974 1983
 1986 1988 1996 2000 2001 2002 2003 2004 2007 2022 2029 2039 2043 2066
 2084 2085 2106 2124 2136 2143 2145 2151 2159 2168 2170 2202 2207 2210
 2219 2299 2300 2301 2319 2344 2354 2381 2

# Let's print out the questionable occupations and their labels

In [52]:
def print_label_names(idx_label_map, label_errors_idx):
    label_error_names = [idx_label_map.get(tmp) for tmp in label_errors_idx]

    # We'll sort the labels alphabetically and by label value for easier manual review
    label_error_names.sort()
    pos_labels = []
    neg_labels = []
    for name in label_error_names:
        key_name = name.replace('_', ' ')
        val = int(occs_df[occs_df['occupation'] == key_name].label )
        if val ==1:
            pos_labels.append((key_name, val))
        else:
            neg_labels.append((key_name, val))
    for key_name, val in pos_labels + neg_labels:
        print('"{}", {}'.format(key_name, val))

print_label_names(idx_label_map, label_errors_idx)        

"Aristocrat", 1
"Bankier", 1
"Bard", 1
"Baru", 1
"Boxer", 1
"Buddhist", 1
"Cacique", 1
"Cadet", 1
"Carabinieri", 1
"Carpenter", 1
"Certified Public Accountant", 1
"Confucian scholar", 1
"Curé", 1
"Derebey", 1
"Dewan", 1
"Dichter", 1
"Drayman", 1
"Equestrian", 1
"Fakir", 1
"Fundi", 1
"Hakim", 1
"Hostess", 1
"Investment Advisor", 1
"Jesuit", 1
"Jihadi", 1
"Juris Doctor", 1
"Mahant", 1
"MasterChef", 1
"Media Composer", 1
"Mohel", 1
"Noble", 1
"Pir", 1
"Public Protector", 1
"Regent", 1
"Regidor", 1
"Roman legionary", 1
"Russian oligarch", 1
"Sexton", 1
"Shah", 1
"Shogun", 1
"Smuggler", 1
"Statesman", 1
"Student", 1
"Stunt", 1
"Swami", 1
"Visor", 1
"Womanizer", 1
"abbey", 1
"academic", 1
"alewife", 1
"augur", 1
"autopilot", 1
"ayatollah", 1
"bicycle motocross", 1
"brazier", 1
"brewer", 1
"burglary", 1
"censor", 1
"chess", 1
"child", 1
"choir", 1
"comic book", 1
"communist", 1
"count", 1
"drugstore operator", 1
"duce", 1
"equestrian", 1
"existentialist", 1
"fashion", 1
"feminist", 1
"finance

In [53]:
occs_df[occs_df['occupation'] =='publican']

Unnamed: 0,item_id,occupation_count,occupation,description,in_google_news,language_detected,source,label,labeled_by,label_error_reason
2277,24729786,43,publican,owner or manager of a pub or public house,1,ga,wikidata,0,human,


In [54]:
occs_df[occs_df['occupation'] =='sumo']

Unnamed: 0,item_id,occupation_count,occupation,description,in_google_news,language_detected,source,label,labeled_by,label_error_reason
8791,40561,1,sumo,full-contact wrestling sport,1,la,wikidata,1,human,


In [55]:
occs_df[occs_df['occupation'] =='amanuensis']

Unnamed: 0,item_id,occupation_count,occupation,description,in_google_news,language_detected,source,label,labeled_by,label_error_reason
2271,499134,33,amanuensis,person employed to write or type what another ...,1,es,wikidata,0,human,


In [56]:
possibly_mislabeled =[]
for idx, row in occs_df.iterrows():
    if row['label'] in (0, -1):
        if 'person employed' in str(row['description']):
            possibly_mislabeled.append(row['occupation'])
len(possibly_mislabeled), possibly_mislabeled

(8,
 ['intelligence officer',
  'security guard',
  'amanuensis',
  'private sector employee',
  'chauffeur / chauffeuse',
  'waste collector',
  'factory employee',
  'supercargo'])

In [57]:
occs_df[occs_df['occupation'] =='supercargo']

Unnamed: 0,item_id,occupation_count,occupation,description,in_google_news,language_detected,source,label,labeled_by,label_error_reason
7773,915830,1,supercargo,person employed on board a vessel by the owner...,0,it,wikidata,-1,,


In [58]:
occs_df[occs_df['occupation'] =='detective']

Unnamed: 0,item_id,occupation_count,occupation,description,in_google_news,language_detected,source,label,labeled_by,label_error_reason
1197,842782,344,detective,"investigator, either a member of a police agen...",1,ia,wikidata,0,human,


## Let's update the dataframe with our cleaned labels (after manual review, not shown here)

In [59]:
manual_good_occs =['Archdruid','Attaché','CEO','Gurkha','Hauptmann','Heir apparent','Holocaust survivor','Jedi Master','Lascar','Maker','Mawla','Meister','Official Assignee','Oligarch','Owner','Pictor','Procurator fiscal','Rynda','Seabee','Shihan','Speaker','Stasi','Subedar','Trekker','amanuensis','autodidact','baritone','billionaire','bookworm','boss','carillonneur','commoner','dame','debutante','deserter','detective','disc jockey','drover','druid','employee','employer','empresario','equerry','federalist','fifer','fugitive','godparent','heir','honorary consul','intermediary','jockey','judoka','karateka','keynote speaker','kumu hula','lama','layperson','liar','maestro','mannequin','master','mate','matriarch','millionaire','mystic','naval attaché','nuncio','nurse','nymph','pigeon fancier','pilgrim','pirate','prelate','prima donna','publican','recluse','representative','seminarian','sexual predator','sidekick','smoker','socialite','suffragette','trustee','tsar','villain', 'publican', 'detective', 'intelligence officer',  'security guard',  'amanuensis',  'private sector employee',  'chauffeur / chauffeuse',  'waste collector',  'factory employee', 'supercargo']
manual_bad_occs = ['Baru','Indie','Stunt','Visor','bicycle motocross','burglary','carmen','chess','choir','comic book','fashion','finance','gendarmerie','grips','harp','management','orthopedics','schoolbook publisher','stagehands','streamer','sumo','syndicalist','warfare','woodworkers']

In [60]:
def update_labels(occs_df, manual_good_occs, manual_bad_occs):
    print(f"Dataframe length before update {len(occs_df):,}")
    for occ in tqdm(manual_good_occs, total=len(manual_good_occs)):
        if occ in occs_df['occupation'].tolist():
            the_row = occs_df[occs_df['occupation'] == occ ]
            the_index = the_row.index[0]
            occs_df.loc[the_index, 'label'] = 1
            occs_df.loc[the_index, 'labeled_by'] = 'cleanlab'
    for occ in tqdm(manual_bad_occs, total=len(manual_bad_occs)):
        if occ in occs_df['occupation'].tolist():
            the_row = occs_df[occs_df['occupation'] == occ ]
            the_index = the_row.index[0]
            occs_df.loc[the_index, 'label'] = 0
            occs_df.loc[the_index, 'labeled_by'] = 'cleanlab'    
    print(f"Dataframe length after update {len(occs_df):,}")     
    return occs_df

occs_df = update_labels(occs_df, manual_good_occs, manual_bad_occs)

  0%|          | 0/96 [00:00<?, ?it/s]

Dataframe length before update 10,974


100%|██████████| 96/96 [00:00<00:00, 158.34it/s]
100%|██████████| 24/24 [00:00<00:00, 315.73it/s]

Dataframe length after update 10,974





## Let's cycle the Cleanlab processing againg and review the suggested corrections

In [61]:
good_occs = occs_df.query("in_google_news ==1 and label ==1")['occupation'].tolist()
bad_occs = occs_df.query("in_google_news ==1 and label ==0 ")['occupation'].tolist()
y = np.concatenate([np.ones(len(good_occs), dtype=np.int32), np.zeros(len(bad_occs), dtype=np.int32)])

all_words = good_occs + bad_occs
all_words = [tmp.replace(' ', '_') for tmp in all_words]

# We'll save the mapping of index to words for decoding
idx_label_map = {}
for idx, name in enumerate(all_words):
    idx_label_map[idx]= name
    
X = [gnews_embed[tmp] for tmp in all_words]
X = np.array(X)

psx = cleanlab.latent_estimation.estimate_cv_predicted_probabilities(
    X, y, clf=LogisticRegression(max_iter=1000, multi_class='auto', solver='lbfgs'))


confident_joint = compute_confident_joint(psx, y)


label_errors_idx = find_label_errors(confident_joint, y)

print_label_names(idx_label_map, label_errors_idx)   


 Joint Label Noise Distribution Matrix P(s,y) of shape (2, 2)
 p(s,y)	y=0	y=1
	---	---
s=0 |	1476	28
s=1 |	64	1952
	Trace(matrix) = 3428

Indices of label errors found by confident learning:
Note label errors are sorted by likelihood of being an error
but here we just sort them by index for comparison with above.
[ 140  216  255  318  370  382  458  538  552  562  746  793  855  859
  872  912  944  952  995 1022 1067 1130 1187 1226 1229 1260 1274 1350
 1408 1420 1442 1523 1537 1566 1567 1627 1632 1643 1648 1722 1727 1738
 1762 1764 1779 1787 1813 1815 1816 1822 1845 1864 1884 1885 1887 1898
 1907 1919 1923 1930 1952 1964 1969 1977 2024 2052 2101 2105 2170 2203
 2208 2289 2331 2374 2406 2421 2526 2531 2641 2659 2696 2714 2758 2759
 2850 2880 3040 3062 3080 3265 3452 3491]
"Aristocrat", 1
"Bankier", 1
"Bard", 1
"Bedel", 1
"Bokor", 1
"Buddhist", 1
"Cacique", 1
"Carabinieri", 1
"Carpenter", 1
"Censor", 1
"Confucian scholar", 1
"Curé", 1
"Derebey", 1
"Dewan", 1
"Dragoman", 1
"Drayman", 1


## Manually adjusting the labels

In [62]:
# from latest multipass
bad_occs_second_pass =[ "Aristocrats", "Carabinieri", "Certified Public Accountants", "Confessors", "Distillers", "Geographers", "Investment Advisors", "Raphaël", "Sextons", "audio", "comic books", "conquistadors", "counts", "fashions", "freemasons", "gentlemans", "geometer", "hair", "house", "humanitarian", "interior", "jewel", "journal", "leather", "lighting", "lyric", "magic", "naturopathic practitioners", "neurology", "parachute", "paralympic athletes", "physical", "playbill", "plenipotentiary", "podiatry", "production", "radiation", "ring", "school", "screen", "shoe", "sing", "social", "solo", "sophists", "sound", "spectrometer", "speech", "station", "steel", "structure", "super", "tamer", "tatoo", "taxi", "technical", "theatrical", "ticketing", "track", "traditional healers", "watercolors", "website", "wedding"]
good_occs_second_pass = [ "Brother", "Chief Executive", "Dempster", "Freedom Fighters", "Jedi", "Met Éireann", "Patrick", "Santa Claus", "Schiffer", "ascetic", "associate", "centenarian", "crank", "delegate", "dilettante", "erudite", "fado singer", "folk hero", "groom", "hero", "inebriate", "madam", "mule", "pastry shop", "patriarch", "pipe organ", "registrar", "stallion", "superhero", "suspect", "virgin", "woodworkers", ]

In [63]:
occs_df = update_labels(occs_df, good_occs_second_pass, bad_occs_second_pass)

100%|██████████| 32/32 [00:00<00:00, 269.20it/s]
100%|██████████| 63/63 [00:00<00:00, 1357.76it/s]

Dataframe length before update 10,974
Dataframe length after update 10,974





## 2nd pass Cleanlab

In [64]:
good_occs = occs_df.query("in_google_news ==1 and label ==1")['occupation'].tolist()
bad_occs = occs_df.query("in_google_news ==1 and label ==0 ")['occupation'].tolist()
y = np.concatenate([np.ones(len(good_occs), dtype=np.int32), np.zeros(len(bad_occs), dtype=np.int32)])

all_words = good_occs + bad_occs
all_words = [tmp.replace(' ', '_') for tmp in all_words]

# We'll save the mapping of index to words for decoding
idx_label_map = {}
for idx, name in enumerate(all_words):
    idx_label_map[idx]= name
    
X = [gnews_embed[tmp] for tmp in all_words]
X = np.array(X)

psx = cleanlab.latent_estimation.estimate_cv_predicted_probabilities(
    X, y, clf=LogisticRegression(max_iter=1000, multi_class='auto', solver='lbfgs'))


confident_joint = compute_confident_joint(psx, y)


label_errors_idx = find_label_errors(confident_joint, y)

print_label_names(idx_label_map, label_errors_idx) 


 Joint Label Noise Distribution Matrix P(s,y) of shape (2, 2)
 p(s,y)	y=0	y=1
	---	---
s=0 |	1466	14
s=1 |	56	1984
	Trace(matrix) = 3450

Indices of label errors found by confident learning:
Note label errors are sorted by likelihood of being an error
but here we just sort them by index for comparison with above.
[ 140  217  281  372  384  462  543  558  617  752  861  865  878  918
  957 1018 1029 1074 1137 1194 1275 1283 1306 1418 1430 1447 1452 1453
 1548 1553 1575 1577 1578 1643 1655 1659 1739 1744 1755 1781 1793 1797
 1807 1829 1834 1837 1843 1866 1886 1909 1917 1942 1946 1986 1991 1999
 2058 2096 2123 2220 2346 2389 2421 2540 2564 2766 2999 3069 3294 3454]
"Aristocrat", 1
"Bankier", 1
"Bard", 1
"Bedel", 1
"Buddhist", 1
"Cacique", 1
"Confucian scholar", 1
"Curé", 1
"Derebey", 1
"Dewan", 1
"Dragoman", 1
"Drayman", 1
"Equestrian", 1
"Fakir", 1
"Frigate Captain", 1
"Fundi", 1
"Gendarme", 1
"Jesuit", 1
"Jihadi", 1
"Mahant", 1
"Media Composer", 1
"Pir", 1
"Procurator", 1
"Public Prote

## Third Pass Cleanlab

In [65]:
bad_occs_third_pass = [ "pipe organ", "furniture", "gondoliers", "hat", "laypersons", "paint", "panel", "set", "soap" ]
good_occs_third_pass =[ "Hofmeister", "Pythia", "dragoon", "pasha", "rikishi", "survivor"]

occs_df = update_labels(occs_df, good_occs_third_pass, bad_occs_third_pass)
good_occs = occs_df.query("in_google_news ==1 and label ==1")['occupation'].tolist()
bad_occs = occs_df.query("in_google_news ==1 and label ==0 ")['occupation'].tolist()
y = np.concatenate([np.ones(len(good_occs), dtype=np.int32), np.zeros(len(bad_occs), dtype=np.int32)])

all_words = good_occs + bad_occs
all_words = [tmp.replace(' ', '_') for tmp in all_words]

# We'll save the mapping of index to words for decoding
idx_label_map = {}
for idx, name in enumerate(all_words):
    idx_label_map[idx]= name
    
X = [gnews_embed[tmp] for tmp in all_words]
X = np.array(X)

psx = cleanlab.latent_estimation.estimate_cv_predicted_probabilities(
    X, y, clf=LogisticRegression(max_iter=1000, multi_class='auto', solver='lbfgs'))


confident_joint = compute_confident_joint(psx, y)


label_errors_idx = find_label_errors(confident_joint, y)

print_label_names(idx_label_map, label_errors_idx) 

100%|██████████| 6/6 [00:00<00:00, 240.78it/s]
100%|██████████| 9/9 [00:00<00:00, 1507.30it/s]

Dataframe length before update 10,974
Dataframe length after update 10,974






 Joint Label Noise Distribution Matrix P(s,y) of shape (2, 2)
 p(s,y)	y=0	y=1
	---	---
s=0 |	1457	18
s=1 |	49	1996
	Trace(matrix) = 3453

Indices of label errors found by confident learning:
Note label errors are sorted by likelihood of being an error
but here we just sort them by index for comparison with above.
[ 140  217  337  373  385  545  560  755  869  882  922  961 1004 1032
 1129 1140 1197 1287 1423 1435 1457 1476 1553 1580 1582 1602 1660 1664
 1744 1749 1760 1779 1784 1786 1802 1812 1839 1841 1842 1848 1871 1891
 1911 1945 1947 1951 1991 1996 2004 2222 2240 2348 2390 2421 2433 2540
 2564 2670 2706 2766 2926 3047 3069 3079 3174 3454 3482]
"Advokat", 1
"Aristocrat", 1
"Bard", 1
"Bedel", 1
"Buddhist", 1
"Cacique", 1
"Confucian scholar", 1
"Constructor", 1
"Curé", 1
"Derebey", 1
"Dewan", 1
"Dragoman", 1
"Drayman", 1
"Druggist", 1
"Equestrian", 1
"Fakir", 1
"Frigate Captain", 1
"Jesuit", 1
"Jihadi", 1
"Mahant", 1
"Media Composer", 1
"Noble", 1
"Pir", 1
"Public Protector", 1
"Regi

## Fourth Update

In [66]:
bad_occs_fourth_pass = [  "furniture",   "hat",   "paint",   "panel",   "set"]
good_occs_fourth_pass =[ "dragoon", "goddess", "santero"]

occs_df = update_labels(occs_df, good_occs_fourth_pass, bad_occs_fourth_pass)

100%|██████████| 3/3 [00:00<00:00, 244.52it/s]
100%|██████████| 5/5 [00:00<00:00, 2823.31it/s]

Dataframe length before update 10,974
Dataframe length after update 10,974





## Another way of spot checking; random sampling 

In [67]:
occs_df.query('in_google_news==1 and (label ==1 or label==0)').sample(9)  

Unnamed: 0,item_id,occupation_count,occupation,description,in_google_news,language_detected,source,label,labeled_by,label_error_reason
442,13418253,5490,philologist,person who practices philology,1,en,wikidata,1,human,
3914,220344,2,Pythia,priestess of the Temple of Apollo at Delphi,1,Unknown,wikidata,1,cleanlab,
9282,212927,1,Una,river in Bosnia and Herzegovina and Croatia,1,Unknown,wikidata,0,human,
5329,1043452,2,maintenance,"operational and functional checks, servicing, ...",1,fr,wikidata,1,human,
9576,37559037,1,Sergent,family name,1,fr,wikidata,1,human,
2215,1124183,25,lumberjack,craftsmen who perform the initial harvesting o...,1,lb,wikidata,1,human,
4258,18362,4,theoretical physics,branch of physics,1,en,wikidata,0,human,
5165,567086,1,Annihilator,Canadian metal band,1,la,wikidata,1,human,
9598,6958747,1,work,"particular form of activity, sold by many peop...",1,en,wikidata,0,human,


In [68]:
random_good_occs =['tamer', 'Rangatira' ]
random_bad_occs = ['ethnographers','clarinetists', 'importers', 'models']

In [69]:
occs_df = update_labels(occs_df, random_good_occs, random_bad_occs)

100%|██████████| 2/2 [00:00<00:00, 232.97it/s]
100%|██████████| 4/4 [00:00<00:00, 2404.99it/s]

Dataframe length before update 10,974
Dataframe length after update 10,974





## Once we're satisfied, we'll export and move on

In [71]:
occs_df.to_csv('occupations.wikidata.all.gnews.labeled.final.csv', sep='\t', index=False)

## Now let's build a classifier and label the rest of the Wikidata occupations, see: `training_to_label_with_BERT_and_Cleanlab.ipynb`