# Labeling occupation data with Wikipedia and GoogleNews
In this notebook we will label the occupational data we pulled down from Wikidata in the notebook: [extracting_occupation_and_employer_data_from_wikidata.ipynb](../getting_data/extracting_occupation_and_employer_data_from_wikidata.ipynb)

### Labeling Guidelines:

Occupation: an activity in which one engages; vocation (m-w.com dictionary definition)

Most commonly, an occupation should indicate a person doing a job.  Used in a sentence, a person could be substituted for the occupation.

* I work in finance.
* I work as an accountant.

Saying "finance" is an occupation is substituting the higher level domain for the occupation. Similarly:

* hair - no
* barber - yes
* stylist - yes
 
* oncology - no, that's the field
* oncologist - yes, that's the person specializing in the field

etc.
 
Plurals are invalid because they indicate an abstract group of people, not a single person performing the work:

* teachers - no
* teacher - yes

Some blending of domain is acceptable:

* software - no
* software engineer - yes
* principal engineer - yes (even though it blends Position/title with Occupation)

So, some flexibility in labeling is desirable; it's important to screen out the extreme bad cases.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import matplotlib.pyplot as plt
import numpy as np
from tqdm import tqdm
from random import sample
from collections import Counter, defaultdict
from itertools import chain
import os
import sys
import inspect
from pathlib import Path 
currentdir = Path.cwd()
parentdir = os.path.dirname(currentdir)
sys.path.insert(0,parentdir) 

from mlyoucanuse.language_code_utils import fast_text_prediction_to_two_letter_language_code
from mlyoucanuse.embeddings import get_embeddings_index
import pandas as pd
import fasttext

Using TensorFlow backend.


# Let's load the GoogleNews embeddings we'll be using

In [2]:
%%time
gnews_embed = get_embeddings_index('GoogleNews', parent_dir=parentdir, embedding_dimensions=300)
gnews_vocab = {tmp for tmp in tqdm(gnews_embed.keys())} 
sample(gnews_vocab, 5)

100%|██████████| 3000000/3000000 [00:01<00:00, 1859725.77it/s]


CPU times: user 2min 3s, sys: 4.11 s, total: 2min 7s
Wall time: 2min 8s


In [77]:
occs_df = pd.read_csv('occupations.wikidata.csv', sep='\t')
print(f"number of occs {len(occs_df):,}; number of occs if we remove NaN {len(occs_df.dropna(subset=['en_label'])):,}")
occs_df.dropna(subset=['en_label'], inplace=True)
occs_df.to_csv('occupations.wikidata.cleaned.csv', sep='\t', index=False)

number of occs 12,754; number of occs if we remove NaN 11,280


In [78]:
print(f"Total rows: {len(occs_df):,}")
print(f"Distinct rows: {len(occs_df.drop_duplicates()):,}")
occs_df.drop_duplicates(inplace=True)

Total rows: 11,280
Distinct rows: 11,280


## Closely inspecting the DataFrame reveals that some rows are near duplicates with differing amounts in the occurence counts and description fields. We wish to retain the higher of the two occurence counts, and otherwise coalesce multiple rows.

In [79]:
label_idx = defaultdict(list) 
dupes = []
for idx, row in occs_df.iterrows():
    name = row['en_label']
    if name in label_idx:
        dupes.append( (idx, row['occupation_counts'], row.to_dict()))
        dupes.extend(label_idx[name])
    else:
        label_idx[name ].append((idx, row['occupation_counts'], row.to_dict()))
dupes.sort(key=lambda x:x[1], reverse=True)
to_remove = set()
seen = set()
for idx, item in enumerate(dupes):
    df_idx, cnt, row_dict  = item
    if row_dict['en_label'] in seen:
        to_remove.add(df_idx)
    else:
        seen.add(row_dict['en_label'])
print(f"Number of rows with possible duplicates {len(dupes):,} items to remove {len(to_remove)}, unique items {len(seen):,}")         
occs_df.drop(to_remove, inplace=True)
print(f"Number of rows with distinct occupations {len(occs_df):,} keeping highest occurence counts")

Number of rows with possible duplicates 568 items to remove 306, unique items 243
Number of rows with distinct occupations 10,974 keeping highest occurence counts


## Determine how many occupations are in the GoogleNews embeddings via all words with underscores inserted for spaces 

In [80]:
all_occupation_names = occs_df['en_label'].tolist()
all_occ_names_gnews_format = [tmp.replace(' ', '_') for tmp in all_occupation_names ]
print(f"Number of occupations harvested from Wikidata {len(all_occupation_names):,}" )
print(f"Number of single word occupations in GNews and Wikidata: {len(gnews_vocab & set(all_occupation_names)):,}")
print(f"Number of single word and compound word occupations in GNews and Wikidata: {len(gnews_vocab & set(all_occ_names_gnews_format)):,}")

Number of occupations harvested from Wikidata 10,974
Number of single word occupations in GNews and Wikidata: 2,937
Number of single word and compound word occupations in GNews and Wikidata: 3,520


In [81]:
occs_df[occs_df['en_label'] =='culinary art']

Unnamed: 0,occupation_item_id,occupation_counts,en_label,en_description
7430,2111686,1,culinary art,"art of the preparation, cooking and presentati..."


### Reasonable assumption: if the occupation has membership in an embedding, its probability of truthiness increases with the number of counts observed; corollary, small counts are suspect, or could be new trends in language usage.

#### Let's update our data features, marking if a word occurs in the GoogleNews embeddings.

In [82]:
in_gnews_data = []
for idx, row in occs_df.iterrows():
    if row['en_label'].replace(' ', '_') in gnews_vocab:
        in_gnews_data.append(1)
    else:
        in_gnews_data.append(0)

occs_df['in_google_news'] = in_gnews_data
occs_df.rename(columns={'occupation_counts': 'occupation_count',
                        'en_label': 'occupation', 
                        'en_description': 'description',
                       'occupation_item_id': 'item_id'}, inplace=True)

## Let's detect the language and provide it for each item, as it can be a valuable filter for data users
We'll use the standard FastText language detector see this [FastText blog post](https://fasttext.cc/blog/2017/10/02/blog-post.html) for a how-to create this model.

In [83]:
model = fasttext.load_model('langdetect.ftz')
langs = []
for idx, row in tqdm(occs_df.iterrows()):
    try:
        langs.append(fast_text_prediction_to_two_letter_language_code(model.predict(row['occupation'])))
    except Exception:
        print('fail at', str(row['occupation']))
        langs.append('Unknown')


10974it [00:01, 6490.53it/s]


## Set some defaults

In [84]:
occs_df['language_detected'] = langs
occs_df['source'] = 'wikidata'
occs_df['label'] = -1 # -1 means unset; will eventually be 1 or 0; True or False
occs_df['labeled_by'] = '' # Human, classifier_gnews, classifier_bert
occs_df['label_error_reason'] = '' # plural, position, employer, paraphrase
occs_df.head()

Unnamed: 0,item_id,occupation_count,occupation,description,in_google_news,language_detected,source,label,labeled_by,label_error_reason
0,82955,606008,politician,"person involved in politics, person who holds ...",1,en,wikidata,-1,,
1,189290,33481,military officer,member of an armed force or uniformed service ...,0,en,wikidata,-1,,
2,131512,8887,farmer,person that works in agriculture,1,da,wikidata,-1,,
3,1734662,2012,cartographer,person preparing geographical maps,1,ia,wikidata,-1,,
4,294126,323,land surveyor,profession,0,tr,wikidata,-1,,


## Manually Fetch Wikipedia occupations and correct 
https://en.wikipedia.org/wiki/Lists_of_occupations

In [85]:
wiki_occs = ['accessory designer', 'actor', 'administrator', 'advertising designer', 'anesthesiologist', 'anesthesiology fellow', 'animator', 'arborist', 'archimime', 'architect', 'artisan', 'artistic director', 'assistant stage manager', 'athletic trainer', 'audio engineer', 'audiologist', 'author', 'auto mechanic', 'backup dancer', 'baker', 'ballet dancer', 'ballet historian', 'ballet master', 'bariatric surgeon', 'barker', 'bartender', 'beader', 'beatboxer', 'behavior analyst', 'benshi', 'biokineticist', 'blacksmith', 'blogger', 'bobbin boy', 'boilermaker', 'boilerman', 'book coach', 'bookbinder', 'bouffon', 'brakeman', 'bridge inspector', 'bridge tender', 'call boy', 'caller', 'cardiac scientist', 'cardiac surgeon', 'cardiologist', 'cardiothoracic surgeon', 'cardiovascular technologist', 'carmen', 'carpenter', 'cashier', 'ceramics artist', 'chaplain', 'character actor', 'charge artist', 'chemical engineer', 'chief fireman', 'chief mechanical engineer', 'chiropractor', 'choreographer', 'circus arts', 'clown', 'colorist', 'comedian', 'commissioning editor', 'company manager', 'composer', 'concept artist', 'conductor', 'construction worker', 'copy editor', 'cordwainer', 'corsetier', 'costume designer', 'costume director', 'creative consultant', 'crew chief', 'curator', 'dance critic', 'dance historian', 'dance notator', 'dance scholar', 'dance therapist', 'dancer', 'demi-soloist', 'dermatologist', 'design director', 'design strategist', 'director', 'director of audience services', 'director of development', 'director of production', 'director of public relations', 'director of special events', 'dispatcher', 'dog writer', 'drag king', 'drag queen', 'dramaturge', 'draper', 'dresser', 'dressmaker', 'electrician', 'electronic equipment maintainer', 'elevator maintainer', 'embroiderer', 'emcee', 'emergency medical technician', 'endocrinologist', 'essayist', 'event planner', 'exotic dancer', 'factory worker', 'family practice physician', 'farrier', 'fashion designer', 'feller', 'fettler', 'filling station attendant', 'filmmaker', 'fine artist', 'flagman', 'flatulist', 'flight nurse', 'floral designer', 'fly crew', 'foreman', 'founder', 'freelancer', 'freight conductor', 'furniture maker', 'gandy dancer', 'ganger', 'gastroenterologist', 'geisha', 'general doctor', 'general practitioner', 'geriatrician', 'ghostwriter', 'glover', 'graphic designer', 'griot', 'grips', 'gunsmith', 'gynaecologist', 'hack writer', 'haematologist', 'hairstylist', 'harlequin', 'hatter', 'healthcare chaplain', 'hepatic biliary pancreatic surgeon', 'house manager', 'illusionist', 'illustrator', 'impressionist', 'industrial engineer', 'infopreneur', 'intensivist', 'interior designer', 'internist', 'itinerant poet', 'janitor', 'jeweler', 'jewellery designer', 'journalist', 'kobzar', 'lactation consultant', 'leatherworker', 'length runner', 'light board operator', 'lighting designer', 'lighting maintainer', 'lighting technician', 'lirnyk', 'literary editor', 'literary manager', 'locomotive engineer', 'locomotive superintendent', 'lyricist', 'magician', 'majorette', 'make-up artist', 'mammographer', 'manager', 'marine designer', 'marketing director', 'marquetarian', 'master carpenter', 'master electrician', 'master of ceremonies', 'materials engineer', 'mechanic', 'mechanical engineer', 'media designer', 'mental health counselor', 'mental health professional', 'midwife', 'miller', 'milliner', 'millwright', 'mime', 'minstrel', 'modelmaker', 'moldmaker', 'monologist', 'movement director', 'music director', 'musician', 'navvy', 'neonatalologist', 'neonatologist', 'nephrologist', 'neurologist', 'neuroradiographer', 'neurosurgeon', 'novelist', 'nurse practitioner', 'nurse-midwife', 'obstetrician', 'occupational therapist', 'occupational therapy assistant', 'oncologist', 'ophthalmologist', 'optometrist', 'orthopedic physician', 'orthotist', 'otolaryngologist', 'paint crew', 'painter', 'panel beater', 'pantomime dame', 'parachute rigger', 'paramedic', 'party planner', 'party princess', 'patternmaker', 'pediatrician', 'penciller', 'pewterer', 'pharmacist', 'phlebotomist', 'photographer', 'photojournalist', 'physical therapist', 'physical therapy assistant', 'physician', 'physician assistant', 'pipefitter', 'plant operator', 'platelayer', 'playbill writer', 'playwright', 'plumber', 'podiatric surgeon', 'podiatrist', 'poet', 'pointsman', 'pornographic actor', 'porter', 'potter', 'power maintainer', 'principal dancer', 'producer', 'production designer', 'production manager', 'professional counselor', 'promotional model', 'property master', 'prosthetist', 'psychiatrist', 'psychologist', 'psychotherapist', 'publicist', 'pulmonologist', 'quilter', 'radiation therapist', 'radiographer', 'radiologist', 'radiotherapist', 'railroad engineer', 'railway lubricator', 'registered nurse', 'respiratory therapist', 'revenue protection inspector', 'rhapsode', 'ring girl', 'ringmaster', 'running crew', 'répétiteur', 'sailmaker', 'sawfiler', 'scenic artist', 'scenic designer', 'scenographer', 'school counselor', 'scop', 'screenwriter', 'scribe', 'script coordinator', 'script doctor', 'scrivener', 'sculptor', 'seamstress', 'secondman', 'set decorator', 'set designer', 'set dresser', 'sex therapist', 'shamakhi dancers', 'shoemaker', 'shop foreman', 'showgirl', 'showman', 'showrunner', 'signal maintainer', 'signalman', 'silversmith', 'singer', 'skomorokh', 'soaper', 'social worker', 'soloist', 'songwriter', 'sonographer', 'sound designer', 'speech language pathologist', 'speechwriter', 'sport psychologist', 'spotlight operator', 'staff writer', 'stage crew', 'stage manager', 'stagehand', 'stagehands', 'station agent', 'station master', 'station superintendent', 'stationary engineer', 'steel erector', 'structure maintainer', 'stunt performer', 'superintendent', 'tailor', 'tattoo artist', 'taxi dancer', 'taxidermist', 'technical director', 'technical writer', 'telephone maintainer', 'theater manager', 'theatre practitioner', 'theatrical technician', 'ticket controller', 'ticket inspector', 'ticketing agent', 'track inspector', 'train dispatcher', 'traquero', 'turnstiles maintainer', 'upholsterer', 'urologist', 'usher', 'vedette', 'waiter', 'wardrobe supervisor', 'web designer', 'website content writer', 'wedding planner', 'welder', 'wheelwright', 'woodworkers', 'writer', 'yoga instructor']
wiki_neg_ex_occs = ['astronomy', 'biology', 'theater', 'audio', 'ballet', 'nursing', 'cardiology', 'engineering', 'art', 'dancing', 'writing', 'acting', 'factory', 'film', 'furniture', 'hair', 'hat', 'house', 'magic', 'interior', 'jewel', 'journal', 'leather', 'lighting', 'literary', 'locomotive', 'lyric', 'make-up',  'music', 'neurology', 'novel', 'dental', 'paint', 'parachute', 'panel', 'party', 'physical', 'plant', 'playbill', 'podiatry', 'production', 'radiation', 'railway', 'railroad', 'ring', 'running', 'sail', 'school', 'screen', 'script', 'set', 'sex', 'shoe', 'show', 'shop', 'sing', 'soap', 'social', 'solo', 'sound', 'speech', 'sport', 'staff', 'stage', 'station', 'steel', 'structure', 'super', 'tatoo', 'taxi', 'technical', 'theatre', 'theatrical', 'ticketing', 'track', 'train', 'web', 'website', 'wedding', 'yoga'   ]

In [86]:
max_idx = len(occs_df) +1
additions = []
for occ in tqdm(wiki_occs, total=len(wiki_occs)):
    vals = {'source': ['wikipedia'], 'label': [1], 'labeled_by': ['human'], 'occupation': [occ] }    
    in_embedding = [1] if occ.replace(' ', '_') in gnews_vocab else [0]
    vals['in_google_news'] = in_embedding
    if occ in occs_df['occupation'].tolist():
        the_row = occs_df[ occs_df['occupation'] == occ ]
        the_index = the_row.index[0]
        # print('updating', occ, 'at', the_index)
        occs_df.loc[(occs_df.index == the_index), 'label'] = 1
        occs_df.loc[(occs_df.index == the_index), 'labeled_by'] = 'human'
    else:
        ser = pd.Series(vals)
        ser.name = max_idx
        max_idx +=1
        additions.append(ser)
print(f"Good manual labels: {len(additions)}") 

for occ in tqdm(wiki_neg_ex_occs, total=len(wiki_neg_ex_occs)):
    vals = {'source': ['wikipedia'], 'label': [0], 'labeled_by': ['human'], 'occupation': [occ]}    
    in_embedding = [1] if occ.replace(' ', '_') in gnews_vocab else [0]
    vals['in_google_news'] = in_embedding
    if occ in occs_df['occupation'].tolist():
        the_row = occs_df[occs_df['occupation'] == occ]
        the_index = the_row.index[0]
        # print('updating', occ, 'at', the_index)
        occs_df.loc[the_index, 'label'] = 0
        occs_df.loc[the_index, 'labeled_by'] = 'human'
    else:
        ser = pd.Series(vals)
        ser.name = max_idx
        max_idx +=1
        additions.append(ser)
print(f"Good and Negative manual labels total: {len(additions)}") 

100%|██████████| 375/375 [00:00<00:00, 453.32it/s]
100%|██████████| 80/80 [00:00<00:00, 710.15it/s]

Good manual labels: 151
Good and Negative manual labels total: 206





In [87]:
print(f"Dataframe length before update {len(occs_df)}")
for vals in tqdm(additions, total=len(additions)):
    occs_df.append(vals)
print(f"Dataframe length after update {len(occs_df)}")

  0%|          | 0/206 [00:00<?, ?it/s]

Dataframe length before update 10974


100%|██████████| 206/206 [00:01<00:00, 128.61it/s]

Dataframe length after update 10974





## Let's check our dataframe column types and perform some quality corrections

In [88]:
occs_df.dtypes

item_id                int64
occupation_count       int64
occupation            object
description           object
in_google_news         int64
language_detected     object
source                object
label                  int64
labeled_by            object
label_error_reason    object
dtype: object

In [89]:
occs_df['description'].fillna('', inplace = True) 
occs_df['label_error_reason'].fillna('', inplace = True) 
occs_df['labeled_by'].fillna('', inplace = True) 
occs_df['language_detected'].fillna('', inplace = True) 
occs_df['occupation'].fillna('', inplace = True) 
occs_df['source'].fillna('', inplace = True) 
occs_df['item_id'].fillna(0, inplace = True) 
occs_df['occupation_count'].fillna(0, inplace = True) 
occs_df = occs_df.astype({'description': 'str', 'item_id': 'int64', 'occupation_count': 'int64'})
occs_df.head()

Unnamed: 0,item_id,occupation_count,occupation,description,in_google_news,language_detected,source,label,labeled_by,label_error_reason
0,82955,606008,politician,"person involved in politics, person who holds ...",1,en,wikidata,-1,,
1,189290,33481,military officer,member of an armed force or uniformed service ...,0,en,wikidata,-1,,
2,131512,8887,farmer,person that works in agriculture,1,da,wikidata,-1,,
3,1734662,2012,cartographer,person preparing geographical maps,1,ia,wikidata,-1,,
4,294126,323,land surveyor,profession,0,tr,wikidata,-1,,


# Strategies for Deciding Which Items to Label
* random selections
* suffix analysis

In [90]:
random_occs_to_label = occs_df.query("in_google_news ==1 and label ==-1 and language_detected=='en'").sample(500)['occupation'].tolist()
# ', '.join(random_occs_to_label)
# Captured here:
# 'choreographer, evolutionary psychologist, historian, fireman, washerwoman, crossword compiler, Scoutmaster, chartered surveyor, Tattooed Lady, developmental psychology, creationism, bioethicist, Borracho, Knights Templar, Geographer, motivational speaker, Warlock, receptionist, human rights, Human Resource, plastic surgeon, footman, Catholicos, extractive metallurgy, teaching, Sepoy, celebrity chef, Therapist, forensic pathologist, scullery maid, philosopher king, discus thrower, officer, computer forensics, cornetist, tracker, oceanographer, evolutionary biologist, tinsmith, Russian oligarch, charcoal burner, ethnographer, commissary, telegraphist, copywriter, orthodontics, mercenary, acoustics, witchcraft, Marathi, bounty hunter, Orthodox priest, chorographer, foley artist, tapestry weaver, heresy, mountaineering, wholesale, freemason, Historia, horologist, subversion, anarchist, Comic Book Artist, orthopedics, Johns Hopkins University, software, Tufts University, freediver, Peace Corps, wholesaler, rock climbing, locksmithing, torturer, Discalced Carmelites, acoustician, bassist, ethnobotanist, monarch, Benedictines, comic book, Cinematography, stenographer, fashion, anthropology, artist, publisher, prophet, apothecary, thoracic surgeon, clinical psychologist, coadjutor bishop, copyright holder, alto saxophone, dishwasher, articled clerk, handicraft, sheep shearer, Polish, metaphysician, sexually transmitted infection, manufactory, chartered accountant, CEO, industry, freelance writer, clavichord, investor, nail technician, Public Protector, benefactor, boss, modern pentathlon, ordination, carpet weaver, Surrealism, censorship, referee, sexual intercourse, crofter, paralympic athlete, cutlery, Cryptography, knighthood, Chief Executive, artist, snowboarder, shortstop, beachcomber, scoutmaster, officer, computing, comedy, wildlife conservation, historic preservation, Christianity, Catholic priest, outdoor enthusiast, abolitionist, Literary Criticism, Juris Doctor, Assistant Secretary, scouting, fiction, witness, official, internship, harpsichordist, intermediary, watercolor, cooking, announcer, Diversity, Historian, oncology, iconography, historiography, holography, lamplighter, Workman, orthodontist, abdominal surgery, coxswain, freemasonry, archdeacon, Claretians, psychotherapy, interior ministry, confectioner, conducting, planetary scientist, forensic anthropologist, catcher, layperson, factor, teamster, organ donation, Companion, charioteer, electrical engineering, aerial photography, anthroposophy, bibliography, phytotherapy, conspiracy theorist, ornithologist, toy, Screen Actors Guild, forecasting, preacher, coach, Assessor, counseling, cryptanalyst, archpriest, bass, trainee, translation, arms trafficker, taekwondo athlete, professor, postdoctoral researcher, phrenologist, calligrapher, bibliophile, psychoanalysis, champion, purchaser, heart disease, Statesman, Fashionista, YouTuber, cultural anthropology, philately, Pictor, phonetics, scout, UNICEF Goodwill Ambassador, weaving, geomorphologist, naturopathic practitioner, trainers, bullfighter, caseworker, witch, Dacoity, Affiliate, counterfeiter, Luthier, Mormon pioneers, metaphysics, archer, wedding officiant, aestheticism, treason, lobbyist, master craftsman, warrior, write, theoretical physics, functionary, coaching, literary executor, documentary filmmaker, anatomical pathologist, athlete, educationalist, freight forwarder, therapist, Investment Advisor, ultramarathon, sophist, Carthusians, Mormon missionary, outlaw, chemistry, Steadicam, Latin Americanist, Whitewash, snowshoe, anaesthesiology, Coach, strongwoman, forensic anthropology, Sofer, ethics, decathlete, glassblower, hagiographer, accounting, sportswriter, scenography, official, supervisory board, Revolutionary, modern pentathlete, ophthalmology, fiction, topographer, mechanical engineering, orchestrator, Landowner, freestyle motocross rider, bicycle motocross, Counselor, revolutionary syndicalism, Notary, chronograph, Journalist, Thinker, archer, Thief, notary, radiation therapy, biographer, geophysicist, Certified Public Accountant, faith healer, electrochemist, conjoined twins, househusband, anthologist, literary critic, health care, researcher, count, Confessor, technologist, Conservationist, hare coursing, heir presumptive, powerlifting, fountain pen, plowman, biathlon, armed robbery, newspaper, architecture, eater, pitcher, screenplay, Fon, lackey, fighter, commander, Chief Constable, trombonist, viscount, Protestant reformer, Creative, Catechesis, infantry, informant, Sexton, Architects, commander, typography, belly dancer, crack, Motorcyclist, troubleshooter, Shaper, traditional healer, heiress, Air Officer Commanding, lottery, Woodcutter, calypsonian, steeplechase, whistleblower'
with open('random.500.to.label.txt', 'wt') as fout:
    for word in random_occs_to_label:
        fout.write(f'"{word}",\n')

# Suffix Analysis
With single string occupation labeling such as we are doing, sampling from character features--such as via suffix analysis, is likely helpful.
A large amount of domain information is typically encoded into the suffix of word. Let's group by the word endings and take a look.

In [91]:
suffixes = Counter([word[-3:] for word in all_occupation_names])
print(suffixes.most_common(71))

[('ist', 1016), ('tor', 460), ('ter', 435), ('ing', 352), ('ker', 225), ('ion', 222), ('ian', 221), ('ent', 218), ('ner', 208), ('her', 204), ('ogy', 195), ('cer', 166), ('der', 149), ('ger', 146), ('yer', 138), ('ant', 137), ('ler', 125), ('eer', 111), ('nce', 108), ('man', 98), ('ics', 92), ('ier', 88), ('ism', 87), ('ity', 83), ('are', 81), ('ver', 73), ('per', 72), ('ate', 72), ('rer', 69), ('ary', 67), ('sor', 65), ('ach', 60), ('ral', 57), ('phy', 56), ('ies', 56), ('mer', 55), ('try', 55), ('ice', 55), ('eur', 49), ('ure', 47), ('ine', 40), ('ard', 39), ('ces', 39), ('ess', 39), ('ive', 38), ('ain', 36), ('ers', 36), ('and', 36), ('ery', 35), ('ser', 34), ('ber', 34), ('lor', 31), ('lar', 30), ('ial', 29), ('ect', 29), ('ory', 29), ('art', 29), ('tic', 28), ('son', 27), ('dor', 27), ('ann', 27), ('all', 27), ('ort', 26), ('del', 25), ('ire', 25), ('hop', 24), ('ign', 23), ('ete', 23), ('wer', 23), ('sta', 23), ('rat', 22)]


## Let's write out the samples in clusters, for easier manual review

In [92]:
suffix_lists = defaultdict(list)
for word in all_occupation_names:
    suffix_lists[word[-3:]].append(word)
output =[]
for word, count in suffixes.most_common(71):
    output.append(sample(suffix_lists[word], 10))
with open('suffixes.to.label.txt', 'wt') as fout:
    for word in chain.from_iterable( output):
        fout.write(f'"{word}",\n')

# Manually label the occupations and Update the dataset with the labels

In [93]:
bad_occs_suffix_groups = ['mythological king', 'art collecting', 'dog breeding', 'radio broadcasting', 'Landscape contracting', 'icon painting', 'Counselor-in-Training', 'knife making', 'poetry reading', 'coin counterfeiting', 'book illustration', 'corporate communication', 'television', 'speculative fiction', 'music organisation', 'medical organization', 'botanical illustration', 'fashion illustration', 'health administration', 'search engine optimization', 'entertainment', 'labour movement', 'investment', 'Marketing management', 'guerrilla movement', 'neurophysiology', 'hydrobiology', 'Veterinary Microbiology', 'anatomical pathology', 'Catholic theology', 'Candidate of Sciences in Pedagogy', 'orchidology', 'malacology', 'cynology', 'traumatology', 'member of the Senate of France', 'headmaster in France', 'insurance', 'Weroance', 'Fire services in France', 'surface science', 'landscape science', 'fence', 'dauphin of France', "turn state's evidence", 'gymnastics', 'computer forensics', 'corrupt politics', 'statistics', 'paralympic athletics', 'pyrotechnics', 'phonetics', 'physical optics', 'Acousto-optics', 'atmospheric physics', 'carabinier', 'cavalier', 'humanism', 'Cameralism', 'shamanism', 'radio journalism', 'revolutionary syndicalism', 'Mexican muralism', 'radical feminism', 'investigative journalism', 'Ministry of Tourism', 'sports journalism', 'Rector of Charles University', 'fertility deity', 'Uppsala University', 'Lipscomb University', 'Ohio State University', 'spirituality', 'principal of Uppsala University', 'herrnhutare', 'samariterhemsföreståndare', 'korkfabriksidkare', 'teckenspråkslärare', 'hemmansägare', 'ärftlighetsforskare', 'kryddkrämare', 'kalkbruksägare', 'fyrmästare', 'lever', 'triumvirate', 'Baku Governorate', 'doctorate', 'Governing Senate', 'Named Professor', 'military', 'canonge lectoral', 'patostreamer', 'choreography', 'Cryptography', 'demography', 'holography', 'aerial photography', 'historiography', 'Agregation of history and geography', 'cinematography', 'philosophy', 'glamour photography', 'member of the Argentine Chamber of Deputies', 'Cultural Institutions Studies', 'Finno-Ugric studies', 'Middle Eastern studies', 'Semitic studies', 'Serbian studies', 'Jewish studies', 'Columbia University Libraries', 'Japanese studies', 'United States Department of Justice', 'Foreign Intelligence Service', 'Federal Police', 'Military Police', 'French high civil service', 'Indian Police Service', 'civil service', 'voluntary service', 'Ministry', 'forestry', 'neuropsychiatry', 'home industry', 'management consulting industry', 'infantry', 'General of the Infantry', 'pharmaceutical industry', 'advertising industry', 'pulp and paper industry', 'person of short stature', 'travel literature', 'Pollet soap manufacture', 'sculpture', 'public figure', 'scholar of the bible as literature', 'caricature', 'sinecure', 'sports figure', 'show business', 'Bangladesh Coast Guard', 'Director-General of the National Heritage Board', 'Captain of the guard', 'Young Guard', 'Member of the Congress of Deputies of Spain', 'mountain', 'District Chief Executive', 'agricultural cooperative', 'Muslim Leadership Initiative', 'emergency medicine', 'Haute cuisine', 'medicine', 'Marta Marín-Dòmine', 'Student medicine', 'space marine', 'transfusion medicine', 'Revolutionary Insurrectionary Army of Ukraine', 'medical sciences', 'archaeology of the Roman provinces', 'Judge Advocate General of the Armed Forces', 'Candidate of Biology Sciences', 'Soviet Armed Forces', 'United States Armed Forces', 'Actrices', 'Norwegian School of Sport Sciences', 'Indonesian National Armed Forces', 'ice cream parlor', 'Category:Composers', 'Shunters', 'racehorse owners and breeders', 'Women Human Rights Defenders', 'United States Army Rangers', 'Islamic religious leaders', 'Lawyers', 'Youtubers', 'Mormon pioneers', 'list of non-fiction writers', 'command', 'Maquis shrubland', 'land surveyor in Poland', 'Historical Dictionary of Switzerland', 'Modstandsmand', 'Member of Parliament in the Parliament of England', 'Kalkulaturvorstand', 'Privy Council of Thailand', 'Deputy Commander, ROK/US Combined Forces Command', 'grave robbery', 'surgery', 'colorectal surgery', 'upholstery', 'nursery', 'embroidery', 'abdominal surgery', 'literary forgery', 'pottery', 'Imagery', 'Bokser', 'Comercial', 'Industrial', 'Cheval territorial', 'infomercial', 'Universal Esperanto Association committee member', 'Varela Project', 'project', 'mystic', 'number theory', 'art history', 'cultural history', 'Princeton Plasma Physics Laboratory', 'sketch story', 'combinatorial group theory', 'International Institute of Social History', 'signatory', 'history', 'prehistory', 'concept art', 'libre art', 'art', 'Go-kart', 'Xerox art', 'new media art', 'conceptual art', 'history of art', 'body art', 'glass art', 'treason', 'Deputy of the National Congress of Ecuador', 'prank call', 'handball', 'Australian rules football', 'ball', 'korfball', 'football', 'climbing wall', 'baseball', 'sport', 'Melbourne Airport', 'cycle sport', 'ice stock sport', 'maritime transport', 'transport', 'Tuatha Dé Danann', 'Fourth State Duma of the Russian Empire', 'satrapy of the Achaemenid Empire', 'Second State Duma of the Russian Empire', 'satire', 'Member of the State Duma of the Russian Empire', 'informal attire', 'graphic design', 'interaction design', 'jewellery design', 'fashion design', 'software design', 'garden design', 'costume design', 'video game design', 'web design', 'game design', 'mayor of Albacete', 'combined track and field event athlete', 'Naczelnik miasta', ]
good_occs_suffix_groups = ['sovietologist', 'music artist', 'labour scientist', 'physiologist',              'sports columnist', 'paleoentomologist', 'radio journalist',              'gemologist', 'venereologist', 'optometrist', 'Liquidator',              "children's illustrator", 'second unit director', 'literary translator',              'labour inspector', 'prosecutor', 'telephone operator', 'boom operator',              'diocesan administrator', 'accident investigator', 'classical trumpeter',              'bounty hunter', 'Steuerberater', 'comics writer', 'importer',              'pipefitter', 'Beamter', 'Proviantmeister', 'Büroangestellter',              'matte painter', 'lace maker', 'tool and die maker', 'lineworker',              'slacker', 'coin weight maker', 'fact checker', 'policymaker',              'Schlagwerker', 'School social worker', 'model maker', 'equestrian',              'funk musician', 'comedian', 'applied statistician', 'fashion historian',              'bluegrass musician', 'naval historian', 'fictional politician',              'Politician', "children's librarian", 'high school student', 'Resident',              'Senior Agent', 'president', 'building superintendent',              'government commissioner', 'gardener', 'urban planner', 'miner',              'judicial scrivener', 'judicial commissioner', 'motorcycle designer',              'cleaner', 'organ tuner', 'managing partner', 'voice teacher',              'Forstaufseher', 'letterpress researcher', 'market researcher',              'high school teacher', 'Gefängnisaufseher', 'Teacher',              'language teacher', 'leading researcher', 'anthroposopher',              'reservofficer', 'Producer', 'dragon boat racer', 'air force officer',              'Communications Officer', 'Japan Coast Guard Officer', 'tango dancer',              'company security officer', 'bicycle police officer', 'police officer',              'mass murder', 'service provider', 'Warder', 'plant breeder',              'activist shareholder', 'religion founder', 'bodybuilder',              'spiritual leader', 'co-founder', 'labor leader', 'hotel manager',              'Schlagersänger', 'general manager', 'pop singer', 'fishmonger', 'luger',              'research data manager', 'folk singer', 'fellmonger', 'Kammersänger',              'Stationsaspirant', 'clerical assistant', 'house servant',              'social and health care assistant', 'consultant',              'environmental consultant', 'Deputy Lieutenant',              'social-media consultant', "magician's assistant",              'international civil servant', 'rugby player', 'cosplayer',              'beach volleyball player', 'administrative lawyer', 'rink hockey player',              'English billiards player', 'cimbalom player', 'pétanque player',              'tennis player', 'oină player', 'Curler', 'storyteller', 'Distiller',              'cobbler', 'offender profiler', 'antiquarian seller', 'driller',              'sjoeler', 'bookseller', 'cobbler', 'engineer', 'Sanitary engineer',              'research engineer', 'ski-orienteer', 'flight engineer',              'Digital marketing engineer', 'pioneer', 'railway engineer',              'software engineer', 'biomedical engineer', 'clubwoman', 'Dragoman',              'horseman', 'rag-and-bone man', 'freedman', 'plowman', 'gentleman',              'spelman', 'creelman', 'frontiersman', 'gondolier', 'furrier', 'cashier',              'Holocaust denier', 'temporary career soldier', 'Zimmerpolier',              'soldier', 'marchand-mercier', 'Judge authority', 'local authority',              'airport security', 'warfare', 'sniper', 'housekeeper', 'bookkeeper',              'sapper', 'storekeeper', 'doorkeeper', 'clone trooper', 'Wheeltapper',              'video game developer', 'United States Marine Corps Scout Sniper',              'motorcycle manufacturer', 'aerospace manufacturer', 'explorer',              'Gödel Lecturer', 'lecturer', 'murderer', 'letterer',              'part-time lecturer', 'automobile manufacturer', 'textile manufacturer',              'glass engraver', 'aerial observer', 'stocking weaver', 'medal engraver',              'taxi driver', 'co-driver', 'altar server', 'observer', 'diver',              'Chief Clerk of the Senate', 'papal legate', 'examining magistrate',              'Parliamentarian of the United States Senate', 'Ph.D. candidate',              'Roman magistrate', 'full professor', 'wardrobe supervisor',              'European Data Protection Supervisor', 'Visor', 'associate professor',              'sponsor', 'supervisor', 'progressor', 'assistant professor',              'functionary', 'fictional secretary', 'Federal Secretary',              'prothonotary', 'apothecary', 'minister plenipotentiary',              'departmental secretary', 'plenipotentiary', 'permanent secretary',              'advocate general', 'Korporal', 'Illinois Attorney General',              'Flottillenadmiral', 'Secretary-general', 'vice admiral',              'Governor-general', 'captain general', 'Massachusetts Attorney General',              'basketball coach', 'Football coach', 'public speaking coach',              'riding coach', 'cricket coach', 'ski jumping coach', 'volleyball coach',              'voice coach', 'cross-country skiing coach', 'dialect coach',              'jazz drummer', 'mind gamer', 'Framer', 'blackface minstrel performer',              'Lebensreformer', 'competitive programmer', 'Sega performer',              'Singer, Songwriter & Legendary Performer', 'amateur astronomer',              'Master of ceremonies', 'cantatrice', 'chief of police', 'répétiteur',              'Oberingenieur', 'Horeca entrepreneur', 'amateur', 'Bohringenieur',              'traceur', 'Sauveteur', 'fashion entrepreneur', 'Entrepreneur',              'Infopreneur', 'legendary figure', 'Hostess', 'Ancient Roman priestess',              'school mistress', 'deaconess', 'fictional princess', 'Countess',              'canoness', 'governess', 'Goliard', 'bard', 'coast guard', 'shipyard',              'data steward', 'dreyfusard', 'supervillain', 'coxswain',              'Papal chamberlain', 'Châtelain', 'captain', 'senior chamberlain',              'military chaplain', 'coxswain', 'broadcasting executive', 'executive',              'permanent representative', 'Healthcare executive', 'representative',              'representative', 'handball executive', 'libertine', 'Maalimine',              'maître de conférences', 'school counselor', 'councillor',              'mental health counselor', 'fashion tailor',              'Licensed professional counselor', 'watercolor', 'Pro-vice-chancellor',              'vice-chancellor', 'chancellor', 'Chancellor of Poland',              'religious studies scholar', 'film scholar', 'Old Testament scholar',              'scholar', 'Russian studies scholar', 'independent scholar',              'gender studies scholar', 'canons regular', 'Judaic scholar',              'Public administration scholar', 'window dresser', 'Katasterlandmeser',              'agricultural adviser', 'vocal composer', 'financial adviser',              'Katasterlandmesser', 'Media Composer', 'Set dresser',              'video game composer', 'official', 'United Nations official',              'Technical Official', 'postal official', 'top official',              'eunuch official', 'YouTuber', 'beachcomber', 'robber',              'Bezirksamtsschreiber', 'board member', 'Hilfsgerichtsschreiber',              'Hilfsschreiber', 'member', 'Proviantschreiber', 'Greek prefect',              'data architect', 'bishop-elect', 'prefect', 'county architect',              'diocesan architect', 'Praetorian prefect', 'City Architect',              'horse-race critic', 'comics critic', 'dance critic', 'music critic',              'social critic', 'bullfighting critic', 'gnostic', 'media critic',              'architecture critic', 'parson', 'layperson', 'fictional businessperson',              'elected person', 'automobile salesperson', 'freemason', 'henchperson',              'dental person', 'businessperson', 'operador', 'Regidor', 'conquistador',              'goodwill ambassador', 'Colecionador', 'rejoneador', 'brand ambassador',              'Colecionador', 'ambassador', 'High Sheriff of Cornwall',              'director of football', 'queen consort', 'Spanish Queen consort',              'Princess consort', 'consort', 'diocesan bishop', 'cardinal-bishop',              'Prince-Bishop', 'Orthodox bishop', 'archbishop', 'bishop',              'military bishop', 'emeritus bishop', 'titular bishop', 'Rennes bishop',              'Bezirkshauptmann', 'Rauchwaren-Großhandels-Kaufmann',              'Tuatha Dé Danann', 'Bezirksamtmann', 'Amtmann', 'Amtshauptmann',              'Landeshauptmann', 'Ältermann', 'Lehnsmann', 'lensmann',              'promotional model', 'padel', 'Model', 'fashion model', 'Bedel', 'model',              'glamour model', 'fetish model', 'runway model', 'pseudo-model',              'Australian rules football umpire', 'fonctionnaire', 'squire',              'rowing umpire', 'Landrat', 'Forstrat', 'bureaucrat', 'Landrat',              'Aristocrat', 'magistrat', 'Geheimer Regierungsrat', 'aristocrat',              'Studienrat', 'eurocrat', 'military athlete', 'jinete',              'extreme sports athlete', 'duathlete', 'pentathlete', 'olympic athlete',              'strength athlete', 'decathlete', 'hammer thrower', 'vegetable grower',              'wire drawer', 'interviewer', 'brewer', 'glassblower', 'knife thrower',              'hewer', 'weight thrower', 'coffee grower', 'naturista', 'sertanista',              'Naturalista', 'Dialogista', 'Internacionalista', 'Kabalista',              'Occitanista', 'repentista', 'Publicista']

random_pos_occs = ['nurse anesthetist', 'body piercer',  'opera singer',  'muralist',  'theoretical physicist',  'Khatib',  'drug lord',  'gunner',  'aeronautical engineer',  'washerwoman',  'hunter',  'necromancer',  'clarinetist',  'ethologist',  'Polyglot',  'choreographer',  'evolutionary psychologist', 'historian', 'fireman',  'crossword compiler', 'Scoutmaster', 'chartered surveyor', 'bioethicist', 'Geographer',  'motivational speaker', 'Warlock', 'receptionist', 'plastic surgeon',  'footman', 'Sepoy',  'celebrity chef', 'Therapist', 'forensic pathologist', 'scullery maid', 'philosopher king', 'discus thrower', 'officer', 'cornetist', 'tracker', 'oceanographer', 'evolutionary biologist', 'tinsmith', 'Russian oligarch', 'ethnographer', 'telegraphist', 'copywriter', 'mercenary', 'bounty hunter', 'Orthodox priest', 'chorographer', 'foley artist', 'tapestry weaver', 'freemason', 'horologist', 'anarchist', 'Comic Book Artist',  'orthopedics', 'freediver', 'wholesaler', 'torturer', 'acoustician',  'bassist', 'ethnobotanist', 'monarch', 'comic book', 'stenographer',  'fashion', 'artist',  'publisher', 'prophet', 'apothecary',  'thoracic surgeon', 'clinical psychologist', 'coadjutor bishop', 'copyright holder', 'dishwasher',  'articled clerk', 'sheep shearer', 'metaphysician', 'chartered accountant', 'freelance writer', 'investor', 'nail technician', 'Public Protector', 'benefactor', 'modern pentathlon', 'carpet weaver', 'referee', 'crofter',  'paralympic athlete', 'artist',  'snowboarder', 'shortstop', 'beachcomber', 'scoutmaster', 'officer', 'Catholic priest', 'outdoor enthusiast', 'abolitionist', 'Juris Doctor', 'Assistant Secretary', 'witness', 'official', 'harpsichordist', 'announcer', 'Historian', 'lamplighter', 'Workman', 'orthodontist', 'coxswain', 'archdeacon', 'confectioner', 'planetary scientist', 'forensic anthropologist', 'catcher', 'teamster', 'charioteer', 'conspiracy theorist', 'ornithologist', 'preacher', 'coach', 'Assessor', 'cryptanalyst', 'archpriest', 'trainee', 'arms trafficker', 'taekwondo athlete', 'professor', 'postdoctoral researcher', 'phrenologist', 'calligrapher', 'bibliophile', 'champion', 'purchaser', 'Statesman', 'Fashionista', 'YouTuber', 'scout', 'UNICEF Goodwill Ambassador', 'geomorphologist', 'naturopathic practitioner', 'bullfighter', 'caseworker', 'witch', 'counterfeiter', 'Luthier', 'archer', 'wedding officiant', 'lobbyist', 'master craftsman', 'warrior', 'functionary', 'literary executor', 'documentary filmmaker', 'anatomical pathologist', 'athlete', 'educationalist', 'freight forwarder', 'therapist', 'Investment Advisor', 'sophist', 'Mormon missionary',  'outlaw', 'Latin Americanist', 'Coach', 'strongwoman', 'Sofer', 'decathlete', 'glassblower', 'hagiographer', 'sportswriter', 'official', 'Revolutionary', 'modern pentathlete', 'topographer', 'orchestrator', 'Landowner', 'freestyle motocross rider', 'bicycle motocross', 'Counselor', 'Notary', 'Journalist', 'Thinker', 'archer', 'Thief', 'notary', 'biographer', 'geophysicist',  'Certified Public Accountant', 'faith healer', 'electrochemist', 'househusband', 'anthologist', 'literary critic', 'researcher', 'count', 'Confessor', 'technologist',  'Conservationist', 'plowman', 'eater', 'pitcher', 'lackey', 'fighter', 'commander', 'Chief Constable', 'trombonist', 'viscount', 'Protestant reformer', 'informant', 'Sexton', 'commander', 'belly dancer', 'Motorcyclist', 'troubleshooter', 'Shaper', 'traditional healer', 'heiress', 'Woodcutter', 'calypsonian', 'whistleblower']
random_neg_occs = ['developmental psychology', 'creationism', 'Borracho', 'Knights Templar', 'human rights', 'Human Resource', 'Catholicos', 'extractive metallurgy', 'teaching', 'computer forensics', 'charcoal burner', 'commissary', 'orthodontics', 'acoustics', 'witchcraft', 'Marathi', 'heresy', 'mountaineering', 'wholesale', 'Historia', 'subversion', 'Johns Hopkins University', 'software', 'Tufts University', 'Peace Corps', 'rock climbing', 'locksmithing', 'Discalced Carmelites', 'Benedictines', 'Cinematography', 'anthropology', 'alto saxophone', 'handicraft', 'Polish', 'sexually transmitted infection', 'manufactory', 'CEO', 'industry', 'clavichord', 'boss', 'modern pentathlon', 'ordination', 'Surrealism', 'censorship', 'sexual intercourse', 'cutlery', 'Cryptography', 'knighthood', 'Chief Executive', 'computing', 'comedy', 'wildlife conservation', 'historic preservation', 'Christianity', 'Literary Criticism', 'scouting', 'fiction', 'internship', 'intermediary', 'watercolor', 'cooking', 'Diversity', 'oncology', 'iconography', 'historiography', 'holography', 'abdominal surgery', 'freemasonry', 'Claretians', 'psychotherapy', 'interior ministry', 'conducting', 'layperson', 'factor', 'organ donation', 'Companion', 'electrical engineering', 'aerial photography', 'anthroposophy', 'bibliography', 'phytotherapy', 'toy', 'Screen Actors Guild', 'forecasting', 'counseling', 'bass', 'translation', 'psychoanalysis', 'heart disease', 'cultural anthropology', 'philately', 'Pictor', 'phonetics', 'weaving', 'trainers', 'Dacoity', 'Affiliate', 'Mormon pioneers', 'metaphysics', 'aestheticism', 'treason', 'write', 'theoretical physics', 'coaching', 'ultramarathon', 'Carthusians', 'chemistry', 'Steadicam', 'Whitewash', 'snowshoe', 'anaesthesiology', 'forensic anthropology', 'ethics', 'accounting', 'scenography', 'supervisory board', 'ophthalmology', 'fiction', 'mechanical engineering', 'revolutionary syndicalism', 'chronograph', 'radiation therapy', 'conjoined twins', 'health care', 'hare coursing', 'heir presumptive', 'powerlifting', 'fountain pen', 'biathlon', 'armed robbery', 'newspaper', 'architecture', 'screenplay', 'Fon', 'Creative', 'Catechesis', 'infantry', 'Architects', 'typography', 'crack', 'Air Officer Commanding', 'lottery', 'steeplechase', 'Tattooed Lady']
                           
pos_occs = random_pos_occs + good_occs_suffix_groups
neg_occs = random_neg_occs + bad_occs_suffix_groups                            
print(f"Positive examples: {len(pos_occs)}")
print(f"Negative examples: {len(neg_occs)}")
negative_plural_labels = [ tmp + 's' for tmp in pos_occs if tmp.replace(' ', '_') + 's' in gnews_vocab]
print(f"Number of negative plural examples {len(negative_plural_labels)}")

Positive examples: 671
Negative examples: 404
Number of negative plural examples 276


In [94]:
additions =[]
print(f"Dataframe length before update {len(occs_df)}")
for occ in tqdm(pos_occs, total=len(pos_occs)):
    if occ in occs_df['occupation'].tolist():
        the_row = occs_df[occs_df['occupation'] == occ]
        the_index = the_row.index[0]
        occs_df.loc[the_index, 'label'] = 1
        occs_df.loc[the_index, 'labeled_by'] = 'human'
for occ in tqdm(neg_occs, total=len(neg_occs)):
    if occ in occs_df['occupation'].tolist():
        the_row = occs_df[occs_df['occupation'] == occ]
        the_index = the_row.index[0]
        occs_df.loc[the_index, 'label'] = 0
        occs_df.loc[the_index, 'labeled_by'] = 'human'    
for occ in tqdm(negative_plural_labels, total=len(negative_plural_labels)):
    if occ in occs_df['occupation'].tolist():
        the_row = occs_df[occs_df['occupation'] == occ]
        the_index = the_row.index[0]
        occs_df.loc[the_index, 'label'] = 0
        occs_df.loc[the_index, 'labeled_by'] = 'human'
        occs_df.loc[the_index, 'label_error_reason'] = 'plural'
    else:        
        ser = pd.Series({'source': 'human', 'label': 0, 
                'labeled_by': 'human', 'occupation': occ, 
                'in_google_news': 1,'label_error_reason': 'plural'})
        ser.name = max_idx
        max_idx +=1
        additions.append(ser)

print(f"Dataframe length before update {len(occs_df):,}")
for vals in tqdm(additions, total=len(additions)):
    occs_df.append(vals) 
occs_df.drop_duplicates(inplace=True)
print(f"Dataframe length after update {len(occs_df):,}")        

  4%|▍         | 29/671 [00:00<00:02, 286.40it/s]

Dataframe length before update 10974


100%|██████████| 671/671 [00:02<00:00, 325.82it/s]
100%|██████████| 404/404 [00:01<00:00, 320.52it/s]
100%|██████████| 276/276 [00:00<00:00, 1876.35it/s]
  5%|▌         | 14/276 [00:00<00:01, 132.83it/s]

Dataframe length before update 10,974


100%|██████████| 276/276 [00:02<00:00, 137.71it/s]

Dataframe length after update 10,974





In [95]:
occs_df['description'].fillna('', inplace = True) 
occs_df['label_error_reason'].fillna('', inplace = True) 
occs_df['labeled_by'].fillna('', inplace = True) 
occs_df['language_detected'].fillna('', inplace = True) 
occs_df['occupation'].fillna('', inplace = True) 
occs_df['source'].fillna('', inplace = True) 
occs_df['item_id'].fillna(0, inplace = True) 
occs_df['occupation_count'].fillna(0, inplace = True) 
occs_df = occs_df.astype({'description': 'str', 'item_id': 'int64', 'occupation_count': 'int64'})
occs_df.head()

Unnamed: 0,item_id,occupation_count,occupation,description,in_google_news,language_detected,source,label,labeled_by,label_error_reason
0,82955,606008,politician,"person involved in politics, person who holds ...",1,en,wikidata,-1,,
1,189290,33481,military officer,member of an armed force or uniformed service ...,0,en,wikidata,-1,,
2,131512,8887,farmer,person that works in agriculture,1,da,wikidata,-1,,
3,1734662,2012,cartographer,person preparing geographical maps,1,ia,wikidata,-1,,
4,294126,323,land surveyor,profession,0,tr,wikidata,-1,,


In [96]:
occs_df.tail()

Unnamed: 0,item_id,occupation_count,occupation,description,in_google_news,language_detected,source,label,labeled_by,label_error_reason
12749,10526680,1,hovkamrerare,,0,sl,wikidata,-1,,
12750,10526703,1,Hovrättspresident,,0,sv,wikidata,-1,,
12751,66486266,1,activista taurí,subclase de activista,0,oc,wikidata,-1,,
12752,360443,1,Escort,Wikimedia disambiguation page,1,oc,wikidata,-1,,
12753,87252988,1,弓道家,,0,Unknown,wikidata,-1,,


### In an earlier version of this notebook, we labeled 1,000 entries with GoogleNews Embeddings and trained a classifier to predict the rest. This worked pretty well, but we decided to label all the occupations that are in GoogleNews since that will help us sooner reach the goal of labeling all the Wikidata occupations.
# Now let's just label all of the GoogleNews Embeddings

In [97]:
all_occs = occs_df.query("in_google_news ==1 and label ==-1")['occupation'].tolist()
print(f"Number total: {len(all_occs):,}")
all_occs = list(set(all_occs))
print(f"Number distinct: {len(all_occs):,}")
all_occs.sort(key=lambda x: str(list(reversed(list(x)))))
with open('all_occs.to_label.txt', 'wt') as fout:
    for word in all_occs:
        fout.write('"{}",\n'.format(word))

Number total: 2,813
Number distinct: 2,813


In [98]:
all_good_gnews_occs =[ "DJ", "VJ", "ninja", "doula", "ballerina", "Poeta", "barista", "Khatib", "Medic", "combat medic", "medic", "psychic", "academic", "comic", "Cleric", "cleric", "heretic", "critic", "nomad", "retired", "unemployed", "milkmaid", "maid", "child", "husband", "brigand", "gourmand", "Vagabond", "Bard", "lifeguard", "bodyguard", "guard", "Steward", "steward", "wizard", "swineherd", "shepherd", "goatherd", "landlord", "warlord", "lord", "apprentice", "maintenance", "finance", "counterintelligence", "Prince", "crown prince", "prince", "duce", "guide", "alcalde", "Thuggee", "refugee", "referee", "alewife", "housewife", "wife", "sage", "Judge", "judge", "concierge", "Roadie", "Indie", "groupie", "gendarmerie", "duke", "constable", "Noble", "noble", "oracle", "beadle", "mole", "disciple", "apostle", "Gendarme", "gendarme", "concubine", "pope", "consigliere", "stevedore", "Chanteuse", "muse", "spouse", "advocate", "candidate", "poet laureate", "inmate", "Playboy Playmate", "magnate", "magistrate", "curate", "exegete", "Athlete", "biathlete", "triathlete", "heptathlete", "vigilante", "soubrette", "prostitute", "acolyte", "pedagogue", "demagogue", "rogue", "Cacique", "cacique", "reeve", "MasterChef", "Chef", "pastry chef", "chef", "Battalion Chief", "chief", "thief", "bailiff", "sheriff", "Pontiff", "king", "thug", "Shah", "Messiah", "grand ayatollah", "ayatollah", "mullah", "eunuch", "sheikh", "pharaoh", "caliph", "dervish", "homeopath", "osteopath", "goldsmith", "locksmith", "metalsmith", "coppersmith", "samurai", "rabbi", "Jihadi", "Fundi", "sensei", "Sufi", "yogi", "Swami", "Carabinieri", "Grand Mufti", "mufti", "paparazzi", "quarterback", "steeplejack", "lumberjack", "Shock jock", "monk", "cook", "Clerk", "clerk", "cannibal", "paralegal", "marshal", "official", "cardinal", "Criminal", "criminal", "principal", "Postmaster General", "Paymaster General", "Solicitor General", "General", "consul general", "brigadier general", "inspector general", "general", "vice admiral", "rear admiral", "admiral", "corporal", "Général", "vassal", "supermodel", "model", "Mohel", "lieutenant colonel", "colonel", "counsel", "pupil", "Reichsmarschall", "patent troll", "troll", "proconsul", "consul", "Raphaël", "Imam", "Hakim", "medium", "magician", "logician", "academician", "lab technician", "cryptologic technician", "pharmacy technician", "pyrotechnician", "technician", "clinician", "rhetorician", "patrician", "econometrician", "Musician", "jazz musician", "mathematician", "bioinformatician", "tactician", "Politician", "politician", "semiotician", "optician", "biostatistician", "statistician", "Beautician", "Comedian", "Wikipedian", "custodian", "theologian", "utopian", "Barbarian", "Grammarian", "grammarian", "veterinarian", "librarian", "humanitarian", "parliamentarian", "documentarian", "antiquarian", "architectural historian", "Equestrian", "Registered dietitian", "dietitian", "seaman", "shaman", "militiaman", "Foreman", "warehouseman", "strongman", "frogman", "coachman", "watchman", "stockman", "milkman", "signalman", "damage controlman", "yeoman", "radioman", "charwoman", "midshipman", "lumberman", "alderman", "cornerman", "airman", "motorman", "bail bondsman", "swordsman", "ombudsman", "Salesman", "marksman", "helmsman", "Hospital Corpsman", "Businessman", "boatman", "cutman", "Drayman", "highwayman", "railwayman", "handyman", "journeyman", "clergyman", "quarryman", "ferryman", "veteran", "courtesan", "partisan", "charlatan", "sultan", "sacristan", "Dewan", "churchwarden", "beauty queen", "queen", "Chapmen", "citizen", "ensign", "chamberlain", "chieftain", "Frigate Captain", "Captain", "captain", "boatswain", "coxswain", "Paladin", "ronin", "assassin", "rebbetzin", "muezzin", "deacon", "trauma surgeon", "orthopedic surgeon", "pediatric surgeon", "gynecological surgeon", "vascular surgeon", "transplant surgeon", "microsurgeon", "surgeon", "centurion", "robber baron", "baron", "vigneron", "matron", "patron", "Mason", "stonemason", "mason", "chairperson", "tradesperson", "spokesperson", "salesperson", "statesperson", "businessperson", "draftsperson", "intern", "Shogun", "nun", "hobo", "Abogado", "commando", "gaucho", "impresario", "gigolo", "sumo", "lyric soprano", "soprano", "jackaroo", "virtuoso", "castrato", "contralto", "chimney sweep", "tramp", "bellhop", "suffragan bishop", "auxiliary bishop", "bishop", "harp", "Vicar", "patriarchal vicar", "parochial vicar", "episcopal vicar", "vicar", "beggar", "Friar", "burglar", "biblical scholar", "Confucian scholar", "Sanskrit scholar", "scholar", "bursar", "political commissar", "commissar", "hussar", "transcriber", "suicide bomber", "barber", "skeleton racer", "Officer", "commanding officer", "officer", "Dancer", "flamenco dancer", "breakdancer", "Necromancer", "conveyancer", "fencer", "Influencer", "influencer", "bouncer", "announcer", "greengrocer", "grocer", "mob enforcer", "Producer", "producer", "bandleader", "cheerleader", "leader", "reader", "loader", "model railroader", "trader", "Crusader", "bobsledder", "horse breeder", "breeder", "paraglider", "dressage rider", "despatch rider", "enduro rider", "motocross rider", "rider", "midfielder", "outfielder", "Gilder", "gilder", "coachbuilder", "shipbuilder", "Bodybuilder", "shareholder", "Commander", "commander", "defender", "habitual offender", "sex offender", "moneylender", "lender", "pretender", "goaltender", "Bartender", "Bookbinder", "Founder", "carder", "wakeboarder", "skateboarder", "Snowboarder", "bodyboarder", "reindeer herder", "herder", "recorder", "balladeer", "buccaneer", "ski mountaineer", "mountaineer", "Imagineer", "Engineer", "electrical engineer", "acoustical engineer", "astronautical engineer", "structural engineer", "bioengineer", "mutineer", "pioneer", "auctioneer", "cannoneer", "seer", "privateer", "racketeer", "musketeer", "pamphleteer", "puppeteer", "war profiteer", "orienteer", "volunteer", "gaffer", "golfer", "gofer", "roofer", "windsurfer", "kitesurfer", "surfer", "Manager", "manager", "Dowager", "gravedigger", "Blogger", "park ranger", "bushranger", "vocal arranger", "arranger", "ranger", "bicycle messenger", "messenger", "Singer", "stringer", "winger", "Cheesemonger", "ironmonger", "cataloger", "astrologer", "Teacher", "Preacher", "teacher", "cattle rancher", "rancher", "Archer", "Researcher", "archer", "Watcher", "etcher", "butcher", "retoucher", "lexicographer", "videographer", "geographer", "choreographer", "lithographer", "autobiographer", "bibliographer", "crystallographer", "demographer", "iconographer", "typographer", "hydrographer", "cinematographer", "astrophotographer", "cryptographer", "cartographer", "philosopher", "haberdasher", "fisher", "schoolbook publisher", "publisher", "musher", "godfather", "Financier", "financier", "brigadier", "grenadier", "Soldier", "supersoldier", "bombardier", "luthier", "Bankier", "freestyle skier", "alpine skier", "skier", "Hospitalier", "Cavalier", "sommelier", "chansonnier", "croupier", "town crier", "carrier", "currier", "courier", "chocolatier", "rentier", "courtier", "glazier", "brazier", "Grand Vizier", "Vizier", "vizier", "speaker", "winemaker", "cheesemaker", "wigmaker", "kingmaker", "matchmaker", "watchmaker", "brickmaker", "clockmaker", "bookmaker", "dollmaker", "Filmmaker", "boilermaker", "papermaker", "glassmaker", "beatmaker", "cabinetmaker", "printmaker", "shirtmaker", "kayaker", "linebacker", "hacker", "drug trafficker", "rocker", "mountain biker", "motorbiker", "hitchhiker", "hiker", "stalker", "tightrope walker", "racewalker", "Banker", "merchant banker", "investment banker", "banker", "Thinker", "tinker", "inker", "debunker", "stockbroker", "pawnbroker", "shipbroker", "broker", "postal worker", "fieldworker", "woodworker", "farmworker", "ironworker", "worker", "Dealer", "Healer", "coin dealer", "dealer", "healer", "whaler", "stabler", "cobbler", "gambler", "chronicler", "saddler", "peddler", "fiddler", "baggage handler", "hurdler", "modeler", "yodeler", "juggler", "Smuggler", "smuggler", "wrangler", "angler", "retailer", "compiler", "tiler", "Caller", "netballer", "Footballer", "signaller", "reseller", "antiquarian bookseller", "fortune teller", "Storyteller", "storyteller", "teller", "serial killer", "distiller", "Controller", "comptroller", "baton twirler", "Hurler", "curler", "hurler", "wrestler", "settler", "butler", "sutler", "ruler", "bowler", "streamer", "gamer", "tamer", "embalmer", "spammer", "programmer", "synchronized swimmer", "swimmer", "Drummer", "drummer", "astronomer", "pig farmer", "oyster farmer", "dairy farmer", "farmer", "Performer", "circus performer", "perfumer", "cleaner", "kindergartener", "fashion designer", "designer", "trainer", "Entertainer", "entertainer", "refiner", "shoeshiner", "coal miner", "medical examiner", "miner", "joiner", "master mariner", "submariner", "mariner", "scanner", "tanner", "skinner", "spinner", "machine gunner", "gunner", "ultramarathon runner", "runner", "falconer", "pensioner", "Commissioner", "commissioner", "stationer", "confectioner", "Reiki practitioner", "insolvency practitioner", "executioner", "crooner", "coroner", "Poisoner", "prisoner", "wood turner", "turner", "partner", "piano tuner", "shipowner", "landscaper", "reaper", "lighthouse keeper", "beekeeper", "timekeeper", "gatekeeper", "goalkeeper", "innkeeper", "zookeeper", "shopkeeper", "groundskeeper", "keeper", "street sweeper", "bagpiper", "helper", "BASE jumper", "smokejumper", "Developer", "cooper", "Trooper", "paratrooper", "stormtrooper", "Rapper", "Trapper", "trapper", "rapper", "Shipper", "skipper", "flipper", "stripper", "sharecropper", "usurper", "pauper", "stretcher bearer", "pallbearer", "spearer", "seafarer", "mass murderer", "gatherer", "caterer", "charterer", "plasterer", "discoverer", "laborer", "arctic explorer", "polar explorer", "restorer", "procurer", "armourer", "treasurer", "usurer", "lecturer", "adventurer", "chaser", "fundraiser", "appraiser", "improviser", "Composer", "purser", "hairdresser", "debater", "inline skater", "figure skater", "roller skater", "slater", "tweeter", "crocheter", "cricketer", "marketer", "geometer", "spectrometer", "jazz trumpeter", "trumpeter", "interpreter", "drafter", "powerlifter", "weightlifter", "Fechter", "Dichter", "Richter", "firefighter", "bullfighter", "gunfighter", "fighter", "reciter", "signwriter", "recruiter", "copper smelter", "pole vaulter", "planter", "Carpenter", "presenter", "dissenter", "Painter", "fresco painter", "portrait painter", "sprinter", "printer", "vampire hunter", "treasure hunter", "werewolf hunter", "Nazi hunter", "demon hunter", "boxing promoter", "promoter", "adapter", "prompter", "Reporter", "investigative reporter", "reporter", "Importer", "Dungeon Master", "Master", "Broadcaster", "broadcaster", "podcaster", "caster", "Zen master", "webmaster", "bandmaster", "Grandmaster", "yardmaster", "schoolmaster", "ironmaster", "quartermaster", "harbourmaster", "concertmaster", "toastmaster", "postmaster", "brewmaster", "paymaster", "spymaster", "mobster", "jester", "forester", "harvester", "gangster", "songster", "prime minister", "interior minister", "minister", "chorister", "barrister", "trickster", "prankster", "pollster", "Tipster", "claims adjuster", "trendsetter", "typesetter", "fitter", "knitter", "babysitter", "Cotter", "Globetrotter", "Cutter", "blues shouter", "presbyter", "rescuer", "caver", "basket weaver", "weaver", "engraver", "believer", "scuba diver", "springboard diver", "cave diver", "skydiver", "caregiver", "driver", "mover", "wood carver", "stone carver", "carver", "reviewer", "winegrower", "javelin thrower", "rower", "Indexer", "remixer", "Boxer", "kickboxer", "boxer", "Slayer", "player", "assayer", "dyer", "buyer", "Lawyer", "lawyer", "sawyer", "organizer", "Womanizer", "colonizer", "quizzer", "Pir", "Fakir", "Amir", "emir", "choir", "picador", "matador", "vendor", "Author", "cookbook author", "coauthor", "prior", "Bokor", "sailor", "chancellor", "Privy Councillor", "countertenor", "tenor", "Monsignor", "Governor", "lieutenant governor", "governor", "Mughal emperor", "emperor", "Conqueror", "conqueror", "juror", "advisor", "Censor", "censor", "confessor", "Clinical Professor", "Regius Professor", "associate professor", "adjunct professor", "assistant professor", "professor", "Assessor", "assessor", "adjudicator", "communicator", "educator", "Liquidator", "creator", "principal investigator", "investigator", "navigator", "interrogator", "gladiator", "mediator", "negotiator", "Annihilator", "legislator", "Translator", "translator", "Manipulator", "manipulator", "postulator", "supervising animator", "animator", "senator", "defensive coordinator", "offensive coordinator", "coordinator", "illuminator", "Elvis impersonator", "impersonator", "moderator", "Operator", "switchboard operator", "lathe operator", "drugstore operator", "operator", "conspirator", "decorator", "orator", "Narrator", "narrator", "Administrator", "apostolic administrator", "administrator", "Illustrator", "botanical illustrator", "Procurator", "dictator", "agitator", "facilitator", "Evaluator", "Innovator", "innovator", "textile conservator", "conservator", "Actor", "kabuki actor", "Contractor", "contractor", "actor", "Lector", "Rector", "collector", "postal inspector", "inspector", "prospector", "Director", "athletic director", "managing director", "director", "Corrector", "corrector", "rector", "prosector", "Doctor", "witch doctor", "Proctor", "conductor", "instructor", "Constructor", "constructor", "proprietor", "solicitor", "contributing editor", "Expeditor", "editor", "auditor", "janitor", "Grand Inquisitor", "inquisitor", "apostolic visitor", "Compositor", "consultor", "cantor", "mentor", "Inventor", "inventor", "preceptor", "Sculptor", "pastor", "angel investor", "impostor", "Autor", "Tutor", "contributor", "distributor", "executor", "tutor", "mayor", "conveyor", "Surveyor", "quantity surveyor", "Monsieur", "Entrepreneur", "serial entrepreneur", "entrepreneur", "seigneur", "Gouverneur", "masseur", "connoisseur", "administrateur", "restaurateur", "Special Rapporteur", "rapporteur", "auteur", "augur", "troubadour", "martyr", "abbess", "princess", "Burgess", "duchess", "chess", "deaconess", "prioress", "Actress", "actress", "mistress", "priestess", "Hostess", "marquess", "underboss", "Professor Emeritus", "professor emeritus", "acrobat", "Avocat", "Soldat", "Advokat", "Diplomat", "diplomat", "prefect", "Architect", "landscape architect", "convict", "Cadet", "shochet", "pickpocket", "valet", "baronet", "dub poet", "haiku poet", "knight", "wainwright", "pundit", "hermit", "Jesuit", "pedant", "Commandant", "commandant", "defendant", "intendant", "flight attendant", "attendant", "sergeant", "Mahant", "merchant", "Lieutenant", "lieutenant", "vagrant", "immigrant", "tyrant", "combatant", "inhabitant", "militant", "Consultant", "consultant", "Certified Accountant", "forensic accountant", "accountant", "beauty pageant contestant", "contestant", "assistant", "adjutant", "civil servant", "clairvoyant", "docent", "vice president", "president", "dissident", "Superintendent", "superintendent", "Correspondent", "correspondent", "respondent", "Student", "student", "Secret Service agent", "sleeper agent", "newsagent", "agent", "Regent", "regent", "Sergent", "patient", "management", "procurement", "parent", "delinquent", "patron saint", "saint", "Stunt", "abbot", "mascot", "Patriot", "polyglot", "Pilot", "paraglider pilot", "glider pilot", "fighter pilot", "autopilot", "despot", "conscript", "ballistics expert", "expert", "Escort", "prince consort", "royal consort", "Gymnast", "rhythmic gymnast", "gymnast", "parish priest", "Anglican priest", "diocesan priest", "priest", "pharmacist", "bioethicist", "ethicist", "Islamicist", "aerodynamicist", "eugenicist", "harmonicist", "particle physicist", "quantum physicist", "biophysicist", "astrophysicist", "physicist", "molecular geneticist", "geneticist", "roboticist", "exorcist", "jihadist", "propagandist", "parodist", "keyboardist", "sound recordist", "canoeist", "pacifist", "collagist", "suffragist", "strategist", "Druggist", "druggist", "genealogist", "mammalogist", "mineralogist", "pharmacologist", "marine ecologist", "freshwater ecologist", "paleoecologist", "ecologist", "ethnomusicologist", "musicologist", "toxicologist", "Oncologist", "radiation oncologist", "mycologist", "Indologist", "Geologist", "amateur archaeologist", "geoarchaeologist", "archaeologist", "petroleum geologist", "planetary geologist", "hydrogeologist", "geologist", "osteologist", "ufologist", "Psychologist", "parapsychologist", "neuropsychologist", "psephologist", "neuropathologist", "pathologist", "mythologist", "Wildlife biologist", "marine biologist", "computational biologist", "developmental biologist", "conservation biologist", "molecular biologist", "geobiologist", "paleobiologist", "microbiologist", "astrobiologist", "neurobiologist", "biologist", "glaciologist", "sociologist", "epidemiologist", "bacteriologist", "kinesiologist", "exercise physiologist", "plant physiologist", "electrophysiologist", "neurophysiologist", "philologist", "forensic entomologist", "entomologist", "seismologist", "cosmologist", "etymologist", "volcanologist", "oenologist", "arachnologist", "Technologist", "surgical technologist", "biotechnologist", "nanotechnologist", "ethnologist", "criminologist", "sinologist", "limnologist", "demonologist", "immunologist", "cryptozoologist", "zoologist", "apologist", "cultural anthropologist", "paleoanthropologist", "anthropologist", "andrologist", "hydrologist", "serologist", "gastroenterologist", "virologist", "meteorologist", "metrologist", "hematologist", "paleoclimatologist", "climatologist", "primatologist", "traumatologist", "rheumatologist", "perinatologist", "hepatologist", "proctologist", "diabetologist", "cosmetologist", "herpetologist", "parasitologist", "sedimentologist", "paleontologist", "gerontologist", "otologist", "epileptologist", "cryptologist", "histologist", "sexologist", "ichthyologist", "embryologist", "allergist", "Metallurgist", "metallurgist", "liturgist", "catechist", "Buddhist", "Kabbalist", "herbalist", "cruciverbalist", "syndicalist", "vocalist", "socialist", "colonialist", "aerialist", "industrialist", "existentialist", "Constitutionalist", "freelance journalist", "pastoralist", "naturalist", "venture capitalist", "Mentalist", "orientalist", "experimentalist", "environmentalist", "instrumentalist", "mentalist", "spiritualist", "medievalist", "turntablist", "motorcyclist", "cyclist", "Evangelist", "televangelist", "evangelist", "Panelist", "panelist", "philatelist", "pugilist", "cellist", "violist", "fabulist", "wardrobe stylist", "ceramist", "inorganic chemist", "analytical chemist", "computational chemist", "alchemist", "biogeochemist", "geochemist", "biochemist", "chemist", "economist", "ergonomist", "agronomist", "taxonomist", "anatomist", "organist", "Pianist", "classical pianist", "jazz pianist", "fortepianist", "pianist", "Humanist", "humanist", "timpanist", "accompanist", "paleobotanist", "botanist", "dental hygienist", "hygienist", "lutenist", "machinist", "mandolinist", "classical violinist", "jazz violinist", "violinist", "trampolinist", "feminist", "gossip columnist", "columnist", "jazz trombonist", "antagonist", "jazz vibraphonist", "vibraphonist", "xylophonist", "jazz saxophonist", "saxophonist", "Zionist", "accordionist", "trade unionist", "percussionist", "preservationist", "conservationist", "abstractionist", "projectionist", "nutritionist", "contortionist", "zionist", "balloonist", "bassoonist", "editorial cartoonist", "cartoonist", "arsonist", "Internist", "hornist", "communist", "oboist", "banjoist", "Soloist", "soloist", "serial rapist", "aromatherapist", "physiotherapist", "hypnotherapist", "rapist", "microscopist", "philanthropist", "harpist", "Typist", "typist", "diarist", "sitarist", "Guitarist", "guitarist", "gallerist", "lepidopterist", "memoirist", "satirist", "literary theorist", "theorist", "aphorist", "florist", "folklorist", "humorist", "terrorist", "motorist", "forensic psychiatrist", "neuropsychiatrist", "physiatrist", "jurist", "dog behaviourist", "watercolourist", "tourist", "caricaturist", "miniaturist", "acupuncturist", "agriculturist", "floriculturist", "horticulturist", "futurist", "Bassist", "pragmatist", "numismatist", "jazz clarinetist", "portraitist", "occultist", "dentist", "atmospheric scientist", "forensic scientist", "neuroscientist", "scientist", "hypnotist", "Artist", "storyboard artist", "mime artist", "trapeze artist", "mixed martial artist", "con artist", "artist", "librettist", "flautist", "linguist", "Ventriloquist", "ventriloquist", "archivist", "activist", "reservist", "copyist", "ghost", "host", "provost", "Fürst", "Analyst", "psychoanalyst", "analyst", "aquanaut", "astronaut", "marabout", "scout", "mahout", "Sadhu", "Tulku", "Baru", "guru", "widow", "Honorary Fellow", "fellow", "dominatrix", "Dux", "deejay", "runaway", "castaway", "lady", "caddy", "dandy", "abbey", "Derebey", "bey", "nanny", "Playboy Bunny", "cowboy", "playboy", "viceroy", "envoy", "Spy", "counterspy", "spy", "lapidary", "burglary", "Roman legionary", "legionary", "missionary", "revolutionary", "Commissary", "Permanent Secretary", "undersecretary", "secretary", "dignitary", "actuary", "deputy", "émigré", "Curé"]
all_bad_gnews_occs =[ "Warner Bros.", "JAMA", "CBRE", "WRAF", "Real Madrid CF", "BMF", "FI", "World War II", "JK", "GLAM", "STUN", "KVN", "VRP", "Q", "SOCAR", "UWS", "TNT", "LOX", "Oba", "tuba", "ProPublica", "Poetica", "Exotica", "Verkhovna Rada", "Tamada", "Peda", "propaganda", "Rynda", "Krav Maga", "Nanga", "tohunga", "Peshmerga", "Buddha", "Sangha", "Halakha", "Gurkha", "pasha", "Pythia", "pedophilia", "Romania", "pyromania", "Nigeria", "Phantasmagoria", "intarsia", "militia", "Raja", "Maharaja", "Monja", "Moja", "mangaka", "Cheka", "karateka", "balalaika", "judoka", "tabla", "Hola", "viola", "kumu hula", "Mawla", "Ulama", "lama", "drama", "RealD Cinema", "Lima", "Eskrima", "plasma", "Fantasma", "State Duma", "Una", "Bana", "Pertamina", "tsarina", "prima donna", "Euskadi Ta Askatasuna", "Garda Síochána", "Club Deportivo Guadalajara", "Tirthankara", "Capoeira", "Rangatira", "Mandora", "diaspora", "Capra", "Perra", "Maestra", "extra", "Procura", "Segura", "Igoumenitsa", "Lokayukta", "Canta", "Roman Rota", "Patriota", "Basque pelota", "Organista", "Cronista", "operetta", "prima ballerina assoluta", "Petah Tikva", "Curva", "FC Metalurh Zaporizhya", "Kshatriya", "acharya", "Mirza", "yakuza", "Airbnb", "job", "Qutb", "club", "laic", "logic", "ceramic", "Mimic", "Westinghouse Electric", "classical music", "Dramatic", "ascetic", "antibiotic", "plastic", "lip sync", "Road", "licensed", "blessed", "accused", "humanitarian aid", "Maggid", "Mujahid", "Murshid", "Archdruid", "druid", "Vivid", "herald", "Grunewald", "guild", "Berthold", "Vorstand", "Legend", "hedge fund", "God", "Bollywood", "harpsichord", "crossword", "fraud", "oud", "Rebbe", "LibreOffice", "doctor's office", "police", "voice", "justice", "United Parcel Service", "dance", "surveillance", "performance", "Unifrance", "reconnaissance", "Clairvoyance", "clairvoyance", "jurisprudence", "conscience", "cognitive neuroscience", "neuroscience", "Commerce", "brigade", "trade", "regicide", "homicide", "ecocide", "Freeride", "Rende", "demimonde", "Seabee", "Official Assignee", "scree", "master's degree", "trustee", "employee", "Luftwaffe", "image", "espionage", "patronage", "page", "massage", "dressage", "hostage", "wage", "masonic lodge", "Ushaw College", "Singe", "doge", "luge", "Rinpoche", "scrapie", "BMX bike", "karaoke", "female", "male", "Generale", "fable", "Venerable", "chronicle", "bicycle", "Cajun fiddle", "Hardanger fiddle", "fiddle", "Nobile", "cinephile", "Japanophile", "versatile", "textile", "vaudeville", "dairy cattle", "mule", "dame", "Anime", "crime", "nativity scene", "Ukraine", "psychosomatic medicine", "veterinary medicine", "airline", "tambourine", "Sorbonne", "trombone", "vibraphone", "baritone", "commune", "landscape", "animal welfare", "guerrilla warfare", "electronic warfare", "cyberwarfare", "lacquerware", "cadre", "billionaire", "millionaire", "Militaire", "Lahore", "folklore", "grocery store", "drugstore", "bookstore", "mythical creature", "miniature", "comparative literature", "children's literature", "literature", "lecture", "permaculture", "aquaculture", "sericulture", "agriculture", "arboriculture", "viticulture", "horticulture", "silviculture", "adventure", "bronze sculpture", "database", "disease", "striptease", "Japanese", "nose", "prose", "nurse", "lacrosse", "recluse", "delegate", "associate", "inebriate", "oblate", "prelate", "apostolate", "primate", "mate", "karate", "pirate", "real estate", "Private", "erudite", "socialite", "antisemite", "Fante", "Amante", "dilettante", "debutante", "Pilote", "Naturaliste", "suffragette", "vignette", "Cute", "flute", "Karolinska Institute", "proselyte", "barbecue", "rescue", "Rogue", "dialogue", "monologue", "École Polytechnique", "critique", "antique", "Basque", "odalisque", "statue", "galley slave", "Margrave", "cooperative", "narrative", "representative", "detective", "fugitive", "Dijkgraaf", "graf", "Rolf", "roof", "Filozof", "pig", "Leipzig", "Undang", "VJing", "dubbing", "climbing", "beachcombing", "horse racing", "fencing", "proofreading", "reading", "paragliding", "gliding", "Freeriding", "welding", "bodybuilding", "molding", "General Officer Commanding", "reindeer herding", "dyeing", "windsurfing", "blogging", "logging", "singing", "birdwatching", "fishing", "publishing", "Nordic skiing", "alpine skiing", "waterskiing", "skiing", "winemaking", "shoemaking", "steelmaking", "filmmaking", "hatmaking", "printmaking", "backpacking", "human trafficking", "mountain biking", "hiking", "tightrope walking", "racewalking", "walking", "drinking", "woodworking", "whaling", "cycling", "grief counseling", "juggling", "smuggling", "angling", "sailing", "bookselling", "grappling", "hurling", "wrestling", "whistling", "Hustling", "bowling", "Lifestreaming", "programming", "swimming", "dairy farming", "gardening", "coal mining", "demining", "mining", "Freerunning", "woodturning", "bookkeeping", "BASE jumping", "bungee jumping", "ski jumping", "rapping", "sheep shearing", "bouldering", "ski mountaineering", "biomedical engineering", "geotechnical engineering", "bioengineering", "orienteering", "volunteering", "catering", "mastering", "flooring", "monitoring", "manufacturing", "fundraising", "merchandising", "advertising", "Nursing", "ice skating", "figure skating", "roller skating", "powerboating", "scapegoating", "collecting", "telemarketing", "marketing", "weightlifting", "editing", "vomiting", "screenwriting", "underwriting", "Copywriting", "quilting", "consulting", "eventing", "letterpress printing", "sprinting", "printing", "hunting", "Accounting", "podcasting", "casting", "knitting", "babysitting", "engraving", "surf lifesaving", "lifesaving", "scuba diving", "driving", "woodcarving", "drawing", "brewing", "sewing", "glassblowing", "sword swallowing", "kickboxing", "beatboxing", "boxing", "rallying", "qigong", "mahjong", "song", "sled dog", "blog", "vlog", "Georg", "Surg", "Metallurg", "Kabbalah", "dawah", "Muhammadiyah", "Teach", "Rich", "ranch", "research", "Oligarch", "matriarch", "patriarch", "sch", "Dutch", "Faqih", "faqih", "Karabakh", "Gursikh", "epigraph", "cinematograph", "photograph", "polygraph", "nymph", "fiqh", "squash", "Yiddish", "English", "Polymath", "cloth", "Da'i", "Kai", "Gabbai", "Jai alai", "qadi", "Jedi", "magi", "shogi", "Maddahi", "Shishi", "rikishi", "Munshi", "bushi", "Granthi", "Kaji", "Fuji", "Wali", "Poli", "Omi", "Shinigami", "origami", "Rouhani", "Yogini", "Karbhari", "Kumari", "Nasi", "Stasi", "graffiti", "Genti", "Mawlawi", "Nazi", "Basij", "kayak", "blackjack", "Soundtrack", "Cossack", "sidekick", "Patrick", "Posek", "Naik", "tzadik", "Grafik", "Ashik", "Bukovnik", "Politik", "silk", "bank", "crank", "pink", "TikTok", "Look", "housework", "work", "kiosk", "FC Shakhtar Donetsk", "Biomechanical", "musical", "Political", "vocal", "Procurator fiscal", "Vandal", "cereal", "Ubisoft Montreal", "commercial", "Holocaust denial", "Industrial", "Marginal", "professional", "Constitutional", "Lokpal", "bomb disposal", "futsal", "intellectual", "spiritual", "Washington Mutual", "carnival", "Kowal", "Kotwal", "Chogyal", "ghazal", "chancel", "Schutzstaffel", "hotel", "duel", "travel", "retail", "Alguacil", "stencil", "soil", "guardia civil", "basketball", "beach volleyball", "volleyball", "ball", "Royal Dutch Shell", "well", "Idol", "pest control", "Baul", "honorary consul", "madam", "Khatam", "Guillem", "Akim", "pilgrim", "victim", "serfdom", "Diplom", "groom", "bookworm", "asceticism", "Jihadism", "pacifism", "Sufism", "anarchism", "sophism", "syndicalism", "Tamil nationalism", "photojournalism", "journalism", "environmentalism", "mentalism", "spiritualism", "televangelism", "evangelism", "urbanism", "paganism", "sectarianism", "equestrianism", "cosmopolitanism", "feminism", "Zionism", "Interventionism", "anarchist communism", "communism", "vulgarism", "pauperism", "terrorism", "tourism", "separatism", "antisemitism", "spiritism", "patriotism", "transvestism", "activism", "museum", "begum", "Aconitum", "pseudonym", "Jan", "Kan", "Chaban", "Taliban", "ban", "publican", "Franciscan", "dean", "fan", "pipe organ", "4chan", "Shihan", "khan",  "Rosicrucian", "supercentenarian", "centenarian", "seminarian", "vegetarian", "pedestrian", "Russian", "Christian", "Bhajan", "Galan", "Catalan", "Ataman", "Leadman", "Godman", "Iceman",  "human", "Tehran", "Guan", "tai chi chuan", "Padawan", "bayan", "dayan", "Gagik Tsarukyan", "Noyan", "fedayeen", "Ingen", "Kohen", "Fremen", "Minutemen", "Tuskegee Airmen", "seinen", "Glazen", "sovereign", "design", "porcelain", "villain", "Kaptein", "Virgin", "virgin", "violin", "Brahmin", "Mandarin", "Schwerin", "mannequin", "kelvin", "hymn", "Mann", "Met Éireann", "Schulmann", "Hauptmann", "Senn", "Celedon", "accordion", "religion", "battalion", "stallion", "rebellion", "Bullion", "Companion", "Psion", "pension", "profession", "possession", "Percussion", "verification", "publication", "telecommunication", "communication", "education", "mediation", "appropriation", "civil aviation", "installation", "speculation", "animation", "explanation", "illumination", "divination", "usurpation", "emigration", "pankration", "exploration", "Orchestration", "orchestration", "administration", "illustration", "duration", "interpretation", "meditation", "rehabilitation", "imitation", "visitation", "consultation", "kidney transplantation", "transplantation", "sexual orientation", "deportation", "computation", "excavation", "innovation", "conservation", "relaxation", "novelization", "World Meteorological Organization", "organization", "characterization", "defection", "conscientious objection", "protection", "fiction", "Reproduction", "construction", "erudition", "ammunition", "Inquisition", "position", "extrasensory perception", "corruption", "Insertion", "extortion", "distribution", "persecution", "revolution", "prostitution", "Yukon", "duathlon", "carillon", "demon", "backgammon", "sermon", "canon", "dragoon", "maroon", "cartoon", "gridiron", "skeleton", "Comintern", "Cresta Run", "Dukun", "Khatun", "Cape Town", "Ladrón", "Jeet Kune Do", "Sambo", "stucco", "Medico", "Comico", "musico", "Politico", "Francisco Franco", "flamenco", "fado", "Delegado", "aikido", "Lamido", "Sonderkommando", "taekwondo", "El Mundo", "judo", "video", "Valeo", "Vago", "Hidalgo", "DIYbio", "Nuncio", "nuncio", "radio", "empresario", "Modelo", "Angelo", "cello", "caudillo", "Hermosillo", "Alamo", "Generalissimo", "King Momo", "Majordomo", "Rayo Vallecano", "Indiano", "fortepiano", "piano", "casino", "Patrono", "Kapo", "Gestapo", "Compo", "Kenpo", "Saladero", "curandero", "folk hero", "Cachero", "superhero", "hero", "Ingeniero", "Guerrillero", "Novillero", "santero", "Wipro", "charro", "Cuatro", "maestro", "Falso", "Mafioso", "Soldato", "Literato", "manifesto", "libretto", "daimyo", "satrap", "trap", "stewardship", "Internship", "sole proprietorship", "mentorship", "entrepreneurship", "comic strip", "shrimp", "pastry shop", "prosthetic makeup", "Warraq", "Lascar", "car", "Subedar", "Faujdar", "Zamindar", "Khudai Khidmatgar", "liar", "registrar", "Caesar", "tsar", "Muhtar", "acoustic guitar", "bass guitar", "guitar", "Shayar", "Kobzar",   "elder",  "murder",  "Schiffer", "fifer", "schlager", "nigger", "Geiger", "passenger", "revenger", "fado singer", "Schwinger",   "father", "mother", "Brother", "brother", "pigeon fancier",  "Polier",  "Pajer", "Maker", "Speaker", "keynote speaker", "loudspeaker", "streaker", "pacemaker",  "cracker", "Spiker", "Trekker", "wanker", "smoker",   "Bachiller", "rotary tiller",  "Comer", "Boomer", "polymer", "Birkebeiner",  "commoner", "Zehntner", "Owner",  "wiper",   "MV Explorer",    "Pooter", "deserter", "Jedi Master", "reiki master",  "master", "Meister", "Hofmeister", "Bergmeister", "Werkmeister", "Jägermeister", "sister", "monster", "Dempster", "maltster", "filibuster", "matter", "Twitter", "cadaver", "drover", "drawer", "employer", "Winzer", "Schweitzer", "Bundeswehr", "chair", "heir", "memoir", "tafsir", "Pensador", "Bangor", "major", "color", "humor", "donor",  "sexual predator",      "visitor",  "Holocaust survivor", "survivor", "Freiherr", "Ambassadeur", "carillonneur", "lur", "parkour", "Once Caldas", "Periodistas", "aerobics", "graphics", "fluid dynamics", "hydrodynamics", "proteomics", "genomics", "economics", "ergonomics", "mechanics", "geotechnics", "tectonics", "pediatrics", "lyrics", "Musics", "particle physics", "geophysics", "biophysics", "astrophysics", "physics", "acrobatics", "aerobatics", "mathematics", "Christian apologetics", "athletics", "homiletics", "cybernetics", "politics", "robotics", "optics", "rhythmic gymnastics", "acrobatic gymnastics", "artistic gymnastics", "criminalistics", "computational linguistics", "sociolinguistics", "linguistics", "aeronautics", "Human Resources", "humanities", "royalties", "sales", "comes", "marines", "Nadadores", "Pilates", "Carmelites", "Mennonites", "Bandeirantes", "Souliotes", "Goldman Sachs", "Dis", "tennis", "exegesis", "amanuensis", "arteriosclerosis", "cryptanalysis", "St. Louis", "maquis", "Bolsheviks", "bowls", "customs", "Musicians", "Cistercians", "United Nations", "Amazons", "Strategos", "Custos", "Coastwatchers", "bikers", "Freedom Fighters", "Professors", "Voyageurs", "Bass", "Badass", "goddess", "agribusiness", "business", "homelessness", "Freestyle Motocross", "visual effects", "women's rights", "decorative arts", "performing arts", "mixed martial arts", "martial arts", "visual arts", "esports", "Hellenists", "Passionists", "Decembrists", "Scientists", "Artists", "Santa Claus", "circus", "genius", "Religious", "Barclays", "Indian Railways", "El País", "Jesús", "Bhagat", "proletariat", "Heimat", "autodidact", "act", "suspect", "product", "Blomstedt", "racket", "clarinet", "Internet", "cornet", "trumpet", "puppet", "pet", "handicraft", "craft", "theft", "Vogt", "CrossFit", "portrait", "exhibit", "Pandit", "drum kit", "occult",  "restaurant", "peasant",   "Tashkent", "talent", "law enforcement", "arrangement", "Atonement", "retirement", "temperance movement", "accompaniment", "government", "treatment", "recruitment", "document", "percussion instrument", "unemployment", "godparent", "Heir apparent", "Referent", "Peot", "plot", "manuscript", "concert",   "Methodist",         "federalist",         "Assist", "pietist",  "marxist", "Debut", "Arnau", "Urdu", "wushu", "haiku", "sudoku", "sanshou", "Shiatsu", "jujutsu", "Sayadaw", "law", "extramarital sex", "remix", "cosplay", "essay", "baby", "rugby", "advocacy", "diplomacy", "pharmacy", "participatory democracy", "aristocracy", "literacy", "piracy", "privacy", "residency", "regency", "insurgency", "bankruptcy", "Daddy", "parody", "inline hockey", "roller hockey", "disc jockey", "jockey", "Silicon Valley", "foley", "Virrey", "Nagy", "pedagogy", "demagogy", "genealogy", "mammalogy", "pharmacology", "gynecology", "ecology", "ethnomusicology", "musicology", "ecotoxicology", "mycology", "planetary geology", "hydrogeology", "geology", "theology", "phraseology", "museology", "otolaryngology", "parapsychology", "neuropsychology", "psychology", "graphology", "geomorphology", "morphology", "pathology", "marine biology", "radiobiology", "microbiology", "photobiology", "sociology", "radiology", "epidemiology", "bacteriology", "kinesiology", "missiology", "psychophysiology", "pathophysiology", "electrophysiology", "clinical neurophysiology", "philology", "gemology", "epistemology", "entomology", "seismology", "etymology", "volcanology", "oenology", "phrenology", "biotechnology", "nanotechnology", "technology", "criminology", "terminology", "phonology", "pulmonology", "immunology", "cryptozoology", "zoology", "escapology", "hydrology", "serology", "gastroenterology", "nephrology", "virology", "meteorology", "horology", "urology", "hematology", "dermatology", "rheumatology", "cosmetology", "herpetology", "parasitology", "paleontology", "gerontology", "histology", "ichthyology", "eulogy", "clergy", "metallurgy", "dramaturgy", "liturgy", "calligraphy", "lexicography", "videography", "biogeography", "geography", "lithography", "oceanography", "pornography", "photography", "cartography", "Electromyography", "homeopathy", "osteopathy", "psychopathy", "naturopathy", "sulky", "elderly", "enemy", "mummy", "economy", "agronomy", "gastronomy", "taxonomy", "anatomy", "army", "Germany", "company", "paleobotany", "botany", "felony", "Gestalt therapy", "physiotherapy", "philanthropy", "ordinary", "janissary", "Military", "paramilitary", "heraldry", "animal husbandry", "robbery", "midwifery", "cardiac surgery", "thoracic surgery", "orthopedic surgery", "neurosurgery", "adultery", "brewery", "cavalry", "masonry", "equerry", "psychiatry", "rocketry", "optometry", "lyric poetry", "slam poetry", "poetry", "puppetry", "landed gentry", "gentry", "carpentry", "analytical chemistry", "medicinal chemistry", "computational chemistry", "geochemistry", "biochemistry", "palmistry", "dentistry", "jury", "usury", "leprosy", "busy", "Laity", "electricity", "deity", "criminality", "personality", "municipality", "nobility", "insanity", "charity", "celebrity", "authority", "Osmania University", "Eötvös Loránd University", "Åbo Akademi University", "Carnegie Mellon University", "private equity", "royalty", "faculty", "Democratic Party", "Institutional Revolutionary Party", "active duty", "navy", "proxy", "Juez", "Hafiz", "kolkhoz", "jazz", "café", "Attaché", "naval attaché"]

In [99]:
print(f"Dataframe length before update {len(occs_df):,}")
for occ in tqdm(all_good_gnews_occs, total=len(all_good_gnews_occs)):
    if occ in occs_df['occupation'].tolist():
        the_row = occs_df[occs_df['occupation'] == occ ]
        the_index = the_row.index[0]
        occs_df.loc[the_index, 'label'] = 1
        occs_df.loc[the_index, 'labeled_by'] = 'human'
for occ in tqdm(all_bad_gnews_occs, total=len(all_bad_gnews_occs)):
    if occ in occs_df['occupation'].tolist():
        the_row = occs_df[occs_df['occupation'] == occ ]
        the_index = the_row.index[0]
        occs_df.loc[the_index, 'label'] = 0
        occs_df.loc[the_index, 'labeled_by'] = 'human'    
print(f"Dataframe length after update {len(occs_df):,}")        

  2%|▏         | 30/1547 [00:00<00:05, 294.26it/s]

Dataframe length before update 10,974


100%|██████████| 1547/1547 [00:04<00:00, 333.07it/s]
100%|██████████| 1343/1343 [00:04<00:00, 326.72it/s]

Dataframe length after update 10,974





In [100]:
occs_df.tail()

Unnamed: 0,item_id,occupation_count,occupation,description,in_google_news,language_detected,source,label,labeled_by,label_error_reason
12749,10526680,1,hovkamrerare,,0,sl,wikidata,-1,,
12750,10526703,1,Hovrättspresident,,0,sv,wikidata,-1,,
12751,66486266,1,activista taurí,subclase de activista,0,oc,wikidata,-1,,
12752,360443,1,Escort,Wikimedia disambiguation page,1,oc,wikidata,1,human,
12753,87252988,1,弓道家,,0,Unknown,wikidata,-1,,


In [101]:
occs_df.head()

Unnamed: 0,item_id,occupation_count,occupation,description,in_google_news,language_detected,source,label,labeled_by,label_error_reason
0,82955,606008,politician,"person involved in politics, person who holds ...",1,en,wikidata,1,human,
1,189290,33481,military officer,member of an armed force or uniformed service ...,0,en,wikidata,-1,,
2,131512,8887,farmer,person that works in agriculture,1,da,wikidata,1,human,
3,1734662,2012,cartographer,person preparing geographical maps,1,ia,wikidata,1,human,
4,294126,323,land surveyor,profession,0,tr,wikidata,-1,,


In [102]:
occs_df['description'].fillna('', inplace = True) 
occs_df['label_error_reason'].fillna('', inplace = True) 
occs_df['labeled_by'].fillna('', inplace = True) 
occs_df['language_detected'].fillna('', inplace = True) 
occs_df['occupation'].fillna('', inplace = True) 
occs_df['source'].fillna('', inplace = True) 
occs_df['item_id'].fillna(0, inplace = True) 
occs_df['occupation_count'].fillna(0, inplace = True) 
occs_df = occs_df.astype({'description': 'str', 'item_id': 'int64', 'occupation_count': 'int64'})
occs_df.head()

Unnamed: 0,item_id,occupation_count,occupation,description,in_google_news,language_detected,source,label,labeled_by,label_error_reason
0,82955,606008,politician,"person involved in politics, person who holds ...",1,en,wikidata,1,human,
1,189290,33481,military officer,member of an armed force or uniformed service ...,0,en,wikidata,-1,,
2,131512,8887,farmer,person that works in agriculture,1,da,wikidata,1,human,
3,1734662,2012,cartographer,person preparing geographical maps,1,ia,wikidata,1,human,
4,294126,323,land surveyor,profession,0,tr,wikidata,-1,,


In [103]:
occs_df.tail()

Unnamed: 0,item_id,occupation_count,occupation,description,in_google_news,language_detected,source,label,labeled_by,label_error_reason
12749,10526680,1,hovkamrerare,,0,sl,wikidata,-1,,
12750,10526703,1,Hovrättspresident,,0,sv,wikidata,-1,,
12751,66486266,1,activista taurí,subclase de activista,0,oc,wikidata,-1,,
12752,360443,1,Escort,Wikimedia disambiguation page,1,oc,wikidata,1,human,
12753,87252988,1,弓道家,,0,Unknown,wikidata,-1,,


In [154]:
occs_df.to_csv('occupations.wikidata.all.gnews.labeled.csv', sep='\t', index=False)

## Now let's check the label quality, see: `correcting_GoogleNews_labels_with_Cleanlab.ipynb`