# Wikipedia Notable Life Expectancies
# [Notebook  13: Data Cleaning Part 12](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean12_thanak_2022_08_03.ipynb)
### Context

The
### Objective

The
### Data Dictionary
- Feature: Description

### Importing Libraries

In [1]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To save/open python objects in pickle file
import pickle

# To help with reading, cleaning, and manipulating data
import pandas as pd
import numpy as np
import re

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)

# To play auditory cue when cell has executed, has warning, or has error and set chime theme
import chime

chime.theme("zelda")

<IPython.core.display.Javascript object>

## Data Overview

### [Reading](), Sampling, and Checking Data Shape

In [2]:
# Reading the dataset
conn = sql.connect("wp_life_expect_clean11.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_clean11", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 98056 rows and 38 columns.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,,,,ballet designer and director,,,,,,,,,86.0,,United Kingdom of Great Britain and Northern Ireland,,,3.091042,0,0,0,0,0,1,0,0,0,0,0,0,1
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,,,,writer,and academic,,,,,,,,68.0,,Ireland,,,2.564949,0,0,0,0,0,0,0,0,1,0,0,0,1


<IPython.core.display.Javascript object>

In [3]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
98054,9,Aamir Liaquat Hussain,", 50, Pakistani journalist and politician, MNA .",https://en.wikipedia.org/wiki/Aamir_Liaquat_Hussain,99,2022,June,", since",,,MNA,,,,,,,,,50.0,,Pakistan,,"2002 2007, since 2018",4.60517,0,0,0,0,0,1,0,0,1,0,0,0,2
98055,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3,2022,June,,,,member of the Academy of Engineering,,,,,,,,,86.0,,"China, People's Republic of",,,1.386294,1,0,0,0,0,0,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

In [4]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
81885,27,Murilo Melo Filho,", 91, Brazilian writer, lawyer and journalist, multiple organ failure.",https://en.wikipedia.org/wiki/Murilo_Melo_Filho,4,2020,May,,,,lawyer and journalist,multiple organ failure,,,,,,,,91.0,,Brazil,,,1.609438,0,0,0,0,0,1,0,0,0,0,0,0,1
44022,12,Paul Bhattacharjee,", 53, British actor , suicide.",https://en.wikipedia.org/wiki/Paul_Bhattacharjee,22,2013,July,", ,",,,suicide,,,,,,,,,53.0,,United Kingdom of Great Britain and Northern Ireland,,", ,",3.135494,0,0,0,0,0,1,0,0,0,0,0,0,1
21723,11,John Spencer,", 71, British former world champion snooker player, stomach cancer.",https://en.wikipedia.org/wiki/John_Spencer_(snooker_player),106,2006,July,,,,stomach cancer,,,,,,,,,71.0,,United Kingdom of Great Britain and Northern Ireland,,,4.672829,0,0,0,0,0,0,1,0,0,0,0,0,1
2257,10,Duncan McKenzie,", 43, American convicted murderer, execution by lethal injection.",https://en.wikipedia.org/wiki/Duncan_McKenzie_(murderer),12,1995,May,,,,execution by lethal injection,,,,,,,,,43.0,,United States of America,,,2.564949,0,0,0,0,0,0,0,0,0,1,0,0,1
57936,17,Doris Roberts,", 90, American actress , stroke.",https://en.wikipedia.org/wiki/Doris_Roberts,92,2016,April,", ,",,,stroke,,,,,,,,,90.0,,United States of America,,", ,",4.532599,0,0,0,0,0,1,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

### Checking Data Types, Duplicates, and Null Values

In [5]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98056 entries, 0 to 98055
Data columns (total 38 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   day                        98056 non-null  object 
 1   name                       98056 non-null  object 
 2   info                       98056 non-null  object 
 3   link                       98056 non-null  object 
 4   num_references             98056 non-null  int64  
 5   year                       98056 non-null  int64  
 6   month                      98056 non-null  object 
 7   info_parenth               36661 non-null  object 
 8   info_1                     22 non-null     object 
 9   info_2                     98024 non-null  object 
 10  info_3                     48896 non-null  object 
 11  info_4                     10264 non-null  object 
 12  info_5                     1265 non-null   object 
 13  info_6                     181 non-null    obj

<IPython.core.display.Javascript object>

#### Observations:
- With our dataset loaded, we can pick up where we left off with extracting known_for values by rebuilding `known_for_dict`.

### Extracting `known_for` Continued

#### Finding `known_for` Roles in `info_2`

In [6]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [7]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [8]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "religion" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [9]:
# # Viewing list sorted by descending length to copy to dictionary below and screen values
# sorted(specific_roles_list, key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [10]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "economics editor"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [11]:
# Creating lists for each category
politics_govt_law = [
    "revolutionary socialist and workers' leader",
    "' leader",  # before business_farming
    "Tlingit elder",
    "Governor of Benue State",
    "veterans advocate",
    "traditional ruler of Ife",
    "ruler of Ras al Khaimah",
    "traditional ruler",
    "ruler of the",
    "ruler",
]

arts = [
    "graphic designer and pioneer in the field of computer graphics",
    "photographic director and videographer",
    "illustrator and graphic designer",
    "graphic designer and typographer",
    "television graphic designer",
    "graphic and type designer",
    "pornographic performer",
    "graphic designer",
    "gay pornographic",
    "pornographic",
    "graphic and",
    "graphic",
    "theatre and opera director and stage designer",
    "theatre director and voice coach",
    "theatre and opera administrator",
    "television and theatre director",
    "and Broadway theatre performer",
    "humorist and theatre director",
    "theatre director and designer",
    "theatre director and theorist",
    "theatre and concert director",
    "opera and theatre director",
    "theatre owner and manager",
    "theatre director and",
    "theatre impresario",
    "theatre publicist",
    "theatre director",
    "theatre designer",
    "theatre promoter",
    "theatre and",
    "theatre",
    "celebrity chef and television personality",
    "pastry chef and television personality",
    "chef and reality show contestant",
    "chef and television personality",
    "Michelin Star winning chef",
    "pioneering television chef",
    "Cajun chef and humorist",
    "famed New Orleans chef",
    "chef and gastronomist",
    "television chef",
    "celebrity chef",
    "internet chef",
    "pastry chef",
    "Creole chef",
    "music  chef",
    "head chef",
    "chef and",
    "chef",
    "stunt performer",
    "movie stuntman",
    "car customizer",
    "customizer",
    "stuntwoman",
    "BBC disc jockey and guru of the independent music scene",  # before sports
    "radio disc jockey and proponent of Pinoy rock",
    "Hall of Fame disc jockey and television host",
    "disc jockey and television personality",
    "disk jockey and sound system operator",
    "disc jockey and music news reporter",
    "disc jockey and record collector",
    "disk jockey known as 'Nightbird'",
    "disc jockey and television host",
    "game show host and disc jockey",
    "former BBC Radio disc jockey",
    "country music disc jockey",
    "BBC Radio disc jockey",
    "footwork disc jockey",
    "reggae disc jockey",
    "radio disc jockey",
    "radio disk jockey",
    "disc jockey and",
    "and disc jockey",
    "disc jockey",
    "disk jockey",
    "children book and magazine illustrator",
    "conceptual designer and illustrator",
    "illustrator for the original books",
    "concept designer and illustrator",
    "printmaker and book illustrator",
    "photo essayist and illustrator",
    "illustrator of children books",
    "illustrator and watercolorist",
    "caricaturist and illustrator",
    "illustrator and caricaturist",
    "science fiction illustrator",
    "magazine cover illustrator",
    "printmaker and illustrator",
    "children book illustrator",
    "illustrator and designer",
    "commercial illustrator",
    "comic book illustrator",
    "botanical illustrator",
    "children illustrator",
    "fashion illustrator",
    "fantasy illustrator",
    "comics illustrator",
    "manga illustrator",
    "comic illustrator",
    "book illustrator",
    "bird illustrator",
    "and illustrator",
    "illustrator and",
    "illustrator",
    "nurseryman",  # before sciences
    "correspondent and editor for United Press International",
    "editor in chief of King Features Syndicate",
    "Pulitzer Prize winning newspaper editor",
    "comic book and pulp magazine editor",
    "newspaper editor of the from until",
    "United Press International editor",
    "founding editor of stomach cancer",
    "wood carver and magazine editor",
    "sound designer and sound editor",
    "science fiction fanzine editor",
    "magazine and newspaper editor",
    "editorial page editor for the",
    "editor of black publications",
    "founding editor of magazine",
    "editor in chief of magazine",
    "Oscar winning sound editor",
    "newspaper editor in chief",
    "Composer and music editor",
    "sound designer and editor",
    "book and magazine editor",
    "science fiction editor",
    "newspaper chief editor",
    "photojournalism editor",
    "visual effects editor",
    "women magazine editor",
    "games magazine editor",
    "and newspaper editor",
    "mystery novel editor",
    "secretary and editor",
    "Disney comics editor",
    "and magazine editor",
    "book review editor",
    "managing editor of",
    "comic book editor",
    "publishing editor",
    "comic  and editor",
    "photo editor and",
    "newspaper editor",
    "magazine editor",
    "literary editor",
    "editor in chief",
    "fashion editor",
    "fiction editor",
    "sound editor",
    "photo editor",
    "music editor",
    "book editor",
    "news editor",
    "CNET editor",
    "and editor",
    "editor and",
    "editor of",
    "editor",
]
sports = [
    "professional road bicycle racer who won two stages of the Tour de",
    "Grand Prix motorcycle and short circuit road racer",
    "short circuit motorcycle road racer",
    "sport sailor and maxi yacht racer",
    "Grand Prix motorcycle road racer",
    "motorcycle and touring car racer",
    "professional road bicycle racer",
    "motor racer and IndyCar driver",
    "motorcycle sidecar road racer",
    "Hall of Fame motorcycle racer",
    "jet car driver and drag racer",
    "automobile racer and designer",
    "professional motocross racer",
    "motorcycle builder and racer",
    "Grand Prix motorcycle racer",
    "Paralympic wheelchair racer",
    "motorcycle speedway racer",
    "drag racer and crew chief",
    "racer and television host",
    "off road motorcycle racer",
    "motorcycle and auto racer",
    "Hall of Fame drag racer",
    "motorcycle rally racer",
    "motorcycle road racer",
    "Moto motorcycle racer",
    "land speed racer and",
    "motorcycle racer and",
    "motorcycle racer and",
    "motorcross racer and",
    "horse harness racer",
    "mountain bike racer",
    "road bicycle racer",
    "disabled ski racer",
    "hillclimbing racer",
    "powerboating racer",
    "cyclo cross racer",
    "touring car racer",
    "motorcycle racer",
    "alpine ski racer",
    "automobile racer",
    "wheelchair racer",
    "Alpine ski racer",
    "sprint car racer",
    "motocross racer",
    "stock car racer",
    "motorbike racer",
    "motorboat racer",
    "ski cross racer",
    "NHRA drag racer",
    "off road racer",
    "bicycle racer",
    "sidecar racer",
    "Air racer and",
    "barrel racer",
    "MotoGP racer",
    "yacht racer",
    "motor racer",
    "rally racer",
    "drag racer",
    "auto racer",
    "air racer",
    "ski racer",
    "and racer",
    "racer",
    "female jockey and pioneer in thoroughbred horse racing",
    "jockey and first woman in to receive a jockey licence",
    "National Hunt jockey and horse trainer",
    "race horse trainer and jockey mentor",
    "jockey in thoroughbred horse racing",
    "jockey in thoroughbred racing",
    "horse trainer and jockey",
    "jockey and horse trainer",
    "National Champion jockey",
    "National Hunt jockey",
    "horse racing jockey",
    "Hall of Fame jockey",
    "jockey and trainer",
    "jockey",
]
sciences = [
    "paleontologist and ornithologist",
    "ichthyologist and ornithologist",
    "ornithologist and",
    "and ornithologist",
    "ornithologist",
    "zoologist and advocate of evolutionary epistemology",
    "zoologist and neurophysiologist",
    "palaeontologist and zoologist",
    "paleontologist and zoologist",
    "soil zoologist and ecologist",
    "immunologist and zoologist",
    "zoologist and ecologist",
    "invertebrate zoologist",
    "zoologist  science",
    "medical zoologist",
    "turtle zoologist",
    "cryptozoologist",
    "and zoologist",
    "zoologist and",
    "zoologist",
    "healthcare advocate and registered nurse",
    "first nurse to earn a master degree",
    "nurse and nurse researcher",
    "nurse and nurse tutor",
    "mental health nurse",
    "nurse and nursing",
    "registered nurse",
    "Navy nurse",
    "Army nurse",
    "nurse and",
    "and nurse",
    "nurse",
    "endocrinologist and medical researcher",
    "pediatrician and medical researcher",
    "immunologist and medical researcher",
    "neurologist and medical researcher",
    "medical researcher in immunology",
    "biomedical researcher",
    "medical researcher",
]

business_farming = [
    "hotelier and casino owner",
    "hotelier and retailer",
    "hotelier and",
    "hotelier",
    "potato farmer and long distance runner",
    "farmer and landowner",
    "farmer and lobbyist",
    "rice farmer",
    "and farmer",
    "farmers'",
    "farmer",
]

academia_humanities = [
    "and Broadway theatre preservationist",  # before arts
]
law_enf_military_operator = [
    "colonel in the Army and of the most decorated women in military history",
    "Army General who commanded military operations in the War from to",
    "paramilitary intelligence chief and clandestine agent",
    "National Liberation Army paramilitary leader",
    "Sandinista guerrilla and military leader",
    "scientific military intelligence expert",
    "Prime Minister of and military leader",
    "Northern Alliance military commander",
    "Serb military commander in the War",
    "military communications listener",
    "military commander and warlord",
    "republican paramilitary leader",
    "military and security official",
    "head of military intelligence",
    "warlord and military figure",
    "Serb paramilitary commander",
    "Biafran military commander",
    "Chetnik military commander",
    "Hutu paramilitary leader",
    "ISIL military commander",
    "and paramilitary leader",
    "and military commander",
    "paramilitary commander",
    "military commander and",
    "era military commander",
    "military intelligence",
    "loyalist paramilitary",
    "military interpreter",
    "military veteran and",
    "military leader and",
    "paramilitary leader",
    "marine and military",
    "military commander",
    "military  designer",
    "military official",
    "military veteran",
    "and paramilitary",
    "military leader",
    "military office",
    "Hamas military",
    "military chief",
    "paramilitary",
    "military man",
    "military",
]

spiritual = ["religion and apologetics", "of religion", "religion"]
social = [
    "animal welfare campaigner",
    "social worker and Righteous Among the Nations",
    "youth social worker",
    "and social worker",
    "social worker",
    "charity fundraiser",
]
crime = ["and forger", "forger", "drug lord"]
event_record_other = [
    "chef and construction worker",  # before arts
    "construction worker",
    "ebola survivor",
    "anthrax attack victim",
]
other_species = []

<IPython.core.display.Javascript object>

In [12]:
# # Example code to quickly sort list in correct descending length search order to copy to dictionary
# temp = sorted(list(set(law_enf_military_operator)), key=lambda x: len(x), reverse=True)
# temp

<IPython.core.display.Javascript object>

In [13]:
# Hard-coding cause_of_death for entry with value in info_2
index = df[df["link"] == "https://en.wikipedia.org/wiki/Hideo_Ogata"].index
df.loc[index, "cause_of_death"] = "stomach cancer"

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [14]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "politics_govt_law": politics_govt_law,
    "business_farming": business_farming,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sciences": sciences,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [15]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['num_categories']!=0].sample(2)

CPU times: total: 3min 10s
Wall time: 3min 10s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
22860,11,Lorentz Eldjarn,", 86, Norwegian biochemist.",https://en.wikipedia.org/wiki/Lorentz_Eldjarn,4,2007,February,,,,,,,,,,,,,86.0,,Norway,,,1.609438,1,0,0,0,0,0,0,0,0,0,0,0,1
41506,10,Bob Fenton,", 89, New Zealand politician, MP for Hastings .",https://en.wikipedia.org/wiki/Bob_Fenton,5,2013,January,,,,MP for Hastings,,,,,,,,,89.0,,New Zealand,,1975 1978,1.791759,0,0,0,0,0,0,0,0,1,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [16]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 9762 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [17]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [18]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [19]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "Jesuit" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [20]:
# # Viewing list sorted by descending length to copy to dictionary below and screen values
# sorted(specific_roles_list, key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [21]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "press director"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [22]:
# Creating lists for each category
politics_govt_law = [
    "managing director of the Abu Dhabi Investment Authority",  # before arts
    "director of Office of Telecommunications Policy",
    "director of the Mint from to",
    "Native advocate",
    "Peace",
]

arts = [
    "director of most episodes of Monty Python Flying Circus",
    "theater director who staged plays on and off Broadway",
    "publishing director of Burke Peerage Limited",
    "opera director and set and costume designer",
    "theatrical director and opera librettist",
    "director of John H Johnson Fashion Fair",
    "music director and music group founder",
    "cameraman and director of photography",
    "director and lyricist in the language",
    "operatic baritone and opera director",
    "television and music video director",
    "opera director and administrator",
    "festival director and cover girl",
    "managing director of BBC Radio",
    "organist and musical director",
    "television and radio director",
    "organist and choral director",
    "rock and roll tour director",
    "Emmy Award winning director",
    "short  and casting director",
    "theater and opera director",
    "opera director and manager",
    "opera director and hazzan",
    "public relations director",
    "theater director and mime",
    "radio program director",
    "marching band director",
    "college band director",
    "music video director",
    "operatic ic director",
    "commercial director",
    "theatrical director",
    "assistant director",
    "festival director",
    "Broadway director",
    "and news director",
    "director of Radio",
    "theater director",
    "casting director",
    "company director",
    "B movie director",
    "gallery director",
    "screen director",
    "choral director",
    "design director",
    "opera director",
    "music director",
    "movie director",
    "media director",
    "anime director",
    "radio director",
    "voice director",
    "press director",
    "band director",
    "set director",
    "director and",
    "director",
    "organist of the St Peter Basilica in Rome",
    "harpsichordist and organist",
    "organist and harpsichordist",
    "chorister and organist",
    "classical organist and",
    "cantor and organist",
    "cathedral organist",
    "classical organist",
    "concert organist",
    "stadium organist",
    "organist and",
    "organist",
]
sports = [
    "track and field coach and athletic director",  # before arts
    "manager and director in the Football League",
    "collegiate athletic director",
    "college athletic director",
    "runner and race director",
    "athletic director",
    "sporting director",
]
sciences = [
    "NASA mission director",  # before arts
]

business_farming = [
    "toy manufacturer and managing director of Lego",  # before arts
    "managing director of Ulsterbus and Citybus",
    "business director",
    "funeral director",
]
academia_humanities = [
    "deputy director of the National Air and Space Museum",  # before arts
    "director of the National Gallery of Art from to",
    "director of the Metropolitan Museum of Art",
    "director of the Cleveland Museum of Art",
    "oral history archive director",
    "curator and museum director",
    "library director",
    "museum director",
]
law_enf_military_operator = [
    "Army Lieutenant General and director of the National Security Agency",  # before arts
    "first director of the Coast Guard Women Reserve",
    "former CIA director",
    "director of the FBI",
]
spiritual = [
    "church music director and concert",  # before arts
    "Jesuit monk and",
    "Jesuit cleric",
    "Jesuit",
]
social = [
    "director of the Peace Corps",  # before arts
    "Peace Corps",  # before politics_govt_law
]
crime = []
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

In [23]:
# Hard_coding info_1 value to capture spiritual for entry
index = df[df["link"] == "https://en.wikipedia.org/wiki/Pasquale_Borgomeo"].index
df.loc[index, "info_1"] = "priest"

# Hard_coding info_2 value to capture business for entry
index = df[df["link"] == "https://en.wikipedia.org/wiki/Jim_Service"].index
df.loc[index, "info_2"] = "business director"  # added to dictionary

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [24]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "sports": sports,
    "arts": arts,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [25]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['num_categories']!=0].sample(2)

CPU times: total: 57.1 s
Wall time: 57.1 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
26516,25,Brian Donnelly,", 59, New Zealand diplomat and politician, MP .",https://en.wikipedia.org/wiki/Brian_Donnelly_(New_Zealand_politician),6,2008,September,,,,MP,,,,,,,,,59.0,,New Zealand,,1996 2008,1.94591,0,0,0,0,0,0,0,0,1,0,0,0,1
43019,23,Norman Jones,", 78, British television actor .",https://en.wikipedia.org/wiki/Norman_Jones_(actor),9,2013,April,",",,,,,,,,,,,,78.0,,United Kingdom of Great Britain and Northern Ireland,,",",2.302585,0,0,0,0,0,1,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [26]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 9538 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [27]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [28]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [29]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "bounty hunter" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [30]:
# # Viewing list sorted by descending length to copy to dictionary below and screen values
# sorted(specific_roles_list, key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [31]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "bounty hunter and reality television personality"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [32]:
# Creating lists for each category
politics_govt_law = [
    "deaf rights advocate",
    "government official and gun control advocate",
    "consumer advocate and government official",
    "local government official",
    "and government official",
    "government official",
]

arts = [
    "radio preacher",
    "entertainer and television personality",
    "and reality television personality",
    "radio and television personality",
    "reality television personality",
    "and television personality",
    "television personality",
]
sports = [
    "national hockey team and Pittsburgh Penguins coach",
    "ice hockey trainer and equipment manager",
    "ice hockey Hall of Fame player and coach",
    "Detroit Red Wings hockey player in the s",
    "women baseball and field hockey player",
    "field hockey player and administrator",
    "professional hockey player and coach",
    "ice hockey administrator and referee",
    "ice hockey referee and administrator",
    "professional ice hockey defenseman",
    "Hall of Fame field hockey player",
    "Hall of Fame ice hockey linesman",
    "field hockey player and manager",
    "roller hockey player and coach",
    "field hockey player and coach",
    "ice hockey coach and manager",
    "field hockey representative",
    "professional hockey player",
    "ice hockey administrator",
    "college ice hockey coach",
    "hockey player and coach",
    "NHL ice hockey referee",
    "ice hockey goaltender",
    "ice hockey defenceman",
    "sledge hockey player",
    "field hockey player",
    "ice hockey referee",
    "ice hockey coach",
    "hockey official",
    "hockey referee",
    "hockey player",
    "field hockey",
    "ice hockey",
    "hockey",
    "competitive figure skater as a teenager",
    "figure skater and figure skating coach",
    "pair skater and figure skating",
    "roller derby skater and coach",
    "short track speed skater",
    "long distance ice skater",
    "figure skater and coach",
    "skate and snowboarder",
    "skateboard innovator",
    "Roller derby skater",
    "roller derby skater",
    "figure skater",
    "speed skater",
    "pair skater",
    "skater",
    "weightlifter and fitness centre owner",
    "champion Paralympic weightlifter",
    "world champion weightlifter",
    "heavyweight weightlifter",
    "weightlifter",
    "sailor and nightclub owner",
    "sailor and adventurer",
    "sailboat designer",
    "sailor and coach",
    "land sailor",
    "sailor and",
    "assailant",
    "sailor",
]
sciences = [
    "sailplane designer and pioneer",  # before sports
    "ichthyologist",
    "yacht designer",
    "sceptic",
    "immunologist and eye tissue transplant researcher",
    "gastroenterologist and immunologist",
    "immunologist and cancer researcher",
    "immunologist and AIDS researcher",
    "pathologist and immunologist",
    "virologist and immunologist",
    "cancer immunologist",
    "immunologist",
    "aerodynamics expert at",
    "aeroplane designer",
    "aerospace pioneer",
    "aerodynamicist",
    "aerospace",
    "aero",
]

business_farming = [
    "diamond merchant",
    "wine collector and dealer",
    "wine collector and dealer",  # before academia_humanities
    "pastoral and tourism pioneer",  # before spiritual
    "financier and venture capitalist",
    "venture capitalist and financier",
    "property developer and financier",
    "billionaire financier",
    "corporate  financier",
    "and a financier",
    "financier and",
    "and financier",
    "financier",
    "pawnbroker",
]
academia_humanities = [
    "ichthyologist and musical instrument collector",
    "collector of Harry Houdini memorabilia",
    "diamond merchant and antique collector",
    "optical illusion collector and sceptic",
    "and music collector",
    "toy car collector",
    "antique collector",
    "record collector",
    "book collector",
    "collector",
]
law_enf_military_operator = [
    "sea captain sailor",  # before sports
    "navy sailor",
    "bounty hunter",
]
spiritual = [
    "evangelist and pastor of the Worldwide Church of God",
    "pastor at the University Baptist Church in Waco",
    "Pentecostal evangelical pastor and",
    "Evangelical Lutheran pastor",
    "Independent Baptist pastor",
    "Baptist megachurch pastor",
    "pastor and evangelist",
    "pastor and exorcist",
    "Pentecostal pastor",
    "evangelical pastor",
    "Protestant pastor",
    "megachurch pastor",
    "pastoral theology",
    "reformist pastor",
    "Christian pastor",
    "Baptist pastor",
    "gospel  pastor",
    "senior pastor",
    "pastor and",
    "and pastor",
    "pastor",
]
social = []
crime = ["assailant", "fugitive from justice", "fugitive"]  # before sports
event_record_other = [
    "potato chip collector",  # before academia_humanities
]
other_species = [
    "skateboarding bulldog",  # before sports
]

<IPython.core.display.Javascript object>

In [33]:
# # Example code to quickly sort list in correct descending length search order to copy to dictionary
# temp = sorted(list(set(law_enf_military_operator)), key=lambda x: len(x), reverse=True)
# temp

<IPython.core.display.Javascript object>

In [34]:
# Hard-coding info_2 value to capture military for entries
index = df[
    df["link"] == "https://en.wikipedia.org/wiki/John_Leake_(NAAFI_manager)"
].index
df.loc[index, "info_2"] = "navy sailor"  # added to dict

index = df[df["link"] == "https://en.wikipedia.org/wiki/Ted_Briggs"].index
df.loc[index, "info_2"] = "navy sailor"

index = df[df["link"] == "https://en.wikipedia.org/wiki/Robert_Walker_(sailor)"].index
df.loc[index, "info_2"] = "navy sailor"

index = df[df["link"] == "https://en.wikipedia.org/wiki/Robert_Stinnett"].index
df.loc[index, "info_2"] = "navy sailor"

index = df[df["link"] == "https://en.wikipedia.org/wiki/Molly_Kool"].index
df.loc[index, "info_2"] = "sea captain sailor"  # added to dict

index = df[df["link"] == "https://en.wikipedia.org/wiki/Susan_Clark_(sailor)"].index
df.loc[index, "info_2"] = "sea captain sailor"

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [35]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [36]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['num_categories']!=0].sample(2)

CPU times: total: 1min 15s
Wall time: 1min 15s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
88226,6,Chi Shangbin,", 71, Chinese football player , heart attack.",https://en.wikipedia.org/wiki/Chi_Shangbin,25,2021,March,"national team and manager Dalian Wanda, Dalian Aerbin",,,heart attack,,,,,,,,,71.0,,"China, People's Republic of",,"national team and manager Dalian Wanda, Dalian Aerbin",3.258097,0,0,0,0,0,0,1,0,0,0,0,0,1
45277,23,Niall Donohue,", 22, Irish hurler , suicide.",https://en.wikipedia.org/wiki/Niall_Donohue,4,2013,October,Galway,,,suicide,,,,,,,,,22.0,,Ireland,,Galway,1.609438,0,0,0,0,0,0,1,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [37]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 9016 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [38]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [39]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [40]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "confectioner" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [41]:
# # Viewing list sorted by descending length to copy to dictionary below and screen values
# sorted(specific_roles_list, key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [42]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "nun and confectioner"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [43]:
# Creating lists for each category
politics_govt_law = [
    "neo fascist",
    "barrister and",
    "barrister",
    "viceroy",
    "princess and a Muhammad Ali Dynasty member",
    "princess and grandmother of King Felipe VI",
    "princess and granddaughter of Mehmed V",
    "princess of the House of Savoy",
    "princess and expatriate",
    "Sealandic princess",
    "and princess",
    "princess",
    "anti torture advocate",
]

arts = [
    "science fiction and multi winner",
    "science fiction bibliographer",
    "science fiction expert",
    "science fiction and",
    "science fiction fan",
    "science fiction",
    "lyricist known for writing musicals with Adolph Green including",
    "country music performer and member of the Grand Ole Opry",
    "music manager and spokesperson for the Bay City Rollers",
    "electronic music programmer and keyboardist",
    "costume designer of Broadway musicals",
    "classical music radio program host",
    "talent manager and music promoter",
    "founder of exotica musical genre",
    "big band and pop music arranger",
    "music venue owner and promoter",
    "calypsonian and music promoter",
    "bandleader and music arranger",
    "country music and rodeo star",
    '"father of bluegrass" music',
    "classical music impresario",
    "country music entertainer",
    "music promoter and agent",
    "electronic music pioneer",
    "country music performer",
    "percussionist and music",
    "music manager and agent",
    "classical music manager",
    "father of Chicano music",
    "pioneer of Celtic music",
    "and musical instrument",
    "musical administrator",
    "folk music researcher",
    "country music fiddler",
    "music hall performer",
    "music  administrator",
    "folk music promoter",
    "rock music promoter",
    "performer of music",
    "country music star",
    "traditional music",
    "cellist and music",
    "musical arranger",
    "music researcher",
    "musical lyricist",
    "music publicist",
    "music video and",
    "classical music",
    "music education",
    "music promoter",
    "music arranger",
    "music manager",
    "music website",
    "musical agent",
    "music expert",
    "music patron",
    "folk music",
    "and music",
    "musical",
    "music",
    "carnival designer",
    "violist and cellist",
    "classical cellist",
    "and cellist",
    "cellist",
    "Hall of Fame talk radio host",
    "television and radio host",
    "Hall of Fame radio host",
    "talk radio host and",
    "talk radio host",
    "and radio host",
    "radio host and",
    "radio host",
    "ballerina and ballet mistress",
    "ballerina and",
    "ballerina",
    "tenor saxophone player",
    "operatic lyric tenor",
    "operatic tenor",
    "countertenor",
    "opera tenor",
    "lyric tenor",
    "heldentenor",
    "tenor and",
    "tenor",
    "microwave cooking consultant",
    "TV cooking show host",
    "cooking show host",
    "television cook",
    "and cook",
    "cook",
]
sports = [
    "world champion bridge player",
    "professional bridge player",
    "contract bridge player",
    "bridge player",
    "thoroughbred horse breeder",
    "and horse breeder",
    "horse breeder",
    "Paralympic wheelchair curler",
    "world champion curler",
    "Hall of Fame curler",
    "curler",
    "middle distance runner and Commonwealth Games gold medallist",
    "middle distance runner and former world record holder",
    "middle distance runner and",
    "middle and long distance runner",
    "orienteer and mountain runner",
    "marathon and triathlon runner",
    "long distance runner and",
    "middle distance runner",
    "long distance runner",
    "steeplechase runner",
    "orienteering runner",
    "runner and coach",
    "distance runner",
    "sprint runner",
    "fell runner",
    "runner",
    "judoka",
    "race car driver and member of the NASCAR Hall of Fame",
    "NASCAR driver and ARCA race car driver owner",
    "Formula One and Grand Prix race car driver",
    "race car driver and hot rod builder",
    "race car driver and team owner",
    "Hall of Fame race car driver",
    "race car driver and mechanic",
    "Formula One race car driver",
    "race car driver and owner",
    "NASCAR race car driver",
    "race car driver and",
    "race car driver",
    "Baseball player who was the first to come out as gay",
    "Baseball player for the Philadelphia Athletics",
    "Baseball player and manager",
    "Baseball player and coach",
    "former Baseball player",
    "Baseball player",
    "Hall of Fame gymnast",
    "gymnastics coach",
    "rhythmic gymnast",
    "gymnast",
    "racecar driver and member of the NASCAR Hall of Fame",
    "professional racecar driver",
    "and racecar driver",
    "racecar driver",
    "baseball umpire and supervisor",
    "Hall of Fame baseball umpire",
    "baseball umpire",
    "Hall of Fame badminton player",
    "badminton player and coach",
    "badminton player and",
    "badminton player",
    "Test cricket umpire",
    "test cricket umpire",
    "cricket umpire",
]
sciences = [
    "paleontologist who revolutionized understanding of dinosaurs",
    "entomologist and paleontologist",
    "malacologist and paleontologist",
    "vertebrate paleontologist",
    "paleontologist",
    "arachnologist and myriapodologist",
    "entomologist and arachnologist",
    "arachnologist and",
    "arachnologist",
    "cardiologist who invented the technique of coronary bypass surgery",
    "cardiologist and expert on hypertension",
    "paediatric cardiologist",
    "pediatric cardiologist",
    "cardiologist",
    "child psychoanalyst",
    "psychoanalyst",
    "gynaecologist who is among the oldest men to have fathered a child",
    "gynaecologist and reproductive medicine researcher",
    "obstetrician and gynaecologist",
    "gynecologist and obstetrician",
    "obstetrician and gynecologist",
    "entomologist and ecologist",
    "statistical ecologist",
    "landscape ecologist",
    "and media ecologist",
    "plant ecologist",
    "paleoecologist",
    "deep ecologist",
    "gynaecologist",
    "geo ecologist",
    "gynecologist",
    "ecologist",
    "virologist credited with eradicating polio in",
    "epidemiologist and virologist",
    "virologist and paediatrician",
    "plant virologist",
    "virologist",
    "criminologist",
    "pharmacologist and biodynamic agriculturalist",
    "Nobel Prize winning pharmacologist",
    "physiologist and pharmacologist",
    "pharmacologist and physiologist",
    "behavioral pharmacologist",
    "clinical pharmacologist",
    "psychopharmacologist",
    "neuropharmacologist",
    "pharmacologist",
    "pathologist who specialized in sickle cell anemia and hematology",
    "pathologist and cancer researcher",
    "neurologist and neuropathologist",
    "pathologist and toxicologist",
    "veterinary pathologist",
    "paediatric pathologist",
    "forensic pathologist",
    "chemical pathologist",
    "cancer pathologist",
    "dermatopathologist",
    "animal pathologist",
    "plant pathologist",
    "phytopathologist",
    "oral pathologist",
    "neuropathologist",
    "pathologist and",
    "pathologist",
    "sexologist and psychotherapist",
    "child psychotherapist",
    "psychotherapist",
]

business_farming = ["founder of PowerBar", "winemaker and", "winemaker", "confectioner"]
academia_humanities = [
    "library",
    "emeritus curator at the Smithsonian Institution National Museum of Natural History",
    "archeologist and former curator at the Smithsonian Institution",
    'museum curator and one of the "Monuments Men"',
    "manuscripts curator at the Museum",
    "Smithsonian Institution curator",
    "egyptologist and curator",
    "educationist and curator",
    "curator of contemporary",
    "curator of paintings",
    "photography curator",
    "and museum curator",
    "museum curator",
    "and curator",
    "curator",
]
law_enf_military_operator = [
    "astronaut in Mercury",
    "candidate astronaut",
    "former astronaut",
    "NASA astronaut",
    "astronaut",
    "resistant",
    "air marshal and Director General of Intelligence",
    "air marshal and George Cross recipient",
    "Air Force air marshal",
    "air marshall",
    "air marshal",
]
spiritual = [
    "Christianity preacher and gospel",
    "prelate from  Catholic Association",
    "prelate of the Catholic Church",
    "prelate in the Catholic Church",
    "catholic prelate and cardinal",
    "Catholic clandestine prelate",
    "Catholic laicized prelate",
    "Catholic Cardinal prelate",
    "Orthodox Old Rite prelate",
    "Eastern Orthodox prelate",
    "United Methodist prelate",
    "Church of South prelate",
    "Episcopalian prelate",
    "Catholic ex prelate",
    "Episcopal prelate",
    "Apostolic prelate",
    "Mar Thoma prelate",
    "episcopal prelate",
    "Church of prelate",
    "Angelican prelate",
    "Orthodox prelate",
    "Lutheran prelate",
    "Maronite prelate",
    "Coptic prelate",
    "Jewish prelate",
    "Mormon prelate",
    "prelate",
    "biblical",
    "Catholic nun and",
    "Apostolic nuncio",
    "Benedictine nun",
    "Poor Clare nun",
    "Catholic nun",
    "and nun",
    "nun",
]
social = [
    "and confectioner",  # before business_farming
]
crime = [
    "terrorist involved in the Glasgow International Airport attack",
    "founder and commander in chief of terrorist organization FARC",
    "terrorist and a commander of Abu Sayyaf",
    "islamist terrorist group leader",
    "Arabian suspected terrorist",
    "terrorist in Bali bombings",
    "al Qaeda terrorist",
    "domestic terrorist",
    "Arabian terrorist",
    "and terrorist",
    "terrorist",
]
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [44]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [45]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['num_categories']!=0].sample(2)

CPU times: total: 2min 46s
Wall time: 2min 46s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
54179,21,Don Randall,", 62, Australian politician, MP for Swan , suspected heart attack.",https://en.wikipedia.org/wiki/Don_Randall_(politician),17,2015,July,and Canning since,,,MP for Swan,suspected heart attack,,,,,,,,62.0,,Australia,,1996 1998 and Canning since 2001,2.890372,0,0,0,0,0,0,0,0,1,0,0,0,1
14844,26,Mamo Wolde,", 69, Ethiopian Olympic long-distance runner , liver cancer.",https://en.wikipedia.org/wiki/Mamo_Wolde,15,2002,May,"gold medal, silver medal, bronze medal",,,liver cancer,,,,,,,,,69.0,,Ethiopia,,"1968 gold medal, 1968 silver medal, 1972 bronze medal",2.772589,0,0,0,0,0,0,1,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [46]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 7672 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [47]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [48]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [49]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "President of" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [50]:
# # Viewing list sorted by descending length to copy to dictionary below and screen values
# sorted(specific_roles_list, key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [51]:
# Example code to quick-check a specific entry
df[df["info_2"] == "puppeteer and visual effects technician ; multiple sclerosis"]

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
54136,18,Brock Winkless,", 55, American puppeteer and visual effects technician ; multiple sclerosis.",https://en.wikipedia.org/wiki/Brock_Winkless,3,2015,July,", ,",,puppeteer and visual effects technician ; multiple sclerosis,,,,,,,,,,55.0,,United States of America,,", ,",1.386294,0,0,0,0,0,0,0,0,0,0,0,0,0


<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [52]:
# Creating lists for each category
politics_govt_law = [
    "advocate for women education",  # before academia_humanities
    "sexual education advocate",
    "radical",
    "Conservative Member of Parliament and former government minister",  # before spiritual
    "president of land and former prime minister of the Republic",
    "minister for social welfare in the Punjab province",
    "opposition leader and former prime minister",
    "minister and Lord of Appeal in Ordinary",
    "minister of labor and social security",
    "Haryana four time chief minister",
    "Conservative government minister",
    "first chief minister of Sikkim",
    "deputy and former minister",
    "chief minister of the ACT",
    "Liberal cabinet minister",
    "and government minister",
    "finance minister since",
    "Nonconformist minister",
    "Prime minister of land",
    "deputy prime minister",
    "n minister of Housing",
    "and deputy minister",
    "government minister",
    "interior minister",
    "highways minister",
    "foreign minister",
    "Cabinet minister",
    "cabinet minister",
    "finance minister",
    "Foreign minister",
    "prime minister",
    "advocate for disadvantaged students",
    "child safety campaigner",
    "WHO official",
    "prince and Head of the House of Hesse",
    "prince and head of the dynasty",
    "Qing dynasty imperial prince",
    "prince and Archduke of Este",
    "Imperial prince",
    "Arabian prince",
    "princeling",
    "prince",
]

arts = [
    "who created the Ladybird books",
    "Hall of Fame lyricist",
    "Broadway lyricist",
    "and lyricist",
    "lyricist",
    "theatrical and television set and costume designer",
    "costume designer and Academy Award winner",
    "Tony Award winning costume designer",
    "and Broadway costume designer",
    "stage and costume designer",
    "set and costume designer",
    "costume designer",
    "keyboardist and a founding member of the progressive rock group Camel",
    "keyboardist for Bruce Springsteen & The E Street Band",
    "keyboardist for Janis Joplin and The Band",
    "keyboardist and sound technician",
    "progressive rock keyboardist",
    "keyboardist and vocalist",
    "Hall of Fame keyboardist",
    "keyboardist of Spirit",
    "blues keyboardist",
    "rock keyboardist",
    "funk keyboardist",
    "keyboard player",
    "keyboardist",
    "diarist",
    "role playing game designer",
    "roleplaying game designer",
    "board game designer",
    "game designer and",
    "wargame designer",
    "game designer",
    "gardening expert and radio presenter",
    "voice over and radio presenter",
    "TV and radio presenter",
    "DJ and radio presenter",
    "BBC radio presenter",
    "radio presenter and",
    "radio presenter",
    "puppeteer and visual effects technician ; multiple sclerosis",
    "ventriloquist and puppeteer",
    "puppeteer and ventriloquist",
    "wayang golek puppeteer",
    "wayang puppeteer",
    "potehi puppeteer",
    "puppeteer",
    "ceramicist and potter",
    "Cochiti Pueblo potter",
    "potter and ceramist",
    "studio potter",
    "potter and",
    "and potter",
    "potter",
    "multi instrumentalist and entertainer",
    "TV entertainer most active in y",
    "talk show host and entertainer",
    "guitar maker and entertainer",
    "trumpeter and entertainer",
    "transvestite entertainer",
    "transsexual entertainer",
    "entertainer and manager",
    "vaudeville entertainer",
    "children entertainer",
    "street entertainer",
    "comic entertainer",
    "drag entertainer",
    "entertainer and",
    "entertainer",
]
sports = [
    "motorcycle speedway rider and Formula One driver",
    "motorcycle speedway rider and coach",
    "speedway and ice speedway rider",
    "speedway rider and promoter",
    "motorcycle speedway rider",
    "speedway rider",
    "handball player and coach",
    "East handball player",
    "handball player",
    "Hall of Fame college baseball coach",
    "college baseball coach and player",
    "basketball and baseball coach",
    "Hall of Fame baseball coach",
    "baseball coach and official",
    "baseball coach and manager",
    "college baseball coach",
    "baseball coach",
]
sciences = [
    "daffodil breeder",
    "NASA space science administrator and a leader in satellite communications",
    "computer science pioneer",
    "materials science expert",
    "mathematics and science",
    "neuroscience researcher",
    "and information science",
    "science and technology",
    "paediatric sciences",
    "cognitive science",
    "chemical sciences",
    "computer science",
    "neuroscience",
    "science",
    "clinical and research pediatrician",
    "pediatrician and founding",
    "pediatrician and",
    "pediatrician",
    "paleoentomologist and coleopterist",
    "entomologist and lepidopterist",
    "entomologist and toxicologist",
    "forensic entomologist",
    "entomologist and",
    "entomologist",
    "agronomist and tea expert",
    "agronomist",
    "neurologist and epileptologist",
    "pediatric neurologist",
    "neurologist and",
    "neurologist",
    "paediatrician and sudden infant death syndrome researcher",
    "paediatrician and",
    "paediatrician",
    "epidemiologist and infectionist",
    "epidemiologist and oncologist",
    "cardiovascular epidemiologist",
    "dental epidemiologist",
    "epidemiologist and",
    "epidemiologist",
    "cosmonaut trainer",
    "first cosmonaut",
    "cosmonaut",
]


business_farming = [
    "wine pioneer and vineyard owner",
    "wine pioneer",
]
academia_humanities = [
    "education reformer and administrator",
    "educationist and intellectual",
    "educational administrator",
    "adult education innovator",
    "physical education expert",
    "education administrator",
    "educational researcher",
    "educational consultant",
    "educational theorist",
    "education consultant",
    "education proponent",
    "educationalist and",
    "education reformer",
    "education official",
    "educational leader",
    "educationalist and",
    "education leader",
    "and educationist",
    "educationalist",
    "educationist",
    "educational",
    "education",
    "classicist and digital humanist",
    "secular humanist",
    "humanist",
]
law_enf_military_operator = [
    "spy and the first chief of intelligence agency",
    "MI agent and spy for the Union",
    "who was a spy for the Union",
    "atomic spy for the Union",
    "spy and double agent",
    "insurgent spy and",
    "spy for the Union",
    "spy for the Stasi",
    "colonel and spy",
    "former KGB spy",
    "Union spy in",
    "alleged spy",
    "Allied spy",
    "spy chief",
    "and spy",
    "spy",
    "deputy minister of intelligence",  # before spiritual
    "railroad",
]
spiritual = [
    "Lingayat spiritual leader",
    "minister of the Church of the Intercession in Harlem",
    "evangelist and Southern Baptist minister",
    "Unitarian Universalist minister",
    "ist and Presbyterian minister",
    "Romani Pentecostal minister",
    "Southern Baptist minister",
    "Protestant minister and",
    "Congregational minister",
    "Christian minister and",
    "Unitarian minister and",
    "Presbyterian minister",
    "Pentecostal minister",
    "evangelical minister",
    "Protestant minister",
    "Methodist minister",
    "Christian minister",
    "Church of minister",
    "religious minister",
    "minister in Harlem",
    "Lutheran minister",
    "Nazarene minister",
    "ordained minister",
    "Baptist minister",
    "baptist minister",
    "Quaker minister",
    "youth minister",
    "minister and",
    "minister",
    "Pentecostal clergyman",
    "Church of clergyman",
    "Episcopal clergyman",
    "Anglican clergyman",
    "Catholic clergyman",
    "Lutheran clergyman",
    "Maronite clergyman",
    "Orthodox clergyman",
    "Anglican clergy",
    "Jewish  clergy",
    "clergyman and",
    "clergyman in",
    "clergyman",
]
social = [
    "community organizer",
]
crime = []
event_record_other = [
    "Hispanic John Jay College of Criminal Justice student",
    "student at South Hadley High School",
    "student and hazing victim",
    "high school student",
    "centenarian student",
    "student and victim",
    "graduate student",
    "exchange student",
    "college student",
    "honors student",
    "Ph D student",
    "law student",
    "student",
]
other_species = []

<IPython.core.display.Javascript object>

In [53]:
# # Example code to quickly sort list in correct descending length search order to copy to dictionary
# temp = sorted(list(set(law_enf_military_operator)), key=lambda x: len(x), reverse=True)
# temp

<IPython.core.display.Javascript object>

In [54]:
# Hard-coding info_2 value for entry to capture event_record_other category only
index = df[df["link"] == "https://en.wikipedia.org/wiki/Eve_Carson"].index
df.loc[index, "info_2"] = "student"

# Hard-coding cause_of_death for entry with value in info_2
index = df[df["link"] == "https://en.wikipedia.org/wiki/Brock_Winkless"].index
df.loc[index, "cause_of_death"] = "multiple sclerosis"

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [55]:
# Combining separate lists into one dictionary
known_for_dict = {
    "politics_govt_law": politics_govt_law,
    "social": social,
    "business_farming": business_farming,
    "sciences": sciences,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
    "academia_humanities": academia_humanities,
    "spiritual": spiritual,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [56]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['num_categories']!=0].sample(2)

CPU times: total: 2min 18s
Wall time: 2min 18s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
52556,23,Herberto Hélder,", 84, Portuguese poet.",https://en.wikipedia.org/wiki/Herberto_H%C3%A9lder,3,2015,March,,,,,,,,,,,,,84.0,,Portugal,,,1.386294,0,0,0,0,0,1,0,0,0,0,0,0,1
91676,29,Michel Egloff,", 80, Swiss prehistorian and archeologist.",https://en.wikipedia.org/wiki/Michel_Egloff,6,2021,July,,,,,,,,,,,,,80.0,,Switzerland,,,1.94591,0,0,0,1,0,0,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [57]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 6840 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [58]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [59]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [60]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "UFO" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [61]:
# # Viewing list sorted by descending length to copy to dictionary below and screen values
# sorted(specific_roles_list, key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [62]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "al theorist"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [63]:
# Creating lists for each category
politics_govt_law = [
    "two term member of County Board of Supervisors",
    "polemicist and conspiracy theorist",
    "RMS Titanic conspiracy theorist",
    "civil rights movement leader",
    "and conspiracy theorist",
    "civil rights campaigner",
    "civil rights pioneer",
    "conspiracy theorist",  # before academia_humanities
    "civil rights worker",
    "law and economics",
    "health economics",
    "economic advisor",
    "Western Marxist",
    "and economic",
    "civil rights",
    "economics",
    "economic",
    "planning",
    "Marxist",
]

arts = [
    "children television host and personality",
    "interior designer and television host",
    "television reporter and interviewer",
    "television and radio talk show host",
    "Bravo television network trainer on",
    "radio and television broadcasting",
    "television  for over four decades",
    "ventriloquist and television and",
    "television host and interviewer",
    "talk radio and television host",
    "state television correspondent",
    "television and  and presenter",
    "television news correspondent",
    "CBC radio and television host",
    "satirist and television host",
    "television reporter and host",
    "television fitness presenter",
    "reporter and television host",
    "television weather presenter",
    "pitchman and television host",
    "television production mogul",
    "public television innovator",
    "television camera operator",
    "television news presenter",
    "radio and television host",
    "television quiz show host",
    "children television host",
    "television color analyst",
    "television station owner",
    "television correspondent",
    "television news reporter",
    "television reality star",
    "television reporter and",
    "reality television star",
    "public television host",
    "television host and DJ",
    "television  presenter",
    "television newsreader",
    "television show host",
    "and television host",
    "television reporter",
    "television designer",
    "television station",
    "television pioneer",
    "television hostess",
    "television newsman",
    "television anchor",
    "television  since",
    "television emcee",
    "television host",
    "television star",
    "television and",
    "urban planning",  # before politics_govt_law
    "television",
]
sports = []
sciences = [
    "founder and principal theorist of Re evaluation Counseling",  # before academia_humanities
    "pioneer in anesthesiology and pain management",  # before business_farming
    "information security specialist",
    "of communication and media",
    "family planning pioneer",  # before politics_govt_law
    "information researcher",
    "who built Alaska first",
    "nursing administrator",
    "telecommunications",
    "nursing assistant",
    "control theorist",
    "early UFOlogist",
    "analytic number",
    "graph theorist",
    "UFO researcher",
    "communications",
    "communication",
    "consciousness",
    "information",
    "nursing",
    "number",
    "UFO",
]


business_farming = [
    "product planning manager",  # before politics_govt_law
    "management consultant",
    "management studies",
    "marketing manager",
    "marketing agent",
    "management guru",
    "organizational",
    "organisational",
    "management",
    "marketing",
]

academia_humanities = [
    "theorist in Arte Povera movement",
    "theorist and archivist",
    "pedagogical theorist",
    "curriculum theorist",
    "literary theorist",
    "cultural theorist",
    "lesbian theorist",
    "design theorist",
    "social theorist",
    "al theorist",
    "theorist",
]
law_enf_military_operator = []
spiritual = [
    "Yup'ik traditional healer",
    "Shia cleric and ayatollah",
    "traditional healer",
    "Shia cleric",
    "Shia",
]
social = []
crime = []
event_record_other = [
    "alleged UFO witness",  # before sciences
]
other_species = []

<IPython.core.display.Javascript object>

In [64]:
# # Example code to quickly sort list in correct descending length search order to copy to dictionary
# temp = sorted(list(set(spiritual)), key=lambda x: len(x), reverse=True)
# temp

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [65]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "event_record_other": event_record_other,
    "sciences": sciences,
    "arts": arts,
    "business_farming": business_farming,
    "politics_govt_law": politics_govt_law,
    "academia_humanities": academia_humanities,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "other_species": other_species,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [66]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['num_categories']!=0].sample(2)

CPU times: total: 1min 3s
Wall time: 1min 3s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
24979,11,Frank Piasecki,", 88, American aeronautical engineer who invented the tandem rotor placement in helicopter design, stroke.",https://en.wikipedia.org/wiki/Frank_Piasecki,10,2008,February,,,,stroke,,,,,,,,,88.0,,United States of America,,,2.397895,1,0,0,0,0,0,0,0,0,0,0,0,1
92476,2,Josephine Medina,", 51, Filipino table tennis player, Paralympic bronze medallist .",https://en.wikipedia.org/wiki/Josephine_Medina,11,2021,September,,,,Paralympic bronze medallist,,,,,,,,,51.0,,Philippines,,2016.0,2.484907,0,0,0,0,0,0,1,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [67]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 6681 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [68]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [69]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [70]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "driver" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [71]:
# # Viewing list sorted by descending length to copy to dictionary below and screen values
# sorted(specific_roles_list, key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [72]:
# # Example code to quick-check a trspecific entry
# df[df["info_2"] == "driver"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [73]:
# Creating lists for each category
politics_govt_law = [
    "assisted suicide advocate",
    "proponent of the single market and Vice President of the Commission",
    "nd President of the Republic and Emperor of Central",
    "first Vice President of the Navajo Nation",
    "President of the Maryland State Senate",
    "President of Federal Reserve Bank of",
    "revolutionary and President of istan",
    "President of the People Republic of",
    "President of the Arab Republic",
    "Vice President of the Republic",
    "President of during The Junta",
    "President of the Bundestag",
    "first President of Senegal",
    "statesman and President of",
    "President of the Republic",
    "former President of South",
    "President of from to",
    "th President of the",
    "Sixth President of",
    "President of the",
    "th President of",
    "nd President of",
    "President of",
    "white supremacist and Holocaust denier",  # before event_record_other
    "solicitor and Holocaust denier"
    "noblewoman and member of the House of Bourbon Two Sicilies",
    "noblewoman and daughter of Victor Emmanuel III of",
    "noblewoman and daughter of King Alfonso XIII",
    "noble and member of the Vanderbilt family",
    "noble and fourth wife of emperor Puyi",
    "noblewoman and monarchist",
    "noble and Infante",
    "noblewoman and",
    "and noblewoman",
    "noblewoman",
    "nobleman",
    "noble",
    "congressman and labor leader",
    "labor leader",
    "bus driver",  # before event_record_other
]

arts = [
    "fashion and bridal wear designer and retailer",
    "designer and the founder of Landor Associates",
    "stone letter carver and typeface designer",
    "designer for Dansk International Designs",
    "production designer and set decorator",
    "Hall of Fame theatrical set designer",
    "designer of Raleigh Chopper bicycle",
    "electric guitar designer and maker",
    "interior decorator and designer",
    "furniture and interior designer",
    "typographer and book designer",
    "interior and product designer",
    "designer of printed textiles",
    "theatrical lighting designer",
    "designer and window dresser",
    "costume and makeup designer",
    "designer of women lingerie",
    "modernist textile designer",
    "ceramic and glass designer",
    "rock album cover designer",
    "Emmy winning set designer",
    "fashion muse and designer",
    "designer of the star flag",
    "costume jewelry designer",
    "and production designer",
    "designer of board games",
    "roller coaster designer",
    "decorator and designer",
    "production designer",
    "embroidery designer",
    "theatrical designer",
    "restaurant designer",
    "scenic designer and",
    "women shoe designer",
    "furniture designer",
    "golf club designer",
    "landscape designer",
    "animation designer",
    "newspaper designer",
    "interior designer",
    "lighting designer",
    "typeface designer",
    "footwear designer",
    "ceramics designer",
    "logotype designer",
    "textile designer",
    "jewelry designer",
    "handbag designer",
    "scenic designer",
    "grotto designer",
    "puzzle designer",
    "puppet designer",
    "stage designer",
    "sound designer",
    "urban designer",
    "glass designer",
    "chief designer",
    "shoe designer",
    "type designer",
    "book designer",
    "flag designer",
    "set designer",
    "web designer",
    "toy designer",
    "TV designer",
    "designer",
    "board shaper",
    "literary guardian and the only child of Agatha Christie",
    "literary executor of James Joyce",
    "talent and literary agent",
    "literary preservationist",
    "literary figure",
    "literary agent",
    "literary",
    "toymaker",
    "news correspondent and essayist",
    "and essayist",
    "essayist",
]
sports = [
    "World surfing champion",
    "professional surfer",
    "surfing legend",
    "windsurfer",
    "surfing",
    "surfer",
    "surf",
    "rodeo cowboy and professional poker player",
    "Hall of Fame poker player and",
    "backgammon and poker player",
    "bookmaker and poker player",
    "professional poker player",
    "Hall of Fame poker player",
    "bridge and poker player",
    "professional poker",
    "poker player and",
    "poker player",
    "former President of FISA and later FIA",  # before politics_govt_law
    "baseball third baseman outfielder who played for the Cincinnati Reds",
    "baseball manager and member of the Baseball Hall of Fame",
    "first Black baseball pitcher to win a World Series game",
    "baseball pitcher and member of the MLB Hall of Fame",
    "baseball relief pitcher for the St Louis Cardinals",
    "baseball first base coach for the Tulsa Drillers",
    "controversial son of baseball great Ted Williams",
    "longtime minor league baseball record holder",
    "baseball second basemen and shortstop",
    "baseball pitcher and pitching coach",
    "baseball pitcher for Boston Red Sox",
    "baseball infielder and outfielder",
    "baseball outfielder and manager",
    "Hall of Fame baseball manager",
    "baseball and softball player",
    "college baseball head coach",
    "baseball manager and coach",
    "baseball clubhouse manager",
    "AAGPBL baseball pitcher",
    "baseball pitcher player",
    "baseball second baseman",
    "baseball college coach",
    "baseball right fielder",
    "baseball third baseman",
    "baseball administrator",
    "baseball  researcher",
    "baseball researcher",
    "baseball club owner",
    "baseball outfielder",
    "baseball shortstop",
    "baseball promoter",
    "baseball pitcher",
    "baseball manager",
    "baseball catcher",
    "baseball figure",
    "baseball batboy",
    "baseball owner",
    "baseball fan",
    "baseball",
    "stock car race driver and member of the NASCAR Hall of Fame",  # before event_record_other
    "rally driver and principal of the Toyota F racing team",
    "NASCAR driver and member of the NASCAR Hall of Fame",
    "chief test driver for Toyota Motor Company",
    "Hall of Fame NASCAR driver and owner",
    "retired stock car and NASCAR driver",
    "car racing team owner and driver",
    "USAC champion midget car driver",
    "sulky driver and horse trainer",
    "Alfa Romeo works' test driver",
    "motor racing and rally driver",
    "NASCAR driver and team owner",
    "professional drifting driver",
    "rally driver and team owner",
    "off road race truck driver",
    "Hall of Fame NASCAR driver",
    "NASCAR Busch Series driver",
    "world tour rally driver",
    "NASCAR driver and owner",
    "and Formula Two driver",
    "racing and test driver",
    "champion NASCAR driver",
    "IndyCar Series driver",
    "former NASCAR driver",
    "Formula One driver",
    "NASCAR race driver",
    "racing car driver",
    "V Supercar driver",
    "Grand Prix driver",
    "rally raid driver",
    "stock car driver",
    "rally co driver",
    "IndyCar driver",
    "NASCAR driver",
    "rally driver",
    "race driver",
    "test driver",
    "NHRA driver",
]
sciences = [
    "designer of the Pitts Special and other aircraft",  # before arts
    "aircraft designer and rocketry pioneer",
    "automobile designer and constructor",
    "designer of the Uzi submachine gun",
    "designer of hi fi audio equipment",
    "racing car and engine designer",
    "racecar and aircraft designer",
    "engine designer and builder",
    "integrated circuit designer",
    "marine equipment designer",
    "boat builder and designer",
    "mechanic and car designer",
    "boat designer and builder",
    "and automobile designer",
    "and aircraft designer",
    "synthesizer designer",
    "and weapons designer",
    "automobile designer",
    "automotive designer",
    "motorcycle designer",
    "custom car designer",
    "spacecraft designer",
    "and rocket designer",
    "motorboat designer",
    "powerboat designer",
    "aircraft designer",
    "computer designer",
    "firearms designer",
    "and boat designer",
    "product designer",
    "weapons designer",
    "car designer and",
    "rocket designer",
    "glider designer",
    "automotive and",
    "boat designer",
    "auto designer",
    "tank designer",
    "car designer",
    "gun designer",
    "pinball",
    "physiologist and sleep researcher",
    "physiologist and nutritionist",
    "nutritional physiologist",
    "plant physiologist",
    "fetal physiologist",
    "psychophysiologist",
    "neurophysiologist",
    "physiologist",
    "pharmacist and scientific researcher",
    "pharmacist and vilification victim",
    "pharmacist and",
    "pharmacist",
]

business_farming = [
    "billionaire automotive manufacturer",
    "automotive products manufacturer",
    "billionaire drug manufacturer",
    "knife maker and manufacturer",
    "roller coaster manufacturer",
    "wind turbine manufacturer",
    "generic drug manufacturer",
    "bicycle tool manufacturer",
    "racing car manufacturer",
    "sunscreen manufacturer",
    "clothing manufacturer",
    "chemical manufacturer",
    "footwear manufacturer",
    "bushwear manufacturer",
    "textile manufacturer",
    "cigar manufacturer",
    "food manufacturer",
    "oven manufacturer",
    "toy manufacturer",
    "manufacturer",
    "pioneer in the field of industrial",
    "industrial relations",
    "industrial manager",
    "industrial",
    "carpet distributor",
    "nightclub owner and property developer",
    "billionaire property developer",
    "property developer and",
    "property developer",
]
academia_humanities = [
    "former President of Hillsdale College",  # before politics_govt_law
    "President of Gadjah Mada University",
    "President of Princeton University",
    "President of University of Beirut",
    "President of Hebrew Union College",
    "lector",
    "archivist who led the National Archives and Records Administration",
    "archivist for the Rudolf Nureyev Foundation",
    "archivist and administrator",
    "archivist",
]
law_enf_military_operator = [
    "anti communist resistance fighter",
    "Jewish resistance fighter",
    "WWII resistance fighter",
    "resistance fighter",
    "taxi driver",  # before event_record_other
]
spiritual = [
    "President of The Church of Jesus Christ of Latter day Saints",  # before politics_govt_law
    "President of the Church of",
    "founder and leader of religious movement The Family International",
    "religious sect founder and former",
    "Catholic religious brother",
    "Catholic religious sister",
    "religious cult leader",
    "religious sect leader",
    "religious extremist",
    "religious studies",
    "religious advisor",
    "religious figure",
    "religious mystic",
    "religious sister",
    "religious books",
    "religious chief",
    "religious",
    "evangelist and alleged doomsday predictor",
    "wife of evangelist Billy Graham",
    "fundamentalist evangelist and",
    "gospel  Christian evangelist",
    "Pentecostal evangelist",
    "Christian evangelist",
    "Lutheran evangelist",
    "evangelist",
]
social = [
    "President of Orbis International",  # before politics_govt_law
]
crime = [
    "and Holocaust perpetrator",  # before event_record_other
]
event_record_other = ["and Holocaust", "vilification victim", "driver"]
other_species = []

<IPython.core.display.Javascript object>

In [74]:
# # Example code to quickly sort list in correct descending length search order to copy to dictionary
# temp = sorted(list(set(law_enf_military_operator)), key=lambda x: len(x), reverse=True)
# temp

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [75]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "sports": sports,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "other_species": other_species,
    "arts": arts,
    "event_record_other": event_record_other,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [76]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['num_categories']!=0].sample(2)

CPU times: total: 2min 51s
Wall time: 2min 51s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
75173,8,Bob Henderson,", 85, Australian footballer .",https://en.wikipedia.org/wiki/Bob_Henderson_(Australian_footballer),5,2019,June,Fitzroy,,,,,,,,,,,,85.0,,Australia,,Fitzroy,1.791759,0,0,0,0,0,0,1,0,0,0,0,0,1
42817,10,Sir Robert Edwards,", 87, British physiologist, Nobel Prize laureate .",https://en.wikipedia.org/wiki/Robert_Edwards_(physiologist),41,2013,April,,,,Nobel Prize laureate,,,,,,,,,87.0,,United Kingdom of Great Britain and Northern Ireland,,2010,3.73767,1,0,0,0,0,0,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [77]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 5932 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [78]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [79]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [80]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "Orthodox Jewish" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [81]:
# # Viewing list sorted by descending length to copy to dictionary below and screen values
# sorted(specific_roles_list, key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [82]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "Orthodox Jewish intellectual and polymath"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [83]:
# Creating lists for each category
politics_govt_law = [
    "Christian boy who became a symbol",  # before spiritual
    "widow of Chiang Ching kuo and First Lady of the Republic of on",
    "former First Lady of Montana",
    "First Lady of New Hampshire",
    "First Lady of West Virginia",
    "First Lady of North Dakota",
    "gubernatorial First Lady",
    "First Lady of Maryland",
    "First Lady of Illinois",
    "First Lady of Michigan",
    "First Lady of Alabama",
    "First Lady of the",
    "South First Lady",
    "First Lady and",
    "First Lady of",
    "First Lady",
    "and animal welfare advocate",
    "Deputy Prime Minister and Mayor of Bucharest during the Communist era",
    "widow of former Prime Minister and President Georges Pompidou",
    "Prime Minister of the and first female speaker of the House",
    "Prime Minister of the Socialist Republic of",
    "Prime Minister of under Mobutu Sese Seko",
    "wife of Prime Minister Morgan Tsvangirai",
    "Prime Minister and Member of Lok Sabha",
    "former President and Prime Minister of",
    "wife of Prime Minister Yitzhak Rabin",
    "former Prime Minister of the Union",
    "Prime Minister of the Republic",
    "former Deputy Prime Minister",
    "Prime Minister of Southern",
    "first Prime Minister of",
    "First Prime Minister of",
    "Prime Minister of the",
    "former Prime Minister",
    "th Prime Minister of",
    "Prime Minister since",
    "nd Prime Minister of",
    "st Prime Minister of",
    "Prime Minister wife",
    "Prime Minister of",
    "Prime Minister",
    "iconoclast",
    "public official",
    "labour union leader",
    "union leader and",
    "union leader",
    "Lieutenant Governor of Arkansas since",
    "Communist revolutionary figure",
    "communist and revolutionary",
    "communist revolutionary",
    "Communist revolutionary",
    "marxist revolutionary",
    "revolutionary and",
    "revolutionary",
    "statesman and founding father of the Republic",
    "General of Internal Service and statesman",
    "and statesman",
    "statesman",
    "chairman of the Jewish Defence League",  # before spiritual
    "Jewish leader",
]

arts = [
    "Tony Award winning publicist",
    "publicist and talent manager",
    "publicist",
    "percussionist and band leader",
    "percussionist and bandleader",
    "salsa percussionist",
    "percussionist",
    "trumpeter and flugelhorn player",
    "trumpeter and bandleader",
    "bandleader and trumpeter",
    "trumpeter and vocalist",
    "trumpeter and arranger",
    "classical trumpeter",
    "highlife trumpeter",
    "trumpet player",
    "trumpeter",
    "clarinetist and bandleader",
    "big band clarinetist",
    "clarinetist",
    "mezzo",
    "drum and bass DJ and co founder of record label Metalheadz",
    "DJ and founder of The Loft",
    "deep house and trance DJ",
    "radio DJ and presenter",
    "radio DJ and TV host",
    "R&B and rock DJ",
    "hip hop DJ",
    "reggae DJ",
    "radio DJ",
    "BBC DJ",
    "DJ",
    "founder of DX Ball",
    "radio programmer",  # before sciences
    "newspaper reporter and Pulitzer Prize winner",
    "Pulitzer Prize winning newspaper reporter",
    "BBC reporter and radio innovator",
    "Pulitzer Prize winning reporter",
    "reporter and war correspondent",
    "BBC reporter for Radio Kent",
    "broadcast news reporter",
    "investigative reporter",
    "anchor and reporter",
    "Indymedia reporter",
    "freelance reporter",
    "newspaper reporter",
    "crime reporter",
    "news reporter",
    "and reporter",
    "ITN reporter",
    "reporter",
    "humorist Georges Bernier",
    "humorist and",
    "and humorist",
    "humorist",
    'known as "the Cosmic Muffin"',
    "operatic bass baritone at the Vienna State Opera and Metropolitan Opera",
    "bass baritone at the Metropolitan Opera",
    "baritone with the Metropolitan Opera",
    "cantor and operatic baritone",
    "operatic bass baritone",
    "operatic baritone",
    "opera baritone",
    "bass baritone",
    "baritone",
    "tightrope walker",
]
sports = [
    "mountain climber and adventurer",
    "rock climber and adventurer",
    "kayaker and adventurer",
    "ocean kayak adventurer",
    "and adventurer",
    "adventurer",
    "Hall of Fame water polo player",
    "water polo player and coach",
    "polo player and coach",
    "water polo player",
    "polo player",
    "Thoroughbred horse trainer and owner",
    "thoroughbred horse trainer",
    "horse trainer and gambler",
    "champion horse trainer",
    "race horse trainer",
    "horse trainer",
    "international volleyball player",
    "volleyball player and coach",
    "and volleyball player",
    "era volleyball player",
    "volleyball player",
    "volleyball coach",
    "cricket and netball international",
    "cricket administrator and manager",
    "cricket and horse racing",
    "South cricket official",
    "cricket administrator",
    "cricket team captain",
    "cricket official",
    "cricket chairman",
    "cricket coach",
    "cricket",
    "high jumper and world record holder",
    "ski jumper who competed for USSR",
    "triple jumper and long jumper",
    "and base jumper",
    "triple jumper",
    "high jumper",
    "long jumper",
    "BASE jumper",
    "ski jumper",
    "showjumper",
]
sciences = [
    "research mycologist",
    "mycologist",
    "paleoceanographer",
    "oceanographer",
    "gemologist",
    "computer programmer known for GNU Debugger",
    "cryptographer and programmer",
    "Computer programmer and",
    "video game programmer",
    "computer programmer",
    "software programmer",
    "game programmer",
    "programmer",
    "medical oncologist and researcher",
    "biotechnologist and oncologist",
    "oncologist",
    "veterinarian and animal behaviorist",
    "veterinarian and equine specialist",
    "veterinarian and parasitologist",
    "veterinarian",
    "herpetologist",
    "fitness and nutritional expert",
    "biology and nutrition expert",
    "nutritionist",
    "palaeontologist and palaeo",
    "vertebrae palaeontologist",
    "palaeontologist",
]

business_farming = [
    "stamp printer",
    "estranged wife of billionaire investment guru Warren Buffett",
    "billionaire advertiser and hotel developer",
    "billionaire mining tycoon",
    "billionaire advertiser",
    "billionaire and",
    "billionaire",
]
academia_humanities = [
    "Professor and Mother of Kanye West",
    "Professor of Sanskrit",
    "Professor of",
    "Egyptologist",
    "intellectual and polymath",
    "Jewish ethnographer",  # before spiritual
]
law_enf_military_operator = [
    "combatant",
    "Jewish Parachutists of Mandate member",  # before spiritual
]
spiritual = [
    "evangelical Christian and founder of Campus Crusade for Christ",
    "Christian leader and monk",
    "Christian worship renewal",
    "evangelical Christian",
    "and Christian leader",
    "Christian apologist",  # before sports
    "Christian counselor",
    "Christian Doctrine",
    "Christian preacher",
    "Christian  leader",
    "Christian leader",
    "Christianity",
    "Christian",
    "spiritual leader of the Ahmadiyya Muslim movement",
    "self styled Muslim cleric and",
    "leader of the Muslim Brotherhood",
    "Muslim preacher and polygamist",
    "Muslim community leader",
    "Black Muslim  leader",
    "Shiite Muslim cleric",
    "Shi'ite Muslim marja",
    "Muslim Sufi leader",
    "Muslim  preacher",
    "Muslim preacher",
    "Muslim leader",
    "Muslim cleric",
    "Muslim",
    "Catholic Cardinal and Grand Master Emeritus of the Equestrian Order of the Holy Sepulchre of",
    "Catholic Bishop and advocate of liberation theology",
    "Patriarch of the Chaldean Catholic Church from to",
    "Catholicos Patriarch of the Church of the East",
    "founder of Catholic journal L Brent Bozell Jr",
    "Catholic Bishop of Maitland Newcastle",
    "head of the True Catholic Church",
    "Cardinal of the Catholic Church",
    "cardinal of the Catholic Church",
    "Cardinal of the Catholic church",
    "Syro Malabar Catholic hierarch",
    "Bishop of the Catholic Church",
    "Catholic Bishop of Wilmington",
    "Catholic Bishop of Santa Rosa",
    "Catholic Bishop of Hyderabad",
    "Catholic Bishop of Nashville",
    "Ruthenian Catholic hierarch",
    "Catholic Bishop of Shanghai",
    "Traditional Catholic Bishop",
    "Maronite Catholic hierarch",
    "Chaldean Catholic hierarch",
    "Catholic fraternity leader",
    "Melkite Catholic hierarch",
    "Eastern Catholic hierarch",
    "Catholic Bishop of Bauchi",
    "Catholic Bishop of Gallup",
    "Catholic Bishop of Albany",
    "Catholic Bishop of Masaka",
    "Byzantine Catholic friar",
    "Maronite Catholic eparch",
    "traditionalist Catholic",
    "c Catholic hierarch",
    "Catholic liturgical",
    "Catholic Bishop of",
    "Catholic monsignor",
    "Catholic hierarch",
    "Catholic Cardinal",
    "Catholic official",
    "Catholic Prelate",
    "Catholic pilgrim",
    "Catholic Church",
    "Catholic Bishop",
    "Catholic leader",
    "Catholic deacon",
    "Catholic layman",
    "Catholic laity",
    "Catholic abbot",
    "Catholic monk",
    "Catholic seer",
    "Catholicism",
    "Catholic",
    "astrologer",
    "Jewish chaplain at liberation of Bergen Belsen",
    "Orthodox Jewish",
    "Jewish",
]
social = [
    'of abusive child labour in "',
    "co founder of Betty Ford Center",
    "animal welfare advocate",
    "convert to the",
    "president and co founder of the Gathering of Jewish Holocaust Survivors",  # before spiritual
    "Jewish community leader",
]
crime = [
    "hostage taker",
    "Jewish woman and Gestapo collaborator during WorldWar II",  # before spiritual
]
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

In [84]:
# # Example code to quickly sort list in correct descending length search order to copy to dictionary
# temp = sorted(list(set(law_enf_military_operator)), key=lambda x: len(x), reverse=True)
# temp

<IPython.core.display.Javascript object>

In [85]:
# Hard-coding info_2 for entries to not misclassify in spiritual
index = df[df["link"] == "https://en.wikipedia.org/wiki/Herbert_Freudenberger"].index
df.loc[index, "info_2"] = ""

index = df[df["link"] == "https://en.wikipedia.org/wiki/Leon_Uris"].index
df.loc[index, "info_2"] = ""

index = df[df["link"] == "https://en.wikipedia.org/wiki/Amnon_Netzer"].index
df.loc[index, "info_2"] = ""

index = df[df["link"] == "https://en.wikipedia.org/wiki/Liliane_Atlan"].index
df.loc[index, "info_2"] = ""

index = df[df["link"] == "https://en.wikipedia.org/wiki/Israel_Getzler"].index
df.loc[index, "info_2"] = ""

index = df[df["link"] == "https://en.wikipedia.org/wiki/Eva_J._Engel"].index
df.loc[index, "info_2"] = ""

index = df[df["link"] == "https://en.wikipedia.org/wiki/Shalom_Yoran"].index
df.loc[index, "info_2"] = ""

index = df[df["link"] == "https://en.wikipedia.org/wiki/David_Cesarani"].index
df.loc[index, "info_2"] = ""

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [86]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sciences": sciences,
    "spiritual": spiritual,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [87]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['num_categories']!=0].sample(2)

CPU times: total: 2min 34s
Wall time: 2min 34s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
56826,6,Milton V. Backman,", 88, American historian.",https://en.wikipedia.org/wiki/Milton_V._Backman,3,2016,February,,,,,,,,,,,,,88.0,,United States of America,,,1.386294,0,0,0,1,0,0,0,0,0,0,0,0,1
53689,14,Phil Judd,", 81, English rugby union player .",https://en.wikipedia.org/wiki/Phil_Judd_(rugby_union),4,2015,June,Coventry,,,,,,,,,,,,81.0,,United Kingdom of Great Britain and Northern Ireland,,Coventry,1.609438,0,0,0,0,0,0,1,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [88]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 5163 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [89]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [90]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [91]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "alleged child sex offender" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [92]:
# # Viewing list sorted by descending length to copy to dictionary below and screen values
# sorted(specific_roles_list, key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [93]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "federal prosecutor and alleged child sex offender"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [94]:
# Creating lists for each category
politics_govt_law = [
    "advocate for disabled people",
    "campaigner for the medical use of cannabis",  # before sciences
    "revolutionary and",
    "revolutionary",
    "physical fitness advocate",
    "victim of videotaped police beating that sparked the riots",
    "urbanist",
    "urban",
    "town",
    "city",
    "simultaneous interpreter",
    "neo",
    "breast cancer advocate",
    "Law Lord and first chairman of the Committee on Standards in Public Life",
    "chairman of the Presidential Advisory Committee on the Arts",
    "chairman of the Crow Nation since",
    "chief prosecutor at the Nuremberg Trials",
    "prosecutor and gun control advocate",
    "federal prosecutor",
    "public prosecutor",
    "prosecutor",
]

arts = [
    "stage hypnotist and illusionist",
    "stage manager",
    "stage and",
    "stage",
    "trombone player for Suburban Legends",  # before politics_govt_law
    "event planner",
    "planner",
    "woman believed to be the world oldest blogger",
    "and blogger",
    "blogger and subject of",
    "blogger and influencer",
    "journalism  blogger",
    "blogger in",
    "blogger",
    "string band fiddler and mandolinist and",
    "old time fiddler and banjo player",
    "bluegrass fiddler",
    "Cajun fiddler",
    "fiddler",
    "record label chairman",
    "chairman of ABC",
]
sports = [
    "fastest centenarian over metres",  # before event_record_other
    "rock climber and",
    "mountain climber",
    "sport climber",
    "rock climber",
    "free climber",
    "climber",
    "All Girls Professional Baseball League player",
    "former Baseball pitcher and pitching coach",
    "Baseball pitcher and pitching coach",
    "Baseball pitcher and umpire",
    "Baseball All Star pitcher",
    "former Baseball All Star",
    "former Baseball pitcher",
    "former Baseball umpire",
    "Baseball Hall of Famer",
    "Baseball commissioner",
    "Baseball team owner",
    "umpire in Baseball",
    "Baseball infielder",
    "Baseball pitcher",
    "Baseball catcher",
    "Baseball manager",
    "Baseball umpire",
    "hurdler and track coach",
    "sprint hurdler",
    "hurdler",
    "former chairman of S L Benfica",
    "chairman of the AFL Commission",
    "longtime chairman of F C",
]
sciences = [
    "medical examiner and Shroud of Turin investigator",
    "and medical aid developer",
    "medical  administrator",
    "medical hypnotist",
    "medical examiner",
    "medical pioneer",
    "medical",
    "pioneer in laser technology",
    "technology pioneer",
    "biotechnology",
    "on technology",
    "technology",
]

business_farming = [
    "homesteading leader",
    "CEO and chairman of United Technologies Corporation",
    "chairman of the Volkswagen automobile company",
    "co founder and former chairman of Amway",
    "chairman of the illy coffee company",
    "chairman and CEO of General Motors",
    "marine services company chairman",
    "chairman of Alliance & Leicester",
    "chairman of the Stock Exchange",
    "chairman of Tobacco Company",
    "chairman of Hunt Petroleum",
    "chairman of Instruments",
    "chairman of Metlife",
]
academia_humanities = []
law_enf_military_operator = [
    "hostage negotiator",  # before arts
    "and former police chief",
    "federal police anti drug coordinator",
    "police chief and guerilla commander",
    "chief of police of San Francisco",
    "Francoist police inspector",
    "NYPD police detective and",
    "police radio dispatcher",
    "police superintendent",
    "police commissioner",
    "senior policewoman",
    "police detective",
    "police informant",
    "police commander",
    "police inspector",
    "police official",
    "police sergeant",
    "police advisor",
    "police chief",
    "policeman",
    "police",
    "transport planner",  # before arts
]
spiritual = [
    "cleric at the Red Mosque in Islamabad",
    "Twelver Shi'a cleric",
    "Anglican cleric",
    "Shiite cleric",
    "and cleric",
    "cleric and",
    "cleric",
    "spiritual leader and head of the Sikh Dharma in the western hemisphere",
    "Chief Rabbi and spiritual leader of the Republic of from to",
    "spiritual leader and founder of Hamas",
    "Gaudiya Vaishnava spiritual leader",
    "Dvaita Vedanta spiritual leader",
    "yogi and spiritual leader",
    "Mi'kmaq spiritual leader",
    "hippy spiritual leader",
    "Hindu spiritual leader",
    "spiritual leader",
    "neocharismatic preacher",  # before politics_govt_law
    "neopagan",
]
social = [
    "healthcare and",
]
crime = [
    "organized crime operative",
    "Salt Lake City spree killer",
    "spree killer and",
    "and spree killer",
    "spree killer",
    "alleged child sex offender",
]
event_record_other = [
    "anencephalic baby who became the center of a medical controversy",  # before sciences
    "medical cannabidiol patient and reform figure",
    "medical research subject",  # before sciences
    "medical patient",
    "hostage killed in",  # before arts
    "Taliban hostage",
    "hostage in",
    "hostage",
    "centenarian and assault victim",
    "centenarian",
    "pedestrian allegedly assaulted by police at G summit protests",  # before politics_govt_law
    "man who died in police custody",
    "victim of police shooting",
    "police detainee",
    "police reformer",
    "police suspect",
    "police victim",
]
other_species = [
    "Thoroughbred hurdler",
]

<IPython.core.display.Javascript object>

In [95]:
# # Example code to quickly sort list in correct descending length search order to copy to dictionary
# temp = sorted(list(set(law_enf_military_operator)), key=lambda x: len(x), reverse=True)
# temp

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [96]:
# Combining separate lists into one dictionary
known_for_dict = {
    "sports": sports,
    "event_record_other": event_record_other,
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "other_species": other_species,
    "arts": arts,
    "politics_govt_law": politics_govt_law,
    "sciences": sciences,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [97]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['num_categories']!=0].sample(2)

CPU times: total: 1min 24s
Wall time: 1min 24s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
46281,9,Michael Hemmingson,", 47, American writer, cardiac arrest.",https://en.wikipedia.org/wiki/Michael_Hemmingson,35,2014,January,,,,cardiac arrest,,,,,,,,,47.0,,United States of America,,,3.583519,0,0,0,0,0,1,0,0,0,0,0,0,1
17149,3,William Steig,", 95, American cartoonist and children's author; creator of Shrek.",https://en.wikipedia.org/wiki/William_Steig,22,2003,October,,,,,,,,,,,,,95.0,,United States of America,,,3.135494,0,0,0,0,0,1,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [98]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 4852 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [99]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [100]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [101]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "Māori leader" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [102]:
# # Viewing list sorted by descending length to copy to dictionary below and screen values
# sorted(specific_roles_list, key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [103]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "submariner"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [104]:
# Creating lists for each category
politics_govt_law = [
    "first administrator of the Autonomous Region of Bougainville",
    "administrator and Secretary of Labor under Gerald Ford",
    "and Caicos Islands administrator",
    "and public health administrator",
    "coal industry administrator",
    "public health administrator",
    "credit union administrator",
    "corrections administrator",
    "scientific administrator",
    "government administrator",
    "college administrator",
    "public administrator",
    "health administrator",
    "administrator",
    "son of the national President and former presidential advisor",
    "wife of Senator and Presidential candidate George McGovern",
    "labor law expert and member of Presidential commissions",
    "widow of former President Rudolf Kirchschläger",
    "daughter of former President Raúl Cubas Grau",
    "possible heir of President Hastings Banda",
    "son in law of former President Nasser",
    "spokesman for President Richard Nixon",
    "wife of acting President Raúl Castro",
    "son of President Calvin Coolidge",
    "wife of President Omar Bongo",
    "Vice Presidential candidate",
    "fourth Vice President",
    "acting President",
    "Senate President",
    "first President",
    "Vice President",
    "for President",
    "Presidential",
    "President",
    "First Deputy Chairman of the Central Bank of",
    "Chairman of the Assembly of Experts",
    "Chairman of the Presidency of",
    "Chairman of the",
    "health official and tribal leader",
    "tribal leader in Waziristan and",
    "Balochistan rebel tribal leader",
    "Athabaskan tribal leader",
    "Native tribal leader",
    "tribal leader",
    "Māori leader and women advocate",
    "Māori leader",
]

arts = [
    "impresario and opera administrator",  # before politics_govt_law
    "opera administrator",
    'co creator of President Johnson "Daisy ad"',
    "motorcycle stuntman",
    "stuntman and",
    "and stuntman",
    "stunt rider",
    "stuntman",
    "talent manager",
    "flautist and soloist for Sydney Symphony for years",
    "classical flautist",
    "flautist",
    "trombonist and bandleader",
    "trombonist",
    "ceramics and glazing master",
    "ceramicist and printmaker",
    "studio ceramicist",
    "ceramicist",
    "ceramic",
    "memoirist",
    "documenter of blues and folk songs",
    "songmaker",
    "song",
    "ceramist",
    "Tewa storyteller",
    "storyteller",
    "classical contralto",
    "operatic contralto",
    "contralto",
]
sports = [
    "triathlon competitor and administrator",  # before politics_govt_law
    "Hall of Fame basketball administrator",
    "college athletics administrator",
    "club basketball administrator",
    "rugby league administrator",
    "softball administrator",
    "squash administrator",
    "sport administrator",
    "golf administrator",
    "State Athletic Commission",
    "Chairman of Chelsea Football Club",
    "Chairman and CEO of corporation",
    "motorcycle tuner and race team owner",
    "motorcycle speedway world champion",
    "Dakar Rally motorcycle rider",
    "motorcycle racing promoter",
    "speedway motorcycle rider",
    "motorcycle trials rider",
    "motorcycle rider",
    "motorcycle race",
    "motorcycle",
    "national athletics coach",
    "athletics coach",
    "born sport shooter",
    "sport shooter",
    "swimming coach",
]
sciences = [
    "cancer researcher and Nobel Prize in Physiology or Medicine laureate",
    "polio and cancer researcher",
    "cancer researcher",
    "motorcycle maker",  # before sports
    "dentist",
    "clinical radiologist and radiation treatment pioneer",
    "neuroradiologist",
    "radiologist",
    "dermatologist",
    "marine mammal expert",  # before law_enf_military_operator
    "nephrologist and endocrinologist",
    "endocrinologist and hematologist",
    "neuroendocrinologist",
    "endocrinologist",
]

business_farming = [
    "hospital administrator",  # before politics_govt_law
    "President and CEO of",
    "Chairman and CEO of Johnson & Johnson",
    "stockbroker and",
    "stockbroker",
]
academia_humanities = [
    "university administrator",  # before politics_govt_law
    "intellectual and foremost encyclopedist",
    "intellectual",
    "lexicographer",
    "and numismatist",
    "numismatist",
    "classicist whose cataloging of Linear B led to its decipherment",
    "classicist",
]
law_enf_military_operator = [
    "KGB interim Chairman",  # before politics_govt_law
    "key Taliban ally",
    "air chief marshal and Chief of Air Staff",
    "air chief marshal",
    "Medal of Honor recipient and Commandant of the Marine Corps",
    "Army Special Forces member and Medal of Honor recipient",
    "Medal of Honor recipient during the War",
    "indian and Medal of Honor recipient",
    "Army Medal of Honor recipient",
    "Medal of Honor recipient",
    "last surviving marine who raised the first flag on Mount Suribachi during the Battle of Iwo Jima",
    "captain of the nuclear powered submarine HMS Superb during the Cold War",
    "submarine commander and prisoner of war",
    "submarine lieutenant commander",
    "submarine captain in the Navy",
    "oldest living submariner",
    "submarine commander",
    "submarine captain",
    "submariner",
    "marine",
]
spiritual = [
    "church administrator",  # before politics_govt_law
    "gospel preacher",
    "gospel",
    "self help  motivational speaker",
    "motivational speaker",
    "Orthodox hierarch",
]
social = [
    "UNICEF Committee President",  # before politics_govt_law
]
crime = ["outlaw biker and gangster", "female gangster", "gangster"]
event_record_other = [
    'inspiration for The Beatles song "Lucy in the Sky with Diamonds"',  # before arts
]
other_species = [
    "Presidential cat of the Clinton family",  # before politics_govt_law
]

<IPython.core.display.Javascript object>

In [105]:
# # Example code to quickly sort list in correct descending length search order to copy to dictionary
# temp = sorted(list(set(law_enf_military_operator)), key=lambda x: len(x), reverse=True)
# temp

<IPython.core.display.Javascript object>

In [106]:
# Hard-coding info_2 values to correctly categorize
index = df[
    df["link"] == "https://en.wikipedia.org/wiki/John_Blackburn_(educator)"
].index
df.loc[index, "info_2"] = "university administrator"  # added to dict

index = df[df["link"] == "https://en.wikipedia.org/wiki/Edward_W._Crosby"].index
df.loc[index, "info_2"] = "university administrator"

# Hard-coding info_2 value to capture sports and science
index = df[df["link"] == "https://en.wikipedia.org/wiki/Len_Vale-Onslow"].index
df.loc[index, "info_2"] = "motorcycle rider motorcycle maker"

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [107]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
    "politics_govt_law": politics_govt_law,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [108]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['num_categories']!=0].sample(2)

CPU times: total: 1min 20s
Wall time: 1min 20s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
30950,26,Ahmed bin Zayed Al Nahyan,", 41, Emirati managing director of the Abu Dhabi Investment Authority, glider crash.",https://en.wikipedia.org/wiki/Ahmed_bin_Zayed_Al_Nahyan,18,2010,March,,,,glider crash,,,,,,,,,41.0,,United Arab Emirates,,,2.944439,0,0,0,0,0,0,0,0,1,0,0,0,1
94996,2,Jean-Guy Couture,", 92, Canadian Roman Catholic prelate, bishop of Chicoutimi .",https://en.wikipedia.org/wiki/Jean-Guy_Couture,4,2022,January,,,,bishop of Chicoutimi,,,,,,,,,92.0,,Canada,Italy,1979 2004,1.609438,0,0,1,0,0,0,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [109]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 4457 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [110]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [111]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [112]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "Hindi" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [113]:
# # Viewing list sorted by descending length to copy to dictionary below and screen values
# sorted(specific_roles_list, key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [114]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "Hindi"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [115]:
# Creating lists for each category
politics_govt_law = [
    "researcher and  capital punishment",  # before sciences
    "social researcher",
    "Fianna Fáil senator and son of W B Yeats",
    "Progressive Democrat senator",
    "senator and congressman",
    "senator from Amazonas",
    "former senator",
    "state senator",
    "ALP senator",
    "and senator",
    "senator",
    "leader and women rights advocate",
    "leader and top official in the rebel government",
]

arts = [
    "fiction researcher",  # before sciences
    "author researcher",
    "big band vocalist and recording star of the s and s",
    "vocalist and original member of R&B girl group",
    "vocalist of new wave group the Waitresses",
    "lead vocalist of the R&B group Hi Five",
    "blues vocalist and harmonica player",
    "lead vocalist for The Chi Lites",
    "Hindustani classical vocalist",
    "klezmer and Yiddish vocalist",
    "rhythm & blues vocalist",
    "heavy metal vocalist",
    "classical vocalist",
    "punk rock vocalist",
    "carnatic vocalist",
    "big band vocalist",
    "dhrupad vocalist",
    "reggae vocalist",
    "dance vocalist",
    "vocal coach",
    "vocal  and",
    "vocalist",
    "banjo player and bluegrass band leader",
    "five string banjo player",
    "banjo player",
    "banjoist",
    "harpsichord builder",
    "harpsichord maker",
    "harpsichordist",
    "Grammy Award winning arranger",
    "bandleader and arranger",
    "choral arranger",
    "arranger",
    "and internet personality",
    "and variety show personality",
    "and social media personality",
    "radio  host and personality",
    "reality show personality",
    "radio and TV personality",
    "social media personality",
    "reality TV personality",
    "talk show personality",
    "internet personality",
    "Internet personality",
    "webcast personality",
    "street personality",
    "media personality",
    "TV personality",
    "personality",
    "performer and celebrity",
    "celebrity hairdresser",
    "Internet celebrity",
    "internet celebrity",
    "YouTube celebrity",
    "celebrity",
]
sports = [
    "lawn bowls competitor",
    "professional bowler",
    "Hall of Fame bowler",
    "lawn bowls player",
    "ten pin bowler",
    "bowls player",
    "lawn bowler",
    "road bowler",
    "bowler",
]
sciences = [
    "biophysics researcher at Lawrence Berkeley National Laboratory",
    "computer specialist and security researcher",
    "epidemiology and public health researcher",
    "ethologist and behavioral researcher",
    "researcher in child psychology",
    "computer security researcher",
    "NASA researcher and manager",
    "animal breeding researcher",
    "cardiovascular researcher",
    "HIV prevention researcher",
    "agricultural researcher",
    "psychedelic researcher",
    "researcher of genetics",
    "orthodontic researcher",
    "paranormal researcher",
    "psychology researcher",
    "HIV AIDS researcher",
    "chemical researcher",
    "genetic researcher",
    "science researcher",
    "autism researcher",
    "sleep researcher",
    "AIDS researcher",
    "researcher",
    "glaciologist",
]

business_farming = [
    "accounting researcher",  # before sciences
    "sharebroker",
    "realtor",
    "head of based financial giant Lehman Brothers",
    "financial consultant",
    "financial analyst",
    "financial pioneer",
    "financial adviser",
    "financial",
    "public accountant",
    "accountant",
]
academia_humanities = [
    "into underwater exploration and shipwrecks",
    "digital humanities",
    "archaeological researcher",  # before sciences
    "intelligence researcher",
    "anthropology researcher",
    "folklore researcher",
    "cultural researcher",
    "academic researcher",
]
law_enf_military_operator = [
    "George Medal winning airman",
    "RAF airman",
    "airman",
    "national security researcher",  # before sciences
]
spiritual = [
    "psychic",
    "founder of the Holy Spirit Movement",
]
social = [
    "women rights advocate and community leader",
    "indigenous community leader",
    "Hmong community leader",
    "Inuit community leader",
    "Kongu community leader",
    "and community leader",
    "community leader",
]
crime = []
event_record_other = [
    "financial advisor",  # before business_farming
    "transsexual sex worker",
    "sex worker",
]
other_species = [
    "Shih Tzu dog and social media celebrity",  # before arts
    "internet celebrity cat",
    "celebrity sheep",
    "celebrity cat",
]

<IPython.core.display.Javascript object>

In [116]:
# # Example code to quickly sort list in correct descending length search order to copy to dictionary
# temp = sorted(list(set(law_enf_military_operator)), key=lambda x: len(x), reverse=True)
# temp

<IPython.core.display.Javascript object>

In [117]:
# Hard-coding info_2 values to correctly categorize entries
index = df[df["link"] == "https://en.wikipedia.org/wiki/Catherine_Filene_Shouse"].index
df.loc[index, "info_2"] = "public researcher"  # added to dict

index = df[df["link"] == "https://en.wikipedia.org/wiki/W._O._G._Lofts"].index
df.loc[index, "info_2"] = "fiction researcher"  # added to dict

index = df[df["link"] == "https://en.wikipedia.org/wiki/Ghamar_Ariyan"].index
df.loc[index, "info_2"] = "academic researcher"  # added to dict

index = df[df["link"] == "https://en.wikipedia.org/wiki/Rafael_Castillejo"].index
df.loc[index, "info_2"] = "author researcher"  # added to dict

index = df[df["link"] == "https://en.wikipedia.org/wiki/Ghamar_Ariyan"].index
df.loc[index, "info_2"] = "academic researcher accounting researcher"  # added to dict

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [118]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "event_record_other": event_record_other,
    "business_farming": business_farming,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
    "sciences": sciences,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [119]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['num_categories']!=0].sample(2)

CPU times: total: 1min 11s
Wall time: 1min 11s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
26312,27,Mark Priestley,", 32, Australian actor , suicide.",https://en.wikipedia.org/wiki/Mark_Priestley,5,2008,August,,,,suicide,,,,,,,,,32.0,,Australia,,,1.791759,0,0,0,0,0,1,0,0,0,0,0,0,1
83237,30,Helen Sanger,", 96, American librarian.",https://en.wikipedia.org/wiki/Helen_Sanger,4,2020,July,,,,,,,,,,,,,96.0,,United States of America,,,1.609438,0,0,0,1,0,0,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [120]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 4216 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [121]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [122]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [123]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "farming" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [124]:
# # Viewing list sorted by descending length to copy to dictionary below and screen values
# sorted(specific_roles_list, key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [125]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "leader and"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [126]:
# Creating lists for each category
politics_govt_law = [
    "leader and Secretary General of the Popular Front for the Liberation of",
    "public leader from the North Eastern state of",
    "leader of the nazi National Socialist Women League",
    "leader in the disability rights movement",
    "wife of late Communist leader Liu Shaoqi",
    "longtime Democratic leader of Queens",
    "Knights of the Ku Klux Klan leader",
    "wife of leader Leonid Brezhnev",
    "founder and leader of the FNLA",
    "a resistance movement leader",
    "Founder and first leader of",
    "revisionist Zionist leader",
    "Nisga'a indigenous leader",
    "co leader of coup d'état",
    "disability rights leader",
    "white supremacist leader",
    "white separatist leader",
    "separatist rebel leader",
    "Sikh separatist leader",
    "guerrilla leader and",
    "Alaska Native leader",
    "Ku Klux Klan leader",
    "independence leader",
    "Hitler Youth leader",
    "traditional leader",
    "leader of Naxalite",
    "coup d'état leader",
    "Maoist leader and",
    "separatist leader",
    "indigenous leader",
    "opposition leader",
    "Republican leader",
    "Makah leader and",
    "communist leader",
    "Qadiriyya leader",
    "socialist leader",
    "Zionist leader",
    "Native leader",
    "Maoist leader",
    "social leader",
    "civil leader",
    "Inuit leader",
    "Hamas leader",
    "abor leader",
    "PLO leader",
    "leader and",
    "n  leader",
    "Ga leader",
    "leader",
]

arts = [
    "reed player and band leader",  # before politics_govt_law
    "leader of The Prisonaires",
    "Māori kapa haka leader",
    "brass band leader",
    "dance band leader",
    "orchestra leader",
    "swing bandleader",
    "big band leader",
    "cultural leader",
    "and bandleader",
    "band leader",
    "bandleader",
]
sports = [
    "Baltimore Orioles cheerleader of the s and s",  # before politics_govt_law
    "cheerleader",
]
sciences = ["psychiatry and psychology", "psychology in", "psychology"]

business_farming = [
    "peasant",
    "farming",
]
academia_humanities = [
    "president of Ner",
]
law_enf_military_operator = [
    "WWII Army leader of the Kokoda Track campaign",  # before politics_govt_law
    "leader of the Izz ad Din al Qassam Brigades",
    "OSS agent and leader of Operation Halyard",
    "former leader of the Intelligence Service",
    "leader of the Anbar Salvation Council",
    "leader of Anbar Salvation Council",
    "deputy leader of al Qaeda",
    "leader of Jabhat al Nusra",
    "Bougainville rebel leader",
    "Lashkar e Jhangvi leader",
    "internal security leader",
    "Maoist guerrilla leader",
    "border militia leader",
    "and Al Shabaab leader",
    "leader of Al Qaeda in",
    "Darfuri rebel leader",
    "senior Hamas leader",
    "Al Qaeda official",
    "Resistance leader",
    "resistance leader",
    "guerrilla leader",
    "rebel leader and",
    "mercenary leader",
    "Air Force leader",
    "leader of Tigers",
    "al Qaeda leader",
    "Al Qaeda leader",
    "militia leader",
    "Taliban leader",
    "rebel leader",
    "army leader",
]
spiritual = [
    "leader in The Church of Jesus Christ of Latter day Saints",  # before politics_govt_law
    "leader of the Apostolic United Brethren",
    "leader of the Branch Davidian sect",
    "influential Baptist preacher and",
    "Rabbinical College for over years",
    "Seventh day Adventist leader",
    "sect leader and polygamist",
    "Latter day Saints leader",
    "Protestant church leader",
    "leader of the cult group",
    "leader in the LDS Church",
    "Mormon women leader",
    "Presbyterian leader",
    "Baháʼí Faith leader",
    "theosophist leader",
    "evangelical leader",
    "Hindu sect leader",
    "Mormon leader and",
    "Neopagan leader",
    "Mormon leader",
    "church leader",
    "Hindu leader",
    "cult leader",
    "sect leader",
    "Sikh leader",
]
social = [
    "leader of the Muscular Dystrophy Association for years and persuaded Jerry Lewis to undertake a yearly telethon to raise money for muscular dystrophy",  # before politics_govt_law
    "leader of the international Scouting movement",  # before politics_govt_law
    "Girl Guides leader",
    "Scout leader",
]
crime = [
    "suspected ringleader of the November Paris attacks",  # before politics_govt_law
    'founder and nominal leader of the " Mafia"',
    "mafia gang leader",
    "drug smuggler",
    "sex offender",
    "mafia leader",
    "gang leader",
]
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

In [127]:
# # Example code to quickly sort list in correct descending length search order to copy to dictionary
# temp = sorted(list(set(law_enf_military_operator)), key=lambda x: len(x), reverse=True)
# temp

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [128]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
    "politics_govt_law": politics_govt_law,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [129]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['num_categories']!=0].sample(2)

CPU times: total: 1min 10s
Wall time: 1min 10s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
94990,1,Calisto Tanzi,", 83, Italian food industry executive and convicted fraudster, founder of Parmalat and owner of Parma Calcio , lung infection.",https://en.wikipedia.org/wiki/Calisto_Tanzi,16,2022,January,,,,founder of Parmalat and owner of Parma Calcio,lung infection,,,,,,,,83.0,,Italy,,1989 2003,2.833213,0,0,0,0,1,0,0,0,0,1,0,0,2
80821,12,Caro Fraser,", 67, British novelist, cancer.",https://en.wikipedia.org/wiki/Caro_Fraser,4,2020,April,,,,cancer,,,,,,,,,67.0,,United Kingdom of Great Britain and Northern Ireland,,,1.609438,0,0,0,0,0,1,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [130]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 4067 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [131]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [132]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [133]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "bushman" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [134]:
# # Viewing list sorted by descending length to copy to dictionary below and screen values
# sorted(specific_roles_list, key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [135]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "documentary  and FAMU"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [136]:
# Creating lists for each category
politics_govt_law = [
    "white supremacist and Klansman",
    "white supremacist",
    "anarcho syndicalist and anarchist",
    "anarcho pacifist",
    "anarchist and",
    "anarchist",
    "anarchism",
    "human rights solicitor",
    "solicitor  denier",
    "solicitor",
    "Secretary of Veterans Affairs",
]

arts = [
    "rock band manager for The Yardbirds and Led Zeppelin",  # before business_farming
    "policy specialist and public relations",
    "former manager of the Ramones",
    "comedy agent and manager",
    "manager of Elvis Presley",
    "manager of rock groups",
    "club owner and manager",
    "entertainment manager",
    "broadcasting manager",
    "publishing manager",
    "night club manager",
    "production manager",
    "opera manager",
    "label manager",
    "band manager",
    "rock manager",
    "Hall of Fame radio talk show host",
    "playmate and talk show host",
    "radio talk show host and",
    "travel documentary host",
    "radio talk show host",
    "talk radio show host",
    "and game show host",
    "and talk show host",
    "children TV host",
    "infomercial host",
    "radio show host",
    "talk show host",
    "game show host",
    "quiz show host",
    "radio TV host",
    "podcast host",
    "air hostess",
    "TV host",
    "modern dance pioneer and",
    "dance therapy pioneer",
    "advocate of dance",
    "dancehall reggae",
    "dance notator",
    "modern dance",
    "dance guru",
    "dance",
    "publishing magnate",  # before business_farming
    "newspaper magnate",
    "media magnate",
    "children bookseller",
    "bookseller",
    "drag queen",
    "typographer and printing",
    "printer and typographer",
    "typographer and",
    "typographer",
    "sound mixer",
    "radio ventriloquist",
    "ventriloquist",
    "sideshow attractions and documentary subjects",
    "documentary  and news correspondent",
    "travel documentary host",
    "documentary  and FAMU",
    "and documentary maker",
    "documentary maker",
    "documentary",
]
sports = [
    "equipment manager for Wildcats men basketball since",  # before business_farming
    "speedway promoter and national team manager",
    "professional wrestling manager",
    "hurling manager and player",
    "lacrosse coach and manager",
    "basketball team manager",
    "MLB catcher and manager",
    "horse racing manager",
    "handball manager",
    "hurling manager",
    "former owner of the Toronto Maple Leafs",
    "strongman and powerlifter",
    "professional strongman",
    "strongman",
    "shot putter",
    "bobsledder",
    "PGA Tour golf player",
    "pro golf caddy",
    "golf champion",
    "golf promoter",
    "golf pioneer",
    "golf player",
    "golf",
    "sprint canoer",
    "slalom canoer",
    "bushman",
]
sciences = ["nephrologist", "paleoclimatologist", "climatologist", "web developer"]

business_farming = [
    "motor company manager",
    "investment manager",
    "hedge fund manager",
    "insurance manager",
    "fund manager and",
    "fund manager",
    "manager",
    "grocery store magnate and a",
    "and shipping magnate",
    "natural resource magnate",
    "pharmaceuticals magnate",
    "food industry magnate",
    "real estate magnate",
    "shipping magnate",
    "bottling magnate",
    "poultry magnate",
    "brewing magnate",
    "timber magnate",
    "mining magnate",
    "hotel magnate",
    "beer magnate",
    "oil magnate",
    "magnate",
    "heiress to Mellon family fortune",
    "heiress and",
    "heiress",
]
academia_humanities = [
    "medievalist",
    "archeologist and conservator",
    "archeologist",
    "orientalist",
]
law_enf_military_operator = [
    "codebreaker at Park",
    "codebreaker",
    "intelligence agent",
    "Marine master gunnery sergeant and recipient of the Medal of Honor",
    "Marine and recipient of the Medal of Honor",
    "Marine with the rank of Brigadier General",
    "Marine and recipient of the Navy Cross",
    "Brigadier General in the Marine Corps",
    "Sea lieutenant of the Royal Marines",
    "Sergeant Major in the Marine Corps",
    "Marine Corps lieutenant colonel",
    "Marine Corps Lieutenant General",
    "Marine Corps Major General",
    "Royal Marine fighter ace",
    "Marine Corps colonel",
    "Marine Corps sniper",
    "Marine war hero",
    "Royal Marine",
    "Marine Corps",
    "Marine and",
    "Marine",
]
spiritual = ["atheist", "archdeacon of Cheltenham", "archdeacon"]
social = []
crime = [
    "founder of the Bandidos Motorcycle Club",
]
event_record_other = [
    "gas station manager",  # before business_farming
    "documentary subject and sexual assault survivor",  # before arts
    "outback mailman and documentary subject",
]
other_species = [
    "Marine service dog",  # before law_enf_military_operator
]

<IPython.core.display.Javascript object>

In [137]:
# # Example code to quickly sort list in correct descending length search order to copy to dictionary
# temp = sorted(list(set(law_enf_military_operator)), key=lambda x: len(x), reverse=True)
# temp

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [138]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "sports": sports,
    "arts": arts,
    "law_enf_military_operator": law_enf_military_operator,
    "business_farming": business_farming,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [139]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['num_categories']!=0].sample(2)

CPU times: total: 1min 26s
Wall time: 1min 25s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
90267,24,Dan Frank,", 67, American editor .",https://en.wikipedia.org/wiki/Dan_Frank,6,2021,May,Pantheon Books,,,,,,,,,,,,67.0,,United States of America,,Pantheon Books,1.94591,0,0,0,0,0,1,0,0,0,0,0,0,1
92094,16,Adedayo Omolafe,", 57, Nigerian politician, member of the House of Representatives.",https://en.wikipedia.org/wiki/Adedayo_Omolafe,8,2021,August,,,,member of the House of Representatives,,,,,,,,,57.0,,Nigeria,,,2.197225,0,0,0,0,0,0,0,0,1,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [140]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 3783 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [141]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [142]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [143]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "IRA member" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [144]:
# # Viewing list sorted by descending length to copy to dictionary below and screen values
# sorted(specific_roles_list, key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [145]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "public"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [146]:
# Creating lists for each category
politics_govt_law = [
    "campaigner on behalf of Guildford Four and Maguire Seven",
    "campaigner against female genital mutilation",
    "Senator & disability rights campaigner",
    "sexual assault awareness campaigner",
    "asbestosis compensation campaigner",
    "campaigner against permissiveness",
    "campaigner for assisted suicide",
    "disabilities rights campaigner",
    "anti drunk driving campaigner",
    "transgender rights campaigner",
    "disability rights campaigner",
    "green investment campaigner",
    "AIDS awareness campaigner",
    "fishing safety campaigner",
    "anti abortion campaigner",
    "animal rights campaigner",
    "public health campaigner",
    "human rights campaigner",
    "equal rights campaigner",
    "presidential campaigner",
    "anti poverty campaigner",
    "campaigner for the deaf",
    "broadcasting campaigner",
    "gun control campaigner",
    "disability campaigner",
    "gay rights campaigner",
    "pro choice campaigner",
    "anti rape campaigner",
    "progeria campaigner",
    "equality campaigner",
    "textbook campaigner",
    "heritage campaigner",
    "rights campaigner",
    "health campaigner",
    "peace campaigner",
    "AIDS campaigner",
    "campaigner",
    "representative and advocate",
    "magistrate",
    "Secretary General of the National Security Council of the Republic of",  # before business_farming
    "Republic personal assistant of Jacqueline Kennedy Onassis",
    "former term Republican Congressman from Minnesota",
    "Republican member of the House of Representatives",
    "Republican Representative from Washington",
    "lobbyist and public relations official",
    "negotiator with the People Republic of",
    "Republican Representative from state",
    "Republican representative from since",
    "Republican congressman for Virginia",
    "Republican Representative for Idaho",
    "republican and founder of NORAID",
    "first president of the Republic",
    "Republic human rights advocate",
    "Republican representative from",
    "public digital media promoter",
    "public relations consultant",
    "public affairs professional",
    "public affairs consultant",
    "public relations expert",
    "public health innovator",
    "Provisional Republican",
    "public health pioneer",
    "republican and",
    "public speaker",
    "public figure",
    "public health",
    "republican",
]

arts = [
    "Oscar winning set decorator",
    "interior decorator and",
    "interior decorator",
    "set decorator",
    "veteran CBS News cameraman",  # before law_enf_military_operator
    "tabla player",
    "operatic bass",
    "electric blues and Chicago blues harmonica player",
    "electric blues harmonica player",
    "Austin blues club owner",
    "blues harmonica player",
    "blues harmonicist",
    "blues",
]
sports = ["squash player"]
sciences = [
    "alternative therapy campaigner",  # before politics_govt_law
    "hematologist and forensic DNA expert",
    "hematologist",
    "ufologist",
]

business_farming = [
    "publican",
]
academia_humanities = [
    "public speaker",
]
law_enf_military_operator = [
    "veteran of WWI and WWII who claimed to have been the inspiration for",
    "last veteran of the War of Independence",
    "fourth to last veteran of World War I",
    "female anti veteran of the Civil War",
    "last official veteran of World War I",
    "highly decorated veteran of the War",
    "penultimate veteran of World War I",
    "last World War I veteran living in",
    "second to last World War I veteran",
    "oldest War of Independence veteran",
    "veteran of the Liberation War",
    "decorated veteran of the War",
    "veteran of the Civil War",
    "World War I era veteran",
    "Army veteran and",
    "World War I veteran",
    "Resistance veteran",
    "Civil War veteran",
    "Army War veteran",
    "WWII veteran",
    "army veteran",
    "Army veteran",
    "WW I veteran",
    "war veteran",
    "War veteran",
    "WW veteran",
    "veterans'",
    "veteran",
    "Army founder member and Northern Assembly member",
    "member of the National Liberation Army",
    "volunteer in the Provisional Republican Army",  # before politics_govt_law  # before business_farming
    "Minister of Defence of the Republic of",
    "Provisional Republican Army member",
    "Republican Army volunteer",
    "IRA volunteer",
    "IRA member",
]
spiritual = []
social = [
    "natural childbirth campaigner",  # before politics_govt_law
    "organ donation campaigner",
    "cancer support campaigner",
    "literacy campaigner",
    "charity campaigner",
    "ALS",
]
crime = ["mafia hitman", "hitman"]
event_record_other = []
other_species = [
    "gorilla who didn't see his kind for years",
    "only albino gorilla in the world",
    "mountain gorilla",
    "western gorilla",
    "gorilla",
]

<IPython.core.display.Javascript object>

In [147]:
# # Example code to quickly sort list in correct descending length search order to copy to dictionary
# temp = sorted(list(set(law_enf_military_operator)), key=lambda x: len(x), reverse=True)
# temp

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [148]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "sciences": sciences,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
    "law_enf_military_operator": law_enf_military_operator,
    "politics_govt_law": politics_govt_law,
    "business_farming": business_farming,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [149]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['num_categories']!=0].sample(2)

CPU times: total: 1min 10s
Wall time: 1min 10s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
43285,16,Maurice Marshall,", 86, New Zealand Olympic middle-distance athlete .",https://en.wikipedia.org/wiki/Maurice_Marshall,8,2013,May,,,,,,,,,,,,,86.0,,New Zealand,,1952.0,2.197225,0,0,0,0,0,0,1,0,0,0,0,0,1
47672,14,Nina Cassian,", 89, Romanian poet, heart attack.",https://en.wikipedia.org/wiki/Nina_Cassian,15,2014,April,,,,heart attack,,,,,,,,,89.0,,Romania,,,2.772589,0,0,0,0,0,1,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [150]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 3586 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [151]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [152]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [153]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "fisherman" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [154]:
# # Viewing list sorted by descending length to copy to dictionary below and screen values
# sorted(specific_roles_list, key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [155]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "radio"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [156]:
# Creating lists for each category
politics_govt_law = [
    "communist official",
    "anti communist",
    "communist and",
    "communist",
    "funeral home insurance agent",  # before arts
    "insurance agent",
    'wife of the nd Governor of Edmund "Pat" Brown and the mother of the th and th Governor of',
    "segregationist Governor of the State of",
    "Taliban recognized Governor of Kunduz",
    "Lieutenant Governor of Missouri",
    "widow of Governor John Connally",
    "Governor of the Central Bank",
    "Senator from and Governor of",
    "former Governor General of",
    "last Governor of Northern",
    "Governor of Massachusetts",
    "Governor of Bauchi State",
    "Governor of the Bank of",
    "Governor General of the",
    "Governor of New South",
    "Governor of Oklahoma",
    "Governor General of",
    "Governor of Montana",
    "Governor of Arizona",
    "Governor of Hawaii",
    "Governor General",
    "Governor of",
    "Governor",
    "landowner",
    "farm preservationist",  # before academia_humanities
    "union representative",
    "trade union official",
    "union organizer",
    "labor unionist",
    "union official",
    "union worker",
    "unionist",
    "union",
    "countess and",
    "countess",
    "socialist",
    "women rights advocate",
]

arts = [
    "theatrical press agent",
    "motion picture agent",
    "theatrical agent",
    "press agent",
    "agent",
    "harpist with the Cleveland Orchestra",
    "harpist and harp maker",
    "harpist",
    "beauty pageant and reality show contestant",
    "beauty pageant contestant",
    "beauty pageant winner",
    "beauty pageant queen",
    "beauty influencer",
    "nightclub owner and promoter",
    "nightclub owner",
    "upright bass player",
    "double bass player",
    "bass player",
    "wood engraver and printmaker",
    "printmaker",
    "hairdresser and wife of beatle Ringo starr",
    "hairdresser",
    "creator of the Brenda Starr comic strip",
    "who inspired husband Bil comic strip",
    "popular culture and comic book",
    "Golden Age comic book letterer",
    "penciler of comic books",
    "comic book store owner",
    "comic book and comic",
    "n comic book creator",
    "comic book letterer",
    "comic book colorist",
    "comic book creator",
    "comics creator",
    "stand up comic",
    "comic letterer",
    "comic books",
    "comic",
    "senior correspondent and substitute anchor for WNBC",
    "retired senior BBC correspondent",
    "space travel news correspondent",
    "correspondent for NBC News",
    "foreign correspondent for",
    "defence correspondent for",
    "news radio correspondent",
    "NBC News correspondent",
    "foreign correspondent",
    "news correspondent",
    "correspondent",
    "IMAX documentarian",
    "documentarian and",
    "documentarian",
    "news radio correspondent",
    "radio play performer",
    "radio newsreader",
    "radio performer",
    "radio anchor",
    "radio mogul",
    "radio and",
    "radio",
    "violist",
    "classical oboist",
    "oboist",
    "satirist and Richard Nixon impersonator",
    "satirist",
    "entertainment",
]
sports = [
    "basketball agent",  # before arts
    "race walker",
    "sprint dog musher",  # before other_species
    "dog breeder",
    "dog musher",
    "CFL player",
    "balloonist",
    "Thoroughbred horse racing trainer and breeder",
    "thoroughbred horse racing trainer",
    "Thoroughbred horse racing",
    "horse racing official",
    "horse racing",
    "professional pool player",
    "pool player",
    "netball player and coach",
    "netball player",
    "Paralympic snowboarder",
    "snowboarder",
    "NFL defensive lineman and member of the College Football Hall of Fame",
    "former NFL player for the Oakland Raiders and Tampa Bay Buccaneers",
    "former NFL player and assistant coach",
    "NFL player with the San Francisco ers",
    "NFL player and coach",
    "NFL wide receiver",
    "NFL fullback",
    "NFL player",
    "Hall of Fame lacrosse player and coach",
    "lacrosse player and college coach",
    "basketball and lacrosse player",
    "Hall of Fame lacrosse player",
    "lacrosse player and coach",
    "college lacrosse player",
    "lacrosse player",
    "lacrosse coach",
    "Hall of Fame professional wrestling interviewer",
    "professional wrestling promoter",
    "professional wrestling referee",
    "professional wrestling",
    "wrestling champion",
    "wrestling promoter",
    "wrestling coach",
    "sumo wrestling",
    "wrestling",
    "treasure hunter",
    "deer hunter",
    "hunter",
]
sciences = [
    "embryologist",
    "pioneer of radio astronomy",  # before arts
    "haematologist",
    "mammalogist",
]

business_farming = ["brewer for Pabst and Karl Strauss Brewing Company", "brewer"]
academia_humanities = [
    "historical preservationist",
    "historic preservationist",
    "railway preservationist",
    "preservationist",
    "Indologist",
    "schoolmaster",
    "headmaster",
]
law_enf_military_operator = [
    "anti communist fighter",  # before politics_govt_law
    "agent who led Resistance saboteurs after the Normandy Invasion",  # before arts
    "FBI agent who created the Abscam sting operation",
    "agent of th Federal Bureau of Investigation",
    "CIA agent who armed the mujaheddin of istan",
    "Secret Intelligence Service agent",
    "KGB agent who defected to",
    "WWII resistance agent",
    "law enforcement agent",
    "Secret Service agent",
    "undercover agent",
    "security agent",
    "secret agent",
    "Mossad agent",
    "DINA agent",
    "FBI agent",
    "CIA agent",
    "SOE agent",
    "and agent",
    "Army lieutenant colonel and recipient of the Medal of Honor",
    "Lieutenant colonel in the Air Defense Forces",
    "Army lieutenant colonel and one of the",
    "Civil Guard lieutenant colonel",
    "colonel in the Air Force",
    "Army lieutenant colonel",
    "Air Force colonel and",
    "Army Rangers colonel",
    "lieutenant colonel",
    "Air Force colonel",
    "air force colonel",
    "Army colonel",
    "colonel",
    "Park cryptographer and",
    "cryptographer",
    "radio operator and gunner during WW II",  # before arts
    "amateur radio operator",
    "Island radio operator",
    "radio operator",
    "Air Chief Marshal",
    "Resistance member and Legion of Honour recipient",
    "Resistance member",
]
spiritual = [
    "cardinal and prefect of the Congregation for Divine Worship",
    "catholic cardinal",
    "Maronite cardinal",
    "cardinal",
    "imam",
    "self help",
    "christian preacher and hymnist",
    "wife of preacher Oral Roberts",
    "Ayatollah and preacher",
    "evangelical preacher",
    "street preacher",
    "preacher and",
    "preacher",
]
social = []
crime = [
    "jailed for human rights abuses under Pinochet",
    "who defected to the Union",
    "and arms dealer",
]
event_record_other = []
other_species = [
    "Corgi belonging to Governor of Jerry Brown",  # before politics_govt_law
    "rescue dog with the Miskolc Spider Special Rescue Team",
    "search and rescue dog for September",
    "Flat Coated Retriever show dog",
    "Golden Retriever rescue dog",
    "Guinness World Record dog",
    "Great Dane therapy dog",
    "crested chihuahua dog",
    "world oldest dog",
    "beagle show dog",
    "performing dog",
    "Pomeranian dog",
    "service dog",
    "dog",
]

<IPython.core.display.Javascript object>

In [157]:
# # Example code to quickly sort list in correct descending length search order to copy to dictionary
# temp = sorted(list(set(law_enf_military_operator)), key=lambda x: len(x), reverse=True)
# temp

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [158]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "business_farming": business_farming,
    "sciences": sciences,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "sports": sports,
    "other_species": other_species,
    "politics_govt_law": politics_govt_law,
    "academia_humanities": academia_humanities,
    "arts": arts,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [159]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['num_categories']!=0].sample(2)

CPU times: total: 2min 3s
Wall time: 2min 3s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
30801,11,Charles Moore,", 79, American photographer.",https://en.wikipedia.org/wiki/Charles_Moore_(photographer),12,2010,March,,,,,,,,,,,,,79.0,,United States of America,,,2.564949,0,0,0,0,0,1,0,0,0,0,0,0,1
49022,19,Norberto Odebrecht,", 93, Brazilian engineer, founder of Odebrecht and the Odebrecht Foundation, cardiac complications.",https://en.wikipedia.org/wiki/Norberto_Odebrecht,4,2014,July,,,,founder of Odebrecht and the Odebrecht Foundation,cardiac complications,,,,,,,,93.0,,Brazil,,,1.609438,1,0,0,0,0,0,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [160]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 3162 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [161]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [162]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [163]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "Libertarian Studies" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [164]:
# # Viewing list sorted by descending length to copy to dictionary below and screen values
# sorted(specific_roles_list, key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [165]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "president of"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [166]:
# Creating lists for each category
politics_govt_law = [
    "official and",
    "senior member of Hezbollah",
    "senior negotiator",
    "senior official",
    "senior",
    "chief negotiator of the Free Trade Agreement",
    "labor negotiator",
    "international human rights",
    "human rights advocate",
    "human rights",
    "monarch and th Omukama of the Kingdom of Toro",
    "monarch",
    "children rights advocate",  # before event_record_other
    "child welfare expert",
    "children advocate",
    "widow of the Nationalist president Chiang Kai shek",
    "president of the Center for",
    "president of the National Union of Students",
    "presidential advisor and Postmaster General",
    "president of the Guam Chamber of Commerce",
    "president of the United Auto Workers",
    "president of the Labour Congress",
    "president of United Steelworkers",
    "presidential private secretary",
    "former presidential candidate",
    "president of the Congress",
    "presidential candidate",
    "wife of the president",
    "presidential advisor",
    "presidential adviser",
    "former president of",
    "presidential aide",
    "presidential son",
    "th president of",
    "president of",
    "president",
]

arts = [
    "TV journalism pioneer and former NBC News president",  # before politics_govt_law
    "president of the Motion Picture Association of",
    "vice president of Def Jam Recordings",
    "president of CBS",
]
sports = [
    "vice president of community relations for St Louis Cardinals",  # before politics_govt_law
    "president of the International Ski Federation",
    "president of the International Skating Union",
    "president of the Western Hockey League",
    "vice president of FIFA",
    "NCAA president",
]
sciences = []

business_farming = [
    "vice president of Hills Bank and Trust Company",  # before politics_govt_law
    "president of Pressman Toy Corporation",
    "president of the Atchison",
    "president and CEO of WD",
    "president of Walgreens",
]
academia_humanities = [
    "former president of the University of at Austin and Rice University",  # before politics_govt_law
    "former vice president of the Metropolitan Museum of Art",
    "Libertarian Studies",
    "president of the Royal College of Psychiatrists",
    "president of the University of Michigan",
    "president of Drexel University",
    "president of Biola University",
    "president of Tech",
    "president of MIT",
]
law_enf_military_operator = [
    "senior commander of the SiPo and SD",
    "senior member of al Qaeda",  # before politics_govt_law
    "al Nusra Front senior official",
    "senior member of Hezbollah",
    "LTTE",
    "national president of the Fraternal Order of Police",
]
spiritual = [
    "president of Universal Life Church",
]
social = [
    "president of the Shafeek Nader Trust for the Community Interest",
    "president of Refugees International",
]
crime = []
event_record_other = [
    "godchild of C S Lewis and eponym for Lucy Pevensie from",
    "child who was mistreated by his grandparents",
    "child victim of the NATO bombing of",
    "migrant child in custody",
    "child homicide victims",
    "victim of child abuse",
    "child cancer victim",
    "child abuse victim",
    "children advocate",
    "children",
    "child",
    "presidential mother",  # before politics_govt_law
]
other_species = []

<IPython.core.display.Javascript object>

In [167]:
# # Example code to quickly sort list in correct descending length search order to copy to dictionary
# temp = sorted(list(set(law_enf_military_operator)), key=lambda x: len(x), reverse=True)
# temp

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [168]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
    "politics_govt_law": politics_govt_law,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [169]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['num_categories']!=0].sample(2)

CPU times: total: 42.2 s
Wall time: 42.2 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
57198,29,Nihal Ahmed Maulavi Mohammed Usman,", 90, Indian politician, Maharashtra MLA , mayor of Malegaon.",https://en.wikipedia.org/wiki/Nihal_Ahmed_Maulavi_Mohammed_Usman,5,2016,February,,,,Maharashtra MLA,mayor of Malegaon,,,,,,,,90.0,,India,,1960 1999,1.791759,0,0,0,0,0,0,0,0,1,0,0,0,1
1027,2,Walter Chyzowych,", 57, Ukrainian-American soccer player.",https://en.wikipedia.org/wiki/Walter_Chyzowych,3,1994,September,,,,,,,,,,,,,57.0,,Ukraine,United States of America,,1.386294,0,0,0,0,0,0,1,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [170]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 3064 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [171]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [172]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [173]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "pollster" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [174]:
# # Viewing list sorted by descending length to copy to dictionary below and screen values
# sorted(specific_roles_list, key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [175]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "pollster"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [176]:
# Creating lists for each category
politics_govt_law = [
    "Holocaust denier",
    "legislator in the Massachusetts House",
    "legislator for New",
    "state legislator",
    "legislator",
    "woman and first female parliamentarian in the Arab world",
    "parliamentary candidate",
    "parliamentarian",
    "Federal District Judge who overturned the conviction of Lt William Calley",  # before crime
    "immigrant and anti Castro lobbyist",
    "labor lobbyist",
    "lobbyist",
    "pacifist",
    "Saramaka chieftain",
    "chieftain",
    "Lower Elwha Klallam Tribe elder",
    "Great Andamanese elder",
    "Klallam elder",
    "Koyukon elder",
    "tribal elder",
    "Inuit elder",
    "first lady",
    "disability advocate",
]

arts = [
    "and builder",
    "origami master",
    "caricaturist of the s and s",
    "caricaturist",
    "auctioneer",
    "gardening expert and presenter",
    "circus owner and presenter",
    "ITN news presenter",
    "news presenter",
    "BBC  presenter",
    "presenter",
    "voice coach",
    "voice",
    "jeweler and silversmith",
    "jeweller",
    "jeweler",
    "YouTuber",
    "etiquette expert",
    "media mogul",
    "calypsonian",
    "illusionist and escapologist",
    "optical illusion  and",
    "illusionist",
    "media owner",
    "textile printer",
    "printer",
    "impressionist",
    "sound system operator and club owner",
    "classical bass in opera and concert",
    "switchboard operator",
    "Yue opera performer",
    "opera impresario",
]
sports = [
    "marathon former world record holder",
    "meter hurdles world record holder",  # before event_record_other
    "master of karate who founded Ashihara karate",
    "master of Goju ryu karate",
    "founder of karate pioneer",
    "karate master and trainer",
    "karate world champion",
    "master of karate",
    "karate master",
    "karateka",
    "orienteer and ski orienteer",
    "orienteering competitor",
    "orienteer",
    "pole vaulter",
    "powerlifter and strength coach",
    "world champion powerlifter",
    "Paralympic powerlifter",
    "powerlifter",
    "sprint and marathon canoeist",
    "sprint canoeist and coach",
    "canoeist and coach",
    "sprint canoeist",
    "slalom canoeist",
    "canoeist",
    "shark fisherman",
    "fisherman",
    "taekwondo champion and coach",
    "taekwondo practitioner",
    "former outfielder for the Dodgers",
    "corner infielder",
    "luger",
    "master of aikido",
    "aikido master",
    "aikidoka",
    "outdoorsman",
]
sciences = [
    "gastroenterologist",
    "seismologist",
    "micropaleontology",  # before academia_humanities
    "microcomputer pioneer",
    "computer pioneer",
    "mathematics and logic",
    "mathematics",
    "nanotechnologist and crystallographer",
    "crystallographer",
    "immunotoxicologist",
    "toxicologist",
    "paranormal investigator",
    "urologist who developed a cure for renal tuberculosis",
    "paediatric urologist",
    "futurologist",
    "urologist",
    "bacteriologist",
    "volcanologist",
]

business_farming = [
    "shipping and",
    "couturier and",
    "couturier",
    "casino operator",
    "tour operator",
]
academia_humanities = ["paleographer", "paleo", "museum founder"]
law_enf_military_operator = [
    "Sharia advocate and insurgent",
    "Warsaw Uprising insurgent",
    "insurgent commander",
    "insurgent",
    "guerrilla member of FARC",
    "Communist guerrilla",
    "guerrilla fighter",
    "guerrilla member",
    "guerrilla",
    "original Navajo code talker",
    "WWII Navajo code talker",
    "Navajo code talker",
    "CIA operative during the Cold War",
    "Arabian external operations chief",
    "intelligence operative",
    "al Qaeda operative",
    "CIA operative",
    "SOE operative",
]
spiritual = [
    "sedevacantism",
    "beautified catholic teenager",  # before event_record_other
    "Benedictine monk and Zen master",
    "Benedictine monk and liturgist",
    "Eastern Orthodox monk",
    "Coptic Orthodox monk",
    "buddhist monk and",
    "Benedictine monk",
    "breatharian monk",
    "Orthodox monk",
    "Trappist monk",
    "Hindu monk",
    "Jain monk",
    "monk",
    "Biblical",
    "Karma Kagyu lama",
    "Dzogchen lama",
    "lama",
    "evangelical church",
    "church consultant",
    "LDS church elder",
    "church",
    "Orthodox elder",
    "Mormon elder",
    "chaplain",
]
social = ["community worker"]
crime = [
    "and murder convict",
    "death row convict",
    "juvenile convict",
    "murder convict",
    "convict",
    "suspect in the Boston Marathon bombings",
    "suspected assassin",
    "terrorism suspect",
    "murder suspect",
    "crime suspect",
    "welder and muffler repair shop owner",
]
event_record_other = [
    "record holder",
    "teenager who was brutally murdered and made national headlines",
    "teenager",
    "victim of videotaped  beating that sparked the riots",
    "teenage car crash victim from Orange County",
    "notable gender reassignment victim",
    "victim of racially motivated crime",
    "'body in the boot' crime victim",
    "scrotal elephantiasis victim",
    "microhydranencephaly victim",
    "victim of unlawful killing",
    "Watergate break in victim",
    "West immigrant and victim",
    "school shooting victim",
    "domestic abuse victim",
    "cyberbullying victim",
    "sexual abuse victim",
    "manslaughter victim",
    "acid attack victim",
    "false rape victim",
    "hate crime victim",
    "homicide victim",
    "bullying victim",
    "shooting victim",
    "HIV aids victim",
    "suicide victim",
    "torture victim",
    "racism victim",
    "polio victim",
    "rape victim",
    "victim",
    "concentration camp prisoner",
]
other_species = []

<IPython.core.display.Javascript object>

In [177]:
# # Example code to quickly sort list in correct descending length search order to copy to dictionary
# temp = sorted(list(set(law_enf_military_operator)), key=lambda x: len(x), reverse=True)
# temp

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [178]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "business_farming": business_farming,
    "sciences": sciences,
    "academia_humanities": academia_humanities,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
    "crime": crime,
    "event_record_other": event_record_other,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [179]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['num_categories']!=0].sample(2)

CPU times: total: 1min 43s
Wall time: 1min 43s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
40627,12,Mario Murillo,", 85, Costa Rican footballer.",https://en.wikipedia.org/wiki/Mario_Murillo_(footballer),6,2012,November,,,,,,,,,,,,,85.0,,Costa Rica,,,1.94591,0,0,0,0,0,0,1,0,0,0,0,0,1
4013,2,"Douglas Houghton, Baron Houghton of Sowerby",", 97, British politician.","https://en.wikipedia.org/wiki/Douglas_Houghton,_Baron_Houghton_of_Sowerby",6,1996,May,,,,,,,,,,,,,97.0,,United Kingdom of Great Britain and Northern Ireland,,,1.94591,0,0,0,0,0,0,0,0,1,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [180]:
# Checking th e number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 2727 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Observations:
- It is time to export our dataframe and start a new notebook.

### Exporting Dataset to SQLite Database [wp_life_expect_clean12.db](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_clean12.db)

In [181]:
# Exporting dataframe

# Saving dataset in a SQLite database
conn = sql.connect("wp_life_expect_clean12.db")
df.to_sql("wp_life_expect_clean12", conn, index=False)

# Chime notification when cell executes
chime.success()

<IPython.core.display.Javascript object>

# [Proceed to Data Cleaning Part 13](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean13_thanak_2022_08_07.ipynb)