# Wikipedia Notable Life Expectancies
# [Notebook  11: Data Cleaning Part 10](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean10_thanak_2022_08_01.ipynb)
### Context

The
### Objective

The
### Data Dictionary
- Feature: Description

### Importing Libraries

In [1]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To save/open python objects in pickle file
import pickle

# To help with reading, cleaning, and manipulating data
import pandas as pd
import numpy as np
import re

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)

# To play auditory cue when cell has executed, has warning, or has error and set chime theme
import chime

chime.theme("zelda")

<IPython.core.display.Javascript object>

## Data Overview

### Reading, Sampling, and Checking Data Shape

In [2]:
# Reading the dataset
conn = sql.connect("wp_life_expect_clean9.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_clean9", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 98057 rows and 38 columns.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,,,dancer,ballet designer and director,,,,,,,,,86.0,,United Kingdom of Great Britain and Northern Ireland,,,3.091042,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,,,,writer,and academic,,,,,,,,68.0,,Ireland,,,2.564949,0,0,0,0,0,0,0,0,1,0,0,0,1


<IPython.core.display.Javascript object>

In [3]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
98055,9,Aamir Liaquat Hussain,", 50, Pakistani journalist and politician, MNA .",https://en.wikipedia.org/wiki/Aamir_Liaquat_Hussain,99,2022,June,", since",,,MNA,,,,,,,,,50.0,,Pakistan,,"2002 2007, since 2018",4.60517,0,0,0,0,0,1,0,0,1,0,0,0,2
98056,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3,2022,June,,,,member of the Academy of Engineering,,,,,,,,,86.0,,"China, People's Republic of",,,1.386294,1,0,0,0,0,0,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

In [4]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
3152,31,Lou Levy,", 84, American music publisher.",https://en.wikipedia.org/wiki/Lou_Levy_(publisher),4,1995,October,,,music publisher,,,,,,,,,,84.0,,United States of America,,,1.609438,0,0,0,0,0,0,0,0,0,0,0,0,0
9960,9,Milt Jackson,", 76, American jazz vibraphonist, liver cancer.",https://en.wikipedia.org/wiki/Milt_Jackson,10,1999,October,,,jazz vibraphonist,liver cancer,,,,,,,,,76.0,,United States of America,,,2.397895,0,0,0,0,0,0,0,0,0,0,0,0,0
31517,29,Akinpelu Oludele Adesola,", 82, Nigirean academic.",https://en.wikipedia.org/wiki/Akinpelu_Oludele_Adesola,3,2010,May,,,,,,,,,,,,,82.0,,Nigeria,,,1.386294,0,0,0,1,0,0,0,0,0,0,0,0,1
68231,22,Lyn Lott,", 67, American golfer, complications from brain surgery.",https://en.wikipedia.org/wiki/Lyn_Lott,4,2018,March,,,golfer,complications from brain surgery,,,,,,,,,67.0,,United States of America,,,1.609438,0,0,0,0,0,0,0,0,0,0,0,0,0
50715,16,Dessie Hughes,", 71, Irish racehorse trainer.",https://en.wikipedia.org/wiki/Dessie_Hughes,3,2014,November,,,racehorse trainer,,,,,,,,,,71.0,,Ireland,,,1.386294,0,0,0,0,0,0,0,0,0,0,0,0,0


<IPython.core.display.Javascript object>

### Checking Data Types, Duplicates, and Null Values

In [5]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98057 entries, 0 to 98056
Data columns (total 38 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   day                        98057 non-null  object 
 1   name                       98057 non-null  object 
 2   info                       98057 non-null  object 
 3   link                       98057 non-null  object 
 4   num_references             98057 non-null  int64  
 5   year                       98057 non-null  int64  
 6   month                      98057 non-null  object 
 7   info_parenth               36661 non-null  object 
 8   info_1                     22 non-null     object 
 9   info_2                     98025 non-null  object 
 10  info_3                     48897 non-null  object 
 11  info_4                     10264 non-null  object 
 12  info_5                     1265 non-null   object 
 13  info_6                     181 non-null    obj

<IPython.core.display.Javascript object>

#### Observations:
- With our dataset loaded, we can pick up where we left off with extracting known_for values by rebuilding `known_for_dict`.

### Extracting `known_for` Continued

#### Finding `known_for` Roles in `info_2`

In [6]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [7]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [8]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "translator" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [9]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [10]:
# # Example code to quick-screen values that may overlap categories
# df.loc[[index for index in df.index if "and Bible translator" in df.loc[index, "info"]]]

<IPython.core.display.Javascript object>

In [11]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "linguist and bible translator"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [12]:
# Creating lists for each category
politics_govt_law = []

arts = []
sports = []
sciences = []

business_farming = []
academia_humanities = [
    "and translator of philosophy and literature",
    "scholar and translator of literature",
    "translator and literature scholar",
    "translator and literary scholar",
    "translator of modern literature",
    "language scholar and translator",
    "linguist and bible translator",
    "medievalist and translator",
    "litterateur and translator",
    "sinologist and translator",
    "translator of literature",
    "translator and linguist",
    "linguist and translator",
    "teacher and translator",
    "scholar and translator",
    "and Bible translator",
    "literary translator",
    "translator and",
    "and translator",
    "translator",
]
law_enf_military_operator = []
spiritual = []
social = []
crime = []
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [13]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [14]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['academia_humanities'] ==1].sample(2)

CPU times: total: 10.2 s
Wall time: 10.2 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
77116,9,Yang Enze,", 99, Chinese telecommunications engineer and academic, cerebral hemorrhage.",https://en.wikipedia.org/wiki/Yang_Enze,4,2019,October,,,,cerebral hemorrhage,,,,,,,,,99.0,,"China, People's Republic of",,,1.609438,1,0,0,1,0,0,0,0,0,0,0,0,2
87676,14,Christopher Lee,", 79, British writer and historian, COVID-19.",https://en.wikipedia.org/wiki/Christopher_Lee_(historian),9,2021,February,,,,COVID,,,,,,,,,79.0,,United Kingdom of Great Britain and Northern Ireland,,,2.302585,0,0,0,1,0,1,0,0,0,0,0,0,2


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [15]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 28392 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [16]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [17]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [18]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "film" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [19]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [20]:
# Example code to quick-screen values that may overlap categories
df.loc[[index for index in df.index if "cultural researcher" in df.loc[index, "info"]]]

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
74555,30,Robert R. Spitzer,", 96, American agricultural researcher and educator.",https://en.wikipedia.org/wiki/Robert_R._Spitzer,3,2019,April,,,agricultural researcher,,,,,,,,,,96.0,,United States of America,,,1.386294,0,0,0,1,0,0,0,0,0,0,0,0,1
93001,26,Kirill Razlogov,", 75, Russian film critic and cultural researcher.",https://en.wikipedia.org/wiki/Kirill_Razlogov,6,2021,September,,,film critic and cultural researcher,,,,,,,,,,75.0,,Russia,,,1.94591,0,0,0,0,0,0,0,0,0,0,0,0,0


<IPython.core.display.Javascript object>

In [21]:
# # Example code to quick-screen values that may overlap categories
# df.loc[[index for index in df.index if "censor" in df.loc[index, "info"]]]

<IPython.core.display.Javascript object>

In [22]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "film subject and domestic abuse symbol"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [23]:
# Creating lists for each category
politics_govt_law = [
    "censor",
]

arts = [
    "documentary filmmaker and pioneer of public access television",
    "television documentary director and filmmaker",
    "film and television special effects designer",
    "Bollywood filmmaker and brother of Dev Anand",
    "rock tour organiser and film studio manager",
    "film musical arranger musical orchestrator",
    "film and television editor and director",
    "filmmaker and children book illustrator",
    "film critic and film festival director",
    "experimental filmmaker and glass maker",
    "film studio executive and talent agent",
    "film editor and Academy Award winner",
    "film and television costume designer",
    "BBC Northern broadcaster & filmmaker",
    "film critic and television presenter",
    "director and producer in film and TV",
    "television executive and filmmaker",
    "music director for Bollywood films",
    "film studio executive and producer",
    "filmmaker and television producer",
    "wildlife film maker and producer",
    "director for film and television",
    "underwater documentary filmmaker",
    "film documentarian and producer",
    "filmmaker and festival promoter",
    "graphic designer and filmmaker",
    "film and advertising executive",
    "film critic and radio producer",
    "film distributor and producer",
    "ʼNamgis documentary filmmaker",
    "cinematographer and filmmaker",
    "and film and theater director",
    "television and film executive",
    "film and television executive",
    "film and television producer",
    "Arabian film and TV director",
    "film and stage choreographer",
    "stage director and filmmaker",
    "choreographer and filmmaker",
    "music documentary filmmaker",
    "film critic and researcher",
    "independent film executive",
    "film and television editor",
    "columnist and film critic",
    "Oscar winning film editor",
    "and documentary filmmaker",
    "film and theater director",
    "and documentary filmmaker",
    "documentary filmmaker and",
    "film and theatre producer",
    "film and theater producer",
    "film and theatre director",
    "film production designer",
    "music and film executive",
    "producer of horror films",
    "and underwater filmmaker",
    "film critic and essayist",
    "film television producer",
    "film marketing publicist",
    "film critic and producer",
    "film editor and producer",
    "and aerial film operator",
    "film editor and director",
    "film critic and director",
    "film industry executive",
    "film and opera director",
    "theater and film critic",
    "film and stage director",
    "filmmaker and cameraman",
    "Emmy Award winning film",
    "documentary film editor",
    "Tony Award winning film",
    "film and theatre critic",
    "theatre and film critic",
    "documentary film maker",
    "experimental filmmaker",
    "filmmaker and director",
    "animator and filmmaker",
    "film critic for on ABC",
    "independent film maker",
    "pornographic filmmaker",
    "filmmaker and designer",
    "producer and filmmaker",
    "filmmaker and producer",
    "documentary filmmaker",
    "film studio executive",
    "advertising filmmaker",
    "avant garde filmmaker",
    "music and film critic",
    "film costume designer",
    "independent filmmaker",
    "film  television host",
    "film and TV director",
    "surrealist filmmaker",
    "film and TV producer",
    "filmmaker and editor",
    "film camera operator",
    "film music director",
    "film location scout",
    "film stunt director",
    "wildlife filmmaker",
    "film prop designer",
    "film choreographer",
    "film distributor",
    "of film studies",
    "film programmer",
    "South filmmaker",
    "and film critic",
    "adult film star",
    "film editor and",
    "film critic and",
    "film executive",
    "film trumpeter",
    "and filmmaker",
    "filmmaker and",
    "film lyricist",
    "film pioneer",
    "film editor",
    "film critic",
    "film maker",
    "filmmaker",
    "film star",
    "film and",
    "and film",
    "film",
]
sports = []
sciences = [
    "restorer",
    "virtual reality technology pioneer and",
]

business_farming = []
academia_humanities = []
law_enf_military_operator = []
spiritual = ["Anglican prelate and theologian", "Anglican prelate"]
social = []
crime = []
event_record_other = [
    "ALD patient portrayed in the film",  # before arts
    "filmgoer",
    "film subject and domestic abuse symbol",
]
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [24]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "sports": sports,
    "arts": arts,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [25]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['spiritual'] ==1].sample(2)

CPU times: total: 1min 6s
Wall time: 1min 6s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
75257,13,Pierre DuMaine,", 87, American Roman Catholic prelate, Bishop of San Jose .",https://en.wikipedia.org/wiki/Pierre_DuMaine,3,2019,June,,,,Bishop of San Jose,,,,,,,,,87.0,,United States of America,Italy,1981 1999,1.386294,0,0,1,0,0,0,0,0,0,0,0,0,1
53024,24,Thomas Joseph Connolly,", 92, American Roman Catholic prelate, Bishop of Baker .",https://en.wikipedia.org/wiki/Thomas_Joseph_Connolly,4,2015,April,,,,Bishop of Baker,,,,,,,,,92.0,,United States of America,Italy,1971 1999,1.609438,0,0,1,0,0,0,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [26]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 27795 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [27]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [28]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [29]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "professor" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [30]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [31]:
# # Example code to quick-screen values that may overlap categories
# df.loc[
#     [
#         index
#         for index in df.index
#         if "and communication professor" in df.loc[index, "info"]
#     ]
# ]

<IPython.core.display.Javascript object>

In [32]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "professor and World War II researcher"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [33]:
# Creating lists for each category
politics_govt_law = [
    "who helped uncover the Bay of Pigs Invasion plan",
    "child protection expert",
]

arts = []
sports = []
sciences = [
    "who made critical contributions to the development of radar",
]

business_farming = []
academia_humanities = [
    "professor and official pronouncer of the Scripps National Spelling Bee from to",
    "geographer and Alexander von Humboldt professor of geography at UCLA",
    "professor and twice interim president of the University of Missouri",
    "professor of and Islamic Studies at the University of Edinburgh",
    "linguistics professor and Pacific Islands language specialist",
    "professor of education and commentator on education topics",
    "professor at Columbia University and scholar of literature",
    "professor of education at the University of Washington",
    "professor and leading researcher into category theory",
    "professor of aesthetics at University of Strasbourg",
    "and drama professor at the Academy of Theatre Arts",
    "professor of Assyriology and Babylonian literature",
    "professor at Princeton Theological Seminary and",
    "professor of history at Indiana University",
    "professor of History at University College",
    "classical scholar and history professor",
    "ist and professor of ancient languages",
    "and professor at Seton Hall University",
    "professor at the University of Chicago",
    "professor and World War II researcher",
    "professor at Brigham Young University",
    "professor and folklorist of cultures",
    "and professor of clinical psychology",
    "professor of comparative literature",
    "scholar and professor of literature",
    "professor specialized in turbulence",
    "and political philosophy professor",
    "professor of modern Jewish history",
    "professor and daughter of Zhu De",
    "professor at Stanford University",
    "professor at Columbia University",
    "professor of Ancient Philosophy",
    "professor of Early Christianity",
    "professor of Jewish literature",
    "professor of Hebrew Literature",
    "and professor at University of",
    "professor at the University of",
    "Stanford University professor",
    "professor emeritus of history",
    "professor of at University of",
    "anthropologist and professor",
    "professor of and runologist",
    "and communication professor",
    "and professor of philosophy",
    "researcher and professor of",
    "Assyriologist and professor",
    "emeritus professor at Yale",
    "professor at University of",
    "professor of Asian studies",
    "professor of philosophy of",
    "professor of women studies",
    "library science professor",
    "ethnologist and professor",
    "folklorist and professor",
    "and philosophy professor",
    "geographer and professor",
    "A&M University professor",
    "professor and sinologist",
    "and University professor",
    "and university professor",
    "librarian and professor",
    "University of professor",
    "professor of Egyptology",
    "women studies professor",
    "pedagogue and professor",
    "professor of philosophy",
    "professor of literature",
    "linguist and professor",
    "professor of geography",
    "and Emeritus professor",
    "and professor emeritus",
    "assistant professor of",
    "professor of Classics",
    "professor emeritus of",
    "scholar and professor",
    "professor of rhetoric",
    "professor of classics",
    "and college professor",
    "and professor of law",
    "literature professor",
    "professor of Studies",
    "professor of studies",
    "professor of history",
    "philosophy professor",
    "university professor",
    "professor of Hebrew",
    "associate professor",
    "professor emeritus",
    "language professor",
    "and law professor",
    "law professor and",
    "college professor",
    "history professor",
    "and a professor",
    "law professor",
    "MIT professor",
    "and professor",
    "professor and",
    "professor of",
    "professor in",
    "professor",
]
law_enf_military_operator = []
spiritual = [
    "expert on biblical manuscripts",
]
social = []
crime = []
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [34]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [35]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['academia_humanities'] ==1].sample(2)

CPU times: total: 53.5 s
Wall time: 53.5 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
16171,21,Karel Kosík,", 76, Czech Marxist philosopher.",https://en.wikipedia.org/wiki/Karel_Kos%C3%ADk,5,2003,February,,,Marxist,,,,,,,,,,76.0,,Czech Republic,,,1.791759,0,0,0,1,0,0,0,0,0,0,0,0,1
94973,1,Ramiz Abutalibov,", 84, Azerbaijani diplomat and historian.",https://en.wikipedia.org/wiki/Ramiz_Abutalibov,3,2022,January,,,,,,,,,,,,,84.0,,Azerbaijan,,,1.386294,0,0,0,1,0,0,0,0,1,0,0,0,2


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [36]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 27546 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [37]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [38]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [39]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "theologian" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [40]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [41]:
# # Example code to quick-screen values that may overlap categories
# df.loc[[index for index in df.index if "founder of magazine" in df.loc[index, "info"]]]

<IPython.core.display.Javascript object>

In [42]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "liberation theologian"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [43]:
# Creating lists for each category
politics_govt_law = []

arts = []
sports = []
sciences = []

business_farming = []
academia_humanities = []
law_enf_military_operator = []
spiritual = [
    "evangelical Protestant pastor and theologian",
    "Christian Protestant Ecumenical theologian",
    "congregationalist minister and theologian",
    "evangelical theologian and missiologist",
    "Eastern Orthodox priest and theologian",
    "Catholic Jesuit priest and theologian",
    "Presbyterian minister and theologian",
    "theologian and Catholic lay leader",
    "Protestant theologian and biblical",
    "Anglican clergyman and theologian",
    "Evangelical Christian theologian",
    "theologian and Dead Sea Scrolls",
    "Franciscan friar and theologian",
    "theologian and religious leader",
    "Baptist minister and theologian",
    "Anglican priest and theologian",
    "Anglican bishop and theologian",
    "Lutheran theologian and bishop",
    "theologian and Bishop of Medak",
    "Catholic priest and theologian",
    "Jesuit priest and theologian",
    "theologian and Old Testament",
    "theologian and New Testament",
    "dispensationalist theologian",
    "Southern Baptist theologian",
    "Eastern Orthodox theologian",
    "Evangelical theologian and",
    "missionary and theologian",
    "theologian and missionary",
    "theologian and ecumenist",
    "minister and theologian",
    "chaplain and theologian",
    "theologian and biblical",
    "theologian and exegete",
    "evangelical theologian",
    "priest and theologian",
    "Jesuit theologian and",
    "liberation theologian",
    "bishop and theologian",
    "theologian and pastor",
    "theologian and priest",
    "protestant theologian",
    "cleric and theologian",
    "Protestant theologian",
    "Methodist theologian",
    "Church of theologian",
    "Christian theologian",
    "Catholic theologian",
    "Lutheran theologian",
    "Qur'anic theologian",
    "Anglican theologian",
    "Islamic theologian",
    "Jewish theologian",
    "Jesuit theologian",
    "Sunni theologian",
    "Queer theologian",
    "lay theologian",
    "and theologian",
    "theologian and",
    "theologian",
]
social = []
crime = []
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [44]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [45]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['spiritual'] ==1].sample(2)

CPU times: total: 30.9 s
Wall time: 30.9 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
62558,7,Edmond La Beaume Cherbonnier,", 99, American theologian.",https://en.wikipedia.org/wiki/Edmond_La_Beaume_Cherbonnier,14,2017,March,,,,,,,,,,,,,99.0,,United States of America,,,2.70805,0,0,1,0,0,0,0,0,0,0,0,0,1
18026,1,Francis James Harrison,", 91, American Roman Catholic prelate, Bishop of Syracuse .",https://en.wikipedia.org/wiki/Francis_James_Harrison,5,2004,May,,,,Bishop of Syracuse,,,,,,,,,91.0,,United States of America,Italy,1977 1987,1.791759,0,0,1,0,0,0,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [46]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 27340 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [47]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [48]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [49]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "linguist" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [50]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [51]:
# # Example code to quick-screen values that may overlap categories
# df.loc[[index for index in df.index if "computational" in df.loc[index, "info"]]]

<IPython.core.display.Javascript object>

In [52]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "psycholinguist"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [53]:
# Creating lists for each category
politics_govt_law = []

arts = []
sports = []
sciences = [
    "neuro linguistic programming expert",  # before academia_humanities
    "psycholinguist",
]

business_farming = []
academia_humanities = [
    "linguist and leading scholar of Mon and Khmer languages",
    "linguist and classical scholar who deciphered Linear B",
    "sociolinguist and linguistic anthropologist",
    "linguistic anthropologist and semiotician",
    "linguist specialized in Romance languages",
    "scholar of literature and linguistics",
    "Yiddish linguist and lexicographer",
    "linguist and literature scholar",
    "linguist and anthropologist",
    "anthropologist and linguist",
    "linguist and Hittitologist",
    "lexicographer and linguist",
    "linguist and lexicographer",
    "musicologist and linguist",
    "linguistic anthropologist",
    "albanologist and linguist",
    "linguist and ethnologist",
    "philologist and linguist",
    "linguist and philologist",
    "linguist and celtologist",
    "linguist and Iranologist",
    "grammarian and linguist",
    "sinologist and linguist",
    "linguistics expert and",
    "Santhali linguist and",
    "scholar and linguist",
    "linguist and scholar",
    "linguist and slavist",
    "linguist and teacher",
    "linguist of descent",
    "linguistics scholar",
    "historical linguist",
    "linguistics expert",
    "classical linguist",
    "Creole linguist",
    "sociolinguist",
    "and linguist",
    "linguist and",
    "linguist",
]
law_enf_military_operator = []
spiritual = []
social = []
crime = []
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [54]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
    "academia_humanities": academia_humanities,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [55]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['academia_humanities'] ==1].sample(2)

CPU times: total: 21.2 s
Wall time: 21.2 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
61127,11,James Gair,", 88, American linguist.",https://en.wikipedia.org/wiki/James_Gair,7,2016,December,,,,,,,,,,,,,88.0,,United States of America,,,2.079442,0,0,0,1,0,0,0,0,0,0,0,0,1
51591,14,Jerzy Holzer,", 84, Polish historian.",https://en.wikipedia.org/wiki/Jerzy_Holzer,3,2015,January,,,,,,,,,,,,,84.0,,Poland,,,1.386294,0,0,0,1,0,0,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [56]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 27147 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [57]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [58]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [59]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "anthropologist" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [60]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [61]:
# # Example code to quick-screen values that may overlap categories
# df.loc[[index for index in df.index if "griot" in df.loc[index, "info"]]]

<IPython.core.display.Javascript object>

In [62]:
# # Example code to quick-checouk a specific entry
# df[df["info_2"] == "physical anthropologist"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [63]:
# Creating lists for each category
politics_govt_law = []

arts = [
    "griot",
]
sports = []
sciences = [
    "palaeontologist and palaeoanthropologist",
    "geologist and paleoanthropologist",
    "forensic anthropologist",
    "physical anthropologist",
    "palaeoanthropologist",  # before academia_humanities
    "paleoanthropologist",
]

business_farming = [
    "banker and member of the Rothschild family",
    "investment banker and financier",
    "beverage executive and banker",
    "banker and venture capitalist",
    "Arabian billionaire banker",
    "investment banker and",
    "financier and banker",
    "banker and executive",
    "banker and chairman",
    "merchant banker and",
    "and merchant banker",
    "billionaire banker",
    "investment banker",
    "mortgage banker",
    "merchant banker",
    "banker and",
    "and banker",
    "banker",
]
academia_humanities = [
    "anthropologist specializing in Aztec culture",
    "social anthropologist ands ethnographer",
    "social anthropologist and musicologist",
    "anthropologist and cryptozoologist",
    "anthropologist and museum director",
    "anthropologist and ethnographer",
    "ethnologist and anthropologist",
    "indigenous Hopi anthropologist",
    "Africanist and anthropologist",
    "anthropologist and folklorist",
    "anthropologist and ethicist",
    "anthropologist and scholar",
    "social anthropologist and",
    "culinary anthropologist",
    "cultural anthropologist",
    "social anthropologist",
    "anthropologist and",
    "and anthropologist",
    "anthropologist",
]
law_enf_military_operator = []
spiritual = []
social = []
crime = []
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [64]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
    "sciences": sciences,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [65]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['business_farming'] ==1].sample(2)

CPU times: total: 22.6 s
Wall time: 22.6 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
58625,7,Thomas Perkins,", 84, American businessman .",https://en.wikipedia.org/wiki/Thomas_Perkins_(businessman),34,2016,June,Perkins Caufield & Byers,,,,,,,,,,,,84.0,,United States of America,United States of America,Perkins Caufield & Byers,3.555348,0,0,0,0,1,0,0,0,0,0,0,0,1
88323,9,"Sir Richard Pease, 3rd Baronet",", 98, British banker.","https://en.wikipedia.org/wiki/Sir_Richard_Pease,_3rd_Baronet",6,2021,March,,,,,,,,,,,,,98.0,,United Kingdom of Great Britain and Northern Ireland,,,1.94591,0,0,0,0,1,0,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [66]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 26828 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [67]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [68]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [69]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "racing cyclist" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [70]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [71]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "racing cyclist and manager"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [72]:
# Creating lists for each category
politics_govt_law = []

arts = []
sports = [
    "track and road racing cyclist",
    "triathlete and racing cyclist",
    "Olympic road racing cyclist",
    "professional racing cyclist",
    "racing cyclist and manager",
    "Paralympic racing cyclist",
    "racing cyclist and sports",
    "Olympic racing cyclist",
    "road racing cyclist",
    "racing cyclist and",
    "and racing cyclist",
    "racing cyclist",
]
sciences = []

business_farming = []
academia_humanities = []
law_enf_military_operator = []
spiritual = []
social = []
crime = []
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [73]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [74]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['sports'] ==1].sample(2)

CPU times: total: 6.64 s
Wall time: 6.63 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
40575,8,Bobby Gilfillan,", 74, Scottish footballer , prostate cancer.",https://en.wikipedia.org/wiki/Bobby_Gilfillan_(footballer_born_1938),3,2012,November,Doncaster Rovers,,,prostate cancer,,,,,,,,,74.0,,Scotland,,Doncaster Rovers,1.386294,0,0,0,0,0,0,1,0,0,0,0,0,1
35056,30,Preston Carpenter,", 77, American football player .",https://en.wikipedia.org/wiki/Preston_Carpenter,3,2011,June,"Cleveland Browns, Pittsburgh Steelers, Washington Redskins",,,,,,,,,,,,77.0,,United States of America,,"Cleveland Browns, Pittsburgh Steelers, Washington Redskins",1.386294,0,0,0,0,0,0,1,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [75]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 26640 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [76]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [77]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [78]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "sports" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [79]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [80]:
# # Example code to quick-screen values that may overlap categories
# df.loc[[index for index in df.index if "car builder" in df.loc[index, "info"]]]

<IPython.core.display.Javascript object>

In [81]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "motorsports manager and car builder"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [82]:
# Creating lists for each category
politics_govt_law = []

arts = [
    "publicist for the University of Maryland and the Washington Redskins",
    "marketing executive and sports imposter",
    "fashion designer and sportswear pioneer",
    "sportswear and fashion designer",  # before sports
]
sports = [
    "IOC sports administrator and Olympic sport shooter",
    "hunting and fishing specialist and outdoor sports",
    "Olympic gold medal winning wrestler and sports",
    "sportsman and college athletics administrator",
    "Olympic silver medal winning sports shooter",
    "athletics coach and sports administrator",
    "netball player and sports administrator",
    "sports shooter and Olympic champion",
    "Olympic sprinter and sports coach",
    "athlete and sports administrator",
    "fencer and sports administrator",
    "sports baseball official scorer",
    "ski jumper and sports official",
    "triple international sportsman",
    "professional sports team owner",
    "basketball referee and sports",
    "orienteer and sports official",
    "football executive and sports",
    "Hall of Fame sports executive",
    "college sports administrator",
    "Baseball Hall of Fame sports",
    "Negro league baseball sports",
    "cricket player and sportsman",
    "cyclist and sports director",
    "fencer and sports executive",
    "Olympic sports commissioner",
    "and Olympic sports shooter",
    "wheelchair sports athlete",
    "sports franchise co owner",
    "high school sports coach",
    "and sports administrator",
    "sports administrator and",
    "athlete and sports coach",
    "weightlifter and sports",
    "sports  sport executive",
    "sports player and coach",
    "motorsports manager and",
    "Olympic sports shooter",
    "and sports team owner",
    "sports administrator",
    "and sports executive",
    "motorsports director",
    "and sports official",
    "Hall of Fame sports",
    "sports shooter and",
    "sports manager and",
    "and sports shooter",
    "sports team owner",
    "golfer and sports",
    "sports club owner",
    "sports coach and",
    "sports executive",
    "cricket  sports",
    "baseball sports",
    "and sportswoman",
    "Republic sports",
    "sportswoman and",
    "sports director",
    "sports promoter",
    "sports official",
    "sports shooter",
    "sportsman and",
    "sportsperson",
    "sports diver",
    "sports coach",
    "sports agent",
    "motorsports",
    "sportsman",
    "and sports",
    "sports and",
    "sports",
]
sciences = []

business_farming = [
    "chairman and owner of Ellen Tracy sportswear",  # before sports
]
academia_humanities = []
law_enf_military_operator = []
spiritual = []
social = []
crime = []
event_record_other = []
other_species = [
    "thoroughbred racehorse involved in sports betting substitution scandal",  # before sports
]

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [83]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [84]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['sports'] ==1].sample(2)

CPU times: total: 39.9 s
Wall time: 39.9 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
16315,28,Kadri Aytaç,", 71, Turkish football player and then manager, Alzheimer's disease.",https://en.wikipedia.org/wiki/Kadri_Ayta%C3%A7,6,2003,March,,,,Alzheimer disease,,,,,,,,,71.0,,Turkey,,,1.94591,0,0,0,0,0,0,1,0,0,0,0,0,1
49287,8,Red Wilson,", 85, American baseball player .",https://en.wikipedia.org/wiki/Red_Wilson,7,2014,August,Detroit Tigers,,,,,,,,,,,,85.0,,United States of America,,Detroit Tigers,2.079442,0,0,0,0,0,0,1,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [85]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 26306 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [86]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [87]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [88]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "broadcaster" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [89]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [90]:
# # Example code to quick-screen values that may overlap categories
# df.loc[
#     [index for index in df.index if "scientific divulgator" in df.loc[index, "info"]]
# ]

<IPython.core.display.Javascript object>

In [91]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "broadcaster and secretary to Joseph Goebbels"]

<IPython.core.display.Javascript object>

In [92]:
# Dropping entry with link that points to husband's page
index = df[df["link"] == "https://en.wikipedia.org/wiki/Ramona_Bell"].index
df.drop(index, inplace=True)
df.reset_index(inplace=True, drop=True)

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [93]:
# Creating lists for each category
politics_govt_law = [
    "secretary to Joseph Goebbels",
]

arts = [
    "button accordion player and radio and television broadcaster",
    "broadcaster and radio and television administrator",
    "BBC broadcaster and transatlantic commentator",
    "first regular broadcaster on CBC Television",
    "news broadcaster and television presenter",
    "broadcaster and public address announcer",
    "and broadcaster for the Cincinnati Reds",
    "broadcaster and cultural administrator",
    "radio broadcaster and television host",
    "broadcaster and television executive",
    "radio broadcaster and documentarian",
    "radio broadcaster and food critic",
    "radio and television broadcaster",
    "newsreader and radio broadcaster",
    "television and radio broadcaster",
    "broadcaster and theatre producer",
    "newspaper editor and broadcaster",
    "broadcaster and television host",
    "radio broadcaster and announcer",
    "radio broadcaster and executive",
    "broadcaster and music arranger",
    "correspondent and broadcaster",
    "music critic and broadcaster",
    "Papua New radio broadcaster",
    "news reader and broadcaster",
    "broadcaster and television",
    "biographer and broadcaster",
    "and television broadcaster",
    "TV and radio broadcaster",
    "Hall of Fame broadcaster",
    "pirate radio broadcaster",
    "and broadcaster known as",
    "broadcaster and theatre",
    "television broadcaster",
    "broadcaster and anchor",
    "cultural administrator",
    "radio news broadcaster",
    "BBC radio broadcaster",
    "and radio broadcaster",
    "radio broadcaster and",
    "outdoors broadcaster",
    "broadcaster for CBC",
    "broadcaster and CEO",
    "radio  broadcaster",
    "radio broadcaster",
    "Māori broadcaster",
    "news broadcaster",
    "BBC broadcaster",
    "broadcaster for",
    "broadcaster and",
    "and broadcaster",
    "broadcaster",
]
sports = [
    "for the Pittsburgh Steelers",
    "yo yo world champion",
]
sciences = [
    "scientific divulgator",
]

business_farming = []
academia_humanities = []
law_enf_military_operator = []
spiritual = []
social = []
crime = []
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [94]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [95]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['arts'] ==1].sample(2)

CPU times: total: 28.6 s
Wall time: 28.6 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
62235,16,Ali Osman,", 58–59, Sudanese composer and conductor.",https://en.wikipedia.org/wiki/Ali_Osman_(composer),4,2017,February,,,,,,,,,,,,,58.5,,Sudan,,,1.609438,0,0,0,0,0,1,0,0,0,0,0,0,1
21378,10,Val Guest,", 94, British film writer and director .",https://en.wikipedia.org/wiki/Val_Guest,12,2006,May,,,,,,,,,,,,,94.0,,United Kingdom of Great Britain and Northern Ireland,,,2.564949,0,0,0,0,0,1,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [96]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 26099 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [97]:
# Obtaining values for column and their counts
roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [98]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [99]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "scholar" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [100]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [101]:
# # Example code to quick-screen values that may overlap categories
# df.loc[[index for index in df.index if "biblical scholar" in df.loc[index, "info"]]]

<IPython.core.display.Javascript object>

In [102]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "feminist scholar"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [103]:
# Creating lists for each category
politics_govt_law = [
    "advocate for education reform",
]

arts = []
sports = []
sciences = [
    "Ayurvedic",
    "mass communications",
]

business_farming = []
academia_humanities = [
    "literary scholar and founder of Reader response criticism",
    "and scholar in medieval studies and palaeography",
    "classical scholar who specialized in mythology",
    "scholar known for writings on the Iroquois",
    "philologist and religious studies scholar",
    "librarian and scholar of library science",
    "literary scholar of Native literature",
    "musicologist and Shakespeare scholar",
    "and folklorist and literary scholar",
    "literary scholar and media theorist",
    "and folklorist and literary scholar",
    "media scholar and cultural theorist",
    "scholar and historical revisionist",
    "museum director and Judaic scholar",
    "literary scholar and social critic",
    "language scholar and lexicographer",
    "scholar of the Caucasian cultures",
    "scholar of continental philosophy",
    "literary scholar and medievalist",
    "folklorist and literary scholar",
    "scholar of renaissance humanism",
    "literary scholar and redologist",
    "librarian and Tolkien scholar",
    "and Dead Sea Scrolls scholar",
    "scholar of Semitic languages",
    "scholar in Buddhist studies",
    "Sufism prelate and scholar",
    "scholar and educationalist",
    "scholar of Asian languages",
    "disability studies scholar",
    "dance scholar and curator",
    "and New Testament scholar",
    "communication scholar and",
    "scholar of gender studies",
    "scholar of historiography",
    "scholar and educationist",
    "scholar and bioethicist",
    "orientalist and scholar",
    "scholar and Perak mufti",
    "scholar of ancient law",
    "and scholar of Judaism",
    "scholar and specialist",
    "gender studies scholar",
    "researcher and scholar",
    "scholar of literature",
    "Shakespearean scholar",
    "and theology scholar",
    "scholar and preacher",
    "literary scholar and",
    "and Napoleon scholar",
    "and literary scholar",
    "biblical scholar and",
    "and biblical scholar",
    "and Biblical scholar",
    "medievalist scholar",
    "Tlingit scholar and",
    "scholar of medieval",
    "Chaucer scholar and",
    "scholar and curator",
    "Renaissance scholar",
    "scholar of history",
    "folk music scholar",
    "scholar of studies",
    "manuscript scholar",
    "literature scholar",
    "Torah scholar and",
    "rare book scholar",
    "Holocaust scholar",
    "holocaust scholar",
    "education scholar",
    "classical scholar",
    "Mayanist scholar",
    "language scholar",
    "medieval scholar",
    "scholar and Sufi",
    "oriental scholar",
    "Semitics scholar",
    "religion scholar",
    "classics scholar",
    "literary scholar",
    "Sanskrit scholar",
    "theatre scholar",
    "library scholar",
    "Judaica scholar",
    "Yolngu scholar",
    "comics scholar",
    "Bible scholar",
    "Saxon scholar",
    "Urdu scholar",
    "scholar of",
    "and scholar",
    "scholar and",
    "scholar",
]
law_enf_military_operator = []
spiritual = [
    "Jainist and Buddhist",
    "talmudic",
    "Talmudic",
    "Talmud",
    "Salafi",
    "hadith",
    "Vedic",
]
social = []
crime = []
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [104]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [105]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['academia_humanities'] ==1].sample(2)

CPU times: total: 52.2 s
Wall time: 52.2 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
29365,26,David Underdown,", 84, British historian, author of definitive work on Pride's Purge.",https://en.wikipedia.org/wiki/David_Underdown,3,2009,September,,,,author of definitive work on Pride Purge,,,,,,,,,84.0,,United Kingdom of Great Britain and Northern Ireland,,,1.386294,0,0,0,1,0,0,0,0,0,0,0,0,1
85862,11,Irena Veisaitė,", 92, Lithuanian theatre scholar and human rights activist, COVID-19.",https://en.wikipedia.org/wiki/Irena_Veisait%C4%97,3,2020,December,,,,COVID,,,,,,,,,92.0,,Lithuania,,,1.386294,0,0,0,1,0,0,0,0,1,0,0,0,2


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [106]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 25648 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [107]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [108]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [109]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "teacher" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [110]:
# # Viewing list sorted by descending length to copy to dictionary below and screen values
# sorted(specific_roles_list, key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [111]:
# # Example code to quick-screen values that may overlap categories
# df.loc[[index for index in df.index if "hoaxer" in df.loc[index, "info"]]]

<IPython.core.display.Javascript object>

In [112]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "suspected serial killer"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [113]:
# Creating lists for each category
politics_govt_law = []

arts = [
    "fashion designer and founder of the Versace fashion house",
    "fashion designer and costume designer",
    "fashion designer and costumier",
    "Quapaw Osage fashion designer",
    "stylist and fashion designer",
    "fashion designer and dress",
    "fashion designer and model",
    "royal and fashion designer",
    "hanbok fashion designer",
    "batik fashion designer",
    "fashion designer and",
    "fashion designer",
    "and designer of golf clubs and gear",
    "and golf course designer",
    "golf course designer",
    "and course designer",
    "Bigfoot hoaxer",
    "and hoaxer",
    "hoaxer",
]
sports = [
    "longtime caddy for legendary golfer Jack Nicklaus",
    "golfer and BC Sports Hall of Fame inductee",
    "Hall of Fame professional golfer",
    "golfer and Masters winner",
    "soccer coach and golfer",
    "international golfer and",
    "professional golfer and",
    "golfer and executive",
    "professional golfer",
    "Hall of Fame golfer",
    "PGA and Tour golfer",
    "golfer and coach",
    "PGA Tour golfer",
    "amateur golfer and",
    "amateur golfer",
    "golfer and",
    "golfer",
]
sciences = [
    "astronomer and space exploration pioneer",
    "astronomer at the Hayden Planetarium",
    "astronomer and paranormal expert",
    "research astronomer",
    "optical astronomer",
    "amateur astronomer",
    "radio astronomer",
    "solar astronomer",
    "and astronomer",
    "astronomer",
]

business_farming = []
academia_humanities = []
law_enf_military_operator = []
spiritual = []
social = []
crime = [
    'convicted serial killer nicknamed "The Granny Killer"',
    "serial killer and sex offender known as the",
    "serial killer and rapist and last executee",
    "murderer and self confessed serial killer",
    "murderer and suspected serial killer",
    "fugitive and suspected serial killer",
    "convicted serial killer and rapist",
    "serial killer and mass murderer",
    "serial killer and sex offender",
    "serial killer and necrophiliac",
    "slave owner and serial killer",
    "serial killer and kidnapper",
    "murderer and serial killer",
    "drifter and serial killer",
    "serial killer and rapist",
    "robber and serial killer",
    "rapist and serial killer",
    "convicted serial killer",
    "suspected serial killer",
    "thief and serial killer",
    "serial killer and thief",
    "alleged serial killer",
    "and serial killer",
    "serial killer and",
    "serial killer",
]
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [114]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [115]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['arts'] ==1].sample(2)

CPU times: total: 35.7 s
Wall time: 35.7 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
74435,24,Saleh Ahmed,", 83, Bangladeshi actor .",https://en.wikipedia.org/wiki/Saleh_Ahmed_(actor),3,2019,April,",",,,,,,,,,,,,83.0,,Bangladesh,,",",1.386294,0,0,0,0,0,1,0,0,0,0,0,0,1
45064,5,Daud Rahbar,", 86, Pakistani author and academic.",https://en.wikipedia.org/wiki/Daud_Rahbar,16,2013,October,,,,,,,,,,,,,86.0,,Pakistan,,,2.833213,0,0,0,1,0,1,0,0,0,0,0,0,2


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [116]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 24990 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [117]:
# # Obtaining values for column and their counts
# roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [118]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [119]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_2"].notna()].index
#             if "music" in df.loc[index, "info_2"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [120]:
# # Viewing list sorted by descending length to copy to dictionary below and screen values
# sorted(specific_roles_list, key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [121]:
# # Example code to quick-screen values that may overlap categories
# df.loc[[index for index in df.index if '"yoga teacher"' in df.loc[index, "info"]]]

<IPython.core.display.Javascript object>

In [122]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == '"yoga teacher"']

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [123]:
# Creating lists for each category
politics_govt_law = [
    "who defected from the Bolshoi Opera in",
    "and exiled prince of Yawnghwe",
    "and public bookcase proponent",
    "exiled anti Castro militant",
    "language campaigner",
    "and trade unionist",
    "trade unionist and",
    "trade unionist",
    "public servant",
    "exiled",
]

arts = [
    "half of the singing duo the Righteous Brothers",
    "acting at the Royal Academy of Dramatic Arts",
    "mezzo soprano at the City Opera for years",
    "soprano and mezzo soprano and librettist",
    "operatic soprano and a People Artist of",
    "rock and roll violinist and guitarist",
    "operatic contralto and mezzo soprano",
    "operatic mezzo soprano and contralto",
    "violinist with the Beaux Arts Trio",
    "dramaturge and newspaper columnist",
    "beauty queen and operatic soprano",
    "ballet dancer and ballet mistress",
    "and inspirational music teacher",  # before academia_humanities
    "horn player and brass instrument",
    "drama and comparative literature",
    "stage director and drama teacher",
    "violinist and orchestra leader",
    "dramatist and theater director",
    "jazz violinist and bass player",
    "classical violinist and music",
    "dramatist and literary critic",
    "ballet dancer and City Ballet",
    "Konkani language litterateur",
    "violinist and fashion model",
    "violinist and concertmaster",
    "ballet dancer and executive",
    "dramatic coloratura soprano",
    "traditional music performer",
    "drama critic and biographer",
    "creator of the book series",
    "operatic soprano and music",
    "operatic soprano and music",
    "violinist and mandolinist",
    "drama and literary critic",
    "ballet dancer and master",
    "ballet dancer and ballet",
    "coloratura mezzo soprano",
    "classical ballet dancer",
    "mezzo soprano and voice",
    "modern dancer and dance",
    "literature and theater",
    "operatic mezzo soprano",
    "Jewish literary critic",
    "Tulu Kannada dramatist",
    "clarinetist and music",
    "soprano and presenter",
    "television drama and",
    "television dramatist",
    "etiquette instructor",
    "violinist and music",
    "classical violinist",
    "ballerina and dance",
    "based ballet dancer",
    "literary critic and",
    "and literary critic",
    "instrument designer",
    "soprano and a voice",
    "classical violinist",
    "classical guitarist",
    "flautist and music",
    "horticulturist and",
    "and horticulturist",
    "theatrical advisor",
    "dancer and acting",
    "violist and music",
    "ballet dancer and",
    "cultural promoter",
    "coloratura soprano",
    "Carnatic violinist",
    "Amateur violinist",
    "classical soprano",
    "tic mezzo soprano",
    "concert violinist",
    "operatic soprano",
    "dramatic soprano",
    "dancer and dance",
    "female violinist",
    "blues violinist",
    "jazz and ballet",
    "radio dramatist",
    "literary critic",
    "concert cellist",
    "ballet director",
    "fashion design",
    "horticulturist",
    "stage director",
    "choir director",
    "violin soloist",
    "jazz violinist",
    "opera soprano",
    "lyric soprano",
    "choreographer",
    "ballet dancer",
    "drama critic",
    "tic soprano",
    "choirmaster",
    "of design",
    "dramatist",
    "soprano",
    "singing",
    "drama",
]
sports = [
    "aikido instructor and Aikikai teacher",  # before academia_humanities
    "golf player and instructor",
    "shodo and aikido teacher",
    "Iyengar Yoga instructor",
    "pioneer judo teacher",
    "karateka and teacher",
    "taekwondo instructor",
    "and yoga instructor",
    "dressage instructor",
    "Pilates instructor",
    "pilates instructor",
    "pilates teacher",
    '"yoga teacher "',
    "aikido teacher",
    "Aikido teacher",
    "yoga teacher",
]
sciences = []

business_farming = []
academia_humanities = [
    "teacher and director of the Willie Clancy Summer School",
    "teacher who popularized speed reading",
    "musicologist and university teacher",
    "teacher and quiz show contestant",
    "teacher and photograph subject",
    "and university teacher",
    "physical education teacher",
    "musicologist and teacher",
    "Lakota language teacher",
    "teacher who named Pluto",
    "folklorist and teacher",
    "transgender teacher",
    "mathematics teacher",
    "university teacher",
    "pioneering teacher",
    "University teacher",
    "volunteer teacher",
    "academy founder",
    "cooking teacher",
    "school teacher",
    "civics teacher",
    "schoolteacher and",
    "schoolteacher",
    "Métis teacher",
    "teacher aide",
    "head teacher",
    "and instructor",
    "instructor and",
    "instructor",
    "headteacher",
    "pedagogue",
    "teacher and",
    "and teacher",
    "teacher",
]


law_enf_military_operator = [
    "Jewish resistance member during the Holocaust",
    "resistance member during World War II",
    "glider pilot and flight instructor",  # before academia_humanities
    "World War II resistance member",
    "military drill instructor",
    "Jewish resistance member",
    "WWII resistance member",
    "flight instructor and",
    "police officer and",
    "and police officer",
    "WAAF aircraftwoman",
    "resistance member",
    "police officer",
    "FBI instructor",
]
spiritual = [
    "Buddhist monk and meditation teacher",  # before academia_humanities
    "Hindu spiritual leader and teacher",
    "Buddhist monk and Dzogchen teacher",
    "Zen Buddhist teacher and rōshi",
    "Zen Buddhist priest and teacher",
    "Qāriʾ and Qira'at teacher",
    "Hindu spiritual teacher",
    "Hindu monk and teacher",
    "metaphysical teacher",
    "Zen Buddhist teacher",
    "teacher of Rinzai Zen",
    "teacher of Buddhism",
    "Rinzai Zen teacher",
    "meditation teacher",
    "spiritual teacher",
    "religious teacher",
    "Vipassana teacher",
    "Buddhist  teacher",
    "Buddhist teacher",
    "Sister of Mercy",
    "Sufi  teacher",
    "Bible teacher",
    "Sufi teacher",
]
social = [
    "socialite",
]
crime = [
    "convicted rapist",
    "child molester",
]
event_record_other = [
    "whose oxygen machine failed after power cut for unpaid account",
    "and Holocaust survivor",
    "substitute teacher and census worker",
    "school teacher and heroine",  # before academia_humanities
]
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [124]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
    "academia_humanities": academia_humanities,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [125]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['arts'] ==1].sample(2)

CPU times: total: 1min 44s
Wall time: 1min 44s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
36551,9,Myra Taylor,", 94, American jazz singer.",https://en.wikipedia.org/wiki/Myra_Taylor_(singer),24,2011,December,,,,,,,,,,,,,94.0,,United States of America,,,3.218876,0,0,0,0,0,1,0,0,0,0,0,0,1
7231,26,Shirley Ardell Mason,", 75, American psychiatric patient and art teacher, breast cancer.",https://en.wikipedia.org/wiki/Shirley_Ardell_Mason,20,1998,February,,,,breast cancer,,,,,,,,,75.0,,United States of America,,,3.044522,0,0,0,1,0,1,0,0,0,0,0,0,2


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [126]:
# Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 23944 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- It is time to export our dataframe and start a new notebook.

### Exporting Dataset to SQLite Database [wp_life_expect_clean10.db](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_clean10.db)

In [127]:
# Exporting dataframe

# Saving dataset in a SQLite database
conn = sql.connect("wp_life_expect_clean10.db")
df.to_sql("wp_life_expect_clean10", conn, index=False)

# Chime notification when cell executes
chime.success()

<IPython.core.display.Javascript object>

# [Proceed to Data Cleaning Part 11](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean11_thanak_2022_07_26.ipynb)