# Wikipedia Notable Life Expectancies
# [Notebook 8: Data Cleaning Part 7](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean7_thanak_2022_07_26.ipynb)
### Context

The
### Objective

The
### Data Dictionary
- Feature: Description

### Importing Libraries

In [1]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To save/open python objects in pickle file
import pickle

# To help with reading, cleaning, and manipulating data
import pandas as pd
import numpy as np
import re

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)

# To play auditory cue when cell has executed, has warning, or has error and set chime theme
import chime

chime.theme("zelda")

<IPython.core.display.Javascript object>

## Data Overview

### Reading, Sampling, and Checking Data Shape

In [2]:
# Reading the dataset
conn = sql.connect("wp_life_expect_clean6.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_clean6", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 98060 rows and 38 columns.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,,,dancer,ballet designer and director,,,,,,,,,86.0,,United Kingdom of Great Britain and Northern Ireland,,,3.091042,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,,,economist,writer,and academic,,,,,,,,68.0,,Ireland,,,2.564949,0,0,0,0,0,0,0,0,0,0,0,0,0


<IPython.core.display.Javascript object>

In [3]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
98058,9,Aamir Liaquat Hussain,", 50, Pakistani journalist and politician, MNA .",https://en.wikipedia.org/wiki/Aamir_Liaquat_Hussain,99,2022,June,", since",,,MNA,,,,,,,,,50.0,,Pakistan,,"2002 2007, since 2018",4.60517,0,0,0,0,0,1,0,0,1,0,0,0,2
98059,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3,2022,June,,,engineer,member of the Academy of Engineering,,,,,,,,,86.0,,"China, People's Republic of",,,1.386294,0,0,0,0,0,0,0,0,0,0,0,0,0


<IPython.core.display.Javascript object>

In [4]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
32150,6,Cacilda Borges Barbosa,", 96, Brazilian electronic musician.",https://en.wikipedia.org/wiki/Cacilda_Borges_Barbosa,7,2010,August,,,,,,,,,,,,,96.0,,Brazil,,,2.079442,0,0,0,0,0,1,0,0,0,0,0,0,1
72620,12,Batton Lash,", 65, American comic book writer and artist , brain cancer.",https://en.wikipedia.org/wiki/Batton_Lash,15,2019,January,",",,,brain cancer,,,,,,,,,65.0,,United States of America,,",",2.772589,0,0,0,0,0,1,0,0,0,0,0,0,1
89352,19,Nasir Durrani,", 64, Pakistani police officer, inspector general of the Khyber Pakhtunkhwa Police , COVID-19.",https://en.wikipedia.org/wiki/Nasir_Durrani,5,2021,April,,,police officer,inspector general of the Khyber Pakhtunkhwa Police,COVID,,,,,,,,64.0,,Pakistan,,2013 2017,1.791759,0,0,0,0,0,0,0,0,0,0,0,0,0
63046,6,Michael McPartland,", 77, British Roman Catholic priest.",https://en.wikipedia.org/wiki/Michael_McPartland,4,2017,April,,,Catholic priest,,,,,,,,,,77.0,,United Kingdom of Great Britain and Northern Ireland,Italy,,1.609438,0,0,0,0,0,0,0,0,0,0,0,0,0
35587,26,Patrick C. Fischer,", 75, American computer scientist and Unabomber target.",https://en.wikipedia.org/wiki/Patrick_C._Fischer,13,2011,August,,,computer scientist and Unabomber target,,,,,,,,,,75.0,,United States of America,,,2.639057,0,0,0,0,0,0,0,0,0,0,0,0,0


<IPython.core.display.Javascript object>

### Checking Data Types, Duplicates, and Null Values

In [5]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98060 entries, 0 to 98059
Data columns (total 38 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   day                        98060 non-null  object 
 1   name                       98060 non-null  object 
 2   info                       98060 non-null  object 
 3   link                       98060 non-null  object 
 4   num_references             98060 non-null  int64  
 5   year                       98060 non-null  int64  
 6   month                      98060 non-null  object 
 7   info_parenth               36661 non-null  object 
 8   info_1                     22 non-null     object 
 9   info_2                     98028 non-null  object 
 10  info_3                     48896 non-null  object 
 11  info_4                     10264 non-null  object 
 12  info_5                     1265 non-null   object 
 13  info_6                     181 non-null    obj

<IPython.core.display.Javascript object>

#### Observations:
- With our dataset loaded, we can pick up where we left off with extracting known_for values by rebuilding `known_for_dict`.

### Extracting `known_for` Continued

#### Finding `known_for` Roles in `info_2`

In [6]:
# Obtaining values for column and their counts
roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [7]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [8]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [index for index in df.index if "Catholic prelate" in df.loc[index, "info"]],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [9]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [10]:
# # Example code to quick-screen values that may overlap categories
# df.loc[[index for index in df.index if "defrocked" in df.loc[index, "info"]]]

<IPython.core.display.Javascript object>

In [11]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "defrocked Catholic prelate"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [12]:
# Creating lists for each category
politics_govt_law = [
    "Patriotic",
]

arts = []
sports = []
sciences = []

business_farming = []
academia_humanities = []
law_enf_military_operator = []
spiritual = [
    "Syro Malabar Catholic prelate",
    "Eastern Catholic prelate",
    "clandestine Catholic prelate",
    "Old Catholic prelate",
    "Catholic prelate and theologian",
    "Catholic prelate and first cardinal",
    "Maronite Catholic prelate",
    "Catholic prelate and Cardinal",
    "Coptic Catholic prelate",
    "Catholic prelate and bishop",
    "Catholic prelate and cardinal",
    "Catholic prelate and",
    "and Catholic prelate",
    "Catholic prelate",
]
social = []
crime = [
    "defrocked",
]
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [13]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [14]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['spiritual'] ==1].sample(2)

CPU times: total: 8.33 s
Wall time: 8.33 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
86950,20,John Baptist Kaggwa,", 77, Ugandan Roman Catholic prelate, bishop of Masaka , COVID-19.",https://en.wikipedia.org/wiki/John_Baptist_Kaggwa,9,2021,January,,,,bishop of Masaka,COVID,,,,,,,,77.0,,Uganda,Italy,1998 2019,2.302585,0,0,1,0,0,0,0,0,0,0,0,0,1
10945,3,John Joseph O'Connor,", 80, American Roman Catholic prelate.",https://en.wikipedia.org/wiki/John_O%27Connor_(cardinal),86,2000,May,,,,,,,,,,,,,80.0,,United States of America,Italy,,4.465908,0,0,1,0,0,0,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [15]:
#### Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 44902 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [16]:
# Obtaining values for column and their counts
roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [17]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [18]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [index for index in df.index if "physicist" in df.loc[index, "info"]], "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [19]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [20]:
# # Example code to quick-screen values that may overlap categories
# df.loc[
#     [index for index in df.index if "physicist and science" in df.loc[index, "info"]]
# ]

<IPython.core.display.Javascript object>

In [21]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "health physicist"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [22]:
# Creating lists for each category
politics_govt_law = ["arms control expert", "refusenik"]

arts = []
sports = []
sciences = [
    "physicist and group leader in the Manhattan Project",
    "physicist and molecular biologist",
    "physicist and computer research executive",
    "Nobel Prize winning biophysicist",
    "physicist and grandson of Yuan Shikai",
    "physicist and polymer researcher",
    "physicist and hydrodynamicist",
    "nuclear physicist and inventor",
    "physicist and a leader in controlled fusion research",
    "computational physicist and the father of plasma based acceleration techniques",
    "biophysicist and theoretical ecologist",
    "physicist and co winner of Nobel Prize in Physics in",
    "leading physicist in the study of waves",
    "physicist and Nobel laureate",
    "physicist ane engineer",
    "astrophysicist and radio astronomer",
    "physicist and artificial intelligence pioneer",
    "physicist and civil engineer",
    'physicist who coined the term "black hole"',
    "physicist who co discovered the Wigner Seitz cell",
    "physicist and former director of SLAC",
    "physicist at Uppsala University",
    "physicist who won the Nobel Prize for Physics in",
    "physicist who built the first laser",
    "physicist who was a pioneer of solid state physics",
    "molecular biophysicist and crystallographer",
    "physicist and member of the Manhattan Project",
    "physicist and color scientist",
    "marine geologist and geophysicist",
    "theoretical physicist and astronomer",
    "physicist and electronics engineer",
    "physicist and inventor of the first digital computer",
    "pioneering biophysicist and virologist",
    "physicist Nobel Prize in Physics laureate",
    "theoretical physicist and magneto ionic theory pioneer",
    "theoretical physicist and Nobel Prize laureate",
    "physicist and jet engine designer",
    "geophysicist and oceanographer",
    "physicist and winner of the Nobel Prize in Physics",
    "physicist known for the Casimir effect",
    "nuclear physicist who worked at the Manhattan Project Metallurgical Laboratory",
    "physicist and team member of the Manhattan Project",
    "biophysicist and biochemist",
    "experimental physicist and scientist",
    "physicist and radiation health physics pioneer",
    "physicist and co inventor of the laser with Charles Townes",
    "nuclear engineer and physicist",
    'physicist known as "the father of Pulsed Power"',
    "physicist and physical chemist",
    "chemist and nuclear physicist",
    "physicist and recipient of the Nobel Prize in Physics",
    "differential geometer and mathematical physicist",
    "physicist and statistician",
    "physicist and father of Joan Baez and Mimi Fariña",
    "nuclear physicist and engineer",
    "physicist and microbiologist",
    "nuclear physicist and ufologist",
    "chemist and biophysicist",
    "condensed matter physicist",
    "physicist and researcher",
    "geophysicist and structural geologist",
    "mesoscopic physicist",
    "physicist and specialist in solid state laser",
    "physicist and aircraft designer",
    "physicist specialized in theoretical catalysis",
    "biologist and biophysicist",
    "thermal physicist",
    "atomic physicist",
    "biophysicist and science",
    "research physicist",
    "theoretical physicist and nuclear engineer",
    "neurophysicist",
    "experimental nuclear physicist",
    "health physicist",
    "physicist and parapsychologist",
    "physicist and skeptic",
    "solid state physicist",
    "biophysicist and virologist",
    "atmospheric physicist",
    "physicist and geneticist",
    "electrical engineer and physicist",
    "climate physicist",
    "nuclear and particle physicist",
    "explosives engineer and physicist",
    "physicist and neurobiologist",
    "mathematical physicist and cosmologist",
    "metallurgist and physicist",
    "mathematical geophysicist and seismologist",
    "East physicist",
    "theoretical physicist and astrophysicist",
    "Nobel Prize winning physicist",
    "optical physicist",
    "metal physicist",
    "metal and detonation physicist",
    "solar physicist",
    "oceanographic physicist",
    "geophysicist and planetary scientist",
    "astroparticle physicist",
    "accelerator physicist",
    "engineer and physicist",
    "molecular biophysicist",
    "physicist and radio astronomer",
    "physicist and meteorologist",
    "physicist and computer scientist",
    "astronomer and physicist",
    "physicist and chemist",
    "chemical physicist",
    "physicist and electrical engineer",
    "physicist and astronomer",
    "astronomer and astrophysicist",
    "medical physicist",
    "space physicist",
    "plasma physicist",
    "chemist and physicist",
    "physicist and inventor",
    "experimental physicist",
    "physicist and engineer",
    "mathematical physicist",
    "particle physicist",
    "biophysicist",
    "geophysicist",
    "and nuclear physicist",
    "nuclear physicist and",
    "nuclear physicist",
    "and astrophysicist",
    "astrophysicist and",
    "astrophysicist",
    "and theoretical physicist",
    "theoretical physicist and",
    "theoretical physicist",
    "physicist and",
    "and physicist",
    "physicist",
]

business_farming = []
academia_humanities = []
law_enf_military_operator = []
spiritual = []
social = []
crime = []
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [23]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [24]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['sciences'] ==1].sample(2)

CPU times: total: 1min 21s
Wall time: 1min 21s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
75685,10,Reinhard Bortfeld,", 92, German geophysicist.",https://en.wikipedia.org/wiki/Reinhard_Bortfeld,10,2019,July,,,,,,,,,,,,,92.0,,Germany,,,2.397895,1,0,0,0,0,0,0,0,0,0,0,0,1
10678,4,Xie Xide,", 78, Chinese physicist.",https://en.wikipedia.org/wiki/Xie_Xide,9,2000,March,,,,,,,,,,,,,78.0,,"China, People's Republic of",,,2.302585,1,0,0,0,0,0,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [25]:
#### Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 44009 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [26]:
# Obtaining values for column and their counts
roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [27]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [28]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [index for index in df.index if "architect" in df.loc[index, "info"]], "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [29]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [30]:
# # Example code to quick-screen values that may overlap categories
# df.loc[[index for index in df.index if "architect and art" in df.loc[index, "info"]]]

<IPython.core.display.Javascript object>

In [31]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "naval architect"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [32]:
# Creating lists for each category
politics_govt_law = ["asylum seeker"]

arts = [
    "architectural and interior designer",
    "photographer and architect",
    "Southwestern style architect",
    "architectural lighting designer",
    "church architect and Gothic Revival designer",
    "medieval architectural",
    "architect and organ designer",
    "landscape and garden architect",
    "set costume designer and architect",
    "architect and raconteur",
    "architect in Oregon",
    "architect of perestroika",
    "architect and acoustician",
    "architecture critic for",
    "interior designer and architect",
    "architect and historic",
    "architect and photographer",
    "furniture designer and architect",
    "architect and interior designer",
    "architect and graphic designer",
    "architect and art collector",
    "and course architect",
    "horticultural architect",
    "architect and designer of the flag of",
    "architect and furniture designer",
    "furniture designer and interior architect",
    "architect and landscape architect",
    "architecture and blues",
    "temple architect and sculptor",
    "bridge architect",
    "architectural critic",
    "industrial designer and architect",
    "architect and caveman",
    "literature and architecture",
    "space architect and spaceport planner",
    "architect and industrial designer",
    "town planner and architect",
    "architectural photographer",
    "architect and sculptor",
    "potter and architect",
    "architecture critic",
    "naval architect",
    "architect and urban designer",
    "golf course architect",
    "designer and architect",
    "architect and architectural",
    "sculptor and architect",
    "architect and town planner",
    "modernist architect",
    "architect and urban planner",
    "architect and designer",
    "landscape architect and",
    "landscape architect",
    "and architectural",
    "architectural",
    "of architecture",
    "and restoration architect",
    "architecture",
    "architect and",
    "and architect",
    "architect",
]
sports = []
sciences = [
    "computer architect and high tech",  # before arts
]

business_farming = []
academia_humanities = [
    "antique and architecture preservationist",  # before arts
]
law_enf_military_operator = []
spiritual = []
social = []
crime = []
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [33]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [34]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['arts'] ==1].sample(2)

CPU times: total: 34.9 s
Wall time: 34.9 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
19742,20,J. D. Cannon,", 83, American actor.",https://en.wikipedia.org/wiki/J._D._Cannon,4,2005,May,,,,,,,,,,,,,83.0,,United States of America,,,1.609438,0,0,0,0,0,1,0,0,0,0,0,0,1
10227,4,John Douglas Pringle,", 87, Australian journalist.",https://en.wikipedia.org/wiki/John_Douglas_Pringle,11,1999,December,,,,,,,,,,,,,87.0,,Australia,,,2.484907,0,0,0,0,0,1,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [35]:
#### Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 43450 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [36]:
# Obtaining values for column and their counts
roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [37]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [38]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [index for index in df.index if "photographer" in df.loc[index, "info"]],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [39]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [40]:
# # Example code to quick-screen values that may overlap categories
# df.loc[[index for index in df.index if "photographer of the" in df.loc[index, "info"]]]

<IPython.core.display.Javascript object>

In [41]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "public relations executive and photographer"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [42]:
# Creating lists for each category
politics_govt_law = []

arts = [
    "crime photographer",
    "photographer and illustrator",
    "adult photographer",
    "photographer of children",
    'photographer who pioneered "environmental portraiture"',
    "photographer at the fall of Saigon",
    "photographer of indigenous peoples in",
    "underwater photographer and filmmaker",
    "photographer and founder of",
    "photographer and editor",
    "photographer and camera operator",
    "underwater nature photographer",
    "photographer and news executive",
    "fashion and portrait photographer",
    "photographer based in San Francisco",
    "photographer born in Mérida",
    "photographer and secret FBI",
    "photographer during World War II",
    "glamour photographer and director of pornographic films",
    "photographer and war correspondent",
    "newspaper photographer",
    "double bassist and photographer",
    "photographer and photo essayist",
    "portrait photographer",
    "wilderness photographer",
    "press photographer",
    "music producer and photographer",
    "publisher and photographer",
    "photographer and art critic",
    "photographer and ballet dancer",
    "photographer and publicist",
    "photographer and theatre director",
    "engraver and photographer",
    "advertising photographer",
    "erotic photographer",
    "graphic designer and photographer",
    "photographer and art director",
    "public relations executive and photographer",
    "punk rock and art photographer",
    "fine art photographer",
    "photographer and documentary filmmaker",
    "environmental photographer",
    "photographer and blogger",
    "photographer and cinematographer",
    "photographer and biographer",
    "printmaker and photographer",
    "jazz and blues photographer",
    "commercial photographer",
    "aerial photographer and director",
    "newspaper and magazine photographer",
    "photographer and film maker",
    "filmmaker and photographer",
    "photographer and graphic designer",
    "street photographer",
    "photographer and model",
    "model and photographer",
    "news photographer",
    "and wildlife photographer",
    "wildlife photographer",
    "jazz photographer",
    "aerial photographer",
    "documentary photographer",
    "celebrity photographer",
    "photographer and filmmaker",
    "art photographer",
    "Pulitzer Prize winning photographer",
    "fashion photographer and",
    "fashion photographer",
    "photographer of the",
    "and Holocaust photographer",
    "photographer and",
    "and photographer",
    "photographer",
]
sports = []
sciences = []

business_farming = []
academia_humanities = []
law_enf_military_operator = []
spiritual = []
social = []
crime = []
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [43]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [44]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['arts'] ==1].sample(2)

CPU times: total: 40 s
Wall time: 40 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
76355,23,Mary Abbott,", 98, American painter.",https://en.wikipedia.org/wiki/Mary_Abbott_(artist),12,2019,August,,,,,,,,,,,,,98.0,,United States of America,,,2.564949,0,0,0,0,0,1,0,0,0,0,0,0,1
60409,18,Anthony Addabbo,", 56, American actor .",https://en.wikipedia.org/wiki/Anthony_Addabbo,35,2016,October,", ,",,,,,,,,,,,,56.0,,United States of America,,", ,",3.583519,0,0,0,0,0,1,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [45]:
#### Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 42922 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [46]:
# Obtaining values for column and their counts
roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [47]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [48]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [index for index in df.index if "economist" in df.loc[index, "info"]], "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [49]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [50]:
# # Example code to quick-screen values that may overlap categories
# df.loc[[index for index in df.index if "health economist" in df.loc[index, "info"]]]

<IPython.core.display.Javascript object>

In [51]:
# # Example code to quick-check a specific entry
# df[
#     df["info_2"]
#     == "economist who did pioneering research in linear programming and environmental economics"
# ]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [52]:
# Creating lists for each category
politics_govt_law = [
    "jurist and economist",
    "Marxian economist and founding editor of the",
    "economist specializing in public economics and information economics",
    "supply side economist",
    "libertarian economist",
    "monetarist and free market economist",
    "economist and government adviser",
    "economist and banking official",
    "economist and government advisor",
    "Marxian economist and a Trotskyist activist and",
    "economist and government minister",
    "economist and Nobel laureate",
    "macroeconomist",
    "monetary economist",
    "economist and government official",
    "public servant and economist",
    "labor economist",
    "economist and communist",
    "economist who did pioneering research in linear",
    "development economist and",
    "Marxist economist",
    "political scientist and economist",
    "Gandhian economist",
    "aristocrat and economist",
    "economist and PZPR activist",
    "economist and political scientist",
    "economist and politologist",
    "economist and taxpayer activist",
    "administrator and economist",
    "economist and government policy advisor",
    "economist and policy adviser",
    "economist and social activist",
    "economist and political adviser",
    "economist and lobbyist",
    "economist and laureate of the Nobel Memorial Prize in Economic Sciences",
    "political economist and activist",
    "lawyer and economist",
    "economist and Nobel Prize laureate",
    "Marxian economist",
    "development economist",
    "civil servant and economist",
    "economist and political activist",
    "feminist economist",
    "economist and public servant",
    "Nobel Prize winning economist",
    "health economist",
    "agricultural economist",
    "and political economist",
    "political economist",
    "economist and an",
    "and economist",
    "economist and",
    "economist",
]

arts = []
sports = []
sciences = [
    "home economist",  # before politics_govt_law
]

business_farming = []
academia_humanities = []
law_enf_military_operator = []
spiritual = []
social = []
crime = ["convicted embezzler"]
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [53]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [54]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['politics_govt_law'] ==1].sample(2)

CPU times: total: 34.6 s
Wall time: 34.6 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
32384,31,John Rowswell,", 55, Canadian politician, Mayor of Sault Ste. Marie, Ontario.",https://en.wikipedia.org/wiki/John_Rowswell,6,2010,August,,,,Mayor of Sault Ste Marie,,,,,,,,,55.0,,Canada,Canada,,1.94591,0,0,0,0,0,0,0,0,1,0,0,0,1
5227,29,Ken Harada,", 77, Japanese politician.",https://en.wikipedia.org/wiki/Ken_Harada_(politician),13,1997,January,,,,,,,,,,,,,77.0,,Japan,,,2.639057,0,0,0,0,0,0,0,0,1,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [55]:
#### Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 42456 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [56]:
# Obtaining values for column and their counts
roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [57]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [58]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[[index for index in df.index if "judge" in df.loc[index, "info"]], "info_2",]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [59]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [60]:
# # Example code to quick-screen values that may overlap categories
# df.loc[[index for index in df.index if "judge and legal" in df.loc[index, "info"]]]

<IPython.core.display.Javascript object>

In [61]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "circuit judge and tabloid columnist"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [62]:
# Creating lists for each category
politics_govt_law = [
    "military judge and",
    "judge on the Supreme Court of Queensland",
    "judge on the Oregon Supreme Court",
    "senior judge for the Central District Court",
    "judge and Representative from Alabama",
    "appeals court judge",
    "High Court judge",
    "intellectual property lawyer and High Court judge",
    "judge of the ACT Supreme Court",
    "judge and hereditary peer",
    "judge and Law Lord",
    "circuit judge for the Court of Appeals for the Ninth Circuit",
    "senior judge of the District Court for the Southern District of",
    "judge and feminist",
    "judge of the District Court for the Western District of Missouri",
    "judge and public servant",
    "District Court judge",
    "Bankruptcy Court judge",
    "who was the first female Supreme Court judge",
    "State judge and prosecutor at the Nuremberg war crimes trials",
    "judge and influential patent attorney",
    "district judge overseeing desegregation in the South",
    "judge and former Lord Chief Justice",
    "judge and peer",
    "civil rights lawyer and the first female federal judge",
    "federal judge who crafted the mass settlement of asbestos lawsuits",
    "senior federal judge and the first black federal prosecutor in history",
    "judge on the Court of Appeals for the Third Circuit",
    "prominent judge sitting in highest court",
    "former chief judge of the Court of Appeals for the Third Circuit",
    "Superior Court judge who presided over the Charles Manson trial",
    "and Ohio judge for years",
    "judge and Vice Chancellor of the Supreme Court",
    "City family court judge and first female judge",
    "senior judge of the Family Division of the High Court",
    "former chief judge",
    "senior judge of the District Court for the Southern District of Alabama and judge for the Middle District of Alabama",
    "senior federal appellate judge",
    "civil rights activist and judge",
    "first female judge of",
    "judge in the",
    "judge and chairperson of the Electoral Commission",
    "judge and political activist",
    "judge and anti apartheid activist",
    "lawyer and Supreme Court judge",
    "judge and disability rights campaigner",
    "senior judge of the Court of Appeals for the Ninth Circuit",
    "jurist and judge",
    "judge and independence activist",
    "senior judge of the District Court for the District of New",
    "attorney and tribal judge",
    "judge and prosecutor",
    "judge and civil servant",
    "judge and ombudsman",
    "jurist and Supreme Court judge",
    "senior federal judge",
    "lawyer and state judge",
    "colonial official and judge",
    "judge of the High Court of and",
    "Navajo judge",
    "senior and chief judge",
    "legislator and federal judge",
    "senior circuit judge",
    "judge and legal",
    "judge and barrister",
    "judge and law lord",
    "district judge and",
    "district judge",
    "judge and life peer",
    "chief judge",
    "senior judge of the District Court for the Eastern District of",
    "attorney and judge",
    "district court judge",
    "judge and lawyer",
    "judge and jurist",
    "state judge",
    "Supreme Court judge",
    "barrister and judge",
    "senior judge",
    "lawyer and judge",
    "federal judge and",
    "federal judge",
    "circuit judge and",
    "judge and",
    "and judge",
    "judge",
]

arts = []
sports = [
    "dog show judge",  # before politics_govt_law
    "boxing judge and",
    "draughts player and judge",
]
sciences = []

business_farming = []
academia_humanities = []
law_enf_military_operator = []
spiritual = []
social = []
crime = []
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [63]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "sports": sports,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [64]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['politics_govt_law'] ==1].sample(2)

CPU times: total: 49.7 s
Wall time: 49.7 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
93706,31,James Jemut Masing,", 72, Malaysian politician, Sarawak MLA , complications from COVID-19.",https://en.wikipedia.org/wiki/James_Jemut_Masing,34,2021,October,since,,,Sarawak MLA,complications from COVID,,,,,,,,72.0,,Malaysia,,since 1983,3.555348,0,0,0,0,0,0,0,0,1,0,0,0,1
44704,4,Raphael Dinyando,", 53, Namibian politician and diplomat, Ambassador to Austria , cancer.",https://en.wikipedia.org/wiki/Raphael_Dinyando,3,2013,September,"since , MP for Rundu , Mayor of Rundu",,,Ambassador to,cancer,,,,,,,,53.0,,Namibia,,"since 2010, MP for Rundu 2000 2010, Mayor of Rundu 1993 1998",1.386294,0,0,0,0,0,0,0,0,1,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [65]:
#### Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 41858 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [66]:
# Obtaining values for column and their counts
roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [67]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [68]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [index for index in df.index if "military officer" in df.loc[index, "info"]],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [69]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [70]:
# # Example code to quick-screen values that may overlap categories
# df.loc[[index for index in df.index if "coup leader" in df.loc[index, "info"]]]

<IPython.core.display.Javascript object>

In [71]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "Karen military officer"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [72]:
# Creating lists for each category
politics_govt_law = []

arts = []
sports = []
sciences = []

business_farming = []
academia_humanities = []
law_enf_military_operator = [
    "WWII military officer",
    "military officer and intelligence official",
    "Karen military officer",
    "Air Force military officer",
    "military officer and war veteran",
    "Resistance member and military officer",
    "military officer and resistance fighter",
    "military officer and National Hero of",
    "military officer of World War I and World War II",
    "and later military officer",
    "military officer and veteran affairs",
    "CIA paramilitary officer",
    "military officer and Hero of the Union",
    "military officer and coup leader",
    "military officer and pilot",
    "and military officer",
    "military officer and",
    "military officer",
]
spiritual = []
social = []
crime = [
    "human trafficker",
]
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [73]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [74]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['law_enf_military_operator'] ==1].sample(2)

CPU times: total: 11.7 s
Wall time: 11.7 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
68569,14,Michael D. Healy,", 91, American military officer.",https://en.wikipedia.org/wiki/Michael_D._Healy,11,2018,April,,,,,,,,,,,,,91.0,,United States of America,,,2.484907,0,0,0,0,0,0,0,1,0,0,0,0,1
64949,19,Pyotr Deynekin,", 79, Russian military officer, commander of the Russian Air Force .",https://en.wikipedia.org/wiki/Pyotr_Deynekin,7,2017,August,", Hero of the Federation",,,commander of the Air Force,,,,,,,,,79.0,,Russia,,"1992 1998, Hero of the Russian Federation 1997",2.079442,0,0,0,0,0,0,0,1,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [75]:
#### Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 41548 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [76]:
# Obtaining values for column and their counts
roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [77]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [78]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [index for index in df.index if "jurist" in df.loc[index, "info"]], "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [79]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [80]:
# # Example code to quick-screen values that may overlap categories
# df.loc[[index for index in df.index if "jurist and lecturer" in df.loc[index, "info"]]]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [81]:
# Creating lists for each category
politics_govt_law = [
    "federal jurist",
    "barrister and jurist",
    "jurist and legal analyst",
    "jurist and blind rights campaigner",
    "jurist and last known World War I",
    "jurist and law lord",
    "jurist and Judge of the International Criminal Tribunal for the Former",
    "jurist and Chief Judge",
    "jurist and Chief Justice",
    "jurist and statesman",
    "jurist and women rights activist",
    "jurist and first female chief justice of the North Carolina Supreme Court",
    "jurist and former Governor of Sindh province",
    "jurist and Governor General",
    "jurist and human rights advocate",
    "Virgin Islander jurist",
    "jurist and human rights activist",
    "jurist and legal",
    "civil rights lawyer and jurist",
    "jurist and magistrate",
    "civil rights activist and jurist",
    "jurist and prosecutor",
    "jurist and public servant",
    "jurist and lawyer",
    "jurist and life peer",
    "attorney and jurist",
    "lawyer and jurist",
    "and jurist",
    "jurist and",
    "jurist",
]

arts = []
sports = []
sciences = []

business_farming = []
academia_humanities = []
law_enf_military_operator = []
spiritual = []
social = []
crime = []
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [82]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [83]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['arts'] ==1].sample(2)

CPU times: total: 16.8 s
Wall time: 16.8 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
85045,2,Baron Wolman,", 83, American photographer , complications from amyotrophic lateral sclerosis.",https://en.wikipedia.org/wiki/Baron_Wolman,9,2020,November,,,,complications from amyotrophic lateral sclerosis,,,,,,,,,83.0,,United States of America,,,2.302585,0,0,0,0,0,1,0,0,0,0,0,0,1
97505,2,Đuro Seder,", 94, Croatian painter.",https://en.wikipedia.org/wiki/%C4%90uro_Seder,3,2022,May,,,,,,,,,,,,,94.0,,Croatia,,,1.386294,0,0,0,0,0,1,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [84]:
#### Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 41124 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [85]:
# Obtaining values for column and their counts
roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [86]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [87]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [index for index in df.index if "novelist" in df.loc[index, "info"]], "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [88]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [89]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "n novelist"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [90]:
# Creating lists for each category
politics_govt_law = [
    "peer and son of novelist John Buchan",  # before arts
]

arts = [
    "publisher and novelist",
    "novelist and editor",
    "novelist under pseudonyms",
    "Booker Prize winning novelist",
    "novelist whose books were popular in the s and s",
    "novelist and memoirist",
    "novelist who wrote a book shortlisted for the Booker Prize in",
    "food  novelist",
    "and mystery novelist",
    "Western novelist",
    "experimental novelist",
    "novelist and Pullizer Prize winner",
    "horror novelist and playwright",
    "young adult novelist",
    "dancer and novelist",
    "children novelist",
    "Nisei novelist and playwright",
    "novelist and critic",
    "wuxia novelist",
    "literary editor and novelist",
    "science fiction novelist",
    "editor and novelist",
    "novelist and filmmaker",
    "novelist and literary columnist",
    "adaption novelist",
    "book publisher and novelist",
    "novelist and literary agent",
    "children novelist and illustrator",
    "giallo novelist",
    "Konkani litterateur and novelist",
    "literary and jazz critic",
    "film producer and novelist",
    "novelist and essayist",
    "playwright and novelist",
    "romance novelist",
    "biographer and novelist",
    "spy novelist",
    "novelist and television producer",
    "fantasy novelist",
    "novelist and dramatist",
    "mystery novelist",
    "romantic novelist",
    "historical novelist",
    "literary critic and novelist",
    "critic and novelist",
    "novelist and biographer",
    "novelist and literary critic",
    "and crime novelist",
    "crime novelist",
    "novelist and playwright",
    "adventure novelist and",
    "n novelist",
    "and novelist",
    "novelist and",
    "novelist",
]
sports = []
sciences = []

business_farming = []
academia_humanities = []
law_enf_military_operator = []
spiritual = []
social = []
crime = []
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [91]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "politics_govt_law": politics_govt_law,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [92]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['arts'] ==1].sample(2)

CPU times: total: 30 s
Wall time: 30 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
3704,16,Roger Bowen,", 63, American actor , heart attack.",https://en.wikipedia.org/wiki/Roger_Bowen,3,1996,February,", ,",,,heart attack,,,,,,,,,63.0,,United States of America,,", ,",1.386294,0,0,0,0,0,1,0,0,0,0,0,0,1
52722,3,Paul Grigoriu,", 70, Romanian radio personality .",https://en.wikipedia.org/wiki/Paul_Grigoriu,4,2015,April,SRR,,,,,,,,,,,,70.0,,Romania,,SRR,1.609438,0,0,0,0,0,1,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [93]:
#### Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 40662 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [94]:
# Obtaining values for column and their counts
roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [95]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [96]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [index for index in df.index if "lawyer" in df.loc[index, "info"]], "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [97]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [98]:
# # Example code to quick-screen values that may overlap categories
# df.loc[
#     [index for index in df.index if "labor leader and lawyer" in df.loc[index, "info"]]
# ]

<IPython.core.display.Javascript object>

In [99]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "lawyer and administrator"]

<IPython.core.display.Javascript object>

In [100]:
# Dropping entry that points to game show page rather than page for individual
index = df[df["link"] == "https://en.wikipedia.org/wiki/Vivienne_Nearing"].index
df.drop(index, inplace=True)
df.reset_index(inplace=True, drop=True)

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [101]:
# Creating lists for each category
politics_govt_law = [
    "public interest lawyer",
    "lawyer and anti Apartheid activist",
    "lawyer and Chief Justice of",
    "lawyer for left wing clients and causes",
    "lawyer and key witness in the trial of Willem Holleeder",
    "Alabama state senator and lawyer",
    "labor lawyer and women rights advocate",
    "lawyer who fought for equitable access to legal services",
    "lawyer who defended dissidents",
    "defense lawyer for Saddam Hussein",
    "lawyer and civil liberties activist",
    "lawyer and former Lord of Appeal",
    "trial lawyer and political fundraiser",
    "trial lawyer and political power broker",
    "activist and lawyer",
    "pro life activist and lawyer",
    "South lawyer",
    "lawyer and partner in Jacoby & Meyers",
    "lawyer and political insider",
    "lawyer and white nationalist",
    "First Amendment lawyer",
    "lawyer and clan chief",
    "labor lawyer",
    "jailhouse lawyer",
    "lawyer and royal commissioner",
    "immigration lawyer",
    "lawyer and founder of Amnesty International",
    "libel lawyer",
    "lawyer and first Alabama Supreme Court justice",
    "lawyer and Postmaster General",
    "lawyer and longtime advisor to Jimmy Carter",
    "lawyer and Solicitor General",
    'lawyer and "fixer" for the Chicago Mafia',
    "lawyer who co founded the National Lawyers Guild",
    "lawyer and chief legal counsel to the RNC",
    "lawyer and communications executive and ambassador",
    "prosecuting lawyer who was the first attorney in the to achieve a murder conviction with exclusively circumstantial evidence",
    "lawyer who defended pacifist Ezra Pound",
    "lawyer and movie studio chairman",
    "civil rights and human rights lawyer",
    "lawyer and political advisor",
    "divorce lawyer",
    "lawyer and official",
    "criminal defense lawyer",
    "lawyer and dissident",
    "lawyer political advisor",
    "lawyer and Associate Justice of the Supreme Court of the",
    "lawyer civil rights activist",
    "lawyer and president of the ICTR",
    "lawyer and Governor of Kansas",
    "lawyer and Governor of South",
    "lawyer at the Department of Justice",
    "lawyer and a key figure in the Watergate investigation",
    "lawyer and civil rights advocate",
    "and criminal lawyer",
    "criminal lawyer and",
    "criminal lawyer",
    "lawyer and environmental activist",
    "human rights lawyer and anti apartheid activist",
    "lawyer and civil servant",
    "lawyer and civic activist",
    "lawyer and human rights advocate",
    "defense lawyer and prosecutor",
    "lawyer and presidential adviser",
    "lawyer and parliamentary draftsman",
    "landowner and lawyer",
    "lawyer and social activist",
    "lawyer and government official",
    "Māori lawyer",
    "lawyer and community leader",
    "arbitration lawyer",
    "disability activist and lawyer",
    "lawyer and legal analyst",
    "lawyer and labor activist",
    "lawyer and eminent domain",
    "lawyer and magistrate",
    "bankruptcy lawyer",
    "pro lawyer",
    "lawyer and prisoner rights activist",
    "lawyer and anti apartheid activist",
    "lawyer and right to die campaigner",
    "lawyer and prosecutor",
    "intellectual property lawyer and",
    "intellectual property lawyer",
    "lawyer and policy adviser",
    "environmental lawyer",
    "abor lawyer",
    "human rights lawyer and life peer",
    "animal welfare lawyer",
    "LGBT rights lawyer",
    "reproductive rights activist and lawyer",
    "lawyer and political strategist",
    "lawyer and LGBT activist",
    "defense lawyer",
    "trial lawyer",
    "civil servant and lawyer",
    "lawyer and public official",
    "lawyer and activist",
    "lawyer and political activist",
    "lawyer and life peer",
    "lawyer and lobbyist",
    "lawyer and feminist",
    "civil rights lawyer and activist",
    "lawyer and women rights activist",
    "human rights activist and lawyer",
    "constitutional lawyer",
    "human rights lawyer",
    "lawyer and public servant",
    "and civil rights lawyer",
    "civil rights lawyer",
    "lawyer and civil rights activist",
    "lawyer and human rights activist",
    "tax lawyer and",
    "and Native rights lawyer",
    "anti segregation lawyer and",
    "lawyer and",
    "and lawyer",
    "lawyer",
]

arts = ["for the Grateful Dead", "to the stars"]  # before politics_govt_law
sports = []
sciences = []

business_farming = []
academia_humanities = []
law_enf_military_operator = []
spiritual = []
social = []
crime = ["gangland criminal", "disbarred"]
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [102]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "arts": arts,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [103]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['politics_govt_law'] ==1].sample(2)

CPU times: total: 1min 11s
Wall time: 1min 11s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
44058,15,Tom Greenwell,", 57, American judge, suicide by gunshot.",https://en.wikipedia.org/wiki/Tom_Greenwell,8,2013,July,,,,suicide by gunshot,,,,,,,,,57.0,,United States of America,,,2.197225,0,0,0,0,0,0,0,0,1,0,0,0,1
97467,30,Ricardo Alarcón,", 84, Cuban politician, minister of foreign affairs .",https://en.wikipedia.org/wiki/Ricardo_Alarc%C3%B3n,20,2022,April,and president of the National Assembly of People Power,,,minister of foreign affairs,,,,,,,,,84.0,,Cuba,,1992 1993 and president of the National Assembly of People Power 1993 2013,3.044522,0,0,0,0,0,0,0,0,1,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [104]:
#### Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 40126 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [105]:
# Obtaining values for column and their counts
roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [106]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [107]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df.index
#             if "professional wrestler" in df.loc[index, "info"]
#         ],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [108]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [109]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "professional wrestler and valet"]

<IPython.core.display.Javascript object>

In [110]:
# # Hard-coding cause_of_death for entry with value in info_2
# index = df[df["link"] == "https://en.wikipedia.org/wiki/Tony_Parisi_(wrestler)"].index
# df.loc[index, "cause_of_death"] = "aneurysm"

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [111]:
# Creating lists for each category
politics_govt_law = []

arts = []
sports = [
    "professional wrestler and kickboxer",
    "women professional wrestler",
    "professional wrestler known as liver transplant",
    "professional wrestler known as aneurysm",
    "professional wrestler of descent",
    'professional wrestler best known as "Hercules Hernandez" or simply just "Hercules"',
    'professional wrestler best known as "Moondog King"',
    'professional wrestler known as "The Crusher"',
    "professional wrestler and booker",
    "professional wrestler for the World Wrestling Federation",
    'Olympic judo bronze medalist and professional wrestler known as "Bad News Brown"',
    'professional wrestler known as "The Black Shadow"',
    "professional wrestler during the Great Depression era",
    "WWE professional wrestler",
    'professional wrestler known as "Biff Wellington"',
    "sumo and professional wrestler",
    "professional wrestler and talent agent",
    "professional wrestler of the s s famous for feuds with Buddy Rogers",
    "bodybuilder and professional wrestler",
    "professional wrestler and wrestling manager",
    "professional wrestler and valet",
    "professional wrestler and referee",
    "Hall of Fame professional wrestler and trainer",
    "female professional wrestler",
    "lucha libre professional wrestler",
    "professional wrestler and trainer",
    "professional wrestler and promoter",
    "professional wrestler and manager",
    "Hall of Fame professional wrestler",
    "professional wrestler known as",
    "professional wrestler and",
    "and professional wrestler",
    "professional wrestler",
]
sciences = []

business_farming = []
academia_humanities = []
law_enf_military_operator = []
spiritual = []
social = []
crime = []
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

In [112]:
# Hard-coding cause_of_death for entry with value in info_2
index = df[df["link"] == "https://en.wikipedia.org/wiki/Jumbo_Tsuruta"].index
df.loc[index, "cause_of_death"] = "complications from liver transplant"

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [113]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [114]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['sports'] ==1].sample(2)

CPU times: total: 22.1 s
Wall time: 22.2 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
12353,25,Sir Donald Bradman,", 92, Australian cricketer.",https://en.wikipedia.org/wiki/Don_Bradman,274,2001,February,,,,,,,,,,,,,92.0,,Australia,,,5.616771,0,0,0,0,0,0,1,0,0,0,0,0,1
14702,29,Liam O'Sullivan,", 20, Scottish footballer, drugs overdose. [1]",https://en.wikipedia.org/wiki/Liam_O%27Sullivan,3,2002,April,,,,drugs overdose [],,,,,,,,,20.0,,Scotland,,,1.386294,0,0,0,0,0,0,1,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [115]:
#### Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 39727 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [116]:
# Obtaining values for column and their counts
roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [117]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [118]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [index for index in df.index if "philosopher" in df.loc[index, "info"]],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [119]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [120]:
# # Example code to quick-screen values that may overlap categories
# df.loc[
#     [index for index in df.index if "religious philosopher" in df.loc[index, "info"]]
# ]

<IPython.core.display.Javascript object>

In [121]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "and humanist philosopher"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [122]:
# Creating lists for each category
politics_govt_law = []

arts = []
sports = []
sciences = []

business_farming = []
academia_humanities = [
    "philosopher and education worker in Eastern",
    "Jewish philosopher and Reform",
    "and cultural philosopher",
    "classicist and philosopher",
    "philosopher and teacher",
    "postmodernist philosopher and",
    "idealist philosopher",
    "North philosopher",
    "sinologist and philosopher",
    "philosopher and the founder of the School of Economic Science",
    "existentialist philosopher",
    "East philosopher",
    "philosopher and Vice Chancellor of Oxford University",
    "philosopher and social anthropologist",
    "educator and philosopher",
    "philosopher and professor in",
    "philosopher and professor of philosophy",
    "and philosopher of social science",
    "philosopher and professor of",
    "anthropologist and philosopher of",
    "cultural theorist and philosopher",
    "philosopher and Holocaust denier",
    "philosopher and anthropologist",
    "Islamologist and philosopher",
    "New Confucian philosopher",
    "philosopher and cultural theorist",
    "philosopher and epistemologist",
    "philosopher of language",
    "philosopher and humanist",
    "philosopher and translator",
    "educational philosopher",
    "translator and philosopher",
    "philosopher of religion",
    "anthropologist and philosopher",
    "philosopher and linguist",
    "philosopher and scholar",
    "social philosopher",
    "philosopher and professor",
    "moral philosopher",
    "philosopher of",
    "and humanist philosopher",
    "and philosopher and",
    "and philosopher",
    "philosopher and",
    "philosopher",
]
law_enf_military_operator = []
spiritual = []
social = []
crime = []
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [123]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [124]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['academia_humanities'] ==1].sample(2)

CPU times: total: 26.2 s
Wall time: 26.2 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
63881,4,Patrick G. Johnston,", 58, Northern Irish scientist and academic administrator, Vice-Chancellor of Queen's University, Belfast .",https://en.wikipedia.org/wiki/Patrick_G._Johnston,14,2017,June,,,scientist,Vice Chancellor of Queen University,Belfast,,,,,,,,58.0,,United Kingdom of Great Britain and Northern Ireland,,2014 2017,2.70805,0,0,0,1,0,0,0,0,0,0,0,0,1
94449,5,Michel Rouche,", 87, French historian and academic.",https://en.wikipedia.org/wiki/Michel_Rouche,4,2021,December,,,,,,,,,,,,,87.0,,France,,,1.609438,0,0,0,1,0,0,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [125]:
#### Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 39291 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [126]:
# Obtaining values for column and their counts
roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [127]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [128]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [index for index in df.index if "sculptor" in df.loc[index, "info"]], "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [129]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [130]:
# # Example code to quick-screen values that may overlap categories
# df.loc[[index for index in df.index if "sculptor and art" in df.loc[index, "info"]]]

<IPython.core.display.Javascript object>

In [131]:
# # Example code to quick-screen values that may overlap categories
# df.loc[
#     [
#         index
#         for index in df.index
#         if "outlaw country music singer songwriter" in df.loc[index, "info"]
#     ]
# ]

<IPython.core.display.Javascript object>

In [132]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "ice hockey player and general manager"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [133]:
# Creating lists for each category
politics_govt_law = []

arts = [
    "sculptor and puzzle creator",
    "steel sculptor",
    "sculptor and studio potter",
    "modernist sculptor",
    "abstract expressionist sculptor",
    "figure sculptor",
    "sculptor and computer art pioneer",
    "Apache sculptor",
    "portrait sculptor",
    "surrealist sculptor",
    "ceramic sculptor",
    "sculptor and metalworker",
    "sculptor and medallist",
    "monumental sculptor",
    "sculptor and art impresario",
    "ceramicist and sculptor",
    "sculptor and publisher",
    "muralist and sculptor",
    "minimalist sculptor",
    "folk sculptor",
    "sculptor and muralist",
    "portrait and bust sculptor",
    "sculptor and ceramicist",
    "sculptor and illustrator",
    "kinetic sculptor",
    "sculptor and coin designer",
    "abstract sculptor",
    "potter and sculptor",
    "sculptor and printmaker",
    "sculptor and arts",
    "sculptor and art",
    "sculptor and",
    "and sculptor",
    "sculptor",
]
sports = [
    "ice hockey player and coach in the National Hockey League",
    "ice hockey player and referee",
    "Olympic ice hockey player and sports administrator",
    "professional ice hockey player and coach",
    "ice hockey player and head coach",
    "ice hockey player and general manager",
    "ice hockey player and speed skater",
    "college ice hockey player",
    "Hall of Fame ice hockey player and official",
    "Olympic bronze medalist ice hockey player",
    "ice hockey player and manager",
    "ice hockey player and goaltending coach",
    "ice hockey player and scout",
    "Olympic silver medalist ice hockey player",
    "ice hockey player and Hall of Fame member",
    "Olympic ice hockey player and coach",
    "Hall of Fame ice hockey player and coach",
    "football and ice hockey player",
    "ice hockey player and executive",
    "Olympic silver medallist ice hockey player",
    "Olympic champion ice hockey player",
    "Olympic ice hockey player and sailor",
    "professional ice hockey player and",
    "professional ice hockey player",
    "ice hockey player and coach",
    "Hall of Fame ice hockey player",
    "Olympic ice hockey player",
    "ice hockey player and",
    "and ice hockey player",
    "ice hockey player",
]
sciences = []

business_farming = []
academia_humanities = []
law_enf_military_operator = []
spiritual = []
social = []
crime = []
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [134]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "sports": sports,
    "arts": arts,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [135]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['arts'] ==1].sample(2)

CPU times: total: 34.8 s
Wall time: 34.8 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
12283,12,Johnny Warangkula Tjupurrula,", 75/76, Australian aboriginal painter.",https://en.wikipedia.org/wiki/Johnny_Warangkula_Tjupurrula,13,2001,February,,,,,,,,,,,,,75.5,,Australia,,,2.639057,0,0,0,0,0,1,0,0,0,0,0,0,1
75166,8,Milan Asadurov,", 69, Bulgarian science fiction writer.",https://en.wikipedia.org/wiki/Milan_Asadurov,3,2019,June,,,,,,,,,,,,,69.0,,Bulgaria,,,1.386294,0,0,0,0,0,1,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [136]:
#### Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 38540 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [137]:
# Obtaining values for column and their counts
roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [138]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [139]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [index for index in df.index if "physician" in df.loc[index, "info"]], "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [140]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [141]:
# # Example code to quick-screen values that may overlap categories
# df.loc[
#     [
#         index
#         for index in df.index
#         if "physician and medical unionist" in df.loc[index, "info"]
#     ]
# ]

<IPython.core.display.Javascript object>

In [142]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "billionare physician"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [143]:
# Creating lists for each category
politics_govt_law = [
    "former head of the WHO AIDS program",
    "head of the F D A",
    "faithless elector",
    "medical unionist",  # before sciences
]

arts = ["chairwoman of the Bromsgrove Festival"]
sports = []
sciences = [
    "biochemist and physician",
    "physician and cytologist",
    "physician and pioneer in sex reassignment surgery",
    "physician who delivered first test tube baby",
    "medical researcher and physician",
    "physician and abortion provider",
    "physician and pioneer in AIDS detection",
    "physician and inventor",
    "attending physician to President John F Kennedy after his assassination",
    "physician and expert on international health",
    "physician and geneticist",
    "physician and histologist",
    "transplant physician and immunologist",
    "physician and toxicologist",
    "physician and haematologist",
    "physician and physiologist",
    "respiratory physician",
    "physician and family therapy pioneer",
    "orthopedic physician",
    "physician and sexologist",
    "physician and medical scientist",
    "physician and virologist",
    "physician and pioneer in prolotherapy",
    "physician and research scientist",
    "physician and suspect in several murders",
    "physician and psychiatrist",
    "pediatric surgeon",
    "physician and a NASA",
    "physician and son of Ernest Hemingway",
    "tropical physician",
    "EMS physician",
    "research physician",
    "physician and toxicology researcher",
    "physician and leukemia researcher",
    "family health physician",
    "physician and organ transplant expert",
    "aerospace physician",
    "physician and cancer genetic researcher",
    "physician and molecular geneticist",
    "cardiologist and emergency physician",
    "physician and chemist",
    "physician and stem cell researcher",
    "physician and scientist",
    "physician and cancer researcher",
    "Ayurvedic physician",
    "physician and gastroenterologist",
    "physician and radiologist",
    "pharmacologist and physician",
    "pediatric cardiac surgeon and physician",
    "physician and neuropathologist",
    "physician and scientific",
    "ayurvedic physician",
    "physician and HIV AIDS researcher",
    "physician and pharmacologist",
    "physician and thoracic specialist",
    "physician and anatomist",
    "chief physician",
    "pediatrician and family physician",
    "physician known for alternative cancer treatments",
    "nutritionist and physician",
    "physician and clinical researcher",
    "physician and cardiologist",
    "physician and psychoanalyst",
    "physician and epidemiologist",
    "physician and pathologist",
    "consultant physician",
    "physician and medical researcher",
    "physician and medical",
    "physician and",
    "and physician",
    "physician",
]

business_farming = ["founder of Janssen Pharmaceutica"]
academia_humanities = []
law_enf_military_operator = []
spiritual = []
social = ["hospice founder", "in chief of Children Hospital"]
crime = []
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [144]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "politics_govt_law": politics_govt_law,
    "sciences": sciences,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [145]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['sciences'] ==1].sample(2)

CPU times: total: 40.9 s
Wall time: 40.9 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
88820,30,Francis Otunta,", 62, Nigerian mathematician and academic administrator, vice-chancellor of Michael Okpara University of Agriculture , traffic collision.",https://en.wikipedia.org/wiki/Francis_Otunta,5,2021,March,,,,vice chancellor of Michael Okpara University of Agriculture,traffic collision,,,,,,,,62.0,,Nigeria,,2016 2021,1.791759,1,0,0,1,0,0,0,0,0,0,0,0,2
8078,30,Irving Segal,", 79, American mathematician.",https://en.wikipedia.org/wiki/Irving_Segal,12,1998,August,,,,,,,,,,,,,79.0,,United States of America,,,2.564949,1,0,0,0,0,0,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [146]:
#### Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 38171 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- Before performing the next iteration, we will export our database and start a new notebook.

### Exporting Dataset to SQLite Database [wp_life_expect_clean7.db](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_clean7.db)

In [147]:
# Exporting dataframe

# Saving dataset in a SQLite database
conn = sql.connect("wp_life_expect_clean7.db")
df.to_sql("wp_life_expect_clean7", conn, index=False)

# Chime notification when cell executes
chime.success()

<IPython.core.display.Javascript object>

# [Proceed to Data Cleaning Part 8 ](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean8_thanak_2022_07_26.ipynb)