# Wikipedia Notable Life Expectancies
# [Notebook 7 : Data Cleaning Part 6](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean6_thanak_2022_07_24.ipynb)
### Context

The
### Objective

The
### Data Dictionary
- Feature: Description

### Importing Libraries

In [1]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To save/open python objects in pickle file
import pickle

# To help with reading, cleaning, and manipulating data
import pandas as pd
import numpy as np
import re

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)

# To play auditory cue when cell has executed, has warning, or has error and set chime theme
import chime

chime.theme("zelda")

<IPython.core.display.Javascript object>

## Data Overview

### Reading, Sampling, and Checking Data Shape

In [2]:
# Reading the dataset
conn = sql.connect("wp_life_expect_clean5.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_clean5", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 98060 rows and 38 columns.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,,,dancer,ballet designer and director,,,,,,,,,86.0,,United Kingdom of Great Britain and Northern Ireland,,,3.091042,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,,,economist,writer,and academic,,,,,,,,68.0,,Ireland,,,2.564949,0,0,0,0,0,0,0,0,0,0,0,0,0


<IPython.core.display.Javascript object>

In [3]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
98058,9,Aamir Liaquat Hussain,", 50, Pakistani journalist and politician, MNA .",https://en.wikipedia.org/wiki/Aamir_Liaquat_Hussain,99,2022,June,", since",,,MNA,,,,,,,,,50.0,,Pakistan,,"2002 2007, since 2018",4.60517,0,0,0,0,0,1,0,0,1,0,0,0,2
98059,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3,2022,June,,,engineer,member of the Academy of Engineering,,,,,,,,,86.0,,"China, People's Republic of",,,1.386294,0,0,0,0,0,0,0,0,0,0,0,0,0


<IPython.core.display.Javascript object>

In [4]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
59486,7,Bryan Clauson,", 27, American racing driver, race collision.",https://en.wikipedia.org/wiki/Bryan_Clauson,23,2016,August,,,racing driver,race collision,,,,,,,,,27.0,,United States of America,,,3.178054,0,0,0,0,0,0,0,0,0,0,0,0,0
54846,14,Adam Purple,", 84, American environmental activist, heart attack.",https://en.wikipedia.org/wiki/Adam_Purple,21,2015,September,,,environmental activist,heart attack,,,,,,,,,84.0,,United States of America,,,3.091042,0,0,0,0,0,0,0,0,0,0,0,0,0
59185,18,Nikolaus Messmer,", 61, Kazakh-born Kyrgyz Roman Catholic prelate, Apostolic Administrator of Kyrgyzstan .",https://en.wikipedia.org/wiki/Nikolaus_Messmer,3,2016,July,since,,Catholic prelate,Apostolic Administrator of,,,,,,,,,61.0,,Kazakhstan,Kyrgyzstan,since 2006,1.386294,0,0,0,0,0,0,0,0,0,0,0,0,0
31181,19,Dylan Meier,", 26, American college football player, climbing accident.",https://en.wikipedia.org/wiki/Dylan_Meier,8,2010,April,,,,climbing accident,,,,,,,,,26.0,,United States of America,,,2.197225,0,0,0,0,0,0,1,0,0,0,0,0,1
49319,11,Vladimir Beara,", 85, Yugoslav football player .",https://en.wikipedia.org/wiki/Vladimir_Beara,12,2014,August,"national team and manager, Olympic silver medalist",,,,,,,,,,,,85.0,,Serbia,,"national team and manager, Olympic silver medalist 1952",2.564949,0,0,0,0,0,0,1,0,0,0,0,0,1


<IPython.core.display.Javascript object>

### Checking Data Types, Duplicates, and Null Values

In [5]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98060 entries, 0 to 98059
Data columns (total 38 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   day                        98060 non-null  object 
 1   name                       98060 non-null  object 
 2   info                       98060 non-null  object 
 3   link                       98060 non-null  object 
 4   num_references             98060 non-null  int64  
 5   year                       98060 non-null  int64  
 6   month                      98060 non-null  object 
 7   info_parenth               36661 non-null  object 
 8   info_1                     22 non-null     object 
 9   info_2                     98028 non-null  object 
 10  info_3                     48896 non-null  object 
 11  info_4                     10264 non-null  object 
 12  info_5                     1265 non-null   object 
 13  info_6                     181 non-null    obj

<IPython.core.display.Javascript object>

#### Observations:
- With our dataset loaded, we can pick up where we left off with extracting known_for values by rebuilding `known_for_dict`.

### Extracting `known_for` Continued

#### Finding `known_for` Roles in `info_2`

In [6]:
# Obtaining values for column and their counts
roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [7]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [8]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[[index for index in df.index if "poet" in df.loc[index, "info"]], "info_2",]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [9]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [10]:
# # Example code to quick-screen values that may overlap categories
# df.loc[[index for index in df.index if "poet and poetry" in df.loc[index, "info"]]]

<IPython.core.display.Javascript object>

In [11]:
# # Example code to quick-screen values that may overlap categories
# df.loc[
#     [
#         index
#         for index in df.index
#         if "outlaw country music singer songwriter" in df.loc[index, "info"]
#     ]
# ]

<IPython.core.display.Javascript object>

In [12]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "n poet"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [13]:
# Creating lists for each category
politics_govt_law = []

arts = [
    "Pulitzer Prize winning poet and former poet laureate",
    "poet of the School",
    "poet and dramatist",
    "Nuyorican poet and playwright",
    "poet and memoirist",
    "poet and BBC producer",
    "Mi'kmaq poet",
    "poet and Pulitzer Prize winner",
    "poet and arts critic",
    "poet and radio host",
    "novelist and poetry promoter",
    "poet and diarist",
    "poet and broadcaster",
    "surrealist poet",
    "Chicano poet",
    "Martiniquan poet",
    "Māori poet",
    "poet who wrote about the Dust Bowl",
    "vocalist and poet",
    "satirist and humorist poet of",
    "Movement poet",
    "avant garde poet and visual artist",
    "poet and literary book publisher",
    "poet of the Beat Generation",
    "Latino poet",
    "contemporary poet",
    "film maker and poet",
    "beat poet",
    "biographer and poet",
    "poet and jazz musician",
    "poet and architecture critic",
    "K'iche' Maya poet",
    "visual artist and poet",
    "poet and biographer",
    "jazz pianist and poet",
    "rhythm poet and musician",
    "dub poet",
    "percussionist and poet",
    "poet and jazz pianist",
    "poet and spoken word musician",
    "poet and sculptor",
    "experimental poet",
    "poet and radio broadcaster",
    "Marathi ghazal poet",
    "poet and co founder of interstitial lung disease",
    "avant garde composer and poet",
    "magazine publisher and poet",
    "literary critic and poet",
    "sculptor and poet",
    "poet and proofreader",
    "Odia poet",
    "folk musician and poet",
    "Ulster Scots poet",
    "vernacular poet",
    "Pashto poet",
    "poet and recording artist",
    "poet and disc jockey",
    "Kannada language poet",
    "and spoken word poet",
    "poet and film producer",
    "Nuyorican poet",
    "Kannada poet",
    "surrealist poet and art critic",
    "director and poet",
    "Kashubian poet",
    "poet and poetry",
    "art critic and poet",
    "poet and art critic",
    "jazz musician and poet",
    "Native poet",
    "poet and composer",
    "poet and cartoonist",
    "poet and filmmaker",
    "Pulitzer Prize winning poet",
    "photographer and poet",
    "poet and visual artist",
    "musician and poet",
    "poet and performance artist",
    "poet and musician",
    "lyricist and poet",
    "painter and poet",
    "poet and artist",
    "poet and publisher",
    "playwright and poet",
    "poet and painter",
    "poet and lyricist",
    "Urdu poet and",
    "Urdu poet",
    "poet and essayist",
    "artist and poet",
    "poet and critic",
    "poet and editor",
    "poet and literary critic",
    "novelist and poet",
    "poet and playwright",
    "poet and novelist",
    "Occitan language poet and",
    "language poet",
    "Arabian poet and",
    "poetess",
    "poet and literary",
    "Beat generation poet and",
    "Beat Generation poet",
    "Beat poet",
    "prize winning poet and",
    "n poet",
    "poet laureate",
    "poet and",
    "and poet",
    "poets",
    "poet",
]
sports = []
sciences = []

business_farming = []
academia_humanities = []
law_enf_military_operator = []
spiritual = []
social = []
crime = []
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

In [14]:
# Hard-coding cause_of_death value found in info_2
index = df[df["link"] == "https://en.wikipedia.org/wiki/Lawrence_Ferlinghetti"].index
df.loc[index, "cause_of_death"] = "interstitial lung disease"

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [15]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [16]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

                    
# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['arts'] ==1].sample(2)

CPU times: total: 57.4 s
Wall time: 57.4 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
62046,4,Georgy Taratorkin,", 72, Russian stage and film actor.",https://en.wikipedia.org/wiki/Georgy_Taratorkin,5,2017,February,,,,,,,,,,,,,72.0,,Russia,,,1.791759,0,0,0,0,0,1,0,0,0,0,0,0,1
13989,20,Joan Wheeler,", 88, American actress.",https://en.wikipedia.org/wiki/Joan_Wheeler,9,2001,December,,,,,,,,,,,,,88.0,,United States of America,,,2.302585,0,0,0,0,0,1,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [17]:
#### Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 53857 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [18]:
# Obtaining values for column and their counts
roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [19]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [20]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [index for index in df.index if "artist" in df.loc[index, "info"]], "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [21]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [22]:
# # Example code to quick-screen values that may overlap categories
# df.loc[[index for index in df.index if "artist and art" in df.loc[index, "info"]]]

<IPython.core.display.Javascript object>

In [23]:
# # Example code to quick-screen values that may overlap categories
# df.loc[
#     [
#         index
#         for index in df.index
#         if "outlaw country music singer songwriter" in df.loc[index, "info"]
#     ]
# ]

<IPython.core.display.Javascript object>

In [24]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "n artist"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [25]:
# Creating lists for each category
politics_govt_law = []

arts = [
    "based graffiti artist whose works were included in the Venice Biennale",
    "game artist",
    "Oscar winning make up artist",
    "choreographer appointed as Sydney Dance Company artistic director",
    "based pop artist",
    "artist and wife of Joaquín Torres García",
    "botanical artist",
    "television make up artist",
    "movie poster artist",
    "courtroom artist",
    "artist and engraver",
    "graphic designer and psychedelic artist",
    "special effects and make up artist",
    "comic book artist and publisher",
    "artist and theatre designer",
    "pop artist and director",
    "artistic draughtsman",
    "artist known for his role in the Conceptualism and Minimalism movements",
    "aboriginal artist",
    "native artist",
    "musician and artist",
    "ballet dancer and artistic director of Ballet",
    "master potter and artist",
    "drag queen music artist",
    "internationally recognized graphic artist",
    "fantasy and science fiction artist and illustrator",
    "internationally exhibited Navajo artist",
    "comic book artist and co founder of",
    "artist who was a member of the Regina Five",
    "animator and layout artist",
    "colorist and cover production artist for DC Comics",
    "special effects artist and pedal steel guitarist",
    "comic book colourist and artist",
    "based architect and artist",
    "installation artist and assemblage sculptor",
    "artist and television presenter",
    "movie artist and illustrator",
    "artist and watercolourist",
    "naïve artist",
    "transgender artist",
    "special effects artist and producer",
    "artist and musical performer",
    "and butter sculpture artist",
    "artist and print maker",
    "pop artist and sculptor",
    "experimental music artist",
    "blues musician and artist",
    "comic book artist and reputed creator of",
    "artist and composer",
    "ceramic artist and designer",
    "video game concept artist",
    "artist and reporter",
    "performance artist and playwright",
    "artist and muralist",
    "animation artist and character designer",
    "blues artist",
    "Academy Award winning visual effects artist",
    "vocalist and bassist and solo artist",
    "artist of origin",
    "neo mannerist artist",
    "shadow play artist",
    "sculptor and conceptual artist",
    "film make up artist",
    "tapestry and textile artist",
    "Golden Age comic book artist",
    "rock musician and artist",
    "strip artist",
    "wood carving artist",
    "plastic artist and illustrator",
    "artist and banknote designer",
    "textile artist and printmaker",
    "fantasy and science fiction artist",
    "figurative expressionist artist",
    "hip hop musician and graffiti artist",
    "Navajo artist",
    "wet folding origami artist",
    "Route artist",
    "animator and comic book artist",
    "R&B artist",
    "mural artist",
    "makeup artist and tenor",
    "artist and lecturer",
    "artist and ceramicist",
    "experimental visual artist",
    "vocalist and session artist",
    "artists' model and memoirist",
    "comic book and comic strip artist",
    "comic strip artist and editor",
    "Native artist and potter",
    "West Coast artist",
    "artist and comic book creator",
    "graphic artist and sculptor",
    "painter and visual artist",
    "New Realist artist",
    "motion picture matte artist",
    "artist and novelist",
    "abstract and representational artist",
    "broadcaster and comic book artist",
    "neo conceptual artist",
    "painter and pioneering manhua artist",
    "hillbilly and bluegrass artist",
    "artist and member of the Fluxus movement",
    "artist and member of the Ultra Lettrist movement",
    "graphic artist and printmaker",
    "musician and recording artist",
    "textile artist who specialized in embroidery",
    "ink artist and wife of Walt Disney",
    "artist and art collector",
    "n artist from Utopia",
    "Chicano artist",
    "artist of origins",
    "artist and doll maker",
    "conceptual and performance artist",
    "studio potter and ceramic artist",
    "tenor and artist",
    "country music artist",
    "visual artist known for her still lives and landscapes",
    "film and video artist",
    "artist and watercolor master",
    "painter and comics artist",
    "vocalist and recording artist",
    "comic book artist born",
    "science fiction and fantasy artist",
    "novelist and artist",
    "conductor and recording artist",
    "botanical artist and art critic",
    "carving artist",
    "visual artist and conceptual sculptor",
    "contemperary artist",
    'sculptor and "one of the nation most accomplished medallic artists"',
    "cartoonist and comic artist",
    "rock artist",
    "psychedelic artist",
    "animation and comic book artist",
    "textile artist and embroiderer",
    "artist and experimental photographer",
    "cartoonist and comic book artist",
    "potter and ceramic artist",
    "artist anddesigner",
    "born artist",
    "sculptor and stained glass artist",
    "equine artist",
    "typographer and graphic artist",
    "visual artist and jewelry and fashion designer",
    "visual artist and protégé of Salvador Dalí",
    "reggae artist",
    "artist and Army art correspondent",
    "film special effects artist",
    "vocalist and artist",
    "experimental filmmaker and artist",
    "fine artist and art editor",
    "comic book artist and book illustrator",
    "Les Automatistes artist and a member of",
    "comic book artist and co creator of Jonah Hex and Black Orchid",
    "Papunya Tula artist",
    "graphic artist and postage stamp designer",
    "artist and metalsmith",
    "manhua artist",
    "illustrator and storyboard artist",
    "Western artist",
    "artist and architectural designer",
    "graphic artist and banknote designer",
    "artist in wood",
    "Yakshagana artist",
    "trapeze artist",
    "newspaper artist and cartoonist",
    "artist and puppeteer",
    "ceramic artist and sculptor",
    "furniture designer and artist",
    "rock album cover artist",
    "stateless auto destructive artist",
    "indigenous artist and printmaker",
    "computer artist",
    "watercolour artist",
    "media artist and designer",
    "artist and landscape architect",
    "Eurodance artist",
    "musician and comic book artist",
    "graphic artist and game designer",
    "comics artist and graphic novelist",
    "hologram artist",
    "cartoonist and comics artist",
    "comics artist and animator",
    "reggae artist and comedian",
    "sound installation artist and musician",
    "artist and potter",
    "artist and film production illustrator",
    "multimedia artist and painter",
    "film animator",
    "artist photographer",
    "experimental filmmaker and visual artist",
    "artist and jewelry designer",
    "graphic designer and film poster artist",
    "fetish artist",
    "avant garde installation artist and sculptor",
    "courtroom sketch artist",
    "sketch artist",
    "caricaturist and comics artist",
    "Iñupiat artist",
    "hand shadow artist",
    "Indigenous artist",
    "musical Thavil artist",
    "ceramist and textile artist",
    "radio presenter and artist",
    "comics artist and cartoonist",
    "lianhuanhua artist",
    "multi media artist",
    "animator and comics artist",
    "artist and cultural figure",
    "typographer and visual artist",
    "graphic designer and album artist",
    "pencil artist",
    "theatre artist and playwright",
    "comic artist and illustrator",
    "motorsports artist",
    "voice over and recording artist",
    "bassist and artist",
    "comic creator and cover artist",
    "tattoo artist and reality show personality",
    "cartoonist and street artist",
    "sand artist",
    "architect and light artist",
    "media producer and makeup artist",
    "Kiowa artist",
    "Peking opera artist",
    "electronic music artist and MC",
    "graphic design artist",
    "impressionist artist",
    "graphic designer and street artist",
    "illustrator of children books and cartoon artist",
    "artist and animator",
    "record producer and artist",
    "rock concert graphic poster artist",
    "land artist",
    "video and visualization artist",
    "artist and weaver",
    "painter and plastic artist",
    "fantasy gaming artist",
    "Madhubani painter and artist",
    "animation director and storyboard artist",
    "conceptual artist and photographer",
    "fantasy artist and album cover designer",
    "design artist and painter",
    "theatre artiste",
    "thangka artist",
    "artist and Alghoza player",
    "fantasy coffin artist",
    "film concept artist",
    "wildlife artist and illustrator",
    "theatre personality and artist",
    "visual artist and fashion designer",
    "artist and fashion designer",
    "comedic artist",
    "sculptor and artistic director",
    "graphic designer and album cover artist",
    "artist and trading card illustrator",
    "geometric artist",
    "founder and artistic director of the Melbourne Theatre Company",
    "indigenous artist",
    "artist and book cover illustrator",
    "punk graphic designer and artistic director",
    "television host and recording artist",
    "artistic director and live performance organizer",
    "weaving artist",
    "pop and minimalist artist",
    "dubbing artist",
    "monumentalist artist",
    "sound artist and electronic music composer",
    "highwire artiste",
    "pixel artist",
    "conceptual and digital artist",
    "artist and inventor of the plastic pink flamingo",
    "batik artist",
    "enamel artist",
    "operatic tenor and artistic director",
    "sound artist and radio presenter",
    "sound artist",
    "conductor and artistic director",
    "graphic artist and designer",
    "quilt artist",
    "comic strip artist and cartoonist",
    "painter and installation artist",
    "graphic artist and cartoonist",
    "recording artist and vocalist",
    "circus artist and animal trainer",
    "Kunqu artist",
    "surrealist artist",
    "harmonism artist",
    "graphic designer and poster artist",
    "marine artist",
    "commercial artist and illustrator",
    "commercial artist",
    "plastic artist",
    "color abstract artist",
    "special effects make up artist",
    "underground graffiti artist",
    "ballet master and artistic director",
    "animation background artist",
    "hymnist and visual artist",
    "recording artist",
    "visual artist and photographer",
    "World War II artist",
    "painter and kinetic artist",
    "muralist and pictorial artist",
    "cartoonist and artist",
    "sculptor and graphic artist",
    "visual artist and musician",
    "artists' model",
    "abstract expressionist artist and",
    "abstract expressionist artist",
    "artist and printmaker",
    "sculptor and installation artist",
    "installation artist",
    "filmmaker and artist",
    "comic book artist and painter",
    "sculptor and performance artist",
    "artist and set designer",
    "comic book artist and editor",
    "cabaret artist",
    "environmental artist",
    "expressionist artist",
    "comic book and advertising artist",
    "kinetic artist",
    "concept artist",
    "fantasy artist",
    "heraldic artist",
    "Ojibwe artist",
    "First Nations artist",
    "ballet dancer and artistic director",
    "mime artist",
    "contemporary visual artist",
    "psychedelic poster artist",
    "tattoo artist",
    "artist and art critic",
    "hip hop artist",
    "forensic artist",
    "illustrator and comics artist",
    "Hall of Fame comic book artist",
    "Inuk artist",
    "figurative artist",
    "folk artist",
    "artist and filmmaker",
    "voice over artist",
    "minimalist artist",
    "artist and graphic designer",
    "nonconformist artist",
    "portrait artist",
    "pottery artist",
    "film poster artist",
    "fiber artist",
    "Kathakali artist",
    "illustrator and comic book artist",
    "science fiction artist",
    "watercolor artist",
    "comic book artist and cartoonist",
    "artist and cartoonist",
    "avant garde artist",
    "realist painter and graphic artist",
    "comics artist and illustrator",
    "ceramics artist",
    "outsider artist",
    "architect and artist",
    "street artist",
    "video artist and",
    "video artist",
    "contemporary artist",
    "Inuit artist",
    "painter and artist",
    "voice artist",
    "rap artist",
    "graffiti artist",
    "special effects artist",
    "visual effects artist and",
    "visual effects artist",
    "and landscape artist",
    "landscape artist",
    "stained glass artist",
    "graphic designer and artist",
    "designer and artist",
    "war artist",
    "comic strip artist",
    "artistic director",
    "pop artist",
    "make up artist and",
    "make up artist",
    "comic artist",
    "wildlife artist",
    "artist and photographer",
    "sculptor and artist",
    "artist and musician",
    "makeup artist",
    "artist and architect",
    "ceramic artist and",
    "ceramic artist",
    "glass artist",
    "textile artist",
    "artist and designer",
    "artist and painter",
    "artist and sculptor",
    "performance artist and",
    "and performance artist",
    "performance artist",
    "artist and illustrator",
    "painter and graphic artist",
    "and graphic artist",
    "graphic artist and",
    "graphic artist",
    "conceptual artist",
    "abstract artist",
    "manga artist and",
    "manga artist",
    "comics artist and",
    "and comics artist",
    "comics artist",
    "visual artist and art",
    "and visual artist",
    "visual artist and",
    "visual artist",
    "comic book artist and",
    "comic book artist",
    "woodcut artist and",
    "poster artist and",
    "poster artist",
    "and storyboard artist",
    "storyboard artist",
    "modern artist and",
    "and fish skin artist",
    "and artist",
    "artist and",
    "n artist",
    "music artist",
    "artist",
]
sports = [
    "Muay martial artist",  # before arts
    "martial artist and Isshinryu karate pioneer",
    "professional wrestler and mixed martial artist",
    "martial artist and founder of Modern Arnis",
    "Olympic artistic gymnast",
    "professional wrestler and martial artist",
    "artistic gymnastics coach",
    "mixed martial artist and kickboxer",
    "artistic gymnast and Olympian",
    "mixed martial artist and grappler",
    "super heavyweight kickboxing champion and mixed martial artist",
    "Olympic taekwondo martial artist",
    "mixed martial artist and professional wrestler",
    "martial artist and teacher",
    "kickboxer and mixed martial artist",
    "artistic gymnast",
    "mixed martial artist",
    "martial artist and",
    "and martial artist",
    "martial artist",
]
sciences = []

business_farming = []
academia_humanities = []
law_enf_military_operator = [
    "first civilian to receive the Intelligence Medal of Merit"
]
spiritual = ["modern primitive proponent"]
social = []
crime = ["scam artist", "con artist and", "art forger"]  # before arts
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [26]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "sports": sports,
    "arts": arts,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [27]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

                    
# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['arts'] ==1].sample(2)

CPU times: total: 3min 57s
Wall time: 3min 57s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
27042,4,Richard Van Allan,", 73, British opera singer, lung cancer.",https://en.wikipedia.org/wiki/Richard_Van_Allan,5,2008,December,,,,lung cancer,,,,,,,,,73.0,,United Kingdom of Great Britain and Northern Ireland,,,1.791759,0,0,0,0,0,1,0,0,0,0,0,0,1
74919,22,Surya Prakash,", 79, Indian artist.",https://en.wikipedia.org/wiki/Surya_Prakash_(artist),4,2019,May,,,,,,,,,,,,,79.0,,India,,,1.609438,0,0,0,0,0,1,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [28]:
#### Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 51956 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [29]:
# Obtaining values for column and their counts
roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [30]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [31]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [index for index in df.index if "painter" in df.loc[index, "info"]], "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [32]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [33]:
# # Example code to quick-screen values that may overlap categories
# df.loc[[index for index in df.index if "provocateur" in df.loc[index, "info"]]]

<IPython.core.display.Javascript object>

In [34]:
# # Example code to quick-screen values that may overlap categories
# df.loc[
#     [
#         index
#         for index in df.index
#         if "outlaw country music singer songwriter" in df.loc[index, "info"]
#     ]
# ]

<IPython.core.display.Javascript object>

In [35]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "painter and chess composer"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [36]:
# Creating lists for each category
politics_govt_law = [
    "provocateur",
]

arts = [
    "painter and drawer",
    "Native painter",
    "rock musician and painter",
    "naïve painter",
    "Puerto Ricanpainter and art",
    "playwright and painter",
    "born painter",
    "architect and painter",
    "engraver and painter",
    "outback painter",
    "Cubist painter",
    "cityscape painter",
    "painter and Cultural Medallion winner",
    "film critic and painter",
    "art critic and surrealist painter",
    "painter of mystical Jewish works",
    "color field painter",
    "painter and lecturer",
    "Andalusian painter and illustrator",
    "painter and glass sculptor",
    "aboriginal painter",
    "Delftware painter",
    "barn painter",
    "abstract painter and printmaker",
    "figurative painter and arts",
    "television painter",
    "draughtsman and painter",
    "mixed media painter",
    "enameller and painter",
    "landscape painter and watercolorist",
    "painter and watercolorist",
    "painter and television host",
    "folk art painter",
    "portrait painter and sculptor",
    "Realist painter",
    "Yōga painter",
    "photographer and portrait painter",
    "portrait and landscape painter",
    "figurative painter and draftsman",
    "painter and protégé of Pablo Picasso",
    "painter and calligrapher",
    "painter and ceramicist",
    "painter and scenographer",
    "watercolour painter",
    "painter and draftsman",
    "animator and painter",
    "painter and weaver",
    "painter and lithographer",
    "painter and cartoonist",
    "Ojibwe painter",
    "cartoonist and painter",
    "stage designer and painter",
    "surrealistic painter",
    "painter and cinematographer",
    "painter and draughtsman",
    "painter and theatre director",
    "icon painter",
    "Coptic painter",
    "fashion designer and painter",
    "designer and painter",
    "composer and painter",
    "signpainter",
    "painter and television presenter",
    "representational painter",
    "painter and print maker",
    "figurative painter",
    "painter and animator",
    "abstract painter and print maker",
    "painter and composer",
    "painter and children book illustrator",
    "art critic and painter",
    "painter of themes",
    "yodeler and painter",
    "geometric abstractionist painter",
    "scenographer and painter",
    "impressionist painter",
    "Yolngu painter",
    "painter and ceramist",
    "magic realist painter",
    "painter and dancer",
    "photorealist painter",
    "watercolor painter",
    "jazz musician and painter",
    "musician and painter",
    "painter and graphic designer",
    "modernist painter",
    "muralist and painter",
    "avant garde painter",
    "COBRA painter",
    "painter and muralist",
    "abstract painter and sculptor",
    "painter and novelist",
    "painter and musician",
    "photographer and painter",
    "portrait painter",
    "painter and designer",
    "and realist painter",
    "realist painter and",
    "surrealist painter",
    "realist painter",
    "illustrator and painter",
    "abstract expressionist painter",
    "expressionist painter",
    "landscape painter",
    "painter and engraver",
    "painter and photographer",
    "painter and illustrator",
    "abstract painter",
    "sculptor and painter",
    "painter and printmaker",
    "painter and sculptor",
    "painter and",
    "and painter",
    "painter",
]
sports = ["chess composer"]
sciences = []

business_farming = []
academia_humanities = ["theatre school director"]
law_enf_military_operator = []
spiritual = []
social = []
crime = []
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [37]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [38]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)                    
                    
# Checking a sample of rows
df[df['arts'] ==1].sample(2)

CPU times: total: 1min 4s
Wall time: 1min 4s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
93345,12,Srikanth,", 81, Indian actor .","https://en.wikipedia.org/wiki/Srikanth_(Tamil_actor,_born_1940)",9,2021,October,", ,",,,,,,,,,,,,81.0,,India,,", ,",2.302585,0,0,0,0,0,1,0,0,0,0,0,0,1
720,13,K. T. Stevens,", 74, American actress, lung cancer.",https://en.wikipedia.org/wiki/K._T._Stevens,3,1994,June,,,,lung cancer,,,,,,,,,74.0,,United States of America,,,1.386294,0,0,0,0,0,1,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [39]:
#### Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 51002 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [40]:
# Obtaining values for column and their counts
roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [41]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [42]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [index for index in df.index if "musician" in df.loc[index, "info"]], "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [43]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [44]:
# # Example code to quick-screen values that may overlap categories
# df.loc[[index for index in df.index if "Shinshu" in df.loc[index, "info"]]]

<IPython.core.display.Javascript object>

In [45]:
# # Example code to quick-screen values that may overlap categories
# df.loc[
#     [
#         index
#         for index in df.index
#         if "outlaw country music singer songwriter" in df.loc[index, "info"]
#     ]
# ]

<IPython.core.display.Javascript object>

In [46]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "stalker of musician Björk"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [47]:
# Creating lists for each category
politics_govt_law = []

arts = [
    "musician known for his work with the Beatles",
    "musician and host of",
    "musician with the Bothy Band",
    "award winning blues musician",
    "shehnai musician and Bharat Ratna winner",
    "musician known for his work on the theme",
    "musician who played at the Johannesburg Live concert",
    "jazz musician and widow of John Coltrane",
    "musician and husband of Celia Cruz",
    "musician and founding conductor of Brooklyn Philharmonic Orchestra",
    "musician and entertainer",
    'musician known as "Saint Thomas"',
    "theme song composer and jazz musician",
    "composer and jazz musician",
    "Benga musician",
    "musician credited with creating mambo",
    "saxophone and clarinet musician",
    "musician and composer from the band Savage Rose",
    "musician and founder of the band Bathory",
    "musician and playwright",
    "Celtic musician",
    "Rockabilly musician",
    "R&B and country & western musician",
    "Tony nominated orchestrator and musician",
    "fiddle player and bluegrass musician",
    "musician with the hardcore punk band Big Boys",
    "musician and founding member of The Vandals",
    "musician for acoustic rock band Plush",
    "steel pan musician",
    "musician and bagpiper",
    "musician and essayist with the stage name Buddy Blue",
    "vocal jazz musician",
    "dancehall garage musician",
    "Romani musician",
    "cartoonist and musician",
    "soca musician",
    "musician and one man band",
    "cajun musician",
    "rapper and musician",
    "country musician and producer",
    "soul and disco musician",
    "street musician",
    "Grammy Award winning jazz producer",
    "Indorock musician",
    "musician and festival organiser",
    "comedian and musician",
    "musician and band manager",
    "jazz musician and jazz critic",
    "traditional jazz musician",
    "country musician and trumpet player",
    "soul and funk musician",
    "country musician and comedian",
    "musician and comedian",
    "hip hop DJ and musician",
    "musician and patron of the arts",
    "club disc jockey and musician",
    "Reggae musician and composer",
    "musician and hula expert",
    "jazz and rock musician",
    "garage punk musician",
    "jazz musician and accordionist",
    "musician and sound recordist",
    "jazz guitarist and studio musician",
    "Classical musician",
    "country musician and television and radio host",
    "keyboardist and session musician",
    "musician and radio and TV personality",
    "Country & Western musician",
    "rock and R&B musician",
    "R&B and jazz musician",
    "rock 'n roll musician",
    "musician and music business executive",
    "country music and rockabilly musician",
    "dixieland jazz musician",
    "guitarist and studio musician",
    "musician and inventor of board game Cluedo",
    "classical musician and tabla player",
    "soul musician and a guitarist",
    "reggae musician and Rastafarian",
    "concert and easy listening musician",
    "blues and folk musician",
    "Jùjú musician",
    "country blues musician",
    "musician and music publisher",
    "kadongo kamu musician",
    "reggae musician and producer",
    "musician and creator of the seggae genre",
    'jazz musician known as "Mr Swing"',
    "musician of music",
    "children musician",
    "musician and DJ of Run DMC",
    "skiffle musician",
    "musician and star of the Buena Vista Social Club",
    "string band fiddler and mandolinist and country blues musician",
    "musician composer and arranger",
    "rockabilly composer and musician",
    "soul musician and brother of Marvin Gaye",
    "musician and serial",
    "Volksmusik musician and collector",
    "musician and pioneer of the musical genres of and",
    "bluegrass and folk musician",
    "bandleader and musician",
    "big band style musician",
    "Jibaro musician",
    "folk musician and composer",
    "musician and vocal session arranger",
    "musician and composer of electronic music",
    "musician and former member of The Beatles",
    "Tejano musician",
    "jazz fusion musician",
    "entertainer and musician",
    "outsider musician",
    "trumpeter and session musician",
    "architect and musician",
    "electronic musician and radio host",
    "sound designer and musician",
    "electronic dance musician and record producer",
    "jazz musician and arranger",
    "Nazrul Sangeet musician",
    "musician and publisher",
    "Indigenous musician",
    "folk and bluegrass musician",
    "punk and new wave musician",
    "trance music producer and musician",
    "Carnatic musician and composer",
    "Hindustani musician",
    "jazz musician and band leader",
    "shehnai musician",
    "rap musician",
    "avant garde musician",
    "ska and mento musician",
    "synthesizer musician",
    "musician and choral conductor",
    "samba musician",
    "khyal musician",
    "kwaito musician",
    "acid house musician",
    "photographer and musician",
    "minimalist musician",
    "trumpeter and brass band musician",
    "soundtrack composer and musician",
    "grime musician",
    "folk rock musician and composer",
    "disc jockey and musician",
    "funk and R&B musician",
    "Grammy award winning musician",
    "musician and organist",
    "blues and rock musician",
    "musician and radio presenter",
    "Hall of Fame instrumental and surf rock musician",
    "surf rock musician",
    "Bubu musician",
    "musician and dancer",
    "Andean cumbia musician",
    "ashik musician",
    "Hall of Fame blues musician",
    "sculptor and musician",
    "experimental musician",
    "jazz musician and Shinshu Buddhist",
    "mbira musician",
    "theatre director and musician",
    "jazz musician and producer",
    "Steelpan musician and arranger",
    "country musician and radio broadcaster",
    "musician and arranger",
    "punk rock musician and",
    "punk rock musician",
    "carnatic musician and music director",
    "musician and music director",
    "steelpan musician and designer",
    "steelpan musician",
    "Tuvan musician",
    "bossa nova musician",
    "vallenato musician",
    "music manager and musician",
    "contra dance musician",
    "bluegrass musician and banjo player",
    "country and rockabilly musician",
    "roots musician and entertainment critic",
    "roots musician",
    "jazz musician and manager",
    "jazz musician and radio show host",
    "electronic musician and television producer",
    "jazz musician and vocal coach involved in the Wrong Door Raid",
    "Detroit blues musician",
    "folk musician and yueqin player",
    "blues and country musician",
    "heavy metal musician",
    "rock and roll session musician",
    "jazz musician and architect",
    "musician and radio and television personality",
    "Hall of Fame bluegrass musician",
    "oud musician",
    "music  musician",
    "cumbia musician",
    "book and album cover designer and jazz musician",
    "salsa musician and composer",
    "musician and percussion mallet manufacturer",
    "Gnawa musician",
    "blues rock musician",
    "musician and music publishing executive",
    "musician and television show host",
    "composer and electronic musician",
    "Māori musician",
    "and Manager of Jazz musician Erroll Garner",
    "musician and house music producer",
    "mariachi musician and",
    "mariachi musician",
    "merengue and salsa musician",
    "pop musician and producer",
    "reggae cross over musician",
    "Grammy Award winning jazz and new age musician",
    "classical dancer and musician",
    "dancer and musician",
    "musician and film composer",
    "rhythm and blues musician",
    "jazz and pop musician",
    "musician and music producer",
    "electric blues musician",
    "musician and DJ",
    "tango musician",
    "Igbo highlife musician",
    "dancehall musician",
    "ska musician",
    "calypso musician",
    "soul musician",
    "session musician",
    "ambient musician",
    "Chicago blues musician",
    "Hall of Fame musician and record producer",
    "big band musician",
    "electronic musician and composer",
    "folk rock musician",
    "Cajun musician",
    "Grammy Award winning musician",
    "jazz and blues musician",
    "traditional musician",
    "reggae musician and record producer",
    "jazz musician and bandleader",
    "musician and bandleader",
    "pop musician",
    "polka musician",
    "musician and conductor",
    "Carnatic musician",
    "punk musician",
    "broadcaster and musician",
    "rock and roll musician",
    "R&B musician",
    "highlife musician",
    "hip hop musician",
    "record producer and musician",
    "musician and producer",
    "classical musician and",
    "classical musician",
    "Hall of Fame musician",
    "rockabilly musician",
    "electronic musician",
    "jazz musician and composer",
    "reggae musician",
    "musician and record producer",
    "country musician",
    "and bluegrass musician",
    "bluegrass musician",
    "composer and musician",
    "musician and composer",
    "and traditional folk musician",
    "folk musician and",
    "and folk musician",
    "folk musician",
    "and rock musician",
    "rock musician",
    "blues musician and",
    "blues musician",
    "and jazz musician",
    "jazz musician",
    "benga musician and",
    "musician and television",
    "Native musician and",
    "musician from",
    "and musician",
    "musician and",
    "musician",
]
sports = []
sciences = []

business_farming = []
academia_humanities = []
law_enf_military_operator = []
spiritual = []
social = []
crime = [
    "stalker of musician Björk",  # before arts
]
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [48]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [49]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()
                    
# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['arts'] ==1].sample(2)

CPU times: total: 2min 28s
Wall time: 2min 28s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
19808,5,Bele Bachem,", 89, German graphic artist, book illustrator and writer.",https://en.wikipedia.org/wiki/Bele_Bachem,4,2005,June,,,,book illustrator and writer,,,,,,,,,89.0,,Germany,,,1.609438,0,0,0,0,0,1,0,0,0,0,0,0,1
15894,1,Cyril Shaps,", 79, British actor .",https://en.wikipedia.org/wiki/Cyril_Shaps,9,2003,January,", ,",,,,,,,,,,,,79.0,,United Kingdom of Great Britain and Northern Ireland,,", ,",2.302585,0,0,0,0,0,1,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [50]:
#### Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 49640 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [51]:
# Obtaining values for column and their counts
roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [52]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [53]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [index for index in df.index if "diplomat" in df.loc[index, "info"]], "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [54]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [55]:
# # Example code to quick-screen values that may overlap categories
# df.loc[
#     [index for index in df.index if "and kidnapping survivor" in df.loc[index, "info"]]
# ]

<IPython.core.display.Javascript object>

In [56]:
# # Example code to quick-screen values that may overlap categories
# df.loc[
#     [
#         index
#         for index in df.index
#         if "outlaw country music singer songwriter" in df.loc[index, "info"]
#     ]
# ]

<IPython.core.display.Javascript object>

In [57]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "outlaw country music singer songwriter"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [58]:
# Creating lists for each category
politics_govt_law = [
    "environmentalist and diplomat",
    "husband of Queen Beatrix of the and diplomat",
    "diplomat and employee of Agency for International Development",
    "Foreign Service Officer and diplomat",
    "diplomat and Apostolic Nuncio to the Republic of",
    "diplomat and Cold War arms negotiator",
    "diplomat and foreign minister",
    "diplomat and ambassador to Arabia",
    "federal judge and diplomat",
    "first female diplomat and ambassador",
    "aristocrat and diplomat",
    "diplomatic clerk",
    "diplomat and Governor of New South",
    "diplomatand ambassador",
    "diplomat and State Department official",
    "diplomat and High Commissioner of",
    "diplomat with the Department of State",
    "diplomat and Minister of Foreign Affairs",
    "diplomat and st Secretary General of UNCTAD",
    "diplomat and defector",
    "and diplomat for the Holy See",
    "diplomat and Medal of Freedom recipient",
    "diplomat and attorney",
    "career diplomat",
    "government diplomat and governor",
    "social advocate and diplomat",
    "diplomat and government official",
    "political aide and diplomat",
    "peer and diplomat",
    "diplomat and courtier",
    "diplomat and life peer",
    "diplomat and independentism leader",
    "political scientist and diplomat",
    "ambassador and diplomat",
    "diplomat and diplomatic analyst",
    "Chickasaw Nation diplomat",
    "diplomat and economist",
    "diplomat and adviser",
    "political  diplomat",
    "diplomat and legal",
    "political figure and diplomat",
    "diplomat and peer",
    "diplomat and activist",
    "attorney and diplomat",
    "diplomat and political analyst",
    "jurist and diplomat",
    "diplomat and jurist",
    "diplomat and civil servant",
    "diplomat and aristocrat",
    "diplomat and colonial administrator",
    "diplomat and political scientist",
    "diplomat and lawyer",
    "civil servant and diplomat",
    "diplomat and public servant",
    "economist and diplomat",
    "lawyer and diplomat",
    "public servant and diplomat",
    "diplomat and ambassador",
    "diplomat serving in",
    "diplomat Ambassador to",
    "and diplomatic advisor",
    "and diplomat in the",
    "diplomat to",
    "diplomat and",
    "and diplomat",
    "diplomat",
]

arts = []
sports = []
sciences = []

business_farming = []
academia_humanities = []
law_enf_military_operator = []
spiritual = []
social = []
crime = []
event_record_other = ["and kidnapping survivor"]
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [59]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [60]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()
                    
# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)
                    
# Checking a sample of rows
df[df['politics_govt_law'] ==1].sample(2)

CPU times: total: 35.7 s
Wall time: 35.7 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
93906,9,Aldo Rizzo,", 86, Italian politician and magistrate, deputy .",https://en.wikipedia.org/wiki/Aldo_Rizzo,6,2021,November,and mayor of Palermo,,,deputy,,,,,,,,,86.0,,Italy,,1979 1992 and mayor of Palermo 1992,1.94591,0,0,0,0,0,0,0,0,1,0,0,0,1
84316,22,Adamu Daramani Sakande,", 58, Ghanaian politician, MP .",https://en.wikipedia.org/wiki/Adamu_Daramani_Sakande,8,2020,September,,,,MP,,,,,,,,,58.0,,Ghana,,2009 2013,2.197225,0,0,0,0,0,0,0,0,1,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [61]:
#### Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 48883 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [62]:
# Obtaining values for column and their counts
roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [63]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [64]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [index for index in df.index if "composer" in df.loc[index, "info"]], "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [65]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [66]:
# # Example code to quick-screen values that may overlap categories
# df.loc[[index for index in df.index if "composer and sound" in df.loc[index, "info"]]]

<IPython.core.display.Javascript object>

In [67]:
# # Example code to quick-screen values that may overlap categories
# df.loc[
#     [
#         index
#         for index in df.index
#         if "outlaw country music singer songwriter" in df.loc[index, "info"]
#     ]
# ]

<IPython.core.display.Javascript object>

In [68]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "Filin composer and interpreter"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [69]:
# Creating lists for each category
politics_govt_law = []

arts = [
    "composer and free jazz violinist",
    "Academy Award winning film score composer",
    "experimental music composer",
    "composer and lyricist of",
    "Emmy award winning composer and lyricist",
    "jazz pianist composer",
    "film composer and music supervisor",
    "Broadway composer and TV producer",
    "composer best known for work on Robotech",
    "blues and rock n' roll guitarist and composer",
    "composer of Broadway musicals",
    "composer of classic film music such as",
    "movie and television composer",
    "jazz soprano saxophonist and composer",
    "composer of feature films and television movie scores",
    "composer and organist choirmaster Washington National Cathedral",
    "orchestrator and composer of film and television scores",
    "composer of film and television scores",
    "composer and choirmaster",
    "guitarist and composer and best known for his work with Van Morrison",
    "composer and musical director",
    "musical theater composer",
    "violist and composer of electronic music",
    "composer and television host",
    "contemporary classical music composer",
    "contemporary classical composer",
    "hip hop composer",
    "jingle composer",
    "composer of background music for and",
    "experimental composer",
    "sitarist and composer",
    "fiddler and composer",
    "progressive rock bassist and composer",
    "Academy Award winning film composer",
    "composer and jazz trombonist",
    "architect and composer",
    "composer of film and television theme music",
    "orchestrator and film composer",
    "composer of Catholic liturgical songs",
    "jazz keyboardist and composer",
    "film music composer and conductor",
    "swing and hard bop trumpeter and composer",
    "film composer and conductor",
    "composer and bassoonist",
    "composer and playwright",
    "composer prolific in film music",
    "composer of orchestral",
    "concert band conductor and composer",
    "composer of classical music and conductor",
    "composer and librettist",
    "violist and composer of classical music",
    "music composer and arranger",
    "composer of orchestral and choral works",
    "guitarist and composer in classical",
    "composer and clarinetist",
    "experimental composer and pianist",
    "bandleader and composer for film and television",
    "jazz composer and saxophonist",
    "classical guitar composer",
    "composer and pioneer of electronic and computer music",
    "migrant composer",
    "jazz trumpet player and composer",
    "producer and composer of radio jingles",
    "television and film composer",
    "composer and music administrator",
    "composer of music and film scores",
    "composer and winner of the Pulitzer Prize",
    "composer and filmmaker",
    "born jazz pianist and composer",
    "jazz composer and alto saxophonist",
    "tango composer and pianist",
    "composer and bandleader",
    "composer and television producer",
    "composer and orchestrator",
    "composer and sculptor",
    "video game composer",
    "composer and sound",
    "music composer and film scorer",
    "jazz and R&B composer",
    "composer and big band leader",
    "composer and clarinet player",
    "composer and musical producer",
    "keyboard player and composer",
    "jazz trombonist and film composer",
    "saxophonist and film composer",
    "orchestral and choral composer",
    "orchestra conductor and composer",
    "composer and accordionist",
    "jazz composer and bandleader",
    "film score composer and music director",
    "microtonal composer",
    "composer and flutist",
    "composer and founding general director of Michigan Opera Theatre",
    "video game music composer",
    "brass band arranger and composer",
    "playwright and composer",
    "jazz alto saxophonist and composer",
    "composer and music producer",
    "violist and composer",
    "sitar player and composer",
    "jazz clarinetist and composer",
    "composer and jazz saxophonist",
    "Māori composer",
    "Filin composer and interpreter",
    "composer and biographer",
    "film composer and pianist",
    "composer and keyboard player",
    "choirmaster and composer",
    "country music composer",
    "musical composer and performer",
    "composer and performer",
    "new age pianist and composer",
    "Tony Award winning producer and composer",
    "composer and sound editor",
    "jazz pianist and film composer",
    "orchestra leader and composer",
    "classical and flamenco guitarist and composer",
    "composer and music industry executive",
    "television score composer",
    "erhu master and composer",
    "rock and jazz drummer and composer",
    "jazz drummer and composer",
    "composer and essayist",
    "classical oboist and composer",
    "Tony Award winning composer",
    "ragtime pianist and composer",
    "composer and jazz trumpeter",
    "classical composer and flautist",
    "bluesman and composer",
    "composer and fiddler",
    "composer and choir director",
    "jazz cellist and composer",
    "composer and novelist",
    "jazz guitarist",
    "composer and music critic",
    "music director and composer",
    "jazz trumpeter and composer",
    "trumpeter and composer",
    "composer and music director",
    "percussionist and composer",
    "composer and choral director",
    "cellist and composer",
    "electronic music composer",
    "Pulitzer Prize winning composer",
    "composer of film scores",
    "composer and orchestra leader",
    "jazz composer and arranger",
    "harpist and composer",
    "flautist and composer",
    "classical composer and conductor",
    "composer and jazz pianist",
    "choral conductor and composer",
    "classical guitarist and composer",
    "avant garde composer and pianist",
    "avant garde composer",
    "film and television music composer",
    "concert pianist and composer",
    "composer and cellist",
    "composer and guitarist",
    "musical director and composer",
    "composer and record producer",
    "composer of contemporary classical music",
    "composer and violist",
    "bassist and composer",
    "composer and violinist",
    "classical music composer",
    "jazz composer",
    "opera composer",
    "composer and arranger",
    "jazz guitarist and composer",
    "composer of classical music",
    "lyricist and composer",
    "film music composer",
    "and computer music composer",
    "music composer and",
    "music composer",
    "jazz saxophonist and composer",
    "saxophonist and composer",
    "classical composer",
    "classical pianist and composer",
    "composer and organist",
    "violinist and composer",
    "film and television composer",
    "television composer",
    "composer and lyricist",
    "film score composer",
    "organist and composer",
    "guitarist and composer",
    "film composer",
    "conductor and composer",
    "jazz pianist and composer",
    "pianist and composer",
    "composer and pianist",
    "composer and conductor",
    "composer of musicals",
    "musical composer and",
    "musical composer",
    "composer and",
    "and composer",
    "composer",
]
sports = []
sciences = []

business_farming = []
academia_humanities = [
    "ethnomusicologist and wife of composer Henry Cowell",  # before arts
]
law_enf_military_operator = []
spiritual = []
social = []
crime = []
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [70]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [71]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)
                    
# Checking a sample of rows
df[df['arts'] ==1].sample(2)

CPU times: total: 1min 49s
Wall time: 1min 49s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
45920,14,C. N. Karunakaran,", 73, Indian painter.",https://en.wikipedia.org/wiki/C._N._Karunakaran,22,2013,December,,,,,,,,,,,,,73.0,,India,,,3.135494,0,0,0,0,0,1,0,0,0,0,0,0,1
69350,5,Daša Drndić,", 71, Croatian radio playwright and author, cancer.",https://en.wikipedia.org/wiki/Da%C5%A1a_Drndi%C4%87,9,2018,June,Radio Belgrade,,,cancer,,,,,,,,,71.0,,Croatia,,Radio Belgrade,2.302585,0,0,0,0,0,1,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [72]:
#### Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 47693 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [73]:
# Obtaining values for column and their counts
roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [74]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [75]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [index for index in df.index if "academic" in df.loc[index, "info"]], "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [76]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [77]:
# # Example code to quick-screen values that may overlap categories
# df.loc[
#     [index for index in df.index if "and academic fraudster" in df.loc[index, "info"]]
# ]

<IPython.core.display.Javascript object>

In [78]:
# # Example code to quick-screen values that may overlap categories
# df.loc[
#     [
#         index
#         for index in df.index
#         if "outlaw country music singer songwriter" in df.loc[index, "info"]
#     ]
# ]

<IPython.core.display.Javascript object>

In [79]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "academician and Judaica scholar"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [80]:
# Creating lists for each category
politics_govt_law = ["intelligence theorist"]

arts = [
    "and academic fraudster",  # before academia_humanities
]
sports = []
sciences = ["and shoe expert", "cyber security expert"]

business_farming = []
academia_humanities = [
    "Three Affiliated Tribes academic",
    "grammar academic",
    "academic and librarian",
    "university administrator and academic",
    "academic and redologist",
    "academic and professor",
    "educationalist and academic",
    "Druze academic",
    "academic and sinologist",
    "academic and educationalist",
    "academic director",
    "Blackfoot academic administrator",
    "medievalist and academic",
    "academic and college president;",
    "Creole academic",
    "anthropologist who founded the academic journal",
    "professor and academic",
    "academic and university president",
    "interpreter and academic",
    "university academic and administrator",
    "academic and administrator",
    "and medical academic",
    "law expert and academic",
    "academic of descent",
    "teacher and academic",
    "academic and philosopher",
    "ans academic",
    "academic leader and",
    "academic philosopher",
    "and academic Parkinson disease",
    "fat studies academic and",
    "and tax academic",
    "theorist and academic",
    "linguist and academic administrator",
    "classics and ancient history academic",
    "Africanist and academic",
    "archivist and academic administrator",
    "Orientalist and academic",
    "and religious academic",
    "scholar and academic administrator",
    "and nursing academic",
    "curator and academic administrator",
    "professor and academic administrator",
    "Meso epigraphist and academic",
    "academic and Quechua translator",
    "and culture academic",
    "academician and educator",
    "philosopher and acedemician",
    "Opaskwayak academic",
    "geographer and academic",
    "academic and university administrator",
    "translator and academic",
    "academic and linguist",
    "academic and scholar",
    "folklorist and academic",
    "educator and academic administrator",
    "classicist and academic",
    "archaeologist and academic",
    "Māori language academic",
    "academic and translator",
    "librarian and academic",
    "educator and academic",
    "linguist and academic",
    "academic and educator",
    "philosopher and academic",
    "scholar and academic",
    "and an academic teacher",
    "academician and",
    "academic and professor of",
    "and academic teacher",
    "and an academic",
    "and academic administrator",
    "and academician",
    "academician",
    "academic professor and",
    "academic administrator and",
    "academic administrator",
    "and academic",
    "academic and",
    "academic",
]
law_enf_military_operator = []
spiritual = []
social = []
crime = []
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

In [81]:
# Hard-coding cause_of_death for entry with value in info_2
index = df[df["link"] == "https://en.wikipedia.org/wiki/Rutherford_Aris"].index
df.loc[index, "cause_of_death"] = "Parkinson disease"

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [82]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "arts": arts,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [83]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['academia_humanities'] ==1].sample(2)

CPU times: total: 49.2 s
Wall time: 49.2 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
57825,9,Patrick J. O'Donnell,", 68, Scottish academic.",https://en.wikipedia.org/wiki/Patrick_J._O%27Donnell,3,2016,April,,,,,,,,,,,,,68.0,,Scotland,,,1.386294,0,0,0,1,0,0,0,0,0,0,0,0,1
40930,4,Tony Sweeney,", 81, Irish sports writer and historian, heart attack.",https://en.wikipedia.org/wiki/Tony_Sweeney,3,2012,December,,,sports,heart attack,,,,,,,,,81.0,,Ireland,,,1.386294,0,0,0,1,0,1,0,0,0,0,0,0,2


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [84]:
#### Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 46805 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [85]:
# Obtaining values for column and their counts
roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [86]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [87]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [index for index in df.index if "mathematician" in df.loc[index, "info"]],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [88]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [89]:
# # Example code to quick-screen values that may overlap categories
# df.loc[
#     [
#         index
#         for index in df.index
#         if "mathematician and journal editor" in df.loc[index, "info"]
#     ]
# ]

<IPython.core.display.Javascript object>

In [90]:
# # Example code to quick-screen values that may overlap categories
# df.loc[
#     [
#         index
#         for index in df.index
#         if "outlaw country music singer songwriter" in df.loc[index, "info"]
#     ]
# ]

<IPython.core.display.Javascript object>

In [91]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "mathematician of ancestry"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [92]:
# Creating lists for each category
politics_govt_law = []

arts = []
sports = []
sciences = [
    "mathematician of ancestry",
    'mathematician and physicist known as "The Voice of JPL"',
    "mathematician & cosmologist; co advocate of the Steady State theory",
    "mathematician at the University of Chicago",
    "mathematician and computer pioneer",
    "physicist and applied mathematician",
    "mathematician and computer scientist developed diehard tests",
    "mathematician and systems engineer",
    "mathematician  mathematics",
    "mathematician and nuclear scientist",
    "geophysicist and mathematician",
    "mathematician specialising in group theory",
    "mathematician and aerodynamicist",
    "mathematician and theoretical astronomer",
    "mathematician known for his contribution to graph theory",
    "mathematical physicist and mathematician",
    "and later mathematician",
    "mathematician and Doctor of Medicine",
    "civil engineer and mathematician",
    "mathematician and pioneering computer scientist",
    "statistician and mathematician",
    "theoretical physicist and mathematician",
    "born mathematician and",
    "mathematician and engineering",
    "engineer and mathematician",
    "mathematician and astronomer",
    "pure mathematician",
    "mathematician and computer programmer",
    "amateur mathematician",
    "scientist and mathematician",
    "mathematician and astrophysicist",
    "mathematician and futurist",
    "mathematician and scientist",
    "mathematician and theoretical computer scientist",
    "mathematician and inventor",
    "mathematician and rheologist",
    "research chemist and mathematician",
    "mathematician and journal editor",
    "mathematician and logician",
    "mathematician and oceanographer",
    "mathematician and engineer",
    "mathematician and mathematics",
    "mathematician and statistician",
    "astronomer and mathematician",
    "physicist and mathematician",
    "mathematician and physicist",
    "mathematician and computer scientist",
    "applied mathematician and",
    "applied mathematician",
    "mathematician and",
    "and mathematician",
    "mathematician",
]

business_farming = []
academia_humanities = []
law_enf_military_operator = [
    "cryptanalyst",
    "radar engineer",
]
spiritual = []
social = []
crime = []
event_record_other = []
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [93]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [94]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()
                    
# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['sciences'] ==1].sample(2)

CPU times: total: 42.9 s
Wall time: 43.1 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
5278,5,Jürgen Neukirch,", 59, German mathematician.",https://en.wikipedia.org/wiki/J%C3%BCrgen_Neukirch,5,1997,February,,,,,,,,,,,,,59.0,,Germany,,,1.791759,1,0,0,0,0,0,0,0,0,0,0,0,1
12159,20,Crispin Nash-Williams,", 68, British mathematician.",https://en.wikipedia.org/wiki/Crispin_Nash-Williams,3,2001,January,,,,,,,,,,,,,68.0,,United Kingdom of Great Britain and Northern Ireland,,,1.386294,1,0,0,0,0,0,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [95]:
#### Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 46140 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` for the next iteration.

#### Finding `known_for` Roles in `info_2`

In [96]:
# Obtaining values for column and their counts
roles_list = df["info_2"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [97]:
# # Code to check each value
# roles_list.pop()

<IPython.core.display.Javascript object>

In [98]:
# # Create specific_roles_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [index for index in df.index if "basketball player" in df.loc[index, "info"]],
#         "info_2",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [99]:
# # Code to check each specific value
# specific_roles_list.pop()

<IPython.core.display.Javascript object>

In [100]:
# # Example code to quick-screen values that may overlap categories
# df.loc[
#     [
#         index
#         for index in df.index
#         if "involved in Bob Knight controversy" in df.loc[index, "info"]
#     ]
# ]

<IPython.core.display.Javascript object>

In [101]:
# # Example code to quick-screen values that may overlap categories
# df.loc[
#     [
#         index
#         for index in df.index
#         if "outlaw country music singer songwriter" in df.loc[index, "info"]
#     ]
# ]

<IPython.core.display.Javascript object>

In [102]:
# # Example code to quick-check a specific entry
# df[df["info_2"] == "basketball player involved in Bob Knight controversy"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [103]:
# Creating lists for each category
politics_govt_law = []

arts = []
sports = [
    "Hall of Fame Olympic basketball player and coach",
    "baseball and basketball player",
    "college baseball and basketball player and",
    "beach volleyball and basketball player",
    "NBA basketball player",
    "basketball player for Arabian Al Ittihad team",
    "basketball player of Seton Hall",
    "Olympic bronze medal winning basketball player",
    "All Star basketball player and coach",
    "basketball player and second highest scorer in Iowa State University history",
    "basketball player with the Boston Celtics and North Carolina State University",
    "Atlanta Hawks basketball player",
    "basketball player and former member of the National Basketball Team",
    "NBA and Ohio State University basketball player",
    "professional basketball player and sports",
    "Spaniard basketball player",
    "basketball player and Olympic athlete",
    "All basketball player for the Oklahoma Sooners and the AAU Phillips ers",
    "track athlete and basketball player",
    "women basketball player",
    "water polo and basketball player and coach",
    "basketball player and college coach",
    "basketball player and professional wrestler",
    "track and field athlete and basketball player",
    "Olympic basketball player and contributor",
    "basketball player and tennis coach",
    "softball and basketball player",
    "basketball player and athlete",
    "and wheelchair basketball player",
    "basketball player and Hall of Fame coach",
    "college basketball player and head coach",
    "Olympic champion basketball player",
    "gold medal winning Olympic basketball player",
    "Collegiate Hall of Fame basketball player",
    "Olympic silver medal winning basketball player",
    "Olympic gold medallist basketball player",
    "Olympic basketball player and coach",
    "football and basketball player and coach",
    "wheelchair basketball player",
    "Olympic gold medal winning basketball player",
    "college basketball player and coach",
    "Hall of Fame basketball player and coach",
    "Hall of Fame basketball player and",
    "Hall of Fame basketball player",
    "Olympic basketball player",
    "and basketball player and coach",
    "basketball player and coach",
    "college basketball player for the University of",
    "and college basketball player",
    "college basketball player",
    "professional basketball player",
    "basketball player and",
    "and basketball player",
    "basketball player",
]
sciences = []

business_farming = []
academia_humanities = []
law_enf_military_operator = []
spiritual = []
social = []
crime = ["involved in point shaving scandal"]
event_record_other = [
    "involved in Bob Knight controversy",
]
other_species = []

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [104]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting Category from `info_2`

In [105]:
%%time

# Dictionary version
search_dict = known_for_dict

# Column to check
column = 'info_2'

# Dataframe
dataframe = df[column].notna()

# For loop to find role in column and extract it as category
for category, category_lst in search_dict.items():
    for role in category_lst:
        for index in dataframe.index:
            item = df.loc[index, column]
            if item:
                if role in item:
                    df.loc[index, category] = 1
                    df.loc[index, column] = item.replace(role, '').strip()

# Updating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking a sample of rows
df[df['sports'] ==1].sample(2)

CPU times: total: 38.4 s
Wall time: 38.4 s


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death,place_1,place_2,info_parenth_copy,log_num_references,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
79351,14,Jimmy Conway,", 73, Irish footballer .",https://en.wikipedia.org/wiki/Jimmy_Conway_(footballer),4,2020,February,"Fulham, Portland Timbers, national team",,,,,,,,,,,,73.0,,Ireland,,"Fulham, Portland Timbers, national team",1.609438,0,0,0,0,0,0,1,0,0,0,0,0,1
25475,25,Sonny Grandelius,", 79, American football player and coach.",https://en.wikipedia.org/wiki/Sonny_Grandelius,16,2008,April,,,,,,,,,,,,,79.0,,United States of America,,,2.833213,0,0,0,0,0,0,1,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Checking the Number of Rows without a First Category

In [106]:
#### Checking the number of rows without a first category
print(
    f'There are {len(df[df["num_categories"]==0])} entries without any known_for category.'
)

There are 45393 entries without any known_for category.


<IPython.core.display.Javascript object>

#### Observations:
- It is time to export our dataframe and start a new notebook.

### Exporting Dataset to SQLite Database [wp_life_expect_clean6.db](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_clean6.db)

In [107]:
# Exporting dataframe

# Saving dataset in a SQLite database
conn = sql.connect("wp_life_expect_clean6.db")
df.to_sql("wp_life_expect_clean6", conn, index=False)

# Chime notification when cell executes
chime.success()

<IPython.core.display.Javascript object>

# [Proceed to Data Cleaning Part 7](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean7_thanak_2022_07_26.ipynb)