[Return to README](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/README.md#explore-the-project)

# Wikipedia Notable Life Expectancy
## Notebook 9: Data Cleaning Part 8
### Context

The
### Objective

The
### Data Dictionary
- Feature: Description

### Importing Libraries

In [1]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To save/open python objects in pickle file
import pickle

# To help with reading, cleaning, and manipulating data
import pandas as pd
import numpy as np
import re

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)

# To play auditory cue when cell has executed, has warning, or has error and set chime theme
import chime

chime.theme("zelda")

<IPython.core.display.Javascript object>

## Data Overview

### [Reading](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_clean7.db), Sampling, and Checking Data Shape

In [2]:
# Reading the dataset
conn = sql.connect("wp_life_expect_clean7.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_clean7", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 98040 rows and 26 columns.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,age,cause_of_death,place_1,place_2,info_parenth_copy,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,,86.0,,United Kingdom of Great Britain and Northern Ireland,,,0,0,0,0,0,1,0,0,0,0,0,0,1
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,,68.0,,Ireland,,,0,0,0,1,0,1,0,0,1,0,0,0,3


<IPython.core.display.Javascript object>

In [3]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,age,cause_of_death,place_1,place_2,info_parenth_copy,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
98038,9,Aamir Liaquat Hussain,", 50, Pakistani journalist and politician, MNA .",https://en.wikipedia.org/wiki/Aamir_Liaquat_Hussain,99,2022,June,"2002 2007, since 2018",50.0,,Pakistan,,", since",0,0,0,0,0,1,0,0,1,0,0,0,2
98039,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3,2022,June,,86.0,,"China, People's Republic of",,,1,0,0,0,0,0,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

In [4]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,age,cause_of_death,place_1,place_2,info_parenth_copy,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
43328,21,Peter Ellis,", 66, Australian football player.",https://en.wikipedia.org/wiki/Peter_Ellis_(footballer),6,2013,May,,66.0,,Australia,,,0,0,0,0,0,0,1,0,0,0,0,0,1
55224,14,Paul W. Taylor,", 91, American philosopher.",https://en.wikipedia.org/wiki/Paul_W._Taylor,3,2015,October,,91.0,,United States of America,,,0,0,0,1,0,0,0,0,0,0,0,0,1
45196,18,Bum Phillips,", 90, American football coach .",https://en.wikipedia.org/wiki/Bum_Phillips,23,2013,October,"Houston Oilers, New Orleans Saints",90.0,,United States of America,,"Oilers, New Orleans Saints",0,0,0,0,0,0,1,0,0,0,0,0,1
27442,20,Stan Hagen,", 68, Canadian politician, member of the Legislative Assembly of British Columbia , heart attack.",https://en.wikipedia.org/wiki/Stan_Hagen,7,2009,January,"1986 1991, since 2001",68.0,heart attack,Canada,,", since",0,0,0,0,0,0,0,0,1,0,0,0,1
84821,24,Nzamba Kitonga,", 64, Kenyan lawyer and politician.",https://en.wikipedia.org/wiki/Nzamba_Kitonga,4,2020,October,,64.0,,Kenya,,,0,0,0,0,0,0,0,0,1,0,0,0,1


<IPython.core.display.Javascript object>

### Checking Data Types and Null Values

In [5]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98040 entries, 0 to 98039
Data columns (total 26 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   day                        98040 non-null  object 
 1   name                       98040 non-null  object 
 2   info                       98040 non-null  object 
 3   link                       98040 non-null  object 
 4   num_references             98040 non-null  int64  
 5   year                       98040 non-null  int64  
 6   month                      98040 non-null  object 
 7   info_parenth               36660 non-null  object 
 8   age                        98040 non-null  float64
 9   cause_of_death             33335 non-null  object 
 10  place_1                    97885 non-null  object 
 11  place_2                    5897 non-null   object 
 12  info_parenth_copy          36660 non-null  object 
 13  sciences                   98040 non-null  int

<IPython.core.display.Javascript object>

#### Observations:
- With our dataset loaded, we can pick up where we left off with extracting known_for and `cause_of_death` values.
- As all of the numbered `info_` columns have been searched and dropped, we are left with `info_parenth` (and its copy).  
- By definition, we would expect `info_parenth` to contain non-essential values.  The column contains a lot of values, so we will begin by looking only for `known_for` information for the few entries that do not yet have a `known_for` category.
- Then we can consider an approach to searching for any `cause_of_death` information in `info_parenth`, followed by a limited search for missing additional `known_for` categories.

### Extracting Remaining `known_for` for Entries Still Lacking a `known_for` Category

#### Checking Entries Lacking lacking `known_for` Category

In [6]:
# Checking entries with num_categories == 0
df[df["num_categories"] == 0]

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,age,cause_of_death,place_1,place_2,info_parenth_copy,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
278,4,Aníbal,", 53, Mexican , brain cancer.",https://en.wikipedia.org/wiki/An%C3%ADbal_(wrestler),20,1994,March,professional wrestler,53.0,brain cancer,Mexico,,professional wrestler,0,0,0,0,0,0,0,0,0,0,0,0,0
11490,10,Chandra Khonnokyoong,", 91, Thai .",https://en.wikipedia.org/wiki/Chandra_Khonnokyoong,25,2000,September,,91.0,,Thailand,,,0,0,0,0,0,0,0,0,0,0,0,0,0
12052,3,Kung Fu,", 49, Mexican , arterial hyper tension.",https://en.wikipedia.org/wiki/Kung_Fu_(wrestler),11,2001,January,,49.0,arterial hyper tension,Mexico,,,0,0,0,0,0,0,0,0,0,0,0,0,0
16376,10,Little Eva,", .",https://en.wikipedia.org/wiki/Little_Eva,14,2003,April,"née Eva Narcissus Boyd, , American pop singer",59.0,,United States of America,,"née Eva Narcissus Boyd, , pop singer",0,0,0,0,0,0,0,0,0,0,0,0,0
36930,12,Natalee Holloway,", 18",https://en.wikipedia.org/wiki/Natalee_Holloway,198,2012,January,"in 2005, American student, missing since 2005 declared legally dead on this date",18.0,,United States of America,,"in , student, missing since declared legally dead on this date",0,0,0,0,0,0,0,0,0,0,0,0,0
79603,27,Sudhakar Chaturvedi,", 122 .",https://en.wikipedia.org/wiki/Sudhakar_Chaturvedi,37,2020,February,"claimed, Indian Vedic scholar and courier Mahatma Gandhi",122.0,,India,,"claimed, Vedic scholar and courier Mahatma Gandhi",0,0,0,0,0,0,0,0,0,0,0,0,0


<IPython.core.display.Javascript object>

#### Observations:
- We can see some additional information in `info_parenth` for some of the values.
- Since we previously separated the information contained in parentheses from the original `info` column, we will maintain `info_parenth_copy` intact, and utilize `info_parenth` for any value extraction.
- We will hard-code the missing `known_for` info for the entries lacking that information, since there are only 2, and we have the link readily available to find it or it is apparent in the link value.  

#### Finding `known_for` Roles in `info_parenth_copy` for Entries Lacking any Category

In [7]:
# # Obtaining unique values for column and their counts
# roles_list = (
#     df[df["num_categories"] == 0]["info_parenth_copy"]
#     .value_counts(ascending=True)
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [8]:
# # Code to check each value
# value = roles_list.pop()
# value

<IPython.core.display.Javascript object>

In [9]:
# # Create specific_roles_cause_list for above popped value
# specific_roles_list = (
#     df.loc[
#         [
#             index
#             for index in df[df["info_parenth_copy"].notna()].index
#             if value in df.loc[index, "info_parenth_copy"]
#         ],
#         "info_parenth_copy",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [10]:
# # Viewing list sorted by descending length to copy to dictionary below and screen values
# sorted(specific_roles_list, key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [11]:
# # Example code to quick-check a specific entry
# df[df["info_parenth_copy"] == value]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category

In [12]:
# Creating lists for each category and sorting by decreasing length and removing duplicates

politics_govt_law = ["and courier Mahatma Gandhi"]
politics_govt_law = sorted(
    list(set(politics_govt_law)), key=lambda x: len(x), reverse=True
)

arts = ["née Eva Narcissus Boyd, , pop singer"]
arts = sorted(list(set(arts)), key=lambda x: len(x), reverse=True)

sports = ["professional wrestler", "wrestler"]
sports = sorted(list(set(sports)), key=lambda x: len(x), reverse=True)

sciences = []
sciences = sorted(list(set(sciences)), key=lambda x: len(x), reverse=True)

business_farming = []
business_farming = sorted(
    list(set(business_farming)), key=lambda x: len(x), reverse=True
)

academia_humanities = ["scholar"]
academia_humanities = sorted(
    list(set(academia_humanities)), key=lambda x: len(x), reverse=True
)

law_enf_military_operator = []
law_enf_military_operator = sorted(
    list(set(law_enf_military_operator)), key=lambda x: len(x), reverse=True
)

spiritual = ["claimed,  Vedic", "spiritual teacher"]
spiritual = sorted(list(set(spiritual)), key=lambda x: len(x), reverse=True)

social = []
social = sorted(list(set(social)), key=lambda x: len(x), reverse=True)

crime = []
crime = sorted(list(set(crime)), key=lambda x: len(x), reverse=True)

event_record_other = ["in , student, missing since declared legally dead on this date"]
event_record_other = sorted(
    list(set(event_record_other)), key=lambda x: len(x), reverse=True
)

other_species = []
other_species = sorted(list(set(other_species)), key=lambda x: len(x), reverse=True)

cause_of_death = []
cause_of_death = sorted(list(set(cause_of_death)), key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [13]:
# Hard-coding info_parenth_copy for entry lacking known_for values
df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Chandra_Khonnokyoong"].index,
    "info_parenth_copy",
] = "spiritual teacher"


# Hard-coding info_parenth_copy for entry lacking known_for values
df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Kung_Fu_(wrestler)"].index,
    "info_parenth_copy",
] = "wrestler"

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [14]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting `known_for` Categories Values from `info_parenth_copy` for Entries without a Category

In [15]:
%%time

# Column to check
column = 'info_parenth_copy'

# Start dataframe
dataframe = df[(df[column].notna()) & (df['num_categories']==0)]
                
# For loop to find role in column and extract it as category
for category, category_lst in known_for_dict.items():
    for role in category_lst:
        for index in dataframe.index:
                item = df.loc[index, column]
                if item:
                    if role in item:
                        df.loc[index, category] = 1
                        df.loc[index, column] = item.replace(role, '').strip()

# Calculating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

CPU times: total: 31.2 ms
Wall time: 28.6 ms


<IPython.core.display.Javascript object>

#### Checking `num_categories` Value Counts

In [16]:
# Checking num_categories Value Counts
df["num_categories"].value_counts()

1    84131
2    12788
3     1082
4       36
5        3
Name: num_categories, dtype: int64

<IPython.core.display.Javascript object>

#### Observations:
- All entries now have at least one `known_for` category.
- Next, we will proceed to examine the values in `cause_of_death` to potentially guide finding that information in `info_parenth_copy` for entries that lack a value for it.

### Searching for Remaining `cause_of_death` Values in `info_parenth_copy`

In [17]:
# # Creating list of cause_of_death values
# cause_list = df["cause_of_death"].value_counts(ascending=True).index.tolist()

<IPython.core.display.Javascript object>

In [18]:
# # Updating cause_list to contain only causes that are in info_parenth_copy values
# cause_list = [
#     item
#     for item in cause_list
#     if any(
#         item in value
#         for value in df[df["info_parenth_copy"].notna()]["info_parenth_copy"]
#     )
# ]

<IPython.core.display.Javascript object>

In [19]:
# # Checking the cause_of_death values starting with most frequent
# value = cause_list.pop()
# value

<IPython.core.display.Javascript object>

In [20]:
# # Creating list of info_parenth_copy values that contain cause_of_death value
# df.loc[
#     [
#         index
#         for index in df[df["info_parenth_copy"].notna()].index
#         if value in df.loc[index, "info_parenth_copy"]
#     ],
#     "info_parenth_copy",
# ].value_counts().index.tolist()

<IPython.core.display.Javascript object>

In [21]:
# # Checking specific entries
# df[
#     df["info_parenth_copy"]
#     == "1969 1974 and Foreign Affairs 1974 1982; 1982 1992, Vice Chancellor 1974 1982; 1982 1992"
# ]

<IPython.core.display.Javascript object>

#### Creating List for `cause_of_death`

In [22]:
# Creating list for cause_of_death
cause_of_death = [
    "cancer",
    "pancreatic cancer",
    "adrenal cancer",
    "endometrial cancer",
    "nasopharynx cancer",
    "parotid cancer",
    "prostate cancer",
    "multiple myeloma, blood cancer",
    "bowel cancer",
    "oesophageal cancer",
    "liver cancer",
    "lung cancer",
    "cancer",
    "breast cancer",
    "testicular cancer",
    "ovarian cancer",
    "peritoneal cancer",
    "heart attack",
    "COVID",
    "congestive heart failure",
    "heart failure",
    "ischemic heart failure",
    "pneumonia",
    "AIDS, pneumonia",
    "pneumonia, infarctions",
    "bronchial pneumonia",
    "stroke",
    "heat stroke",
    "shot",
    "gunshot wounds",
    "traffic collision",
    "natural causes disease",
    "natural causes",
    "suicide",
    "suspected suicide",
    "suicide by drowning",
    "suicide by hydrogen sulfide",
    "suicide by hanging",
    "Alzheimer disease",
    "leukemia",
    "Parkinson disease",
    "Parkinson’s disease",
    "Creutzfeldt Jakob disease",
    "kidney disease",
    "Pick disease",
    "heart disease",
    "car accident",
    "injuries due to a fall",
    "fall",
    "subdural hematoma, fall",
    "multiple organ failure",
    "AIDS, lymphoma",
    "Hodgkin lymphoma",
    "gastric lymphoma",
    "plane crash",
    "amyotrophic lateral sclerosis",
    "euthanized",
    "uveal melanoma",
    "emphysema",
    "pulmonary emphysema",
    "emphysema, bronchitis",
    "Lewy body dementia",
    "renal failure",
    "intracerebral hemorrhage",
    "liver failure",
    "pulmonary embolism",
    "homicide",
    "pulmonary fibrosis",
    "idiopathic pulmonary fibrosis",
    "abdominal aortic aneurysm",
    "sepsis",
    "glioblastoma multiforme",
    "Jordanian bombings",
    "accidental shooting",
    "pulmonary edema",
    "septic infection",
    "myelodysplastic syndrome",
    "locked in syndrome",
    "multiple organ dysfunction syndrome",
    "superior vena cava syndrome",
    "Marfan syndrome",
    "Guillain Barré syndrome",
    "multiple sclerosis",
    "AIDS",
    "multiple organ failure",
    "pulmonary emphysema",
    "emphysema",
    "emphysema, bronchitis",
    "aortic dissection",
    "progressive supranuclear palsy",
    "Hodgkin lymphoma",
    "COPD",
    "pancreatitis",
    "cerebral haemorrhage",
    "ALS",
    "AL amyloidosis",
    "car accident",
    "accidental shooting",
    "epilepsy",
    "dilated cardiomyopathy",
    "thrombosis",
    "rheumatoid arthritis",
    "beheading",
    "leiomyosarcoma",
    "Ewing sarcoma",
    "sarcoma",
    "leptomeningeal carcinomatosis",
    "nasopharyngeal carcinoma",
    "small cell carcinoma",
    "myelodysplasia",
    "pulmonary embolism",
    "embolism",
    "suffocated",
    "cerebral haemorrhage",
    "assassination",
    "gastrointestinal hemorrhage",
    "intracerebral hemorrhage",
    "anaphylaxis",
    "progressive supranuclear palsy",
    "shelling",
    "pulmonary edema",
    "Jordanian bombings",
    "posterior cortical atrophy",
    "emphysema, bronchitis",
    "West Nile virus",
    "corticobasal degeneration",
    "heat stroke",
    "glioblastoma multiforme",
    "acute endocarditis",
    "arrhythmogenic right ventricular dysplasia",
    "alcoholism",
    "plane crash",
    "normal pressure hydrocephalus",
    "primary progressive aphasia",
    "dilated cardiomyopathy",
    "subdural haematoma",
    "arrhythmia",
    "thrombus",
    "thrombosis",
    "essential thrombocytosis",
    "thrombotic thrombocytopenic purpura",
    "vasculitis",
    "self defenestration",
    "ventricular tachycardia",
]

# Clearing out duplicate values and sorting in descending length order to use for extracting values
cause_of_death = sorted(list(set(cause_of_death)), key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [23]:
# Dropping info_parenth_copy value for entries to avoid incorrect cause_of_death
df.loc[
    [
        index
        for index in df[df["info_parenth_copy"].notna()].index
        if "breaststroke" in df.loc[index, "info_parenth_copy"]
        or "backstroke" in df.loc[index, "info_parenth_copy"]
    ],
    "info_parenth_copy",
] = ""

# Dropping info_parenth_copy value for entries to avoid incorrect cause_of_death
df.loc[
    [
        index
        for index in df[df["info_parenth_copy"].notna()].index
        if "shot put" in df.loc[index, "info_parenth_copy"]
        or "Aldershot" in df.loc[index, "info_parenth_copy"]
    ],
    "info_parenth_copy",
] = ""

# Dropping info_parent_copy value for entry to avoid incorrect cause_of_death
df.loc[
    df[df["info_parenth_copy"] == "fallout shelter sign"].index, "info_parenth_copy"
] = ""

# Dropping info_parent_copy value for entry to avoid incorrect cause_of_death
df.loc[
    df[
        df["info_parenth_copy"]
        == "HIV, President of the International AIDS Society 1994 1998"
    ].index,
    "info_parenth_copy",
] = ""

# Dropping info_parent_copy value for entry to avoid incorrect cause_of_death
df.loc[
    df[df["info_parenth_copy"] == "assassination of Orlando Letelier"].index,
    "info_parenth_copy",
] = ""

# Dropping info_parenth_copy value for entries to avoid incorrect cause_of_death
df.loc[
    [
        index
        for index in df[df["info_parenth_copy"].notna()].index
        if "Suicide" in df.loc[index, "info_parenth_copy"]
    ],
    "info_parenth_copy",
] = ""

<IPython.core.display.Javascript object>

#### Extracting `cause_of_death` Values from `info_parenth_copy`

In [24]:
%%time

# Column to search
column = "info_parenth_copy"

# Dataframe to search
dataframe = df[df[column].notna()]

# For loop to extract cause from column to cause_of_death
for cause in cause_of_death:
    for index in dataframe.index:
        item = df.loc[index, column]
        if item:
            if cause in item:
                if df.loc[index, 'cause_of_death']:
                    df.loc[index, 'cause_of_death'] = df.loc[index, 'cause_of_death'] + '/' + cause
                    df.loc[index, column] = item.replace(cause, "").strip()
                else:
                    df.loc[index, "cause_of_death"] = cause
                    df.loc[index, column] = item.replace(cause, "").strip()

# Checking number of cause_of_death values
print(
    f'There are {df["cause_of_death"].notna().sum()} values in cause_of_death column.\n'
)

There are 33460 values in cause_of_death column.

CPU times: total: 1min 33s
Wall time: 1min 34s


<IPython.core.display.Javascript object>

#### Observations:
- We extracted ~125 values to `cause_of_death` with our last search.
- There are additional category values in `info_parenth_copy` that were not previously captured.  The challenge of searching this column is that it has a very high proportion of unique values, so the cost of capturing the additional categories may be too high.
- Let us attempt to narrow the search by restricting it to most frequent key words, such as "politician", etc., then only search for them in `info_parenth_copy` values for entries that do not already have the associated category.

### Search of `info_parenth_copy` for Additional `known_for` Categories with Constraints

#### Checking Initial Value Counts for `info_parenth_copy`

In [25]:
# Checking info_parenth_copy initial value counts
df["info_parenth_copy"].value_counts()

                                           12570
, ,                                         3293
,                                           1999
since                                       1202
national team                                167
                                           ...  
Silver Oak Cellars                             1
, Foreign Secretary                            1
UCLA, Milwaukee Bucks                          1
Brownsville Station                            1
and minister of industry and technology        1
Name: info_parenth_copy, Length: 13248, dtype: int64

<IPython.core.display.Javascript object>

#### Observations:
- Almost 1/3 of `info_parenth_copy` values are unique, so we will aim to streamline the search for information that will add a new category to an entry, but not take excessive time.  By definition, information in parentheses is anticipated to add detail that is non-essential to the primary information.

#### Function to Save Indices of Rows Matching Regular Expressions Pattern to a List and Print Number of Rows with Match

In [26]:
# Define a function that takes dataframe, column name, and re pattern as arguments and returns list of indices
# for which column value matches re pattern
def rows_with_pattern(dataframe, column, pattern):
    """
    Takes input of dataframe, column name, and re pattern 
    and returns list of indices for rows that contain match
    for pattern anywhere within value for given column.
    
    dataframe: dataframe
    column: column name
    pattern: re pattern
    """
    index_list = []

    for i in dataframe.index:
        item = dataframe.loc[i, column]
        match = re.search(pattern, item)
        if match:
            index_list.append(i)
    print(
        f"There are {len(index_list)} rows with matching pattern in column '{column}'."
    )
    return index_list

<IPython.core.display.Javascript object>

#### Function to Use rows_with_pattern Function for Multiple Regular Expression Patterns

In [27]:
# Define a function that calls rows_with_pattern function for multiple re patterns
# returning a single list of indices for all rows with any pattern match


def multiple_patterns(dataframe, column, patterns):
    """
    Takes input dataframe, column, and list of re patterns and returns single list 
    of indices for rows in which a match for any pattern is found with re.search
    
    dataframe: dataframe
    column: column name
    patterns: list of re patterns
    """
    rows_combined = []

    # For loop to check each pattern
    for pattern in patterns:

        # List and number of rows matching each pattern
        print(pattern)
        rows_to_check = rows_with_pattern(dataframe, column, pattern)
        print("")

        # Add list for each pattern to combined list
        rows_combined += rows_to_check

    return rows_combined

<IPython.core.display.Javascript object>

#### Checking a Sample of `info_parenth_copy` Unique Values

In [28]:
# Checking a sample of info_parenth_copy Unique Values
pd.Series(df["info_parenth_copy"].value_counts().index.tolist()).sample(100)

9589                                                    Norwich City, Tottenham Hotspur
4629                                                 Entombed, Entombed A D , Firespawn
2458                                                                     Dublin, Celtic
11638                                          Donald Trump and publicist New Generals,
4197                                                                               Elia
70                                                                   Montreal Canadiens
4106                                                   Corinthians, women national team
3649                                                                   CR Vasco da Gama
11495                                                                          Helix SF
11633                                                 Chicago White Sox, Boston Red Sox
12869                                                                  Them, Thin Lizzy
4845                            

<IPython.core.display.Javascript object>

#### Observations:
- We can see that many of the values are proper nouns of places, people, or titles (some in quotations).  
- First, we can drop the titles in quotations using regular expressions.
-  Then we will take the approach of combining all of the values into a single list, converting that list to a single string, then back to a list of single word elements, which we can reduce to a set of prioritized words for which to search.

#### Checking and Dropping Titles in Quotations from `info_parenth_copy`

In [29]:
# Column to check
column = "info_parenth_copy"

# Dataframe to check
dataframe = df[df[column].notna()]

# Patterns for re
pattern = f'".*"'

# Finding indices of rows that do and do not have pattern
rows_to_check = rows_with_pattern(dataframe, column, pattern)

# Checking a sample of rows
df.loc[rows_to_check, :].sample(2)

There are 676 rows with matching pattern in column 'info_parenth_copy'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,age,cause_of_death,place_1,place_2,info_parenth_copy,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
33423,4,Gerry Rafferty,", 63, Scottish singer-songwriter , liver failure.",https://en.wikipedia.org/wiki/Gerry_Rafferty,80,2011,January,"""Baker Street""",63.0,liver failure,Scotland,,"""Baker Street""",0,0,0,0,0,1,0,0,0,0,0,0,1
95605,25,Hardev Dilgir,", 82, Indian lyricist , heart attack.",https://en.wikipedia.org/wiki/Hardev_Dilgir,11,2022,January,"""Tere Tille Ton""",82.0,heart attack,India,,"""Tere Tille Ton""",0,0,0,0,0,1,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

In [30]:
# For loop to extract quotations and characters within from info_parenth_copy
for index in rows_to_check:
    item = df.loc[index, column]
    match = re.search(pattern, item)
    if match:
        df.loc[index, column] = re.sub(pattern, "", df.loc[index, column]).strip()

# Recheck a sample of treated rows
df.loc[rows_to_check, :].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,age,cause_of_death,place_1,place_2,info_parenth_copy,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
34028,11,Hugh Martin,", 96, American songwriter , natural causes.",https://en.wikipedia.org/wiki/Hugh_Martin,18,2011,March,"""Have Yourself a Merry Little Christmas"" and film composer",96.0,natural causes,United States of America,,and film composer,0,0,0,0,0,1,0,0,0,0,0,0,1
50465,2,Acker Bilk,", 85, British jazz clarinetist .",https://en.wikipedia.org/wiki/Acker_Bilk,24,2014,November,"""Stranger on the Shore""",85.0,,United Kingdom of Great Britain and Northern Ireland,,,0,0,0,0,0,1,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

In [31]:
# Rechecking info_parenth_copy value counts
df["info_parenth_copy"].value_counts()

                                            13061
, ,                                          3293
,                                            2000
since                                        1202
national team                                 167
                                            ...  
, President of the Governing Council            1
Gloucestershire, Worcestershire,                1
North Sydney, Eastern Suburbs, New South        1
neuroendocrine, ,                               1
and minister of industry and technology         1
Name: info_parenth_copy, Length: 12723, dtype: int64

<IPython.core.display.Javascript object>

#### Observations:
- Dropping song and other titles in quotations simplified the remaining values.
- Next, we will create an abbreviated single list of all of the values, then use the values to extract additional categories.

#### Extracting Additional `known_for` Category from `info_parenth_copy` Using `roles_list`

In [32]:
## Combining `info_parenth_copy` Values into a Single List of Unique Values for Searching

# Creating a single list of info_parenth_copy values
roles_list = df["info_parenth_copy"].value_counts().index.tolist()

# Converting to a single string and removing commas, semicolons, and extra whitespace
roles_list = (
    " ".join(roles_list).replace(",", "").replace(";", "").replace("  ", " ").strip()
)

# Splitting into a list of individual words and converting to a Series to easily check value counts
roles_list = roles_list.split()

# Converting to a series for value_counts in ascending order for use of pop() on most frequent values first
# and dropping obvious extraneous values
roles_list = (
    pd.Series(roles_list)
    .value_counts(ascending=True)
    .drop(
        [
            "and",
            "of",
            "the",
            "The",
            "since",
            "on",
            "nd",
            "th",
            "for",
            "to",
            "&",
            "de",
            "winner",
            "in",
            "at",
            "this",
        ]
    )
)

# Dropping values that occur fewer than 3 times
roles_list = roles_list[roles_list > 2]

# Converting back to list
roles_list = roles_list.index.tolist()

print(f"There are {len(roles_list)} remaining unique individual words in roles_list.\n")

There are 2497 remaining unique individual words in roles_list.



<IPython.core.display.Javascript object>

In [33]:
# # Example code to check each value in roles_list in descending order of frequency
# value = roles_list.pop()
# value

<IPython.core.display.Javascript object>

In [34]:
# # Create specific_roles_cause_list for above popped value
# # only checking entries not already in category associated with popped value
# specific_roles_cause_list = (
#     df.loc[
#         [
#             index
#             for index in df[
#                 (df["info_parenth_copy"].notna())
#                 & (df["politics_govt_law"] == 0)
#                 #                 & (df["law_enf_military_operator"] == 0)
#                 #                 & (df["spiritual"] == 0)
#                 #                 & (df["sports"] == 0)
#                 #                 & (df["academia_humanities"] == 0)
#                 #                 & (df["arts"] == 0)
#                 #                 & (df["business_farming"] == 0)
#             ].index
#             if value in df.loc[index, "info_parenth_copy"]
#         ],
#         "info_parenth_copy",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [35]:
# # Viewing list sorted by descending length to copy to dictionary below and screen values
# sorted(specific_roles_cause_list, key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [36]:
# # Checking individua entries as needed
# df[df[column] == "The Cardinals"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category and for `cause_of_death`

In [37]:
# Creating lists for each category and sorting by decreasing length and removing duplicates

politics_govt_law = [
    "Middletown and politician, Senator and Deputy First Minister",
    "Deputy Minister of Foreign Affairs and politician,",
    ", Minister of Military Production",
    "and Minister of Education",
    "and politician, member of the Arizona House of Representatives and Senate",
    "and politician, member of the Tennessee House of Representatives",
    "and politician, member of the House of Representatives",
    "House of Representatives and politician, member of the Senate and",
    "Dewan Negara, director and politician, member of the",
    "Landtag of Bavaria, , and politician, member of the",
    "House of Councillors and singer, member of the",
    "Sejm, member of the",
    "John Paul II Catholic University of Lublin and member of the Senate",
    "Michigan Wolverines, Eastern Michigan Eagles and politician, member of the Michigan House of Representatives",
    "Detroit Tigers, Philadelphia Phillies and politician, member of the House of Representatives and Senator",
    "Montreal Alouettes, Duke Blue Devils, Baltimore Colts and politician, member of the Maryland Senate",
    "Winnipeg Blue Bombers and politician, member of the Washington House of Representatives and Senate",
    "San Francisco ers and politician, member of the State Assembly and County Board of Supervisors",
    "Tennessee Volunteers and politician, member of the House of Representatives and Senate",
    "Detroit Lions, businessman and politician, member of the House of Representatives",
    "St Louis Cardinals, Dolphins and politician, member of the Nebraska Legislature",
    "Giants and politician, member of the Minnesota House of Representatives",
    "WBRE TV and politician, member of the House of Representatives since",
    "Augusta National Golf Club, member of the House of Representatives",
    "Oglethorpe and politician, member of the House of Representatives",
    "Oakland Raiders and politician, member of the Minnesota Senate",
    "Gators and politician, member of the House of Representatives",
    "Cleveland Browns and politician, member of the Ohio Senate",
    "and member of the House of Lords since",
    "Random House and human rights activist Helsinki Watch",
    "and member of the House of Lords since",
    "Royal Opera House, and conductor",
    "Parliament House, Canberra",
    "Sydney Opera House",
    "Cairo Opera House",
    "Ballymaloe House",
    "Crowded House",
    "Random House",
    "Feral House",
    "Chaos theory, Chief Scientific Advisor to the Government",
    "Journalists' Union of the Athens Daily Newspapers and politician, MEP and Vice President",
    "National Rifle Association of America, Oscar, , President of the , winner",
    "Pulitzer Prize for Presidential Medal of Freedom, Order of the South,",
    "caretaker minister of information & broadcasting and politician,",
    "minister of culture, , and politician,",
    "minister of culture, ,",
    "Toronto Maple Leafs, commentator MP and politician, Stanley Cup",
    "MP and politician, Liberal People Party , and leader of the",
    "Detroit Red Wings, Toronto Maple Leafs and politician, MP",
    "Venizelos SA and politician, MP , and Deputy Speaker",
    "Interacting boson model and politician, MP",
    "MP, human rights activist and politician,",
    "WF and politician, MP for Stirling",
    "MP, civic activist and politician,",
    ", , economist, and politician, MP",
    "LRT televizija and politician, MP",
    "MP, , , composer and politician,",
    "Partex Group and politician, MP",
    "MP, MEP, and politician, , , ,",
    "MLA, MP, and politician, and",
    ", coach and politician, MP",
    "MP, and politician, since",
    "Rustavi Ensemble, MP",
    "MP, and politician,",
    "MP and politician,",
    "and politician, MP",
    "MP, , ,",
    ", MP, ,",
    "Chicago Cardinals St Louis Cardinals and politician, mayor of Starkville,",
    "Cleveland Browns, Baltimore Ravens and politician, mayor of Tangipahoa,",
    "Brainerd International Raceway and politician, mayor of Flint, Michigan",
    "Lézignan, national team and politician, mayor of Lézignan Corbières",
    "PGA Tour, Tour and politician, mayor of Villa Allende since",
    "and politician, mayor of Ouray, Colorado",
]
politics_govt_law = sorted(
    list(set(politics_govt_law)), key=lambda x: len(x), reverse=True
)

arts = [
    "Minister Frederick Gray in the James Bond films",
    "WCGA and TV WTVC broadcaster",
    "The Wrecking Crew and original member of Herb Alpert Tijuana Brass",
    "Agence Presse, member",
    "Rams, member of Pro Football Hall of Fame, and actor ,",
    "Sydney Opera House",
    "News Corporation, President of the Academy of Television Arts & Sciences since",
    "Royal Court Theatre",
]
arts = sorted(list(set(arts)), key=lambda x: len(x), reverse=True)

sports = [
    "and sports team owner Islanders",
    "and rugby union player national team, Wellington",
    "and owner of the New Patriots football team",
    "and sports team owner Dolphins, Panthers",
    "and sports team owner Orlando Magic",
    "and Baseball team owner Mets",
    "and college football coach Columbia University",
    "ThyssenKrupp, member of IOC",
    "Plugged Nickle, member of National Museum of Racing and Hall of Fame",
    "FC Köln, member of World Cup winning team",
    "since and member of FIFA Council since",
    "TVS, CBS Sports, Sportsvision and baseball Chicago White Sox",
    "and baseball player Memphis Red Sox",
    "baseball beat and San Francisco Giants writer J G Taylor Spink Award, recipient of the",
    "Yankees, Giants",
    ", Senator for Minnesota , Olympic silver medalist in ice hockey",
    ", and Olympic eventing chef d'équipe ,",
    ", President of Olympic Committee",
    ", Olympic speed skater",
    ", and Olympic swimmer ,",
    "Olympic medallist,",
    "Little Caesars, Detroit Red Wings, Detroit Tigers",
]
sports = sorted(list(set(sports)), key=lambda x: len(x), reverse=True)

sciences = [
    "operated on President John Fitzgerald Kennedy and Lee Harvey Oswald",
]
sciences = sorted(list(set(sciences)), key=lambda x: len(x), reverse=True)

business_farming = [
    "Uni President Enterprises Corporation death announced on this date",
    "and Air , President of Delta Air Lines",
]
business_farming = sorted(
    list(set(business_farming)), key=lambda x: len(x), reverse=True
)

academia_humanities = [
    "Northern Arizona University",
    "and President of Royal Society",
    "University of Chicago, President of Physical Society",
    "Graham number, President of the Mathematical Society",
    "HIV, President of the International  Society",
    "National Gypsum and academic administrator, President of the UNC",
    "Clemson Tigers and academic administrator, President of Clemson University",
]
academia_humanities = sorted(
    list(set(academia_humanities)), key=lambda x: len(x), reverse=True
)

law_enf_military_operator = [
    "People Army, Minister of Defence",
    ", first Defence Minister of ia",
    "and Minister of Defence of",
    "and Minister of Defense",
    "Flying Tigers",
]
law_enf_military_operator = sorted(
    list(set(law_enf_military_operator)), key=lambda x: len(x), reverse=True
)

spiritual = [
    "and dean of the College of Cardinals",
]
spiritual = sorted(list(set(spiritual)), key=lambda x: len(x), reverse=True)

social = []
social = sorted(list(set(social)), key=lambda x: len(x), reverse=True)

crime = [
    "'Manson Family' member",
]
crime = sorted(list(set(crime)), key=lambda x: len(x), reverse=True)

event_record_other = []
event_record_other = sorted(
    list(set(event_record_other)), key=lambda x: len(x), reverse=True
)

other_species = []
other_species = sorted(list(set(other_species)), key=lambda x: len(x), reverse=True)

cause_of_death = []
cause_of_death = sorted(list(set(cause_of_death)), key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

#### Creating `known_for_dict` Dictionary of Category Keys and Specific Role Lists of Values

In [38]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting `known_for` Categories and `cause_of_death` Values from `info_parenth_copy`

In [39]:
%%time

# Column to check
column = 'info_parenth_copy'

# Start dataframe
dataframe = df[df[column].notna()]

# For loop to find cause in column and extract it to cause_of_death
for cause in cause_of_death:
    for index in dataframe.index:
        item = df.loc[index, column]
        if item:
            if cause in item:
                if df.loc[index, 'cause_of_death']:
                    df.loc[index, 'cause_of_death'] = df.loc[index, 'cause_of_death'] + '/' + cause
                    df.loc[index, column] = item.replace(cause, '').strip()
                else:
                    df.loc[index, 'cause_of_death'] = cause
                    df.loc[index, column] = item.replace(cause, '').strip()
                
                
# For loop to find role in column and extract it as category
for category, category_lst in known_for_dict.items():
    for role in category_lst:
        for index in dataframe.index:
                item = df.loc[index, column]
                if item:
                    if role in item:
                        df.loc[index, category] = 1
                        df.loc[index, column] = item.replace(role, '').strip()

# Calculating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking number of cause_of_death values
print(f'There are {df["cause_of_death"].notna().sum()} values in cause_of_death column.\n')

There are 33460 values in cause_of_death column.

CPU times: total: 1min 15s
Wall time: 1min 16s


<IPython.core.display.Javascript object>

#### Checking `num_categories` Value Counts

In [40]:
# Checking num_categories Value Counts
df["num_categories"].value_counts()

1    84032
2    12879
3     1090
4       36
5        3
Name: num_categories, dtype: int64

<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild `known_for_dict` and `cause_of_death` for the next iteration.

#### Extracting Additional `known_for` Category from `info_parenth_copy` Using `roles_list`

In [41]:
## Combining `info_parenth_copy` Values into a Single List of Unique Values for Searching

# Creating a single list of info_parenth_copy values
roles_list = df["info_parenth_copy"].value_counts().index.tolist()

# Converting to a single string and removing commas, semicolons, and extra whitespace
roles_list = (
    " ".join(roles_list).replace(",", "").replace(";", "").replace("  ", " ").strip()
)

# Splitting into a list of individual words and converting to a Series to easily check value counts
roles_list = roles_list.split()

# Converting to a series for value_counts in ascending order for use of pop() on most frequent values first
# and dropping obvious extraneous values
roles_list = (
    pd.Series(roles_list)
    .value_counts(ascending=True)
    .drop(
        [
            "and",
            "of",
            "the",
            "The",
            "since",
            "on",
            "nd",
            "th",
            "for",
            "to",
            "&",
            "de",
            "winner",
            "in",
            "at",
            "this",
            "that",
        ]
    )
)

# Dropping values that occur fewer than 3 times
roles_list = roles_list[roles_list > 2]

# Converting back to list
roles_list = roles_list.index.tolist()

print(f"There are {len(roles_list)} remaining unique individual words in roles_list.\n")

There are 2485 remaining unique individual words in roles_list.



<IPython.core.display.Javascript object>

In [42]:
# # Slicing roles_list for search
# roles_list_sliced = roles_list[2000:]

<IPython.core.display.Javascript object>

In [43]:
# # Example code to check each value in slice of roles_list in descending order of frequency
# value = roles_list_sliced.pop()
# value

<IPython.core.display.Javascript object>

In [44]:
# # Create specific_roles_cause_list for above popped value
# # only checking entries not already in category associated with popped value
# specific_roles_cause_list = (
#     df.loc[
#         [
#             index
#             for index in df[
#                 (df["info_parenth_copy"].notna())
#                 #                 & (df["politics_govt_law"] == 0)
#                 #                 & (df["law_enf_military_operator"] == 0)
#                 #                 & (df["spiritual"] == 0)
#                 #                 & (df["sports"] == 0)
#                 #                 & (df["academia_humanities"] == 0)
#                 & (df["arts"] == 0)
#                 #                 & (df["business_farming"] == 0)
#                 #                 & (df['crime']==0)
#                 #                 & (df["sciences"] == 0)
#                 & (df["other_species"] == 0)
#             ].index
#             if value in df.loc[index, "info_parenth_copy"]
#         ],
#         "info_parenth_copy",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [45]:
# # Viewing list sorted by descending length to copy to dictionary below and screen values
# sorted(specific_roles_cause_list, key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [46]:
# # Checking individua entries as needed
# df[df[column] == "and Permanent Secretary"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category and for `cause_of_death`

In [47]:
# Creating lists for each category and sorting by decreasing length and removing duplicates

politics_govt_law = [
    "CFRB, Senator and TV panelist",
    "FARC and politician, deputy",
    ", Chairwoman of PLA",
    "Winston Churchill",
    ", prefect of the CICLSAL and apostolic nuncio to",
    "Richard Nixon, Gerald Ford",
    "Jyllands Posten Muhammad cartoons controversy",
    "Koch Industries and political financier Americans for Prosperity",
    "Holocaust, political activist",
    "national team and politician, National Councillor",
    "Race Relations Conciliator",
    "Quezon City Pride Council, and LGBT activist",
    "since and Apostolic Nuncio to since",
    "since , Apostolic Nuncio to",
    "since , Apostolic Nuncio",
    "and Apostolic Nuncio to",
    "Church of the Universe and political candidate Marijuana Party",
    "Education Commission of the States",
    "Assistant Secretary of State and government official,",
    "Merrill Lynch, Assistant Secretary of the Treasury",
    "Assistant Attorney General,",
    ", Prefect of the Congregation for the Clergy and President of the Pontifical Commission",
    ", President of the Pontifical Commission for City State",
    "and president of the Pastoral Care of Health Care Workers",
    ", prefect of the CICLSAL and apostolic nuncio to",
    "since and Apostolic nuncio",
    "MBM Arquitectes and urban planner, Barcelona city councilor and president of Fundació Joan Miró",
    "national team, coach Magic, and politician, Hamilton city councillor , since",
    "and Knoxville, Tennessee city councilman",
    "Radio Free Radio Liberty",
    ", chief of the general staff , and governor of",
    "and chairman joint chiefs of staff committee",
    "civil rights activist, National Medal of Arts and , laureate",
    "Goldman Sachs and civil servant, Deputy Secretary of State",
    "national team, coach Magic, and politician, Hamilton city councillor , since",
    "national team and politician, Ballymena Borough councillor",
    "and political activist Black Panthers",
    "Moro Islamic Liberation Front and politician, Speaker of the Bangsamoro Parliament since",
    "Ambassador",
    "University of Arkansas at Little Rock and child development campaigner NAEYC",
    "Pulitzer Prize and political scientist, recipient of the",
    ", political commentator and actor",
    "Swimming and defense lawyer Gary Ridgway",
    "Apollo , assistant secretary of state for public affairs",
    ", Vice President , Director of Central Intelligence",
    "External Intelligence of the Defence Force",
    "The Tragically Hip and activist Lake Waterkeeper, residential school reconciliation",
    "Que Sera, Sera, Golden Globe, singer  and animal welfare activist, winner , , ,",
    "Nobel Prize laureate in Literature, , and anti apartheid activist,",
    "civil rights activist, National Medal of Arts and , laureate",
    "phone hacking scandal, whistleblower of the ,  body found on this date",
    "Lehman Brothers, co founder of The Blackstone Group, Secretary of Commerce",
    "Legislative Yuan, , director and politician, member of the",
    "and chairman of the Atomic Energy Commission",
    "and chairman of the UGC",
    "Randwick, New South Wales, national team, Lord Mayor of Sydney",
    "and politician, chair of the Federal Reserve Bank of Kansas City",
    "Anarchist Federation",
    "Edmonton Eskimos and politician, Premier of Alberta",
    ", political commentator and actor",
    ", Qualifications Authority CEO",
    "Matson, Inc , conservationist Ghirardelli Square and diplomat, Trade Representative",
    "Central News Agency and diplomat, Representative to",
    "Ambassador and diplomat, to Czechoslovakia ;",
    "Cox Enterprises and diplomat, Ambassador to",
    "Ambassador to Mexico, , and diplomat,",
    "Ambassador to and diplomat,",
    "civil rights activist",
    "Lok Sabha, Gurdaspur, and politician, member of the for ; since",
    "Idaho House of Representatives and politician, member of the",
    "Legislative Yuan, , director and politician, member of the",
    "House of Representatives and politician, member of the",
    "Del Rosario University and politician, Commandant of the Army , Governor of Magdalena",
    ", Governor of Maharashtra and Rajasthan",
    ", Governor of Nagaland and Manipur",
    "and Lieutenant Governor of",
    "and Governor of West",
    "Calgary Stampeders, Edmonton Eskimos and politician, Lieutenant Governor of Alberta",
    "and Governor of Bank of Scotland,",
    ", Governor General",
    "Idaho House of Representatives and politician, member of the",
    "House of Representatives and politician, member of the",
    "Randwick, New South Wales, national team, Lord Mayor of Sydney",
    "Napster and politician, Mayor of San Carlos,",
    "Henderson, Nevada, Mayor of",
    ", Mayor of the Gold Coast",
    "The Invincibles, rules footballer and politician, Victorian MLA for Prahran",
    "Saurashtra and politician, MLA , ,",
    "Andhra Pradesh, MLA and politician,",
    "Lehman Brothers, co founder of The Blackstone Group, Secretary of Commerce",
    "Goldman Sachs and civil servant, Deputy Secretary of State",
    "Assistant Secretary of State and government official,",
]
politics_govt_law = sorted(
    list(set(politics_govt_law)), key=lambda x: len(x), reverse=True
)

arts = [
    "Conseil supérieur de l'audiovisuel, Centre national du cinéma et de l'image animée",
    "South Melbourne and radio broadcaster ABC Local Radio",
    "and fight choreographer , ,",
    "Canterbury region and biographer Richard Pearse, Denis Glover",
    "Green Bay Packers, Denver Broncos and stuntman",
    ", stuntman and actor",
    "WWF and stuntman ,",
    ", , and stuntman",
    "NJPW, WCW and reality show contestant , Olympic bronze medalist",
    "WWE, model and reality show contestant",
    "Tate LaBianca murders case and author ,",
    "Milwaukee Brewers, Yankees and announcer Cleveland Indians",
    "Pittsburgh Pirates and announcer Mets",
    "WCCO TV, Fox Sports North and sports commentator Minnesota Timberwolves",
    "WMAQ, WSNS TV and sports talk personality Sporting News Radio",
    "WHDH, The Weather Channel",
    "The Weather Channel",
    "media Prime Television",
    "Shaw Communications and mass media executive Corus Entertainment",
    "Tupac Amaru Shakur Center for the Arts, Amaru Entertainment, Makaveli Branded",
    "Playboy",
    "and lyricist national anthem",
    "child actor",
    "Cardinals, Cubs, Pirates and Hall of Fame sportscaster MLB GOTW, World Series champion",
    "Philadelphia Eagles, coach and sportscaster CBS Sports, WCAU",
    "Oakland Raiders and sportscaster Super Bowl, , champion",
    "Toronto Argonauts, Dallas Cowboys and sportscaster TSN",
    "Cleveland Browns, Dallas Cowboys and sportscaster CBS",
    "Philadelphia Eagles, Rams and sportscaster",
    "Film Society of Lincoln Center",
    "CNN, MSNBC",
    "Kingston Kings, national team and presenter Sky Sports",
    "Münnich Motorsport and television presenter",
    ", and television presenter",
    ", journalist and editor AGERPRES,",
    "and journalist",
    "national team and sports commentator ABC Sports, ESPN",
    "Dodgers, Chicago Cubs, Yankees and commentator, World Series champion ,",
    "Kaizer Chiefs, national team and commentator SuperSport",
    "national team and sports commentator ABC Sports, ESPN",
    "Dallas Cowboys and commentator",
    "WWA, WWWF and commentator",
    "chairman of ITV plc",
    "since , Ambassador to and child actor",
    ", actor ,",
    "World War II and actor Oscar, winner",
    "Bible Black, Geezer Butler Band, actor",
    "Queen Mary University of and child actor kg schoolboy champion, BUCS champion ,",
    "Chicago Blackhawks, Toronto Maple Leafs and actor Stanley Cup, champion",
    "Philadelphia San Francisco Warriors, Lakers and actor",
    "Cleveland Browns, Philadelphia Eagles and actor",
    "San Francisco ers, Baltimore Colts and actor",
    "Pittsburgh Steelers and actor ,",
    "national team and child actor ,",
    "Bellator, UFC, boxer and actor",
    "Yankees, writer , and actor",
    "WWE, NWA, WCW and actor , ,",
    ", stunt double and actor ,",
    "Detroit Lions and actor ,",
    "Oakland Raiders and actor",
    "Baltimore Colts and actor",
    "Dallas Cowboys and actor",
    "Stampede and actor ,",
    ", stuntman and actor",
    "IWE, NJPW and actor",
    "CMLL, WCW and actor",
    "WCCW, WWF and actor",
    "WCCW and actor ,",
    "Rams and actor ,",
    "Mets and actor ,",
    "NWA and actor ,",
    "WWE and actor ,",
    "WWF and actor",
    "and actor , ,",
    "WBO and actor",
    ", and actor",
    "and actor ,",
    ", actor",
]
arts = sorted(list(set(arts)), key=lambda x: len(x), reverse=True)

sports = [
    "Philadelphia Eagles, and football player Baltimore Colts,",
    "Hendrick Motorsports",
    "Back Bay Restaurant Group and dog racetrack owner Wonderland Greyhound Park",
    "General Electric, owner of Albany River Rats",
    ", co founder of the AFL and owner of the Chargers",
    "Cyclone Tracy and AFL player St Kilda",
    "cricketer Otago",
    "First Allied Corporation and sports franchise owner Manchester United, Tampa Bay Buccaneers",
    "A Bank , broadcast executive TV and sport administrator Cricket",
    "and sports executive",
    "Oakland Athletics, Golden State Warriors, Golden Bears football",
    "Chicago Bears, and football player",
    "football Sydney FC",
    "Mawarid Holding and horse stable owner Juddmonte",
    "Donald Trump and publicist New Generals,",
    "deputy, Olympique de Marseille and president of",
    "Baseball Hall of Fame, Yankees, Pittsburgh Pirates",
    "WCCO TV, Fox Sports North and sports commentator Minnesota Timberwolves",
    "WMAQ, WSNS TV and sports talk personality Sporting News Radio",
    "Fox Sports Media Group, CBC Sports",
    "NBC, NBC Sports, ABC Sports",
    "The Sports and composer ,",
    "CBS Sports",
    "Automobiles Gonfaronnaises Sportives",
    "Cerner, Sporting Kansas City",
    "Portland Trail Blazers and journalist",
    "NFL Films, ,",
    "NFL Films",
    "WCCO TV, Fox Sports North and sports commentator Minnesota Timberwolves",
    "WMAQ, WSNS TV and sports talk personality Sporting News Radio",
    "First Allied Corporation and sports franchise owner Manchester United, Tampa Bay Buccaneers",
    "ESPN, and reporter",
    "ESPN,",
    "since , president of the National Football Federation and minister of the interior",
    "WCCO TV, Fox Sports North and sports commentator Minnesota Timberwolves",
    "Victoria Day and footballer Djurgårdens IF",
    "Essendon and footballer Hawthorn",
    "First Allied Corporation and sports franchise owner Manchester United, Tampa Bay Buccaneers",
    "PizzaExpress and football club owner Peterborough United",
    "United Dairy Farmers, Cincinnati Reds",
    "Cotton Factory Club, national team",
    "Partizan and coach OKK Beograd, Cantù, Secretary General of FIBA",
]
sports = sorted(list(set(sports)), key=lambda x: len(x), reverse=True)

sciences = [
    "Atari, , , Director of Engineering for",
    "Bud Moore Engineering",
    "co discovered the blood test for the Rh blood factor",
]
sciences = sorted(list(set(sciences)), key=lambda x: len(x), reverse=True)

business_farming = [
    "Playboy Enterprises, businessman  and reality television personality",
    "Jimmy Dean Foods, actor and businessman",
    "Petro Canada, businessman Abu Sayyaf",
    "CA Technologies, ",
    "Nintendo, , president and CEO of since",
]
business_farming = sorted(
    list(set(business_farming)), key=lambda x: len(x), reverse=True
)

academia_humanities = [
    "Conseil supérieur de l'audiovisuel, Télévisions, International Francophone Press Union",
    "and Chancellor of University of Southampton since",
    "Canterbury, national team, sports administrator, and educator Auckland Grammar School",
    "Saskatchewan Wheat Pool and educator, chancellor of the University of Saskatchewan",
    "Eddie Harris, Elvin Jones and educator Conservatory of Music",
    "High Sheriff of Nottinghamshire and",
    "Warsaw Ghetto Uprising survivor",
    "University of Glasgow, Regius Professor of Zoology",
    "Māori and coach Counties,",
    "and Rector of Pontifical Academy of Theology",
    "University of Otago and",
    "Eddie Harris, Elvin Jones and educator Conservatory of Music",
    "Division of Musical Instruments at the National Museum of History and Technology",
    "Chancellor of Queen University Belfast since",
    ", Chairwoman of PLA and Chancellor of University of Southampton since",
    "Chancellor of McGill University and",
    "National Council of Jewish Women and philanthropist Cooper Hewitt, Smithsonian Design Museum",
    "Triangle Publications, TelVue and philanthropist Columbia University",
    "Washington University School of Medicine",
    "Johns Hopkins School of Medicine",
    "NSW Supreme Court, chancellor of the University of Sydney",
    ", academic and chancellor of Carleton University",
    "and chancellor of the University of since",
    "Connecticut Bank and Trust Company and academic administrator, president of Trinity College Connecticut",
    "Deseret Management Corporation and academic administrator Weber State University",
    "General Electric and academic administrator Rensselaer Polytechnic Institute",
    "Chengdu J and academic Academy of Engineering",
    "Hongdu JL and academic Academy of Engineering",
    ", academic and chancellor of Carleton University",
    "Ball State University, and academic",
    "Connecticut House, , judge Connecticut Superior Court, ; professor Quinnipiac",
    "All Girls Professional Baseball League and professor Suffolk University",
    "University of North Carolina Wilmington and professor",
    "CBS News and professor University of Southern",
    "Seoul Institute of the Arts and professor",
    "Medtronic and museum founder Bakken Museum",
]
academia_humanities = sorted(
    list(set(academia_humanities)), key=lambda x: len(x), reverse=True
)

law_enf_military_operator = [
    ", Acting Director of FBI and Administrator of EPA ,",
    ", Secretary of State for War and Minister for the Armed Forces",
    "ISIL in",
    ", Sheriff of Middlesex County",
    "Colorado A&M Aggies and World War II RAF officer Tuskegee Airmen",
    ", and aviator Tuskegee Airmen",
    "Tuskegee Airmen",
    "NRK and military officer Commander of Operation Gunnerside",
    "Police Department,  and police officer",
    "Assistant Secretary of Defense, Ambassador to",
    "minister of defence",
    "Al Qaeda",
    "Great Eastern Islamic Raiders' Front",
    "Islamic Jihad of",
    "Islamic State",
    "Assistant Secretary of Defense, Ambassador to",
    ", Minister of Defense",
    ", Defense Minister",
    "Defense Forces,",
    "Colorado A&M Aggies and World War II RAF officer Tuskegee Airmen",
    "FK Liepājas Metalurgs, RAF Jelgava, national team",
    "NRK and military officer Commander of Operation Gunnerside",
    ", minister of defense , and provisional president of the Senate",
    ", since , minister of national defense and interior",
    ", foreign affairs and defense since",
    "since and minister of defense",
    ", secretary of the Navy",
    ", , Navy Minister",
    "and Bishop to the Forces",
    "Defense Forces,",
    ", Secretary of State for War and Minister for the Armed Forces",
    "and Director General of the National Police",
    "Police Department,  and police officer",
    "and Halland , Police Commissioner",
    "and Minister of Police Affairs",
    "Police Sports Club",
    ", minister of defence and transport and public works since",
    ", minister of the interior , and national defence",
    ", minister of national defence and transport",
    ", minister of defence and foreign affairs ,",
    ", minister of foreign affairs and defence",
    ", minister of defence and the interior",
    "since and minister of defence since",
    ", deputy , and minister of defence",
    "Red Brigades",
    "Team B and professor University",
    "United Freedom Front",
]
law_enf_military_operator = sorted(
    list(set(law_enf_military_operator)), key=lambda x: len(x), reverse=True
)

spiritual = [", auxiliary bishop of Port Moresby since"]
spiritual = sorted(list(set(spiritual)), key=lambda x: len(x), reverse=True)

social = [
    "National Urban League, UNCF",
    "Bear Stearns, St Vincent Catholic Medical Center and philanthropist Al Smith Dinner",
    "Tesla, Wikimedia Foundation",
    "Mutual of and humanitarian Concern Worldwide,",
    "Chase Manhattan, globalist Trilateral Commission and philanthropist Rockefeller Brothers Fund",
    "James B Nutter & Company, philanthropist Children Mercy Hospital and power broker Missouri",
    "Bear Stearns, St Vincent Catholic Medical Center and philanthropist Al Smith Dinner",
    "Berkshire Hathaway and philanthropist Maine Medical Center",
    "Barrick Gold and philanthropist Toronto General Hospital",
    "philanthropist Smile Train",
    "Mencap and activist",
]
social = sorted(list(set(social)), key=lambda x: len(x), reverse=True)

crime = [
    "Richard Nixon, convicted for conspiracy to obstruct justice and wiretapping Watergate scandal",
    "CIA, Brigade and convicted criminal Watergate burglary",
    "Seattle Seahawks, Oakland Raiders, Jets and mass murderer Rock Hill shooting",
    "of Vincennes scandal",
    "Angry Brigade",
]
crime = sorted(list(set(crime)), key=lambda x: len(x), reverse=True)

event_record_other = [
    "coup d'état",
    "Hezbollah, FBI most wanted terrorist death announced on this date",
    "and Nobel survivor, Laureate",
    "Oscar, , and Holocaust survivor, winner ,",
]
event_record_other = sorted(
    list(set(event_record_other)), key=lambda x: len(x), reverse=True
)

other_species = []
other_species = sorted(list(set(other_species)), key=lambda x: len(x), reverse=True)

cause_of_death = []
cause_of_death = sorted(list(set(cause_of_death)), key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [48]:
# Dropping entry with link that points to another individual's page
df.drop(
    df[df["link"] == "https://en.wikipedia.org/wiki/Larry_Fisher_(murderer)"].index,
    inplace=True,
)
df.reset_index(inplace=True, drop=True)

# Hard-coding cause_of_death for entry with partial value in info_parenth_copy
df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Moustapha_Akkad"].index,
    "cause_of_death",
] = "injuries sustained in Jordanian bombings"

# Hard-coding cause_of_death for entry with value converted from apparent suicide to homicide
df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Alberto_Nisman"].index,
    "cause_of_death",
] = "homicide"

<IPython.core.display.Javascript object>

#### Creating known_for_dict Dictionary of Category Keys and Specific Role Lists of Values

In [49]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting `known_for` Categories and `cause_of_death` Values from `info_parenth_copy`

In [50]:
%%time

# Column to check
column = 'info_parenth_copy'

# Start dataframe
dataframe = df[df[column].notna()]

# For loop to find cause in column and extract it to cause_of_death
for cause in cause_of_death:
    for index in dataframe.index:
        item = df.loc[index, column]
        if item:
            if cause in item:
                if df.loc[index, 'cause_of_death']:
                    df.loc[index, 'cause_of_death'] = df.loc[index, 'cause_of_death'] + '/' + cause
                    df.loc[index, column] = item.replace(cause, '').strip()
                else:
                    df.loc[index, 'cause_of_death'] = cause
                    df.loc[index, column] = item.replace(cause, '').strip()
                
                
# For loop to find role in column and extract it as category
for category, category_lst in known_for_dict.items():
    for role in category_lst:
        for index in dataframe.index:
                item = df.loc[index, column]
                if item:
                    if role in item:
                        df.loc[index, category] = 1
                        df.loc[index, column] = item.replace(role, '').strip()

# Calculating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking number of cause_of_death values
print(f'There are {df["cause_of_death"].notna().sum()} values in cause_of_death column.\n')

There are 33460 values in cause_of_death column.

CPU times: total: 3min 24s
Wall time: 3min 27s


<IPython.core.display.Javascript object>

#### Checking num_categories Value Counts

In [51]:
# Checking num_categories Value Counts
df["num_categories"].value_counts()

1    83780
2    13109
3     1109
4       38
5        3
Name: num_categories, dtype: int64

<IPython.core.display.Javascript object>

#### Observations:
- We will proceed to rebuild known_for_dict and cause_of_death for the next iteration.

#### Extracting Additional `known_for` Category from `info_parenth_copy` Using `roles_list`

In [52]:
# # Slicing roles_list for search
# roles_list_sliced = roles_list[2000:]

<IPython.core.display.Javascript object>

In [53]:
# # Example code to check each value in slice of roles_list in descending order of frequency
# value = roles_list_sliced.pop()
# value

<IPython.core.display.Javascript object>

In [54]:
# # Create specific_roles_cause_list for above popped value
# # only checking entries not already in category associated with popped value
# specific_roles_cause_list = (
#     df.loc[
#         [
#             index
#             for index in df[
#                 (df["info_parenth_copy"].notna())
#                 #                 & (df["politics_govt_law"] == 0)
#                 #                 & (df["law_enf_military_operator"] == 0)
#                 #                 & (df["spiritual"] == 0)
#                 #                 & (df["sports"] == 0)
#                 #                 & (df["academia_humanities"] == 0)
#                 #                 & (df["arts"] == 0)
#                 #                 & (df["business_farming"] == 0)
#                 #                 & (df['crime']==0)
#                 #                 & (df["sciences"] == 0)
#                 #                 & (df["social"] == 0)
#                 & (df["other_species"] == 0)
#             ].index
#             if value in df.loc[index, "info_parenth_copy"]
#         ],
#         "info_parenth_copy",
#     ]
#     .value_counts()
#     .index.tolist()
# )

<IPython.core.display.Javascript object>

In [55]:
# # Viewing list sorted by descending length to copy to dictionary below and screen values
# sorted(specific_roles_cause_list, key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

In [56]:
# # Checking individua entries as needed
# df[df[column] == "City Film Commissioner and executive,"]

<IPython.core.display.Javascript object>

#### Creating Lists for Each `known_for` Category and for `cause_of_death`

In [57]:
# Creating lists for each category and sorting by decreasing length and removing duplicates

politics_govt_law = [
    "Dassault Group and politician, Senator",
    "Mercury Atlas and politician, Senator from Ohio",
    "Senator and politician, MEP and",
    "Senator, , and politician,",
    "Senator, , and",
    "Senator and ,",
    "ZB, Radio Dunedin and politician Dunedin City Council",
    "Arts Council and executive, chair of",
    "Detroit City Council and politician",
    "Council for Secular Humanism",
    "deputy, , television presenter and politician,",
    "deputy, , and politician,",
    "Chernobyl disaster, Hero of the Union, deputy since",
    "ZB, Radio Dunedin and politician Dunedin City Council",
    "deputy, , television presenter and politician,",
    "Detroit City Council and politician",
    "Senator and politician, MEP and",
    "Senator, , and politician,",
    "deputy, , and politician,",
    "MEP, and politician,",
    "Dassault Group and politician, Senator",
    "Martini & Rossi and politician, Deputy",
    "Mercury Atlas and politician, Senator from Ohio",
    "Juventus, national team and politician, MEP",
    "Montreal Alouettes and politician, MNA",
    "Punjab, Lahore and politician, senator",
    "Vasco da Gama and politician, Deputy",
    "Laois and politician, TD , MEP",
    "and politician, MHA for Bass",
    "South Centre and economist, executive director of the",
    "Commissioner and executive,",
    "Arts Council and executive, chair of",
    ", chairman of COSC and governor of Punjab",
]
politics_govt_law = sorted(
    list(set(politics_govt_law)), key=lambda x: len(x), reverse=True
)

arts = [
    "Fulham and manager Coventry City, trade union leader PFA and TV presenter",
    "Stardom, Wrestle and reality TV personality",
    "and chairman of RTVE",
    "WLS TV, WMAQ TV",
    "WUSA, WJLA TV",
    "WBZ TV",
    "owner and CEO of Fabergé Inc and songwriter two time nominee for Academy Award for Best Original Song",
    "McCann Erickson and songwriter",
    "Tate murders, songwriter",
    "Family Radio",
    "Pragati Maidan, Institute of Management Bangalore, Salar Jung Museum",
    "Simon Property Group, producer",
    "Mann Theatres and film producer ,",
    "Univision and producer",
    "Portland Trail Blazers, Phoenix Suns and reality television personality",
    "Lakers and television broadcaster Utah Jazz",
    "County Superior Court and television personality ,",
    "County Superior Court and television personality",
    ", Mayor of City , television judge",
    ", and television personality",
    "Pulitzer Prize and writer , , winner ,",
    "Pulitzer Prize, winner",
    "World Trade Center, Shanghai World Financial Center, Bank of Tower",
    "Giants and broadcaster ABC",
    "War Resisters League and magazine editor",
    "DSW, NWWL, bodybuilder and actress",
    "Billy Rose Aquacade and actress ,",
    "WWF and actress , ,",
    "GLOW and actress ,",
    "and actress , ,",
    "Ford Motor Company, Chrysler and writer",
    "McCann Erickson and songwriter",
    "Yankees, Mets and writer",
    ", , and writer",
    "Pulitzer Prize and writer , , winner ,",
    "City Film Commissioner and executive,",
    "and writer, ambassador to",
    "Pulitzer Prize team at Gannett and filmmaker",
    "Pulitzer Prize, , editor for team that won",
]
arts = sorted(list(set(arts)), key=lambda x: len(x), reverse=True)

sports = [
    "Seventh Judicial Circuit Court of and football player State Seminoles",
    "City Council and college football player USC",
    "San Francisco ers, and football player",
    "Nets, and basketball player",
    "Pittsburgh Steelers and football executive",
    "Indiana Pacers",
    "King Power and football club owner Leicester City",
    ", owner of the Kansas City Royals",
    "HEICO and basketball franchise owner Memphis Grizzlies",
]
sports = sorted(list(set(sports)), key=lambda x: len(x), reverse=True)

sciences = []
sciences = sorted(list(set(sciences)), key=lambda x: len(x), reverse=True)

business_farming = []
business_farming = sorted(
    list(set(business_farming)), key=lambda x: len(x), reverse=True
)

academia_humanities = [
    "Botanical Museum of the National University of Córdoba",
    "High Desert Museum",
    "National Civil Rights Museum",
    "since and the National Museum",
    "City Film",
    "Symphony Orchestra, Royal Philharmonic Orchestra and teacher Canberra School of Music",
    "MIT, Medical School",
    "University and the Polaroid Corporation",
    "University of Cambridge and writer",
]
academia_humanities = sorted(
    list(set(academia_humanities)), key=lambda x: len(x), reverse=True
)

law_enf_military_operator = [
    ", Chairman of the Joint Chiefs of Staff",
    "Parliament attack",
    "Trans Airways and evacuator Operation Moses",
    "Minister for the Environment, Minister for Defence, Member of the Parliament",
    ", Minister of Defence and Chief Minister of Goa , , since",
    ", Minister of National Defence and Foreign Affairs and",
    "since , Minister of Defence , and Finance",
    ", Minister for Defence , Ceann Comhairle",
    ", Minister for Defence and Territories",
    "and Minister of National Defence ,",
    ", , Defence and External Affairs",
    ", Minister of National Defence",
    ", Minister of Defence and MP",
    ", , Minister for Defence",
    ", Minister of Defence ,",
    ", Minister for Defence",
    "; Minister for Defence",
    ", and Defence Minister",
    ", Minister of Defence",
    "and Defence Minister",
    "and Defence , , MP",
    "and Defence",
]
law_enf_military_operator = sorted(
    list(set(law_enf_military_operator)), key=lambda x: len(x), reverse=True
)

spiritual = ["Manson Family and cult leader"]
spiritual = sorted(list(set(spiritual)), key=lambda x: len(x), reverse=True)

social = ["and President of the WWF", "Mission Alliance"]
social = sorted(list(set(social)), key=lambda x: len(x), reverse=True)

crime = []
crime = sorted(list(set(crime)), key=lambda x: len(x), reverse=True)

event_record_other = []
event_record_other = sorted(
    list(set(event_record_other)), key=lambda x: len(x), reverse=True
)

other_species = []
other_species = sorted(list(set(other_species)), key=lambda x: len(x), reverse=True)

cause_of_death = []
cause_of_death = sorted(list(set(cause_of_death)), key=lambda x: len(x), reverse=True)

<IPython.core.display.Javascript object>

#### Creating known_for_dict Dictionary of Category Keys and Specific Role Lists of Values

In [58]:
# Combining separate lists into one dictionary
known_for_dict = {
    "social": social,
    "spiritual": spiritual,
    "academia_humanities": academia_humanities,
    "business_farming": business_farming,
    "sciences": sciences,
    "politics_govt_law": politics_govt_law,
    "law_enf_military_operator": law_enf_military_operator,
    "crime": crime,
    "event_record_other": event_record_other,
    "other_species": other_species,
    "arts": arts,
    "sports": sports,
}

<IPython.core.display.Javascript object>

#### Extracting `known_for` Categories and `cause_of_death` Values from `info_parenth_copy`

In [59]:
%%time

# Column to check
column = 'info_parenth_copy'

# Start dataframe
dataframe = df[df[column].notna()]

# For loop to find cause in column and extract it to cause_of_death
for cause in cause_of_death:
    for index in dataframe.index:
        item = df.loc[index, column]
        if item:
            if cause in item:
                if df.loc[index, 'cause_of_death']:
                    df.loc[index, 'cause_of_death'] = df.loc[index, 'cause_of_death'] + '/' + cause
                    df.loc[index, column] = item.replace(cause, '').strip()
                else:
                    df.loc[index, 'cause_of_death'] = cause
                    df.loc[index, column] = item.replace(cause, '').strip()
                
                
# For loop to find role in column and extract it as category
for category, category_lst in known_for_dict.items():
    for role in category_lst:
        for index in dataframe.index:
                item = df.loc[index, column]
                if item:
                    if role in item:
                        df.loc[index, category] = 1
                        df.loc[index, column] = item.replace(role, '').strip()

# Calculating num_categories
df["num_categories"] = df[known_for_dict.keys()].sum(axis=1)

# Checking number of cause_of_death values
print(f'There are {df["cause_of_death"].notna().sum()} values in cause_of_death column.\n')

There are 33460 values in cause_of_death column.

CPU times: total: 1min 21s
Wall time: 1min 22s


<IPython.core.display.Javascript object>

#### Checking num_categories Value Counts

In [60]:
# Checking num_categories Value Counts
df["num_categories"].value_counts()

1    83686
2    13193
3     1119
4       38
5        3
Name: num_categories, dtype: int64

<IPython.core.display.Javascript object>

#### Dropping `info_parenth_copy`

In [61]:
# Dropping info_parenth_copy
df.drop("info_parenth_copy", axis=1, inplace=True)

# Checking a sample
df.sample()

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,age,cause_of_death,place_1,place_2,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
14521,23,Richard Sylbert,", 73, American film production designer and art director , cancer.",https://en.wikipedia.org/wiki/Richard_Sylbert,4,2002,March,"Oscar, 1967, 1991, winner ,",73.0,cancer,United States of America,,0,0,0,0,0,1,0,0,0,0,0,0,1


<IPython.core.display.Javascript object>

#### Observations:
- We have searched all of the values in `roles_list`.
- There were no additional `cause_of_death` values, so our search using `cause_of_death` values was adequate.
- Our search of `info_parenth_copy` extracted ~900 additional category values.
- The column has been dropped as we are finished with it and with our search for `known_for` categories.  Recall that the original `info_parenth` column is still available.
- Next, we will add any overlooked `cause_of_death` information for entries containing "victim".

### Checking for Missed `cause_of_death` Values

#### Examining entries with missing `cause_of_death` that have 'victim' in `info`

In [62]:
# Cecking info column for missed cause_of_death values for entries with "victim" in info
df.loc[
    [
        index
        for index in df.index
        if "victim" in df.loc[index, "info"] and df.loc[index, "cause_of_death"] == None
    ],
    :,
]

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,age,cause_of_death,place_1,place_2,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
8003,11,Nicky Verstappen,", 11, Dutch homicide victim",https://en.wikipedia.org/wiki/Death_of_Nicky_Verstappen,40,1998,August,,11.0,,Netherlands,,0,0,0,0,0,0,0,0,0,0,1,0,1
8755,17,Samantha Reid,", 15, American manslaughter victim.",https://en.wikipedia.org/wiki/Death_of_Samantha_Reid,6,1999,January,,15.0,,United States of America,,0,0,0,0,0,0,0,0,0,0,1,0,1
9198,17,Milica Rakić,", 3, child victim of the NATO bombing of Yugoslavia.",https://en.wikipedia.org/wiki/Milica_Raki%C4%87,16,1999,April,,3.0,,Serbia,,0,0,0,0,0,0,0,0,0,0,1,0,1
11445,31,Brian Murphy,", 18, Irish victim of unlawful killing.",https://en.wikipedia.org/wiki/Death_of_Brian_Murphy,5,2000,August,,18.0,,Ireland,,0,0,0,0,0,0,0,0,0,0,1,0,1
12239,6,Gus Boulis,", 51, Greek-born American businessman and murder victim.",https://en.wikipedia.org/wiki/Gus_Boulis,21,2001,February,,51.0,,Greece,United States of America,0,0,0,0,1,0,0,0,0,0,1,0,2
13555,5,Robert Stevens,", 63, American photo editor and anthrax attack victim.",https://en.wikipedia.org/wiki/Death_of_Robert_Stevens,16,2001,October,,63.0,,United States of America,,0,0,0,0,0,1,0,0,0,0,1,0,2
13710,5,Andrew Bagby,", 28, American doctor and murder victim whose killing was documented in the movie: Dear Zachary",https://en.wikipedia.org/wiki/Murder_of_Zachary_Turner,23,2001,November,,28.0,,United States of America,,1,0,0,0,0,0,0,0,0,0,1,0,2
14907,13,Stanley L. Greigg,", 71, American Watergate break-in victim.",https://en.wikipedia.org/wiki/Stanley_L._Greigg,7,2002,June,,71.0,,United States of America,,0,0,0,0,0,0,0,0,0,0,1,0,1
17167,7,Ryan Halligan,", 13, American suicide victim.",https://en.wikipedia.org/wiki/Suicide_of_Ryan_Halligan,9,2003,October,,13.0,,United States of America,,0,0,0,0,0,0,0,0,0,0,1,0,1
17382,22,Dru Sjodin,", 22, American murder victim.",https://en.wikipedia.org/wiki/Murder_of_Dru_Sjodin,27,2003,November,,22.0,,United States of America,,0,0,0,0,0,0,0,0,0,0,1,0,1


<IPython.core.display.Javascript object>

#### Examining Entries with Missing `cause_of_death` that Have `event_record_other` as Only Category

In [63]:
# Checking info column for missed cause_of_death for entries with event_record_other as only category
df.loc[[
        index
        for index in df[df["cause_of_death"].isna()].index
        if df.loc[index, "event_record_other"] == 1
        and df.loc[index, "num_categories"] == 1
    and 'victim' not in df.loc[index, 'info']
    ], :]


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,age,cause_of_death,place_1,place_2,sciences,social,spiritual,academia_humanities,business_farming,arts,sports,law_enf_military_operator,politics_govt_law,crime,event_record_other,other_species,num_categories
2495,26,Jason Callahan,", 19, American man and unidentified body and missing person case.",https://en.wikipedia.org/wiki/Death_of_Jason_Callahan,22,1995,June,,19.0,,United States of America,,0,0,0,0,0,0,0,0,0,0,1,0,1
4244,24,Grace Ho,", 89, Mother of Bruce Lee.",https://en.wikipedia.org/wiki/Grace_Ho,8,1996,June,,89.0,,,,0,0,0,0,0,0,0,0,0,0,1,0,1
4758,25,Solveig Gunbjørg Jacobsen,", 83, First person born on South Georgia.",https://en.wikipedia.org/wiki/Solveig_Gunbj%C3%B8rg_Jacobsen,4,1996,October,,83.0,,Georgia,,0,0,0,0,0,0,0,0,0,0,1,0,1
4824,9,Alvin Straight,", 76, American lawn mower traveler.",https://en.wikipedia.org/wiki/Alvin_Straight,9,1996,November,,76.0,,United States of America,,0,0,0,0,0,0,0,0,0,0,1,0,1
4995,17,Sun Yaoting,", 92, last imperial Chinese eunuch.",https://en.wikipedia.org/wiki/Sun_Yaoting,5,1996,December,,92.0,,"China, People's Republic of",,0,0,0,0,0,0,0,0,0,0,1,0,1
5239,31,Eugenia Smith,", 98, American Romanov impostor.",https://en.wikipedia.org/wiki/Eugenia_Smith,15,1997,January,,98.0,,United States of America,,0,0,0,0,0,0,0,0,0,0,1,0,1
6756,9,Lucy Jane Askew,", 114, British supercentenarian, oldest person in the United Kingdom.",https://en.wikipedia.org/wiki/List_of_British_supercentenarians,48,1997,December,,114.0,,United Kingdom of Great Britain and Northern Ireland,,0,0,0,0,0,0,0,0,0,0,1,0,1
7466,16,Marie-Louise Meilleur,", 117, Canadian supercentenarian, oldest living person at the time of her death.",https://en.wikipedia.org/wiki/Marie-Louise_Meilleur,7,1998,April,,117.0,,Canada,,0,0,0,0,0,0,0,0,0,0,1,0,1
9261,29,Denzo Ishizaki,", 112, Japan's oldest man.",https://en.wikipedia.org/wiki/Supercentenarians_from_Japan,42,1999,April,,112.0,,Japan,,0,0,0,0,0,0,0,0,0,0,1,0,1
10347,30,Sarah Knauss,", 119, American supercentenarian, oldest person in the world and oldest-ever US citizen.",https://en.wikipedia.org/wiki/Sarah_Knauss,17,1999,December,,119.0,,United States of America,,0,0,0,0,0,0,0,0,0,0,1,0,1


<IPython.core.display.Javascript object>

#### Hard-coding `cause_of_death` for Missed Values Identified in `info`

In [64]:
# Hard-coding identified missed cause_of_death values from above search for entries with "victim" in info and
# for entries with event_record_other as only category
df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Death_of_Nicky_Verstappen"].index,
    "cause_of_death",
] = "homicide"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Milica_Raki%C4%87"].index,
    "cause_of_death",
] = "NATO bombing of Yugoslavia"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Death_of_Samantha_Reid"].index,
    "cause_of_death",
] = "manslaughter"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Death_of_Brian_Murphy"].index,
    "cause_of_death",
] = "unlawful killing"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Gus_Boulis"].index,
    "cause_of_death",
] = "murdered"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Death_of_Robert_Stevens"].index,
    "cause_of_death",
] = "anthrax attack"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Murder_of_Zachary_Turner"].index,
    "cause_of_death",
] = "murdered"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Suicide_of_Ryan_Halligan"].index,
    "cause_of_death",
] = "suicide subsequent to cyber-bullying"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Murder_of_Dru_Sjodin"].index,
    "cause_of_death",
] = "murdered"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Cecilia_Cubas"].index,
    "cause_of_death",
] = "kidnapped and murdered"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Death_of_Maria_Korp"].index,
    "cause_of_death",
] = "oxygen starvation, head injuries, and severe dehydration from being kidnapped"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Ian_Bush"].index, "cause_of_death",
] = "shot by police"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Lim_Hock_Soon"].index,
    "cause_of_death",
] = "shot/murdered"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Chris_and_Cru_Kahui"].index,
    "cause_of_death",
] = "homicide"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Chanel_Petro_Nixon"].index,
    "cause_of_death",
] = "murdered"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Jessie_Davis"].index,
    "cause_of_death",
] = "murdered"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Emily_Sander"].index,
    "cause_of_death",
] = "murdered"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Maria_Lauterbach"].index,
    "cause_of_death",
] = "murdered"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Ben_Bowen"].index, "cause_of_death",
] = "cancer"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Caylee_Anthony"].index,
    "cause_of_death",
] = "murdered"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Sandra_Cantu_homicide"].index,
    "cause_of_death",
] = "homicide"

df.loc[
    df[
        df["link"] == "https://en.wikipedia.org/wiki/Murder_of_Jennifer_Daugherty"
    ].index,
    "cause_of_death",
] = "torture/murder"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Murder_of_Oksana_Makar"].index,
    "cause_of_death",
] = "murdered"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Death_of_Sigrid_Schjetne"].index,
    "cause_of_death",
] = "homicide"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Murder_of_Gabriel_Fernandez"].index,
    "cause_of_death",
] = "murdered"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Iana_Kasian"].index,
    "cause_of_death",
] = "murdered"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Jaxon_Buell"].index,
    "cause_of_death",
] = "microhydranencephaly"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Murder_of_Adrianne_Reynolds"].index,
    "cause_of_death",
] = "murdered"

df.loc[
    df[
        df["link"] == "https://en.wikipedia.org/wiki/Shooting_of_Rigoberto_Alpizar"
    ].index,
    "cause_of_death",
] = " fatally shot by U.S. Air Marshals after allegedly claiming he had placed a bomb aboard"

df.loc[
    df[df["link"] == "https://en.wikipedia.org/wiki/Killing_of_Justine_Damond"].index,
    "cause_of_death",
] = "murdered by a Minneapolis Police officer"

<IPython.core.display.Javascript object>

#### Dropping Entry with Link that Points to Event Page Rather than Individual's Page

In [65]:
# Dropping entry that points to event page rather than individual's page
df.drop(
    df[df["link"] == "https://en.wikipedia.org/wiki/Thor_Hesla"].index, inplace=True
)
df.reset_index(inplace=True, drop=True)

# Checking number of cause_of_death values
print(
    f'There are {df["cause_of_death"].notna().sum()} values in cause_of_death column.\n'
)

There are 33490 values in cause_of_death column.



<IPython.core.display.Javascript object>

#### Observations:
- We were able to find missed `cause_of_death` for ~30 entries as well as identify one entry to be dropped as its link points to the event page rather than the individual's page.
- At this point, we have completed separating the various features within the original scraped data and we have our complete dataset.
- Any further feature extraction/engineering will be done as needed during EDA or pre-processing before modeling.
- We will now save our dataset and pick back up in a new notebook.

### Exporting Dataset to SQLite Database [wp_life_expect_clean8.db]()

In [66]:
# Exporting dataframe

# Saving dataset in a SQLite database
conn = sql.connect("wp_life_expect_clean8.db")
df.to_sql("wp_life_expect_clean8", conn, index=False)

98038

<IPython.core.display.Javascript object>

In [67]:
print("Complete")

# Chime notification when cell executes
chime.success()

Complete


<IPython.core.display.Javascript object>

# [Proceed to Exploratory Data Analysis](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_EDA_thanak_2022_09_30.ipynb)

[Return to README](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/README.md#explore-the-project)