# Wikipedia Notable Life Expectancies

# [Notebook 3 of 4: Data Cleaning](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_data_clean2_thanak_2022_06_17.ipynb)

## Context

The


## Objective

The

### Data Dictionary

- Feature: Description

## Importing Necessary Libraries

In [1]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
import sqlite3 as sql

# To help with reading and manipulating data
import pandas as pd
import numpy as np
import re

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 200)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some visualization attributes
pd.set_option("max_colwidth", 150)

<IPython.core.display.Javascript object>

## Data Overview


### Reading, Sampling, and Checking Data Shape

In [2]:
# Reading the dataset
conn = sql.connect("wp_life_expect_clean1.db")
data = pd.read_sql("SELECT * FROM wp_life_expect_clean1", conn)

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

There are 132445 rows and 20 columns.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
0,1,William Chappell,", 86, British dancer, ballet designer and director.",https://en.wikipedia.org/wiki/William_Chappell_(dancer),21,1994,January,,,British dancer,ballet designer and director,,,,,,,,,86.0
1,1,Raymond Crotty,", 68, Irish economist, writer, and academic.",https://en.wikipedia.org/wiki/Raymond_Crotty,12,1994,January,,,Irish economist,writer,and academic,,,,,,,,68.0


<IPython.core.display.Javascript object>

In [3]:
# Checking last 2 rows of the data
df.tail(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
132443,9,Oleg Moliboga,", 69, Russian volleyball player, Olympic champion and coach.",https://en.wikipedia.org/wiki/Oleg_Moliboga,2,2022,June,(),,Russian volleyball player,Olympic champion and coach,,,,,,,,,69.0
132444,9,Zou Jing,", 86, Chinese engineer, member of the Chinese Academy of Engineering.",https://en.wikipedia.org/wiki/Zou_Jing_(engineer),3,2022,June,,,Chinese engineer,member of the Chinese Academy of Engineering,,,,,,,,,86.0


<IPython.core.display.Javascript object>

In [4]:
# Checking a sample of the data
df.sample(5)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
119284,15,Kōjirō Akagi,", 87, Japanese painter.",https://en.wikipedia.org/wiki/K%C5%8Djir%C5%8D_Akagi,1,2021,February,,,Japanese painter,,,,,,,,,,87.0
14674,30,Roger Kimpton,", 83, Australian cricketer.",https://en.wikipedia.org/wiki/Roger_Kimpton,14,1999,November,,,Australian cricketer,,,,,,,,,,83.0
81848,7,John Krish,", 92, British film director.",https://en.wikipedia.org/wiki/John_Krish,2,2016,May,,,British film director,,,,,,,,,,92.0
35867,1,Raúl Reyes,", 59, Colombian FARC second-in-command, airstrike.",https://en.wikipedia.org/wiki/Ra%C3%BAl_Reyes,28,2008,March,,,Colombian FARC second-in-command,airstrike,,,,,,,,,59.0
37747,8,Ahn Jae-hwan,", 36, South Korean actor, body found on this date after suicide by carbon monoxide poisoning.",https://en.wikipedia.org/wiki/Ahn_Jae-hwan,4,2008,September,,,South Korean actor,body found on this date after suicide by carbon monoxide poisoning,,,,,,,,,36.0


<IPython.core.display.Javascript object>

#### Observations:
- There are currently 132,445 rows and 20 columns.

### Checking Data Types, Duplicates, and Null Values

In [5]:
# Checking data types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132445 entries, 0 to 132444
Data columns (total 20 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   day             132445 non-null  object 
 1   name            132445 non-null  object 
 2   info            132445 non-null  object 
 3   link            132445 non-null  object 
 4   num_references  132445 non-null  object 
 5   year            132445 non-null  int64  
 6   month           132445 non-null  object 
 7   info_parenth    49788 non-null   object 
 8   info_1          132445 non-null  object 
 9   info_2          132421 non-null  object 
 10  info_3          62570 non-null   object 
 11  info_4          12587 non-null   object 
 12  info_5          1505 non-null    object 
 13  info_6          217 non-null     object 
 14  info_7          33 non-null      object 
 15  info_8          7 non-null       object 
 16  info_9          1 non-null       object 
 17  info_10   

<IPython.core.display.Javascript object>

#### Observations:
- Our dataset was saved to and read from the database without any hiccups.
- Picking up where we left off, we will aim to track down the remaining missing values for `age`, starting with searching for digits in `info_2`.

### Remaining Missing Values for Age

In [6]:
# Checking number of remaining missing values
print(f'There are {df["age"].isna().sum()} missing values for age.')

There are 79 missing values for age.


<IPython.core.display.Javascript object>

#### Function to Save Indices of Rows Matching Regular Expressions Pattern to a List and Print Number of Rows with Match

In [7]:
# Define a function that takes dataframe, column name, and re pattern as arguments and returns list of indices
# for which column value matches re pattern
def rows_with_pattern(dataframe, column, pattern):
    """
    Takes input of dataframe, column name, and re pattern 
    and returns list of indices for rows that contain match
    for pattern anywhere within value for given column.
    
    dataframe: dataframe
    column: column name
    pattern: re pattern
    """
    index_list = []

    for i in dataframe.index:
        item = dataframe.loc[i, column]
        match = re.search(pattern, item)
        if match:
            index_list.append(i)
    print(
        f"There are {len(index_list)} rows with matching pattern in column '{column}'."
    )
    return index_list

<IPython.core.display.Javascript object>

#### Function to Use rows_with_pattern Function for Multiple Regular Expression Patterns

In [8]:
# Define a function that calls rows_with_pattern function for multiple re patterns
# returning a single list of indices for all rows with any pattern match


def multiple_patterns(dataframe, column, patterns):
    """
    Takes input dataframe, column, and list of re patterns and returns single list 
    of indices for rows in which a match for any pattern is found with re.search
    
    dataframe: dataframe
    column: column name
    patterns: list of re patterns
    """
    rows_combined = []

    # For loop to check each pattern
    for pattern in patterns:

        # List and number of rows matching each pattern
        print(pattern)
        rows_to_check = rows_with_pattern(dataframe, column, pattern)
        print("")

        # Add list for each pattern to combined list
        rows_combined += rows_to_check

    return rows_combined

<IPython.core.display.Javascript object>

### `info_2`

#### Rows Missing `age` with Digits in `info_2`

In [9]:
# Pattern for re
pattern = r"\d"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_2"].notna())], "info_2", pattern
)

# Examining the rows directly
df.loc[rows_to_check, :]

There are 43 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
6846,17,Spiro Agnew,", American politician, 77, 39th Vice President of the United States, leukemia.",https://en.wikipedia.org/wiki/Spiro_Agnew,207,1996,September,,American politician,77,39th Vice President of the United States,leukemia,,,,,,,,
12861,14,Muslimgauze,", , 37, British electronic musician, fungal infection.",https://en.wikipedia.org/wiki/Muslimgauze,26,1999,January,(Bryn Jones),,37,British electronic musician,fungal infection,,,,,,,,
14993,15,Željko Ražnatović,", , 47, Serbian mobster and paramilitary leader.",https://en.wikipedia.org/wiki/Arkan,55,2000,January,(aka Arkan),,47,Serbian mobster and paramilitary leader,,,,,,,,,
15062,28,Sarah Caudwell,", , 60, British detective story writer and barrister, cancer.",https://en.wikipedia.org/wiki/Sarah_Caudwell,9,2000,January,(aka Sarah Cockburn),,60,British detective story writer and barrister,cancer,,,,,,,,
16111,30,Max Showalter,", , 83, American actor, composer, pianist, singer, cancer.",https://en.wikipedia.org/wiki/Max_Showalter,14,2000,July,(aka Casey Adams),,83,American actor,composer,pianist,singer,cancer,,,,,
17792,15,Joey Ramone,", , 49, American musician, lead singer for The Ramones, lymphoma.",https://en.wikipedia.org/wiki/Joey_Ramone,25,2001,April,(b. Jeffrey Hyman),,49,American musician,lead singer for The Ramones,lymphoma,,,,,,,
18170,4,Dipendra,", King of Nepal, 29, suicide.",https://en.wikipedia.org/wiki/Dipendra_of_Nepal,8,2001,June,,King of Nepal,29,suicide,,,,,,,,,
19017,28,Mohammad Khalequzzaman,", member of the then National Assembly of Pakistan and Union Minister of Labor, died in 28 September .",https://en.wikipedia.org/wiki/Mohammad_Khalequzzaman,3,2001,September,,member of the then National Assembly of Pakistan and Union Minister of Labor,died in 28 September,,,,,,,,,,
19117,12,Lord Hailsham of St Marylebone,", , 94, British lawyer and politician.","https://en.wikipedia.org/wiki/Quintin_Hogg,_Baron_Hailsham_of_St_Marylebone",17,2001,October,(Quintin Hogg),,94,British lawyer and politician,,,,,,,,,
23752,6,Jules Engel,", Jules Engel, 94, American filmmaker, visual artist, and film director.",https://en.wikipedia.org/wiki/Jules_Engel,10,2003,September,,Jules Engel,94,American filmmaker,visual artist,and film director,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- There are several ages as sole integer year values and two as year ranges with two integers.
- The remaining entries are missing age, but do have digits, so order of processing matters here.
- We can safely remove any of these rows that contains a letter of the alphabet, taking care to select rows only from those that are missing `age` and have a digit in `info_2`.

In [10]:
# Pattern for re
pattern = r"[a-z,A-Z]"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(df.loc[rows_to_check, :], "info_2", pattern)

# Checking a sample of the rows
df.loc[rows_to_check, :].sample(2)

There are 17 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
19017,28,Mohammad Khalequzzaman,", member of the then National Assembly of Pakistan and Union Minister of Labor, died in 28 September .",https://en.wikipedia.org/wiki/Mohammad_Khalequzzaman,3,2001,September,,member of the then National Assembly of Pakistan and Union Minister of Labor,died in 28 September,,,,,,,,,,
127156,17,R. N. R. Manohar,", Indian film director , COVID-19.",https://en.wikipedia.org/wiki/R._N._R._Manohar,8,2021,November,"(, ) and actor ()",Indian film director,COVID-19,,,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- We can drop these rows, as they are missing the data for age.
- Extraction of age for single integer value and two integer ranges can follow.

#### Dropping Additional Rows with Age Data Absent

In [11]:
# Dropping rows, resetting index, and checking new shape of df
df.drop(rows_to_check, inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132428, 20)

<IPython.core.display.Javascript object>

#### Remaining Rows with `age` values in `info_2`

In [12]:
# Regular expression for parenthesis and its contents
pattern = r"\d"

# Finding indices of rows that have pattern
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_2"].notna())], "info_2", pattern
)

# Checking unique values
df.loc[rows_to_check, :]["info_2"].unique()

There are 26 rows with matching pattern in column 'info_2'.


array(['77', '37', '47', '60', '83', '49', '29', '94', '55', '62', '69',
       '80', '32', '70', '81', '24', '84', '95', '76', '86', '61',
       '74–75', '79–80'], dtype=object)

<IPython.core.display.Javascript object>

#### Extracting `age` for Ranges with Two Values

In [13]:
# Pattern for re
pattern = r"(\d{1,3})(-|–|/| or )(\d{1,3})"

# List and number of rows matching pattern
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_2"].notna())], "info_2", pattern
)

# Checking sample of rows
df.loc[rows_to_check, :].sample(2)

There are 2 rows with matching pattern in column 'info_2'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
61915,1,Basil Soper,", British actor, 74–75.",https://en.wikipedia.org/wiki/Basil_Soper,0,2013,June,,British actor,74–75,,,,,,,,,,
86907,8,Mohamud Muse Hersi,", Somali politician, 79–80, President of Puntland .",https://en.wikipedia.org/wiki/Mohamud_Muse_Hersi,12,2017,February,(–),Somali politician,79–80,President of Puntland,,,,,,,,,


<IPython.core.display.Javascript object>

In [14]:
# For loop to find rows with values and pattern and calculate and extract age to age column and remove age from info_2
for i in df.loc[rows_to_check, :].index:
    item = df.loc[i, "info_2"]
    match = re.search(pattern, item)
    if match:
        age = (int(match.group(1)) + int(match.group(3))) / 2
        df.loc[i, "age"] = age
        df.loc[i, "info_2"] = re.sub(pattern, "", df.loc[i, "info_2"])

# Checking example rows
pd.concat([df[df["name"] == "Mohamud Muse Hersi"], df[df["name"] == "Basil Soper"]])

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
86907,8,Mohamud Muse Hersi,", Somali politician, 79–80, President of Puntland .",https://en.wikipedia.org/wiki/Mohamud_Muse_Hersi,12,2017,February,(–),Somali politician,,President of Puntland,,,,,,,,,79.5
61915,1,Basil Soper,", British actor, 74–75.",https://en.wikipedia.org/wiki/Basil_Soper,0,2013,June,,British actor,,,,,,,,,,,74.5


<IPython.core.display.Javascript object>

#### Extracting `age` as Single Integer

In [15]:
# List of patterns for age formats with single integer for age
pattern = r"\b(\d{1,3})\b"

# List and number of rows matching patterns
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_2"].notna())], "info_2", pattern
)

There are 24 rows with matching pattern in column 'info_2'.


<IPython.core.display.Javascript object>

In [16]:
# For loop to extract age pattern to age column
for i in df.loc[rows_to_check, :].index:
    item = df.loc[i, "info_2"]
    match = re.search(pattern, item)
    if match:
        age = int(match.group(1))
        df.loc[i, "age"] = age
        df.loc[i, "info_2"] = re.sub(pattern, "", df.loc[i, "info_2"])

# Re-checking number of rows matching patterns
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_2"].notna())], "info_2", pattern
)

There are 0 rows with matching pattern in column 'info_2'.


<IPython.core.display.Javascript object>

In [17]:
# Checking number of remaining missing values
print(f'There are {df["age"].isna().sum()} missing values for age.')
df[df["age"].isna()].sample(2)

There are 36 missing values for age.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
112423,14,Tawfiq al-Yasiri,", Iraqi politician, member of the IRDC, COVID-19.",https://en.wikipedia.org/wiki/Tawfiq_al-Yasiri,1,2020,June,,Iraqi politician,member of the IRDC,COVID-19,,,,,,,,,
117639,1,Toabur Rahim,", Bangladeshi politician, MP , COVID-19.",https://en.wikipedia.org/wiki/Toabur_Rahim,2,2021,January,(–),Bangladeshi politician,MP,COVID-19,,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- At this point, we could look through the list to manually extract any remaining `age` info, which would likely be the fastest approach.
- We will instead take a programmatic approach, for the sake of the exercise.
- `info_parenth`, and `info_3` and beyond are the remaining columns to search.
- We see that COVID-19 appears often and the number 19 could be mistakenly extracted as an age.  
- Let us start by moving it to a new column `cause_of_death`.

#### Extracting "COVID-19" from Remaining `info` Sub-columns where `age` Missing

In [18]:
# List of columns to check
cols_to_check = [
    "info_parenth",
    "info_3",
    "info_4",
    "info_5",
    "info_6",
    "info_7",
    "info_8",
    "info_9",
    "info_10",
    "info_11",
]

# Pattern for re
pattern = r"(COVID-19)"

# For loop to collect indices of all rows with pattern
comb_rows_to_check = []
for column in cols_to_check:
    rows_to_check = rows_with_pattern(
        df[(df["age"].isna()) & (df[column].notna())], column, pattern
    )
    comb_rows_to_check += rows_to_check

# Checking sample of rows
df.loc[comb_rows_to_check, :].sample(2)

There are 1 rows with matching pattern in column 'info_parenth'.
There are 30 rows with matching pattern in column 'info_3'.
There are 2 rows with matching pattern in column 'info_4'.
There are 1 rows with matching pattern in column 'info_5'.
There are 0 rows with matching pattern in column 'info_6'.
There are 0 rows with matching pattern in column 'info_7'.
There are 0 rows with matching pattern in column 'info_8'.
There are 0 rows with matching pattern in column 'info_9'.
There are 0 rows with matching pattern in column 'info_10'.
There are 0 rows with matching pattern in column 'info_11'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age
117312,23,Loyiso Mpumlwana,", South African politician, member of the National Assembly , COVID-19.",https://en.wikipedia.org/wiki/Loyiso_Mpumlwana,9,2020,December,"(–, since )",South African politician,member of the National Assembly,COVID-19,,,,,,,,,
116218,14,Abu Hena,", Bangladeshi politician, MP , COVID-19.",https://en.wikipedia.org/wiki/Abu_Hena_(Bangladeshi_politician),4,2020,November,(–),Bangladeshi politician,MP,COVID-19,,,,,,,,,


<IPython.core.display.Javascript object>

In [21]:
# For loop to extract COVID-19 from remaining info columns for entries with missing age
for column in cols_to_check:
    for i in df.loc[comb_rows_to_check, :].index:
        if df.loc[i, column]:
            item = df.loc[i, column]
            match = re.search(pattern, item)
            if match:
                cause = match.group(1)
                df.loc[i, "cause_of_death"] = cause
                df.loc[i, column] = re.sub(pattern, "", df.loc[i, column])

# Checking sample of rows
df.loc[comb_rows_to_check, :].sample(2)

Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
122356,20,Tshoganetso Tongwane,", South African politician, MP , COVID-19.",https://en.wikipedia.org/wiki/Tshoganetso_Tongwane,8,2021,May,"(–, –, since )",South African politician,MP,,,,,,,,,,,COVID-19
116496,24,Hussein Al-Zuhairi,", Iraqi politician, MP, COVID-19.",https://en.wikipedia.org/wiki/Hussein_Al-Zuhairi,3,2020,November,,Iraqi politician,MP,,,,,,,,,,,COVID-19


<IPython.core.display.Javascript object>

#### Observations:
- With "COVID-19" put aside, we can check for any remaining digits in the same columns for these entries.

#### Checking Remaining `info` Columns for Remaining Digits

In [23]:
# List of columns to check
cols_to_check = [
    "info_parenth",
    "info_3",
    "info_4",
    "info_5",
    "info_6",
    "info_7",
    "info_8",
    "info_9",
    "info_10",
    "info_11",
]

# Pattern for re
pattern = r"\d"

# For loop to collect indices of all rows with pattern
comb_rows_to_check = []
for column in cols_to_check:
    rows_to_check = rows_with_pattern(
        df[(df["age"].isna()) & (df[column].notna())], column, pattern
    )
    comb_rows_to_check += rows_to_check

# Checking sample of rows
df.loc[comb_rows_to_check, :]

There are 2 rows with matching pattern in column 'info_parenth'.
There are 0 rows with matching pattern in column 'info_3'.
There are 0 rows with matching pattern in column 'info_4'.
There are 0 rows with matching pattern in column 'info_5'.
There are 0 rows with matching pattern in column 'info_6'.
There are 0 rows with matching pattern in column 'info_7'.
There are 0 rows with matching pattern in column 'info_8'.
There are 0 rows with matching pattern in column 'info_9'.
There are 0 rows with matching pattern in column 'info_10'.
There are 0 rows with matching pattern in column 'info_11'.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
22855,10,Little Eva,", .",https://en.wikipedia.org/wiki/Little_Eva,14,2003,April,"(née Eva Narcissus Boyd), 59, American pop singer ()",,,,,,,,,,,,,
126710,31,Simon Young,", Irish radio presenter .",https://en.wikipedia.org/wiki/Simon_Young_(presenter),9,2021,October,(RTÉ 2fm),Irish radio presenter,,,,,,,,,,,,


<IPython.core.display.Javascript object>

#### Observations:
- There are only two entries that still potentially contain digits for the age data.
- Here, we see that there is one entry we can preserve that has an age value in `info_parenth`.
- The other entry has a radio station identification value and is missing age data.
- After we collect this last age, we will drop the remaining entries missing `age`.

#### Extracting `age` from `info_parenth`

In [24]:
# List of patterns for age formats with single integer for age
pattern = r"\b(\d{1,3})\b"

# List and number of rows matching patterns
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_parenth"].notna())], "info_parenth", pattern
)

There are 1 rows with matching pattern in column 'info_parenth'.


<IPython.core.display.Javascript object>

In [27]:
# For loop to extract age pattern to age column
for i in df.loc[rows_to_check, :].index:
    item = df.loc[i, "info_parenth"]
    match = re.search(pattern, item)
    if match:
        age = int(match.group(1))
        df.loc[i, "age"] = age
        df.loc[i, "info_parenth"] = re.sub(pattern, "", df.loc[i, "info_parenth"])

# Re-checking number of rows matching patterns
rows_to_check = rows_with_pattern(
    df[(df["age"].isna()) & (df["info_parenth"].notna())], "info_parenth", pattern
)

There are 0 rows with matching pattern in column 'info_parenth'.


<IPython.core.display.Javascript object>

#### Dropping the Last Entries with Missing `age` Values

In [28]:
# Checking number of remaining missing values
print(f'There are {df["age"].isna().sum()} missing values for age.')

df

There are 35 missing values for age.


Unnamed: 0,day,name,info,link,num_references,year,month,info_parenth,info_1,info_2,info_3,info_4,info_5,info_6,info_7,info_8,info_9,info_10,info_11,age,cause_of_death
126710,31,Simon Young,", Irish radio presenter .",https://en.wikipedia.org/wiki/Simon_Young_(presenter),9,2021,October,(RTÉ 2fm),Irish radio presenter,,,,,,,,,,,,
118147,14,Dinesh Chandra Yadav,", Nepalese politician, member of the Constituent Assembly , COVID-19.",https://en.wikipedia.org/wiki/Dinesh_Chandra_Yadav_(Nepali_politician),4,2021,January,(–),Nepalese politician,member of the Constituent Assembly,,,,,,,,,,,COVID-19


<IPython.core.display.Javascript object>

In [30]:
# Dropping rows, resetting index, and checking new shape of df
df.dropna(subset="age", inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

(132393, 21)

<IPython.core.display.Javascript object>

In [32]:
# Checking current info status
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132393 entries, 0 to 132392
Data columns (total 21 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   day             132393 non-null  object 
 1   name            132393 non-null  object 
 2   info            132393 non-null  object 
 3   link            132393 non-null  object 
 4   num_references  132393 non-null  object 
 5   year            132393 non-null  int64  
 6   month           132393 non-null  object 
 7   info_parenth    49752 non-null   object 
 8   info_1          132393 non-null  object 
 9   info_2          132371 non-null  object 
 10  info_3          62537 non-null   object 
 11  info_4          12584 non-null   object 
 12  info_5          1504 non-null    object 
 13  info_6          217 non-null     object 
 14  info_7          33 non-null      object 
 15  info_8          7 non-null       object 
 16  info_9          1 non-null       object 
 17  info_10   

<IPython.core.display.Javascript object>