**This notebook scrapes data from CDC data called "Pregnancy-Related Deaths: Data From Maternal Mortality Review Committees in 36 U.S. States, 2017–2019". Link: https://www.cdc.gov/maternal-mortality/php/data-research/mmrc-2017-2019.html#cdc_research_or_data_summary_explore_more-data-sources-and-methods** The purpose of this notebook is to access data on maternal mortality with breakdown on causes and other important characteristics. The data will be used for the masters project on maternal mortality in the US. 

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 1: Request the page
url = 'https://www.cdc.gov/maternal-mortality/php/data-research/mmrc-2017-2019.html#cdc_research_or_data_summary_explore_more-data-sources-and-methods'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Step 2: Find the div that contains raw HTML
raw_html_div = soup.find('div', class_='dfe-block dfe-block--cdcmodule cdc_raw_html')

# Step 3: Extract the raw HTML inside the div
# If raw_html_div contains embedded HTML as a string, parse it again
inner_html = BeautifulSoup(raw_html_div.decode_contents(), 'html.parser')

# Step 4: Now find the table inside the inner HTML
table = inner_html.find('table')  # Should work if the table is present in the raw HTML

# Step 5: Extract rows and convert to DataFrame
rows = table.find_all('tr')
data = []
for row in rows:
    cells = row.find_all(['td', 'th'])
    data.append([cell.get_text(strip=True) for cell in cells])

df = pd.DataFrame(data)
df.columns = df.iloc[0]  # Set first row as header
df = df.drop(0).reset_index(drop=True)

# Step 6: Display the table
df.head()


Unnamed: 0,Unnamed: 1,"Number of pregnancy-related deaths (N = 1,018)",%
0,Race and ethnicity,,
1,Hispanic,144.0,14.4
2,non-Hispanic American Indian or Alaska Native,9.0,0.9
3,non-Hispanic Asian,34.0,3.4
4,non-Hispanic Black,315.0,31.4


In [2]:
# Step 1: Get the page
url = 'https://www.cdc.gov/maternal-mortality/php/data-research/mmrc-2017-2019.html#cdc_research_or_data_summary_explore_more-data-sources-and-methods'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Step 2: Find the container div
container = soup.find('div', class_='dfe-block dfe-block--cdcmodule cdc_raw_html')

# Step 3: Parse the inner raw HTML
inner_html = BeautifulSoup(container.decode_contents(), 'html.parser')

# Step 4: Find all tables
tables = inner_html.find_all('table')

# Step 5: Extract first table (e.g., tables[0])
rows_0 = tables[0].find_all('tr')
data_0 = [[cell.get_text(strip=True) for cell in row.find_all(['td', 'th'])] for row in rows_0]
df0 = pd.DataFrame(data_0)
df0.columns = df0.iloc[0]
df0 = df0.drop(0).reset_index(drop=True)

# Step 6: Extract second table (e.g., tables[1])
rows_1 = tables[1].find_all('tr')
data_1 = [[cell.get_text(strip=True) for cell in row.find_all(['td', 'th'])] for row in rows_1]
df1 = pd.DataFrame(data_1)
df1.columns = df1.iloc[0]
df1 = df1.drop(0).reset_index(drop=True)

# Optional: View both
print("First table:")
display(df0)

print("Second table:")
display(df1)


First table:


Unnamed: 0,Unnamed: 1,"Number of pregnancy-related deaths (N = 1,018)",%
0,Race and ethnicity,,
1,Hispanic,144.0,14.4
2,non-Hispanic American Indian or Alaska Native,9.0,0.9
3,non-Hispanic Asian,34.0,3.4
4,non-Hispanic Black,315.0,31.4
5,non-Hispanic Native Hawaiian and Other Pacific...,6.0,0.6
6,non-Hispanic White,467.0,46.6
7,non-Hispanic other/multiple races,27.0,2.7
8,Age at death (years),,
9,15–19,29.0,2.9


Second table:


Unnamed: 0,a,Race or ethnicity was missing for 16 (1.6%) pregnancy-related deaths; age was missing for 5 (0.5%) pregnancy-related deaths; education was missing for 30 (2.9%) pregnancy-related deaths.
0,b,Percentages might not sum to 100 because of ro...
1,,


In [5]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 1: Load the page
url = 'https://www.cdc.gov/maternal-mortality/php/data-research/mmrc-2017-2019.html#cdc_research_or_data_summary_explore_more-data-sources-and-methods'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Step 2: Find the div with all tables inside
container = soup.find('div', class_='dfe-block dfe-block--cdcmodule cdc_raw_html')

# Step 3: Parse the inner HTML inside that div
inner_html = BeautifulSoup(container.decode_contents(), 'html.parser')

# Step 4: Find all tables
tables = inner_html.find_all('table')

# Step 5: Convert each table into a pandas DataFrame
dataframes = []
for i, table in enumerate(tables):
    rows = table.find_all('tr')
    data = [[cell.get_text(strip=True) for cell in row.find_all(['th', 'td'])] for row in rows]
    
    # Handle empty or malformed tables safely
    if data:
        df = pd.DataFrame(data)
        df.columns = df.iloc[0]  # Use first row as header
        df = df.drop(0).reset_index(drop=True)
        dataframes.append(df)
        print(f"\n✅ Table {i+1} extracted with shape {df.shape}")
    else:
        print(f"\n⚠️ Table {i+1} was empty or invalid.")

# Step 6: Preview tables
for i, df in enumerate(dataframes):
    print(f"\n📊 Preview of Table {i+1}:")
    display(df)




✅ Table 1 extracted with shape (22, 3)

✅ Table 2 extracted with shape (2, 2)

📊 Preview of Table 1:


Unnamed: 0,Unnamed: 1,"Number of pregnancy-related deaths (N = 1,018)",%
0,Race and ethnicity,,
1,Hispanic,144.0,14.4
2,non-Hispanic American Indian or Alaska Native,9.0,0.9
3,non-Hispanic Asian,34.0,3.4
4,non-Hispanic Black,315.0,31.4
5,non-Hispanic Native Hawaiian and Other Pacific...,6.0,0.6
6,non-Hispanic White,467.0,46.6
7,non-Hispanic other/multiple races,27.0,2.7
8,Age at death (years),,
9,15–19,29.0,2.9



📊 Preview of Table 2:


Unnamed: 0,a,Race or ethnicity was missing for 16 (1.6%) pregnancy-related deaths; age was missing for 5 (0.5%) pregnancy-related deaths; education was missing for 30 (2.9%) pregnancy-related deaths.
0,b,Percentages might not sum to 100 because of ro...
1,,


In [6]:
# Save each table
for i, df in enumerate(dataframes):
    df.to_csv(f'table_{i+1}.csv', index=False)


In [11]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 1: Load the page
url = 'https://www.cdc.gov/maternal-mortality/php/data-research/mmrc-2017-2019.html#cdc_research_or_data_summary_explore_more-data-sources-and-methods'  # Replace with the actual URL
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Step 2: Find all matching divs with the correct class
target_divs = soup.find_all('div', class_='dfe-block dfe-block--cdcmodule cdc_raw_html')

# Step 3: Loop through each matching div and extract tables
for div_index, div in enumerate(target_divs):
    inner_html = BeautifulSoup(div.decode_contents(), 'html.parser')
    tables = inner_html.find_all('table')

    for i, table in enumerate(tables):
        rows = table.find_all('tr')
        data = [[cell.get_text(strip=True) for cell in row.find_all(['td', 'th'])] for row in rows]

        # Save only if there's data
        if data:
            df = pd.DataFrame(data)
            df.columns = df.iloc[0]  # First row as header
            df = df.drop(0).reset_index(drop=True)
            filename = f"div{div_index+1}_table{i+1}.csv"
            df.to_csv(filename, index=False)
            print(f"✅ Saved {filename}")
        else:
            print(f"⚠️ Skipped empty table {i+1} in div {div_index+1}")


✅ Saved div1_table1.csv
✅ Saved div1_table2.csv
✅ Saved div2_table1.csv
✅ Saved div2_table2.csv
✅ Saved div3_table1.csv
✅ Saved div3_table2.csv
✅ Saved div4_table1.csv
✅ Saved div4_table2.csv
✅ Saved div5_table1.csv
✅ Saved div5_table2.csv
✅ Saved div6_table1.csv
✅ Saved div6_table2.csv


In [12]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 1: Load the webpage
url = 'https://www.cdc.gov/maternal-mortality/php/data-research/mmrc-2017-2019.html#cdc_research_or_data_summary_explore_more-data-sources-and-methods'  # Replace with your URL
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Step 2: Find all divs with the target class
target_divs = soup.find_all('div', class_='dfe-block dfe-block--cdcmodule cdc_raw_html')

# Step 3: Extract and display tables
for div_index, div in enumerate(target_divs):
    inner_html = BeautifulSoup(div.decode_contents(), 'html.parser')
    tables = inner_html.find_all('table')

    for i, table in enumerate(tables):
        rows = table.find_all('tr')
        data = [[cell.get_text(strip=True) for cell in row.find_all(['td', 'th'])] for row in rows]

        if data:
            df = pd.DataFrame(data)
            df.columns = df.iloc[0]  # Use first row as header
            df = df.drop(0).reset_index(drop=True)
            print(f"\nTable {i+1} from div {div_index+1}:")
            display(df)
        else:
            print(f"\nTable {i+1} from div {div_index+1} is empty or malformed.")



Table 1 from div 1:


Unnamed: 0,Unnamed: 1,"Number of pregnancy-related deaths (N = 1,018)",%
0,Race and ethnicity,,
1,Hispanic,144.0,14.4
2,non-Hispanic American Indian or Alaska Native,9.0,0.9
3,non-Hispanic Asian,34.0,3.4
4,non-Hispanic Black,315.0,31.4
5,non-Hispanic Native Hawaiian and Other Pacific...,6.0,0.6
6,non-Hispanic White,467.0,46.6
7,non-Hispanic other/multiple races,27.0,2.7
8,Age at death (years),,
9,15–19,29.0,2.9



Table 2 from div 1:


Unnamed: 0,a,Race or ethnicity was missing for 16 (1.6%) pregnancy-related deaths; age was missing for 5 (0.5%) pregnancy-related deaths; education was missing for 30 (2.9%) pregnancy-related deaths.
0,b,Percentages might not sum to 100 because of ro...
1,,



Table 1 from div 2:


Unnamed: 0,Unnamed: 1,Number of pregnancy-related deaths,%
0,Urban,714,81.8
1,Rural,159,18.2



Table 2 from div 2:


Unnamed: 0,a,"Among pregnancy-related deaths, 14.2% lacked geographic information based on county of last residence (n = 145); most were missing data (n = 144) or undetermined (n = 1). Urban classification includes metropolitan division (≥2,500,000) and metropolitan (≥50,000–2,499,999); Rural classification includes micropolitan (10,000–49,999) and rural (<10,000) as captured in MMRIA."
0,b,Percentages might not sum to 100 because of ro...
1,,



Table 1 from div 3:


Unnamed: 0,Unnamed: 1,Number of pregnancy-related deaths,%
0,During pregnancy,216,21.6
1,Day of delivery,132,13.2
2,1–6 days postpartum,120,12.0
3,7–42 days postpartum,233,23.3
4,43–365 days postpartum,301,30.0



Table 2 from div 3:


Unnamed: 0,a,Specific timing information is missing (n = 2) orunknown(n = 14) for 16 (1.6%) pregnancy-related deaths.
0,b,Percentages might not sum to 100 because of ro...
1,,



Table 1 from div 4:


Unnamed: 0,Unnamed: 1,Total,Hispanic,Non-Hispanic,None,None.1,None.2,None.3,None.4,None.5,None.6,None.7,None.8,None.9,None.10
0,Condition,,,,,AI/AN,Asian,Black,NHOPI,White,,,,,
1,Number of pregnancy-related deaths,%,Number of pregnancy-related deaths,%,Number of pregnancy-related deaths,%,Number of pregnancy-related deaths,%,Number of pregnancy-related deaths,%,Number of pregnancy-related deaths,%,Number of pregnancy-related deaths,%,
2,Mental health conditionsc,224,22.7,34,24.1,2,-,1,3.1,21,7.0,0,-,159,34.8
3,Hemorrhaged,135,13.7,30,21.3,2,-,10,31.3,33,10.9,1,-,53,11.6
4,Cardiac and coronary conditionse,126,12.8,15,10.6,1,-,7,21.9,48,15.9,0,-,49,10.7
5,Infection,91,9.2,15,10.6,1,-,0,0.0,23,7.6,0,-,49,10.7
6,Embolism-thrombotic,86,8.7,9,6.4,0,-,2,6.3,36,11.9,0,-,34,7.4
7,Cardiomyopathy,84,8.5,5,3.6,0,-,2,6.3,42,13.9,0,-,33,7.2
8,Hypertensive disorders of pregnancy,64,6.5,7,5.0,0,-,1,3.1,30,9.9,1,-,22,4.8
9,Amniotic fluid embolism,37,3.8,6,4.3,1,-,7,21.9,10,3.3,2,-,9,2.0



Table 2 from div 4:


Unnamed: 0,a,"Specific cause of death was missing (n = 10) or listed as ""unknown"" (n = 21) for a total of 31 (3.0%) pregnancy-related deaths. Only underlying causes with at least 10 pregnancy-related deaths total are included in the table; therefore, the causes in the table may not reflect all causes of death overall or for each race ethnicity category. Percentages are not presented when the denominator is <10."
0,b,Race or ethnicity was missing for 16 (1.6%) pr...
1,c,Mental health conditions include deaths of sui...
2,d,Excludes aneurysms or cerebrovascular accident.
3,e,Cardiac and coronary conditions include deaths...
4,f,"Injury includes intentional injury (homicide),..."
5,,



Table 1 from div 5:


Unnamed: 0,Unnamed: 1,Suicide,Homicide,None,None.1
0,,Number of pregnancy-related deaths,%,Number of pregnancy-related deaths,%
1,No,880,90.6,971,97.0
2,Yes,82,8.4,29,2.9
3,Probably,9,0.9,1,0.1



Table 2 from div 5:


Unnamed: 0,a,"A suicide manner of death determination was missing (n = 5) or listed as ""unknown"" (n = 42) for a total of 47 (4.6%) pregnancy-related deaths. A homicide manner of death determination was missing (n = 10) or listed as ""unknown"" (n = 7) for a total of 17 (1.7%) of pregnancy-related deaths."
0,b,Percentages might not sum to 100 because of ro...
1,,



Table 1 from div 6:


Unnamed: 0,Unnamed: 1,Number of pregnancy-related deaths,%
0,Preventable,839,84.2
1,Not Preventable,157,15.8



Table 2 from div 6:


Unnamed: 0,a,A preventability determination was missing (n = 4) or unable to be determined (n = 18) for a total of 22 (2.2%) pregnancy-related deaths.
0,b,Percentages might not sum to 100 because of ro...
1,,
