<a href="https://colab.research.google.com/github/soymlk94/datavis_sp24/blob/main/ps1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Title:** Analyzing HPV Vaccination, Prevalence, and Cancer Rates in the United States

**Abstract**

This analysis explores the relationship between HPV vaccination rates, prevalence, and HPV-related cancer rates in the United States. By merging datasets from multiple sources, we aim to evaluate the effectiveness of vaccination programs and identify disparities based on demographic and socioeconomic factors. The research focuses on how HPV prevalence and vaccination rates correlate with HPV-related cancer cases and whether external factors (e.g., insurance coverage and race) influence these trends.

**Research Question**

How do HPV vaccination rates impact HPV prevalence and HPV-related cancer rates, and what role do demographic factors (age, gender, socioeconomic status) play in this relationship?

**Hypotheses**

Higher HPV vaccination rates correlate with lower HPV prevalence.

Groups with lower vaccination coverage (e.g., uninsured, lower-income populations) have higher HPV-related cancer rates.

HPV-related cancer rates are higher in populations with historically lower vaccination uptake.

**Data Sources & Justification**

H**PV Vaccination Data (USA)** - Provides vaccination coverage rates by age, sex, and race.

**HPV Prevalence Data** - Shows the prevalence of different HPV types in the population.

**HPV Cancer Data** - Contains HPV-related cancer rates and total cases by demographic factors.

**WHO Vaccination Data **(USA) - Offers additional insights into national HPV vaccine coverage trends.

By merging these datasets, we can analyze trends over time, compare vaccination rates against cancer cases, and identify potential policy improvements.

**Data Manipulation & Methods**

To ensure data consistency and usability, the following steps were performed:

**Renamed Variables** - Standardized column names across datasets.

**Replaced Values** - Missing values in critical fields were filled appropriately.

**Dropped/Kept Variables** - Removed irrelevant columns and kept necessary ones for analysis.

**Collapsed Data (Aggregated Data by Groups)** - Summarized vaccination, prevalence, and cancer rates using groupby and aggregation.

**Merged Datasets **- Integrated vaccination, prevalence, and cancer datasets using demographic variables (age, sex, race, and insurance status).

**Conclusion & Future Work**

This study provides an evidence-based look at HPV vaccination effectiveness. Future research could expand by incorporating longitudinal data and analyzing the impact of new vaccination policies. Addressing disparities in vaccination rates may help reduce HPV-related cancer incidence in vulnerable populations.




In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [123]:
hpv_vaccination_df = pd.read_excel("/content/HPV Vaccination_Data.xlsx")
hpv_vaccination_2_df = pd.read_excel("/content/HPV Vaccinations 2(USA).xlsx")
hpv_cancer_df = pd.read_csv("/content/HPV Cancer.csv")
hpv_prevalence_df = pd.read_excel("/content/HPV Prevalence(Final).xlsx")

In [None]:
# Set Pandas display options for better visualization
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)


In [101]:
# Display initial datasets as tables
print("HPV Vaccination Data:")
display(hpv_vaccination_df)

print("HPV Vaccination 2 Data:")
display(hpv_vaccination_2_df)

print("HPV Cancer Data:")
display(hpv_cancer_df)

print("HPV Prevalence Data:")
display(hpv_prevalence_df)


HPV Vaccination Data:


Unnamed: 0,category,hpv vaccination rate (%)
0,Total,38.6
1,Age 9-10,7.3
2,Age 11-12,30.9
3,Age 13-14,48.8
4,Age 15-17,56.9
5,Girls,42.9
6,Boys,34.6
7,"White, non-Hispanic",39.9
8,Hispanic,34.4
9,"Black, non-Hispanic",


HPV Vaccination 2 Data:


Unnamed: 0,GROUP,CODE,NAME,YEAR,ANTIGEN,ANTIGEN_DESCRIPTION,COVERAGE_CATEGORY,COVERAGE_CATEGORY_DESCRIPTION,TARGET_NUMBER,DOSES,COVERAGE
0,COUNTRIES,USA,United States of America,2023.0,HPV_MALE,"HPV Male, final dose",ADMIN,Administrative coverage,,,
1,COUNTRIES,USA,United States of America,2023.0,HPV_FEM,"HPV Female, final dose",ADMIN,Administrative coverage,,,
2,COUNTRIES,USA,United States of America,2023.0,15HPV1_F,"HPV Vaccination coverage by age 15, first dose...",HPV,HPV Estimates,,,80.0
3,COUNTRIES,USA,United States of America,2023.0,15HPVC_M,"HPV Vaccination coverage by age 15, last dose,...",HPV,HPV Estimates,,,63.0
4,COUNTRIES,USA,United States of America,2023.0,15HPV1_M,"HPV Vaccination coverage by age 15, first dose...",HPV,HPV Estimates,,,78.0
...,...,...,...,...,...,...,...,...,...,...,...
132,COUNTRIES,USA,United States of America,2010.0,PRHPV1_M,"HPV Vaccination program coverage, first dose, ...",HPV,HPV Estimates,,,
133,COUNTRIES,USA,United States of America,2010.0,15HPVC_F,"HPV Vaccination coverage by age 15, last dose,...",HPV,HPV Estimates,,,32.0
134,COUNTRIES,USA,United States of America,2010.0,PRHPVC_M,"HPV Vaccination program coverage, last dose, m...",HPV,HPV Estimates,,,
135,COUNTRIES,USA,United States of America,2010.0,PRHPVC_F,"HPV Vaccination program coverage, last dose, f...",HPV,HPV Estimates,,,23.0


HPV Cancer Data:


Unnamed: 0,Sex,cancer type,Rates,Cases_x,Percentage,HPV Type,Cases_y,Race,Rate
0,All,Total,12.6,,,,,,
1,All,Anus*,2.0,,,,,,
2,All,Oropharynx,5.2,,,,,,
3,Females,Total,14.0,,,,,Non-Hispanic White,14.6
4,Females,Total,14.0,,,,,Non-Hispanic Black,13.0
...,...,...,...,...,...,...,...,...,...
88,Male,Anus*,,,,HPV-negative,291.0,,
89,Male,Oropharynx,,,,Caused by HPV types 16 and 18,11300.0,,
90,Male,Oropharynx,,,,Caused by HPV types 31/33/45/52/58,800.0,,
91,Male,Oropharynx,,,,Caused by other HPV types,800.0,,


HPV Prevalence Data:


Unnamed: 0,category,prevelance (%),Confidence Interval (95%)
0,Any Oral HPV (Total),7.3,6.1-8.5
1,Any Oral HPV (Men),11.5,9.9-13.1
2,Any Oral HPV (Women),3.3,2.5-4.1
3,Any Oral HPV (Non-Hispanic Asian),2.9,2.1-3.7
4,Any Oral HPV (Non-Hispanic Black),9.7,8.1-11.3
5,Any Oral HPV (Non-Hispanic White),7.3,6.1-8.5
6,Any Oral HPV (Hispanic),7.0,5.8-8.2
7,High-Risk Oral HPV (Total),4.0,3.2-4.8
8,High-Risk Oral HPV (Men),6.8,5.4-8.2
9,High-Risk Oral HPV (Women),1.2,0.8-1.6


In [102]:
# Rename variables (rename var)
hpv_vaccination_df = hpv_vaccination_df.rename(columns={
    "Category": "HPV_Vaccine_Type", "Prevalence (%)": "Vaccine_Coverage_Rate"
})
hpv_cancer_df = hpv_cancer_df.rename(columns={
    "Cancer_Type": "Cancer_Type_Category", "Cancer Rate (%)": "Cancer_Prevalence_Rate"
})

In [103]:
# Display after renaming
print("\nRenamed Columns in HPV Vaccination Data:")
display(hpv_vaccination_df)

print("\nRenamed Columns in HPV Cancer Data:")
display(hpv_cancer_df)


Renamed Columns in HPV Vaccination Data:


Unnamed: 0,category,hpv vaccination rate (%)
0,Total,38.6
1,Age 9-10,7.3
2,Age 11-12,30.9
3,Age 13-14,48.8
4,Age 15-17,56.9
5,Girls,42.9
6,Boys,34.6
7,"White, non-Hispanic",39.9
8,Hispanic,34.4
9,"Black, non-Hispanic",



Renamed Columns in HPV Cancer Data:


Unnamed: 0,Sex,cancer type,Rates,Cases_x,Percentage,HPV Type,Cases_y,Race,Rate
0,All,Total,12.6,,,,,,
1,All,Anus*,2.0,,,,,,
2,All,Oropharynx,5.2,,,,,,
3,Females,Total,14.0,,,,,Non-Hispanic White,14.6
4,Females,Total,14.0,,,,,Non-Hispanic Black,13.0
...,...,...,...,...,...,...,...,...,...
88,Male,Anus*,,,,HPV-negative,291.0,,
89,Male,Oropharynx,,,,Caused by HPV types 16 and 18,11300.0,,
90,Male,Oropharynx,,,,Caused by HPV types 31/33/45/52/58,800.0,,
91,Male,Oropharynx,,,,Caused by other HPV types,800.0,,


In [104]:
print(hpv_cancer_df.columns)

Index(['Sex', 'cancer type ', 'Rates', 'Cases_x', 'Percentage', 'HPV Type',
       'Cases_y', 'Race', 'Rate'],
      dtype='object')


In [105]:
# Replace missing values (replace vals)
hpv_cancer_df["Cases_x"] = hpv_cancer_df["Cases_x"].fillna(0)
hpv_vaccination_2_df["GROUP"] = hpv_vaccination_2_df["GROUP"].replace({"United States of America": "USA"})


In [106]:
# Display after replacing values
print("\nHPV Cancer Data after Replacing Missing Values:")
display(hpv_cancer_df)



HPV Cancer Data after Replacing Missing Values:


Unnamed: 0,Sex,cancer type,Rates,Cases_x,Percentage,HPV Type,Cases_y,Race,Rate
0,All,Total,12.6,0.0,,,,,
1,All,Anus*,2.0,0.0,,,,,
2,All,Oropharynx,5.2,0.0,,,,,
3,Females,Total,14.0,0.0,,,,Non-Hispanic White,14.6
4,Females,Total,14.0,0.0,,,,Non-Hispanic Black,13.0
...,...,...,...,...,...,...,...,...,...
88,Male,Anus*,,0.0,,HPV-negative,291.0,,
89,Male,Oropharynx,,0.0,,Caused by HPV types 16 and 18,11300.0,,
90,Male,Oropharynx,,0.0,,Caused by HPV types 31/33/45/52/58,800.0,,
91,Male,Oropharynx,,0.0,,Caused by other HPV types,800.0,,


In [107]:
print(hpv_vaccination_2_df.columns)

Index(['GROUP', 'CODE', 'NAME', 'YEAR', 'ANTIGEN', 'ANTIGEN_DESCRIPTION',
       'COVERAGE_CATEGORY', 'COVERAGE_CATEGORY_DESCRIPTION', 'TARGET_NUMBER',
       'DOSES', 'COVERAGE'],
      dtype='object')


In [108]:
# Drop or keep variables (drop or keep vars)
hpv_vaccination_2_df = hpv_vaccination_2_df.drop(columns=["COVERAGE_CATEGORY"])
hpv_cancer_df = hpv_cancer_df.drop(columns=["Race"])


In [109]:
# Display after dropping variables
print("\nHPV Vaccination 2 Data after Dropping Columns:")
display(hpv_vaccination_2_df)


HPV Vaccination 2 Data after Dropping Columns:


Unnamed: 0,GROUP,CODE,NAME,YEAR,ANTIGEN,ANTIGEN_DESCRIPTION,COVERAGE_CATEGORY_DESCRIPTION,TARGET_NUMBER,DOSES,COVERAGE
0,COUNTRIES,USA,United States of America,2023.0,HPV_MALE,"HPV Male, final dose",Administrative coverage,,,
1,COUNTRIES,USA,United States of America,2023.0,HPV_FEM,"HPV Female, final dose",Administrative coverage,,,
2,COUNTRIES,USA,United States of America,2023.0,15HPV1_F,"HPV Vaccination coverage by age 15, first dose...",HPV Estimates,,,80.0
3,COUNTRIES,USA,United States of America,2023.0,15HPVC_M,"HPV Vaccination coverage by age 15, last dose,...",HPV Estimates,,,63.0
4,COUNTRIES,USA,United States of America,2023.0,15HPV1_M,"HPV Vaccination coverage by age 15, first dose...",HPV Estimates,,,78.0
...,...,...,...,...,...,...,...,...,...,...
132,COUNTRIES,USA,United States of America,2010.0,PRHPV1_M,"HPV Vaccination program coverage, first dose, ...",HPV Estimates,,,
133,COUNTRIES,USA,United States of America,2010.0,15HPVC_F,"HPV Vaccination coverage by age 15, last dose,...",HPV Estimates,,,32.0
134,COUNTRIES,USA,United States of America,2010.0,PRHPVC_M,"HPV Vaccination program coverage, last dose, m...",HPV Estimates,,,
135,COUNTRIES,USA,United States of America,2010.0,PRHPVC_F,"HPV Vaccination program coverage, last dose, f...",HPV Estimates,,,23.0


In [110]:
print(hpv_vaccination_df.columns)

Index(['category', 'hpv vaccination rate (%)'], dtype='object')


In [111]:
# Standardize column names (strip spaces, lowercase)
def clean_columns(df):
    df.columns = df.columns.str.lower().str.strip()
    return df

In [132]:
# Standardize category names for merging
hpv_vaccination_df["category"] = hpv_vaccination_df["category"].str.lower().str.strip()
hpv_prevalence_df["category"] = hpv_prevalence_df["category"].str.lower().str.strip()
hpv_cancer_df["cancer type"] = hpv_cancer_df["cancer type"].str.lower().str.strip()
hpv_vaccination_2_df["antigen"] = hpv_vaccination_2_df["antigen"].str.lower().str.strip()


In [133]:
print("Columns in hpv_vaccination_df:", hpv_vaccination_df.columns)
print("Columns in hpv_prevalence_df:", hpv_prevalence_df.columns)
print("Columns in hpv_cancer_df:", hpv_cancer_df.columns)
print("Columns in hpv_vaccination_2_df:", hpv_vaccination_2_df.columns)

Columns in hpv_vaccination_df: Index(['category', 'hpv vaccination rate (%)'], dtype='object')
Columns in hpv_prevalence_df: Index(['category', 'prevelance (%)', 'Confidence Interval (95%)'], dtype='object')
Columns in hpv_cancer_df: Index(['Sex', 'cancer type', 'Rates', 'Cases_x', 'Percentage', 'HPV Type',
       'Cases_y', 'Race', 'Rate'],
      dtype='object')
Columns in hpv_vaccination_2_df: Index(['GROUP', 'CODE', 'NAME', 'YEAR', 'antigen', 'ANTIGEN_DESCRIPTION',
       'COVERAGE_CATEGORY', 'COVERAGE_CATEGORY_DESCRIPTION', 'TARGET_NUMBER',
       'DOSES', 'COVERAGE'],
      dtype='object')


In [134]:
print("Unique Categories in hpv_vaccination_df:")
print(hpv_vaccination_df["category"].unique())

print("\nUnique Categories in hpv_prevalence_df:")
print(hpv_prevalence_df["category"].unique())

print("\nUnique Categories in hpv_cancer_df:")
print(hpv_cancer_df["cancer type"].unique())

print("\nUnique Categories in hpv_vaccination_2_df:")
print(hpv_vaccination_2_df["antigen"].unique())


Unique Categories in hpv_vaccination_df:
['total' 'age 9-10' 'age 11-12' 'age 13-14' 'age 15-17' 'girls' 'boys'
 'white, non-hispanic' 'hispanic' 'black, non-hispanic'
 'asian, non-hispanic' 'private insurance' 'medicaid' 'other government'
 'uninsured' 'high school or less' 'associate’s degree or some college'
 'bachelor’s degree or higher' 'less than 100% of fpl'
 '100% to less than 200% of fpl' '200% to less than 400% of fpl'
 '400% or more of fpl' 'with disability' 'without disability'
 'large central metro' 'large fringe metro' 'medium & small metro'
 'non-metro']

Unique Categories in hpv_prevalence_df:
['any oral hpv (total)' 'any oral hpv (men)' 'any oral hpv (women)'
 'any oral hpv (non-hispanic asian)' 'any oral hpv (non-hispanic black)'
 'any oral hpv (non-hispanic white)' 'any oral hpv (hispanic)'
 'high-risk oral hpv (total)' 'high-risk oral hpv (men)'
 'high-risk oral hpv (women)' 'high-risk oral hpv (non-hispanic asian)'
 'high-risk oral hpv (non-hispanic black)'
 'high-

In [135]:
category_mapping = {
    "age 9–10": "age group: 9–10",
    "age 11–12": "age group: 11–12",
    "age 13–14": "age group: 13–14",
    "age 15–17": "age group: 15–17",
    "white, non-hispanic": "non-hispanic white",
    "black, non-hispanic": "non-hispanic black",
    "hispanic": "hispanic",
    "uninsured": "no insurance",
    "private insurance": "insured",
    "cervix": "hpv-related cervical cancer",
    "penis": "hpv-related penile cancer",
    "vagina": "hpv-related vaginal cancer",
    "anus": "hpv-related anal cancer",
    "oropharynx": "oropharyngeal cancer",
    "hpv_male": "male",
    "hpv_fem": "female",
}

In [124]:
hpv_vaccination_df["category"] = hpv_vaccination_df["category"].replace(category_mapping)
hpv_prevalence_df["category"] = hpv_prevalence_df["category"].replace(category_mapping)
hpv_cancer_df["cancer type"] = hpv_cancer_df["cancer type"].replace(category_mapping)
hpv_vaccination_2_df["antigen"] = hpv_vaccination_2_df["antigen"].replace(category_mapping)

In [144]:
hpv_vaccination_df["category"] = hpv_vaccination_df["category"].astype(str).str.lower().str.strip()
hpv_prevalence_df["category"] = hpv_prevalence_df["category"].astype(str).str.lower().str.strip()
hpv_cancer_df["cancer type"] = hpv_cancer_df["cancer type"].astype(str).str.lower().str.strip()



In [74]:
merged_debug = hpv_vaccination_df.merge(
    hpv_prevalence_df, on="Category", how="outer", indicator=True
)

print("\n Merge Indicator Counts:")
print(merged_debug["_merge"].value_counts())  # Shows which rows matched or didn't

display(merged_debug)  # Show the full merged table



🔹 Merge Indicator Counts:
_merge
left_only     28
right_only    28
both           0
Name: count, dtype: int64


Unnamed: 0,Category,HPV Vaccination Rate (%),Prevalence (%),Confidence Interval (95%),_merge
0,100% to less than 200% of fpl,35.5,,,left_only
1,200% to less than 400% of fpl,36.5,,,left_only
2,400% or more of fpl,45.7,,,left_only
3,age 11-12,30.9,,,left_only
4,age 13-14,48.8,,,left_only
5,age 15-17,56.9,,,left_only
6,age 9-10,7.3,,,left_only
7,any genital hpv (hispanic),,41.4,39.1-43.7,right_only
8,any genital hpv (men),,45.2,42.7-47.7,right_only
9,any genital hpv (non-hispanic asian),,23.8,21.5-26.1,right_only


In [148]:
# Merge 1: HPV Vaccination with Prevalence Data
merged_vaccination_prevalence = hpv_vaccination_df.merge(
    hpv_prevalence_df, on="category", how="inner"
)
print("\n🔹 Merged HPV Vaccination and Prevalence Data:")
display(merged_vaccination_prevalence)

# Merge 2: Adding HPV Cancer Data
merged_vaccination_prevalence_cancer = merged_vaccination_prevalence.merge(
    hpv_cancer_df, left_on="category", right_on="cancer type", how="left"
).drop(columns=["cancer type"])

print("\n Merged with HPV Cancer Data:")
display(merged_vaccination_prevalence_cancer)

# Merge 3: Adding WHO HPV Vaccination Data
final_merged_dataset = merged_vaccination_prevalence_cancer.merge(
    hpv_vaccination_2_df, left_on="category", right_on="antigen", how="left"
).drop(columns=["antigen"])

print("\n Final Merged Dataset with WHO HPV Vaccination Data:")
display(final_merged_dataset)



🔹 Merged HPV Vaccination and Prevalence Data:


Unnamed: 0,category,hpv vaccination rate (%),prevelance (%),Confidence Interval (95%)



 Merged with HPV Cancer Data:


Unnamed: 0,category,hpv vaccination rate (%),prevelance (%),Confidence Interval (95%),Sex,Rates,Cases_x,Percentage,HPV Type,Cases_y,Race,Rate



 Final Merged Dataset with WHO HPV Vaccination Data:


Unnamed: 0,category,hpv vaccination rate (%),prevelance (%),Confidence Interval (95%),Sex,Rates,Cases_x,Percentage,HPV Type,Cases_y,Race,Rate,GROUP,CODE,NAME,YEAR,ANTIGEN_DESCRIPTION,COVERAGE_CATEGORY,COVERAGE_CATEGORY_DESCRIPTION,TARGET_NUMBER,DOSES,COVERAGE


In [80]:
print("Dataset Shape:", final_merged_dataset.shape)  # Check if DataFrame is empty
print(final_merged_dataset.head())  # Show the first few rows

Dataset Shape: (0, 22)
Empty DataFrame
Columns: [Category, HPV Vaccination Rate (%), Prevalence (%), Confidence Interval (95%), Sex, Rates, Cases_x, Percentage, HPV Type, Cases_y, Race, Rate, GROUP, CODE, NAME, YEAR, ANTIGEN_DESCRIPTION, COVERAGE_CATEGORY, COVERAGE_CATEGORY_DESCRIPTION, TARGET_NUMBER, DOSES, COVERAGE]
Index: []


In [81]:
print(final_merged_dataset.columns)


Index(['Category', 'HPV Vaccination Rate (%)', 'Prevalence (%)',
       'Confidence Interval (95%)', 'Sex', 'Rates', 'Cases_x', 'Percentage',
       'HPV Type', 'Cases_y', 'Race', 'Rate', 'GROUP', 'CODE', 'NAME', 'YEAR',
       'ANTIGEN_DESCRIPTION', 'COVERAGE_CATEGORY',
       'COVERAGE_CATEGORY_DESCRIPTION', 'TARGET_NUMBER', 'DOSES', 'COVERAGE'],
      dtype='object')


In [137]:
# Find categories that do not overlap
set1 = set(hpv_vaccination_df["category"].unique())
set2 = set(hpv_prevalence_df["category"].unique())
set3 = set(hpv_cancer_df["cancer type"].unique())
set4 = set(hpv_vaccination_2_df["antigen"].unique())

# Print non-matching categories
print(" Categories in hpv_vaccination_df NOT in hpv_prevalence_df:", set1 - set2)
print(" Categories in hpv_prevalence_df NOT in hpv_vaccination_df:", set2 - set1)
print("Categories in hpv_vaccination_df NOT in hpv_cancer_df:", set1 - set3)
print("Categories in hpv_vaccination_2_df NOT in hpv_vaccination_df:", set4 - set1)


 Categories in hpv_vaccination_df NOT in hpv_prevalence_df: {'age 13-14', 'with disability', 'uninsured', 'age 9-10', 'medium & small metro', 'white, non-hispanic', 'boys', 'bachelor’s degree or higher', 'large central metro', 'without disability', 'hispanic', '200% to less than 400% of fpl', 'girls', 'non-metro', 'asian, non-hispanic', 'black, non-hispanic', '400% or more of fpl', 'age 11-12', 'associate’s degree or some college', 'age 15-17', 'other government', 'private insurance', 'high school or less', 'less than 100% of fpl', '100% to less than 200% of fpl', 'large fringe metro', 'medicaid', 'total'}
 Categories in hpv_prevalence_df NOT in hpv_vaccination_df: {'any genital hpv (total)', 'high-risk genital hpv (hispanic)', 'high-risk oral hpv (non-hispanic black)', 'any oral hpv (non-hispanic asian)', 'any oral hpv (hispanic)', 'any genital hpv (women)', 'any oral hpv (total)', 'any genital hpv (non-hispanic white)', 'high-risk oral hpv (men)', 'high-risk genital hpv (total)', 'an

In [139]:
category_mapping = {
    # Age groups
    "age 9–10": "children 9-10",
    "age 11–12": "children 11-12",
    "age 13–14": "teenagers 13-14",
    "age 15–17": "teenagers 15-17",

    # Demographics
    "white, non-hispanic": "non-hispanic white",
    "black, non-hispanic": "non-hispanic black",

    # Insurance
    "private insurance": "insured",
    "uninsured": "no insurance",

    # HPV types
    "any oral hpv (total)": "oral hpv",
    "high-risk oral hpv": "oral hpv high risk",
    "any genital hpv (total)": "genital hpv",

    # Cancer types
    "cervix": "hpv-related cervical cancer",
    "penis": "hpv-related penile cancer",
    "vagina": "hpv-related vaginal cancer",
    "anus": "hpv-related anal cancer",
    "oropharynx": "oropharyngeal cancer",

    # Coded variables
    "hpv_male": "male",
    "hpv_fem": "female",
}

# Apply mapping to datasets
hpv_vaccination_df["category"] = hpv_vaccination_df["category"].replace(category_mapping)
hpv_prevalence_df["category"] = hpv_prevalence_df["category"].replace(category_mapping)
hpv_cancer_df["cancer type"] = hpv_cancer_df["cancer type"].replace(category_mapping)
hpv_vaccination_2_df["antigen"] = hpv_vaccination_2_df["antigen"].replace(category_mapping)
