<h1><center>Neglected Tropical Diseases in Africa and their eradication</center></h1>
<h4><center>
<a style="text-decoration:none" href="https://github.com/zainabkapadia52">Zainab Kapadia</a>, IIT Gandhinagar, <a style="text-decoration:none" href="mailto:<author1>@iitgn.ac.in">23110373@iitgn.ac.in</a>
<br><br>
<a style="text-decoration:none" href="https://github.com/nikhil-405">Nikhil Goyal</a>, IIT Gandhinagar, <a style="text-decoration:none" href="mailto:<author2>@iitgn.ac.in">.g@iitgn.ac.in</a>
<br><br>
<a style="text-decoration:none" href="https://github.com/Vishal1309">Vishal Soni</a>, IIT Gandhinagar, <a style="text-decoration:none" href="mailto:<author3>@iitgn.ac.in">jayesh.s@iitgn.ac.in</a>
</center></h4>

# Neglected Tropical Diseases in Africa and their eradication

### Introduction

Neglected tropical diseases (NTDs) affect more than one billion of the world’s poorest people, trapping them in a vicious cycle of infection, disability, and poverty. In sub‑Saharan Africa, diseases like lymphatic filariasis and soil‑transmitted helminthiases burden rural communities with chronic illness, impaired child development, and lost economic productivity. Despite decades of mass‑drug administration campaigns and global partnerships led by the World Health Organization, progress toward eradication remains uneven—country by country, year by year.

A rigorous, data‑driven approach can illuminate where interventions have succeeded and where gaps persist. By combining prevalence and treatment records with socioeconomic indicators such as GDP per capita, health‑expenditure, and access to improved sanitation, we can quantify five‑year changes in disease burden, assess the impact of mass‑drug coverage, and uncover the underlying drivers of progress. In the sections that follow, we summarize our cleaned datasets, posit clear hypotheses around program coverage and socioeconomic context, and use statistical tests and visualizations to chart a path toward the lasting elimination of NTDs in Africa.

In [1]:
from IPython.display import HTML
HTML('''<button type="button" class="btn btn-outline-danger"  onclick="codeToggle();">Toggle Code</button>''')

### Analysis of lymphatic filariasis and soil‑transmitted helminthiases Spread in Africa 

In [4]:
# Importing the libraries needed
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import widgets, interactive
import plotly.io as pio
import plotly.express as px
import warnings
import os

data_dir = "data"

warnings.filterwarnings('ignore')
pio.renderers.default='notebook'
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)

In [11]:
df= pd.read_excel("data/STH_data.xlsx")
df= df[df["region"] == "AFR"].copy()
df.sort_values(by="year", inplace=True)
df.drop(columns=["region", "country"], inplace=True)
df["country_code"]= df["country_code"].str.upper()
df.rename(columns={
    "year": "Year",
    "country_code": "CountryCode"
}, inplace=True)
df.to_csv("data/STH_summary.csv", index=False)


df= pd.read_excel("data/LF_data.xlsx")
df= df[df["Region"] == "AFR"].copy()
df.sort_values(by="Year", inplace=True)
df.drop(columns=["Region", "Country"], inplace=True)
df["country_code"]= df["country_code"].str.upper()
df.rename(columns={
    "country_code": "CountryCode"
}, inplace=True)
df.to_csv( "data/LF_summary.csv", index=False)


df = pd.read_csv("data/COUNTRY.csv")
african_countries_df = df[df["ParentCode"] == "AFR"]
african_country_codes = african_countries_df["Code"].tolist()


gdp_df = pd.read_csv("data/GDP.csv") 
gdp_africa = gdp_df[gdp_df["Country Code"].isin(african_country_codes)].copy()
gdp_africa.drop(columns=["Country Name", "Indicator Name", "Indicator Code"], inplace=True)
gdp_africa.rename(columns={"Country Code": "CountryCode"}, inplace=True)
if 'Unnamed: 0' in df.columns:
    df.drop(columns=['Unnamed: 0'], inplace=True)
gdp_africa.to_csv("data/gdp_per_capita_africa.csv", index=False)


df = pd.read_csv("data/health_expenditure_per_capita.csv")  
df = df[df["Country Code"].isin(african_country_codes)].copy()
columns_to_drop = ["Country Name", "Unnamed: 0"]
df.drop(columns=[col for col in columns_to_drop if col in df.columns], inplace=True)
df.rename(columns={"Country Code": "CountryCode"}, inplace=True)
df.to_csv("data/health_expenditure_per_capita_africa.csv", index=False)


df = pd.read_csv("data/edited_API_SH.STA.BASS.ZS_DS2_EN_CSV_v2_65890.csv")
df = df[df["Country Code"].isin(african_country_codes)].copy()
columns_to_drop = ["Country Name", "Unnamed: 0"]
df.drop(columns=[col for col in columns_to_drop if col in df.columns], inplace=True)
df.rename(columns={"Country Code": "CountryCode"}, inplace=True)
df.to_csv("data/sanitation_africa.csv", index=False)


available_country_codes = [
    'AGO', 'BDI', 'BEN', 'BFA', 'BWA', 'CAF', 'CIV', 'CMR', 'COD', 'COG',
    'COM', 'CPV', 'DZA', 'ERI', 'ETH', 'GAB', 'GHA', 'GIN', 'GMB', 'GNB',
    'GNQ', 'KEN', 'LBR', 'LSO', 'MDG', 'MLI', 'MOZ', 'MRT', 'MUS', 'MWI',
    'NAM', 'NER', 'NGA', 'RWA', 'SEN', 'SLE', 'SSD', 'STP', 'SWZ', 'SYC',
    'TCD', 'TGO', 'TZA', 'UGA', 'ZAF', 'ZMB', 'ZWE'
]

# List of disease names
diseases= ["STH","LF"]
for disease_name in diseases:
    file_path= f"data/{disease_name}_summary.csv"
    df= pd.read_csv(file_path)
    df_filtered= df[df["CountryCode"].isin(available_country_codes)].copy()
    filtered_path= f"data/{disease_name}_filtered.csv"
    df_filtered.to_csv(filtered_path, index=False)

In [12]:
lf= pd.read_csv("data/LF_filtered.csv")
sth= pd.read_csv("data/STH_filtered.csv")

# Dropping columns with >70% missing values
def drop_high_missing(df, threshold=0.7):
    return df.loc[:, df.isna().mean() <= threshold]

lf_clean= drop_high_missing(lf)
sth_clean= drop_high_missing(sth)
lf_clean.replace("No data", pd.NA, inplace=True)
sth_clean.replace("No data", pd.NA, inplace=True)

# Imputing numeric columns by country using transform 
for df in (lf_clean, sth_clean):
    num_cols = df.select_dtypes(include='number').columns.drop('Year', errors='ignore')
    df[num_cols] = df.groupby("CountryCode")[num_cols] \
                     .transform(lambda grp: grp.interpolate(method='linear')
                                            .ffill().bfill())
    cat_cols = [c for c in df.columns if c not in num_cols and c not in ["CountryCode", "Year"]]
    for col in cat_cols:
        if df[col].isna().any():
            df[col] = df[col].fillna(df[col].mode().iloc[0])


lf_clean.to_csv("data/LF_cleaned.csv", index=False)
sth_clean.to_csv("data/STH_cleaned.csv", index=False)

In [16]:
gdp_wide= pd.read_csv("data/gdp_per_capita_africa.csv",index_col=0)       
healthexp= pd.read_csv("data/health_expenditure_per_capita_africa.csv")
sanit= pd.read_csv("data/sanitation_africa.csv")

gdp_long= (
    gdp_wide
      .melt(id_vars="CountryCode", 
            var_name="Year", 
            value_name="GDP_per_capita")
      .assign(Year=lambda df: df["Year"].astype(int))
)

healthexp_long= (
    healthexp
      .melt(id_vars="CountryCode",
            var_name="Year",
            value_name="HealthExp_per_capita")
      .assign(Year=lambda df: df["Year"].astype(int))
)

sanit_long= (
    sanit
      .melt(id_vars="CountryCode",
            var_name="Year",
            value_name="Pct_with_improved_sanitation")
      .assign(Year=lambda df: df["Year"].astype(int))
)

cov= (
    gdp_long
      .merge(healthexp_long, on=["CountryCode","Year"], how="outer")
      .merge(sanit_long,     on=["CountryCode","Year"], how="outer")
)

lf_full= lf_clean.merge(cov, on=["CountryCode","Year"], how="left")
sth_full= sth_clean.merge(cov, on=["CountryCode","Year"], how="left")

lf_final= lf_full.drop(columns=[
    'Mapping status',
    'Type of MDA',
    'Current status of MDA',
    'Number of IUs covered',
    'Geographical coverage (%)',
    'Total population of IUs'
])

sth_final= sth_full.drop(columns=[
    'Number of Pre-SAC targeted',
    'Drug combination, Pre-SAC'
])

lf_final.to_csv('data/LF_final.csv', index=False)

cov_cols= ["Programme coverage, Pre-SAC (%)", "National coverage, Pre-SAC (%)"]
for col in cov_cols:
    sth_final[col]= pd.to_numeric(sth_final[col], errors="coerce")

missing_pct= sth_final[cov_cols].isna().mean() * 100
sth_final[cov_cols]= (
    sth_final
    .groupby("CountryCode")[cov_cols]
    .transform(lambda grp: grp.interpolate(method="linear")
                                .ffill()
                                .bfill())
)

for col in cov_cols:
    median_val = sth_final[col].median()
    sth_final[col].fillna(median_val, inplace=True)

sth_final.to_csv("data/STH_final.csv", index=False)

In [19]:
import pandas as pd
import plotly.express as px

lf = pd.read_csv('data/LF_final.csv')
sth = pd.read_csv('data/STH_final.csv')

lf['Population requiring PC for LF'] = pd.to_numeric(
    lf['Population requiring PC for LF'], errors='coerce')
sth['Population requiring PC for STH, Pre-SAC'] = pd.to_numeric(
    sth['Population requiring PC for STH, Pre-SAC'], errors='coerce')

lf_year = (
    lf.groupby('Year')['Population requiring PC for LF']
      .sum()
      .reset_index(name='Pop_PC')
      .assign(Disease='LF')
)
sth_year = (
    sth.groupby('Year')['Population requiring PC for STH, Pre-SAC']
       .sum()
       .reset_index(name='Pop_PC')
       .assign(Disease='STH')
)

df = pd.concat([lf_year, sth_year], ignore_index=True)
fig = px.line(
    df,
    x='Year',
    y='Pop_PC',
    color='Disease',
    labels={'Pop_PC':'Population Requiring PC'},
    title='Trend of Population Requiring Preventive Chemotherapy: LF and STH'
)
fig.update_layout(xaxis=dict(dtick=1), yaxis_title='Population Requiring PC')
fig.show()


In [20]:
fig_lf = px.choropleth(
    lf,
    locations='CountryCode',           
    color='Population requiring PC for LF',
    hover_name='CountryCode',
    animation_frame='Year',
    color_continuous_scale='Viridis',
    range_color=[0, lf['Population requiring PC for LF'].max()],
    scope='africa',
    labels={'Population requiring PC for LF':'Pop. requiring PC'},
    title='LF: Population Requiring Preventive Chemotherapy Over Time'
)
fig_lf.update_layout(margin={"r":0,"t":40,"l":0,"b":0})
fig_lf.show()

In [21]:
# -- STH choropleth --
fig_sth = px.choropleth(
    sth,
    locations='CountryCode',
    color='Population requiring PC for STH, Pre-SAC',
    hover_name='CountryCode',
    animation_frame='Year',
    color_continuous_scale='Plasma',
    range_color=[0, sth['Population requiring PC for STH, Pre-SAC'].max()],
    scope='africa',
    labels={'Population requiring PC for STH, Pre-SAC':'Pop. requiring PC'},
    title='STH: Population Requiring Preventive Chemotherapy Over Time'
)
fig_sth.update_layout(margin={"r":0,"t":40,"l":0,"b":0})
fig_sth.show()


In [23]:
# --- H1: ANOVA on LF delta_5y by coverage bins ---
# Create coverage bins (<50%, 50-75%, >75%)

from scipy.stats import f_oneway, pearsonr
import statsmodels.formula.api as smf

baseline_year = 2000
future_year = baseline_year + 5

# Baseline and future series
base = (lf[lf['Year'] == baseline_year]
        .groupby('CountryCode')['Population requiring PC for LF']
        .mean()
        .rename('baseline_PC'))

future = (lf[lf['Year'] == future_year]
          .groupby('CountryCode')['Population requiring PC for LF']
          .mean()
          .rename('future_PC'))

# Average coverage across all years
avg_cov = (lf.groupby('CountryCode')['Programme (drug) coverage (%)']
           .mean()
           .rename('avg_coverage'))

# Combine
summary = pd.concat([base, future, avg_cov], axis=1).dropna()
summary['delta_5y'] = summary['future_PC'] - summary['baseline_PC']

# 5. Create coverage bins
summary['coverage_bin'] = pd.cut(
    summary['avg_coverage'],
    bins=[0, 50, 75, 100],
    labels=['<50%', '50-75%', '>75%']
)

# 6. Prepare groups for ANOVA
groups = [grp['delta_5y'].values for _, grp in summary.groupby('coverage_bin')]

# 7. Run one-way ANOVA
F_stat, p_val = f_oneway(*groups)

print("\nH₁ ANOVA Results (Δ5y by avg_coverage bin):")
print(f"  F-statistic = {F_stat:.2f}")
print(f"  p-value     = {p_val:.4f}")


H₁ ANOVA Results (Δ5y by avg_coverage bin):
  F-statistic = 9.01
  p-value     = 0.2292


In [24]:
import statsmodels.api as sm

lf['log_GDP'] = np.log(lf['GDP_per_capita'])
lf['log_HE']  = np.log(lf['HealthExp_per_capita'])
y = lf['baseline_PC']

r_gdp, p_gdp = pearsonr(lf['log_GDP'], y)
r_he,  p_he  = pearsonr(lf['log_HE'],  y)
print(f"Correlation: log(GDP) vs baseline_PC: r={r_gdp:.2f}, p={p_gdp:.4f}")
print(f"Correlation: log(HealthExp) vs baseline_PC: r={r_he:.2f}, p={p_he:.4f}")

X = lf[['log_GDP', 'log_HE']]
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
print(model.summary())

KeyError: 'baseline_PC'