# Task 1 — Data Exploration and Enrichment
This notebook loads the starter dataset, explores the unified schema, reviews events and impact links, adds example enrichment records (observations/events/impact_links), saves an enriched dataset to `data/processed/`, and writes `reports/data_enrichment_log.md`.

**Before running:**
1. Ensure these files exist:
- `data/raw/ethiopia_fi_unified_data.csv`
- `data/raw/reference_codes.csv`
2. Update `COLLECTOR` in the enrichment cell.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 120)


In [3]:
raw_path = Path("../data/raw")

df = pd.read_csv(raw_path / "ethiopia_fi_unified_data.csv")
ref = pd.read_csv(raw_path / "reference_codes.csv")

print("Unified data shape:", df.shape)
print("Reference codes shape:", ref.shape)
df.head()


Unified data shape: (43, 34)
Reference codes shape: (71, 4)


Unnamed: 0,record_id,record_type,category,pillar,indicator,indicator_code,indicator_direction,value_numeric,value_text,value_type,unit,observation_date,period_start,period_end,fiscal_year,gender,location,region,source_name,source_type,source_url,confidence,related_indicator,relationship_type,impact_direction,impact_magnitude,impact_estimate,lag_months,evidence_basis,comparable_country,collected_by,collection_date,original_text,notes
0,REC_0001,observation,,ACCESS,Account Ownership Rate,ACC_OWNERSHIP,higher_better,22.0,,percentage,%,2014-12-31,,,2014,all,national,,Global Findex 2014,survey,https://www.worldbank.org/en/publication/globa...,high,,,,,,,,Example_Trainee,2025-01-20,,Baseline year,
1,REC_0002,observation,,ACCESS,Account Ownership Rate,ACC_OWNERSHIP,higher_better,35.0,,percentage,%,2017-12-31,,,2017,all,national,,Global Findex 2017,survey,https://www.worldbank.org/en/publication/globa...,high,,,,,,,,Example_Trainee,2025-01-20,,,
2,REC_0003,observation,,ACCESS,Account Ownership Rate,ACC_OWNERSHIP,higher_better,46.0,,percentage,%,2021-12-31,,,2021,all,national,,Global Findex 2021,survey,https://www.worldbank.org/en/publication/globa...,high,,,,,,,,Example_Trainee,2025-01-20,,,
3,REC_0004,observation,,ACCESS,Account Ownership Rate,ACC_OWNERSHIP,higher_better,56.0,,percentage,%,2021-12-31,,,2021,male,national,,Global Findex 2021,survey,https://www.worldbank.org/en/publication/globa...,high,,,,,,,,Example_Trainee,2025-01-20,,Gender disaggregated,
4,REC_0005,observation,,ACCESS,Account Ownership Rate,ACC_OWNERSHIP,higher_better,36.0,,percentage,%,2021-12-31,,,2021,female,national,,Global Findex 2021,survey,https://www.worldbank.org/en/publication/globa...,high,,,,,,,,Example_Trainee,2025-01-20,,Gender disaggregated,


In [4]:
df.columns.tolist()

['record_id',
 'record_type',
 'category',
 'pillar',
 'indicator',
 'indicator_code',
 'indicator_direction',
 'value_numeric',
 'value_text',
 'value_type',
 'unit',
 'observation_date',
 'period_start',
 'period_end',
 'fiscal_year',
 'gender',
 'location',
 'region',
 'source_name',
 'source_type',
 'source_url',
 'confidence',
 'related_indicator',
 'relationship_type',
 'impact_direction',
 'impact_magnitude',
 'impact_estimate',
 'lag_months',
 'evidence_basis',
 'comparable_country',
 'collected_by',
 'collection_date',
 'original_text',
 'notes']

In [5]:
df.dtypes

record_id                  str
record_type                str
category                   str
pillar                     str
indicator                  str
indicator_code             str
indicator_direction        str
value_numeric          float64
value_text                 str
value_type                 str
unit                       str
observation_date           str
period_start               str
period_end                 str
fiscal_year                str
gender                     str
location                   str
region                 float64
source_name                str
source_type                str
source_url                 str
confidence                 str
related_indicator      float64
relationship_type      float64
impact_direction       float64
impact_magnitude       float64
impact_estimate        float64
lag_months             float64
evidence_basis         float64
comparable_country         str
collected_by               str
collection_date            str
original

In [6]:
df["record_type"].value_counts(dropna=False)

record_type
observation    30
event          10
target          3
Name: count, dtype: int64

In [7]:
# record_type x pillar
pd.crosstab(df["record_type"], df["pillar"], dropna=False)

pillar,ACCESS,AFFORDABILITY,GENDER,USAGE,NaN
record_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
event,0,0,0,0,10
observation,14,1,4,11,0
target,2,0,1,0,0


In [8]:
df["confidence"].value_counts(dropna=False)

confidence
high      40
medium     3
Name: count, dtype: int64

In [9]:
df["source_type"].value_counts(dropna=False)

source_type
operator      15
survey        10
regulator      7
research       4
policy         3
calculated     2
news           2
Name: count, dtype: int64

In [10]:
ref["field_name"].value_counts()

KeyError: 'field_name'

In [None]:
# Look up valid values for key fields (if present in reference_codes.csv)
for field in ["record_type", "pillar", "category", "confidence", "impact_direction", "impact_magnitude"]:
    subset = ref[ref["field_name"] == field] if "field_name" in ref.columns else pd.DataFrame()
    print("\nFIELD:", field)
    if not subset.empty:
        display(subset[[c for c in ["field_name","code","description"] if c in subset.columns]]
                .drop_duplicates()
                .sort_values("code" if "code" in subset.columns else subset.columns[0]))
    else:
        print("No reference codes found for this field (check reference_codes.csv schema).")


In [None]:
obs = df[df["record_type"] == "observation"].copy()
obs["observation_date"] = pd.to_datetime(obs["observation_date"], errors="coerce")

obs["observation_date"].min(), obs["observation_date"].max()


In [None]:
obs["indicator_code"].nunique(), obs["indicator_code"].value_counts().head(30)

In [None]:
indicator_coverage = (
    obs.groupby("indicator_code")["observation_date"]
    .agg(["min", "max", "count"])
    .reset_index()
    .sort_values("count", ascending=False)
)
indicator_coverage.head(50)


In [None]:
events = df[df["record_type"] == "event"].copy()
events["event_date"] = pd.to_datetime(events["event_date"], errors="coerce")

cols = [c for c in ["record_id","event_name","category","event_date"] if c in events.columns]
events[cols].sort_values("event_date")


In [None]:
links = df[df["record_type"] == "impact_link"].copy()
key_cols = [c for c in ["record_id","parent_id","pillar","related_indicator","impact_direction","impact_magnitude","lag_months","evidence_basis"] if c in links.columns]
links[key_cols].head(25)


In [None]:
event_ids = set(events["record_id"].dropna().astype(str)) if "record_id" in events.columns else set()
if "parent_id" in links.columns:
    links["parent_id"] = links["parent_id"].astype(str)
    missing_parents = links[~links["parent_id"].isin(event_ids)]
    print("Impact links with missing parent event:", missing_parents.shape[0])
    display(missing_parents.head(10))
else:
    print("No parent_id column found in impact_link records.")


## Why events have no pillar
Events are **pillar-agnostic by design** to avoid bias. The effect of an event on **Access** or **Usage** is defined using `impact_link` records that connect `parent_id` (event) to `related_indicator` + `pillar`, and quantify direction/magnitude/lag.

In [None]:
# Inspect schema columns to ensure you fill the right fields for enrichment
df.columns.tolist()


In [None]:
from datetime import date

TODAY = "2026-02-03"
COLLECTOR = "Your Name"  # <-- change this

cols = df.columns.tolist()

def make_row(**kwargs):
    """Create a row dict that matches df columns exactly."""
    row = {c: np.nan for c in cols}
    for k, v in kwargs.items():
        if k in row:
            row[k] = v
    return row

new_rows = []

# ----------------------------
# NEW OBSERVATIONS (EXAMPLES)
# ----------------------------
new_rows.append(make_row(
    record_id="obs_interop_p2p_surpass_atm_2025",
    record_type="observation",
    pillar="usage",
    indicator="Interoperable P2P transfers surpass interoperable ATM withdrawals",
    indicator_code="interop_p2p_gt_atm_flag",
    value_numeric=1,
    observation_date="2025-11-01",
    source_name="EthSwitch (reported via media)",
    source_type="media",
    source_url="https://capitalethiopia.com/2025/11/02/ethswitch-reports-historic-growth-as-p2p-payments-surpass-atm-withdrawals/",
    confidence="medium",
    original_text="EthSwitch reported P2P payments surpassed ATM withdrawals (interoperable transfers).",
    collected_by=COLLECTOR,
    collection_date=TODAY,
    notes="Usage proxy: shift toward digital transfers; helps nowcast 2025 usage when Findex is sparse."
))

new_rows.append(make_row(
    record_id="obs_telebirr_users_2024",
    record_type="observation",
    pillar="access",
    indicator="Telebirr registered users",
    indicator_code="telebirr_registered_users",
    value_numeric=54000000,
    observation_date="2024-12-31",
    source_name="Ethio Telecom (reported via media)",
    source_type="media",
    source_url="https://techpression.com/ethio-telecoms-telebirr-surpasses-54-million-users-processes-2-4-trillion-etb-in-transactions/",
    confidence="medium",
    original_text="Telebirr surpassed 54 million users.",
    collected_by=COLLECTOR,
    collection_date=TODAY,
    notes="Access proxy: Findex account definition includes mobile money usage; scale supports bridging years."
))

new_rows.append(make_row(
    record_id="obs_mpesa_customers_2025",
    record_type="observation",
    pillar="usage",
    indicator="M-Pesa Ethiopia customers",
    indicator_code="mpesa_customers",
    value_numeric=10000000,
    observation_date="2025-08-01",
    source_name="Safaricom Ethiopia (press release)",
    source_type="operator",
    source_url="https://www.safaricom.co.ke/media-center-landing/press-releases/safaricom-ethiopia-hits-10-million-customers-demonstrates-strong-performance-investment-and-job-creation",
    confidence="medium",
    original_text="Safaricom Ethiopia hit 10 million customers.",
    collected_by=COLLECTOR,
    collection_date=TODAY,
    notes="Usage proxy: market competition and network effects can accelerate digital payment adoption."
))

# ----------------------------
# NEW EVENTS (EXAMPLES)
# ----------------------------
new_rows.append(make_row(
    record_id="event_telebirr_launch_2021",
    record_type="event",
    event_name="Telebirr launch",
    category="product_launch",
    event_date="2021-05-11",
    source_name="Ethio Telecom",
    source_type="operator",
    source_url="https://www.ethiotelecom.et/",
    confidence="low",
    original_text="Telebirr launched in 2021 (exact date to verify if not in dataset).",
    collected_by=COLLECTOR,
    collection_date=TODAY,
    notes="Major product launch; event is pillar-agnostic; link impacts via impact_link."
))

new_rows.append(make_row(
    record_id="event_mpesa_entry_2023",
    record_type="event",
    event_name="M-Pesa Ethiopia market entry",
    category="market_entry",
    event_date="2023-08-01",
    source_name="Safaricom Ethiopia",
    source_type="operator",
    source_url="https://www.safaricom.co.ke/",
    confidence="low",
    original_text="M-Pesa launched in Ethiopia in 2023 (date to verify if not in dataset).",
    collected_by=COLLECTOR,
    collection_date=TODAY,
    notes="Competition event; affects usage and agent network expansion; link via impact_links."
))

new_rows.append(make_row(
    record_id="event_interop_p2p_milestone_2025",
    record_type="event",
    event_name="Interoperable P2P transactions surpass interoperable ATM withdrawals",
    category="milestone",
    event_date="2025-11-01",
    source_name="EthSwitch (reported via media)",
    source_type="media",
    source_url="https://capitalethiopia.com/2025/11/02/ethswitch-reports-historic-growth-as-p2p-payments-surpass-atm-withdrawals/",
    confidence="medium",
    original_text="EthSwitch reports P2P transactions surpassed ATM withdrawals.",
    collected_by=COLLECTOR,
    collection_date=TODAY,
    notes="Important market milestone; impacts usage via linked indicator."
))

# ----------------------------
# NEW IMPACT LINKS (EXAMPLES)
# ----------------------------
new_rows.append(make_row(
    record_id="link_telebirr_launch_to_users",
    record_type="impact_link",
    parent_id="event_telebirr_launch_2021",
    pillar="access",
    related_indicator="telebirr_registered_users",
    impact_direction="positive",
    impact_magnitude="high",
    lag_months=3,
    evidence_basis="Operator product launch drives mobile money registrations.",
    source_url="https://www.ethiotelecom.et/",
    original_text="Product launch introduces mobile money to users; early adoption expected within months.",
    confidence="medium",
    collected_by=COLLECTOR,
    collection_date=TODAY,
    notes="Connects pillar-agnostic event to Access indicator."
))

new_rows.append(make_row(
    record_id="link_mpesa_entry_to_customers",
    record_type="impact_link",
    parent_id="event_mpesa_entry_2023",
    pillar="usage",
    related_indicator="mpesa_customers",
    impact_direction="positive",
    impact_magnitude="medium",
    lag_months=6,
    evidence_basis="Market entry increases digital payment options and campaigns; adoption ramps over quarters.",
    source_url="https://www.safaricom.co.ke/",
    original_text="New entrant expands use cases and agent footprint; usage impact lags launch.",
    confidence="medium",
    collected_by=COLLECTOR,
    collection_date=TODAY,
    notes="Models competitor-driven adoption effect on usage."
))

new_rows.append(make_row(
    record_id="link_interop_milestone_to_usage",
    record_type="impact_link",
    parent_id="event_interop_p2p_milestone_2025",
    pillar="usage",
    related_indicator="interop_p2p_gt_atm_flag",
    impact_direction="positive",
    impact_magnitude="medium",
    lag_months=0,
    evidence_basis="Interoperability reduces friction and shifts transactions to digital rails.",
    source_url="https://capitalethiopia.com/2025/11/02/ethswitch-reports-historic-growth-as-p2p-payments-surpass-atm-withdrawals/",
    original_text="P2P surpassing ATM indicates digital rail preference; immediate usage shift.",
    confidence="medium",
    collected_by=COLLECTOR,
    collection_date=TODAY,
    notes="Captures structural shift from cash withdrawals to digital transfers."
))

new_rows_df = pd.DataFrame(new_rows, columns=cols)
print("New rows shape:", new_rows_df.shape)
new_rows_df.head(10)


In [None]:
# Check if any new record_id already exists in the dataset
existing_ids = set(df["record_id"].astype(str)) if "record_id" in df.columns else set()
dupe_new_ids = [rid for rid in new_rows_df["record_id"].astype(str) if rid in existing_ids]
print("Duplicate new record_ids found:", len(dupe_new_ids))
dupe_new_ids[:10]


In [None]:
# Completeness check for new observations
req_obs = [c for c in ["pillar","indicator_code","value_numeric","observation_date","source_name","source_url","confidence"] if c in cols]
missing_obs = new_rows_df[new_rows_df["record_type"]=="observation"][req_obs].isna().sum()
missing_obs


In [None]:
# Validate that new impact_links point to an event (existing or newly added)
all_event_ids = set(df[df["record_type"]=="event"]["record_id"].astype(str)) if "record_id" in df.columns else set()
all_event_ids |= set(new_rows_df[new_rows_df["record_type"]=="event"]["record_id"].astype(str))

new_links = new_rows_df[new_rows_df["record_type"]=="impact_link"].copy()
if "parent_id" in new_links.columns and not new_links.empty:
    missing_parent = new_links[~new_links["parent_id"].astype(str).isin(all_event_ids)]
    print("New links missing parent event:", missing_parent.shape[0])
    display(missing_parent[[c for c in ["record_id","parent_id"] if c in missing_parent.columns]].head(10))
else:
    print("No new impact links or no parent_id column.")


In [None]:
# Append and save enriched dataset
df_enriched = pd.concat([df, new_rows_df], ignore_index=True)

processed_path = Path("../data/processed")
processed_path.mkdir(parents=True, exist_ok=True)

out_file = processed_path / "ethiopia_fi_unified_data_enriched.csv"
df_enriched.to_csv(out_file, index=False)

print("Saved:", out_file, "shape:", df_enriched.shape)


In [None]:
# Write enrichment log (Task 1 required)
reports_path = Path("../reports")
reports_path.mkdir(parents=True, exist_ok=True)

log_path = reports_path / "data_enrichment_log.md"

log_text = f"""# Data Enrichment Log – Task 1

## Dataset Exploration Summary
- Unified schema confirmed (interpretation depends on `record_type`).
- Events are pillar-agnostic by design; impacts are defined through `impact_link` records.
- Impact links connect events to indicators using `parent_id`.

---

## Added Records (Examples)

### Added (Observation): interop_p2p_gt_atm_flag (2025-11-01)
- Record type: observation
- Pillar: usage
- Indicator code: interop_p2p_gt_atm_flag
- Value: 1
- Source URL: https://capitalethiopia.com/2025/11/02/ethswitch-reports-historic-growth-as-p2p-payments-surpass-atm-withdrawals/
- Original text: "EthSwitch reported P2P payments surpassed ATM withdrawals (interoperable transfers)."
- Confidence: medium
- Collected by: {{COLLECTOR}}
- Collection date: {{TODAY}}
- Notes: Usage proxy capturing shift from cash withdrawals to digital transfers.

### Added (Observation): telebirr_registered_users (2024-12-31)
- Record type: observation
- Pillar: access
- Indicator code: telebirr_registered_users
- Value: 54,000,000
- Source URL: https://techpression.com/ethio-telecoms-telebirr-surpasses-54-million-users-processes-2-4-trillion-etb-in-transactions/
- Original text: "Telebirr surpassed 54 million users."
- Confidence: medium
- Collected by: {{COLLECTOR}}
- Collection date: {{TODAY}}
- Notes: Access proxy supporting bridging between sparse Findex survey years.

### Added (Observation): mpesa_customers (2025-08-01)
- Record type: observation
- Pillar: usage
- Indicator code: mpesa_customers
- Value: 10,000,000
- Source URL: https://www.safaricom.co.ke/media-center-landing/press-releases/safaricom-ethiopia-hits-10-million-customers-demonstrates-strong-performance-investment-and-job-creation
- Original text: "Safaricom Ethiopia hit 10 million customers."
- Confidence: medium
- Collected by: {{COLLECTOR}}
- Collection date: {{TODAY}}
- Notes: Usage proxy for competitive entry scaling digital payments.

### Added (Event): Telebirr launch (2021-05-11)
- Record type: event
- Category: product_launch
- Pillar: (blank by design)
- Source URL: https://www.ethiotelecom.et/
- Original text: "Telebirr launched in 2021 (exact date to verify if not in dataset)."
- Confidence: low
- Notes: Major product launch; impacts modeled via impact_link.

### Added (Event): M-Pesa Ethiopia market entry (2023-08-01)
- Record type: event
- Category: market_entry
- Pillar: (blank by design)
- Source URL: https://www.safaricom.co.ke/
- Original text: "M-Pesa launched in Ethiopia in 2023 (date to verify if not in dataset)."
- Confidence: low
- Notes: Market entry event; impacts modeled via impact_link.

### Added (Event): Interoperable P2P surpasses ATM milestone (2025-11-01)
- Record type: event
- Category: milestone
- Pillar: (blank by design)
- Source URL: https://capitalethiopia.com/2025/11/02/ethswitch-reports-historic-growth-as-p2p-payments-surpass-atm-withdrawals/
- Original text: "EthSwitch reports P2P transactions surpassed ATM withdrawals."
- Confidence: medium
- Notes: Usage-shift milestone; impacts modeled via impact_link.

### Added (Impact Links)
- link_telebirr_launch_to_users: Telebirr launch → telebirr_registered_users (access), positive, high, lag 3 months
- link_mpesa_entry_to_customers: M-Pesa entry → mpesa_customers (usage), positive, medium, lag 6 months
- link_interop_milestone_to_usage: Interop milestone → interop_p2p_gt_atm_flag (usage), positive, medium, lag 0 months
"""

with open(log_path, "w", encoding="utf-8") as f:
    f.write(log_text)

print("Wrote:", log_path)
