# DS701: District Slot Revenue & NAFI Reimbursement FY20-FY24 Parser

This notebook automates the extraction of structured data from the District Revenues Reports (FY2020‚ÄìFY2024).  
It processes the consolidated PDF, extracts tabular content using spatial alignment, classifies records by service branch, region, and revenue type, and outputs a cleaned CSV file ready for SQL, datasette, or BI dashboards.

The notebook is designed to run directly inside this repository.  
All input and output files are read/written relative to the `fa25-team-b` project directory.


##What you‚Äôll get

*   A filtered PDF containing only relevant tables (for auditability)
*   A parsed DataFrame and a CSV with columns:
["Service","Category","Region","Base","Location","Month","Year","Amount"]
*   A small manual patch applied for ‚ÄúSasebo Navy ‚Äì Hario (FY2020 Feb‚ÄìSep)‚Äù.
*   Canonicalized base names (e.g., Camp Hanson USMC ‚Üí Camp Hansen USMC, Hohenfels normalization).









# 1. Environment Setup
This section installs required Python packages (`PyMuPDF`, `pandas`, `numpy`) and configures filesystem paths using `pathlib`.

All file paths now depend on the repository structure.


## **Inputs Used by the Notebook**

These three paths are automatically configured:

- **SRC_PDF**  
  Path to the original PDF report located in:  
  `fa25-team-b/pdf/District Revenues FY20-FY24.pdf`

- **FILTERED_PDF**  
  Output path for the PDF containing only relevant extracted pages.

- **OUT_CSV**  
  Final processed CSV saved to:  
  `fa25-team-b/CSVs/District Revenue/`

You do **not** need to manually specify any Drive paths.  
All paths are resolved relative to the `fa25-team-b` folder based on where the notebook is executed.



In [1]:
%pip install pymupdf pandas numpy

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
import fitz, re, pandas as pd, numpy as np
from typing import List, Dict, Optional, Tuple, Any
from pathlib import Path

# === Project root directory: fa25-team-b ===
# Assuming this notebook is executed inside:
# ds-muckrock-liberation/fa25-team-b/PDF Extraction/
PROJECT_ROOT = Path.cwd().resolve().parents[0]   # Parent directory of "PDF Extraction" = fa25-team-b

# 1Ô∏è‚É£ PDF source + FILTERED_PDF output: fa25-team-b/pdf/
PDF_DIR = PROJECT_ROOT / "pdf"

SRC_PDF = PDF_DIR / "District Revenues FY20-FY24.pdf"
FILTERED_PDF = PDF_DIR / "District Revenues FY20-FY24_FILTERED.pdf"

# 2Ô∏è‚É£ Output directory for CSV: fa25-team-b/CSVs/District Revenue/
OUT_CSV_DIR = PROJECT_ROOT / "CSVs" / "District Revenue"
OUT_CSV_DIR.mkdir(parents=True, exist_ok=True)

OUT_CSV = OUT_CSV_DIR / "District_Revenue_filtered_FY20-FY24_final.csv"


# 2. Page Parser

How the parser works:

*   Page filtering: keeps only pages that contain recognizable region/service headers, totals lines, or a 12-month ‚Äúbar‚Äù.
*   Month detection: reads all word boxes, finds month tokens and their x-centers, computes a typical gap, and builds 12 tolerant x-bins for Oct‚ÜíSep (FY order).
*   Row assembly: for each line, tokens left of the first month value form the Location (hyphen-aware joiner). Numeric tokens to the right are snapped into month bins; right-most wins on collisions.
*   FY vs Calendar: months Oct‚ÄìDec map to Year = FY-1; Jan‚ÄìSep map to Year = FY.
*   Service assignment: when a ‚Äútotals‚Äù line appears (e.g., ‚ÄúTotal Europe Slot Revenue Army‚Äù), the parser retro-fills Service for the block of rows preceding it.


In [4]:
# ---------------- Month helpers ----------------
MONTH_TOKEN_RE = re.compile(
    r"^(?:\d{2,4}[-/])?(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|"
    r"May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:t|tember)?|Oct(?:ober)?|"
    r"Nov(?:ember)?|Dec(?:ember)?)(?:[-/]\d{2,4})?$", re.I
)
MONTH_NAMES_FULL = ["October","November","December","January","February","March",
                    "April","May","June","July","August","September"]
FISCAL_RE = re.compile(r"Fiscal\s+Year\s+(\d{4})", re.I)

def normalise_month(name: str) -> Optional[str]:
    if not name: return None
    s = name.strip().rstrip('.').lower()
    m = {
        "jan":"January","january":"January",
        "feb":"February","february":"February",
        "mar":"March","march":"March",
        "apr":"April","april":"April",
        "may":"May",
        "jun":"June","june":"June",
        "jul":"July","july":"July",
        "aug":"August","august":"August",
        "sep":"September","sept":"September","september":"September",
        "oct":"October","october":"October",
        "nov":"November","november":"November",
        "dec":"December","december":"December",
    }
    for k,v in m.items():
        if s.startswith(k): return v
    return None

def extract_fiscal_year(text: str) -> Optional[int]:
    m = FISCAL_RE.search(text)
    return int(m.group(1)) if m else None

# üîß Detect month column centers and column bins (with tolerance) for coordinate-based projection
def extract_month_columns(page: fitz.Page) -> Tuple[List[str], List[float], List[Tuple[float,float]]]:
    words = page.get_text("words")  # [x0,y0,x1,y1,text,block_no,line_no,word_no]
    lines: Dict[float, List[Tuple[float,float,str]]] = {}
    for x0,y0,x1,y1,txt, *_ in words:
        y = round(y0, 1)
        xc = (x0 + x1) / 2.0
        lines.setdefault(y, []).append((xc, x1 - x0, txt))

    for y, toks in sorted(lines.items(), key=lambda kv: kv[0]):
        month_hits = []
        for xc, w, t in sorted(toks, key=lambda z: z[0]):
            tt = t.strip().replace('.', '')
            if MONTH_TOKEN_RE.match(tt):
                m = normalise_month(tt)
                if m: month_hits.append((xc, m))
        if len(month_hits) >= 8:
            month_hits = sorted(month_hits, key=lambda z: z[0])
            centers, names = [], []
            for xc, m in month_hits:
                if m in MONTH_NAMES_FULL and m not in names:
                    names.append(m); centers.append(xc)
            if centers:
                # Fill up to 12 columns
                gap = np.median(np.diff(centers)) if len(centers) >= 2 else 60.0
                first_x = centers[0] - gap*(MONTH_NAMES_FULL.index(names[0]))
                centers_full = [first_x + i*gap for i in range(12)]
                bins = _to_bins_with_tolerance(centers_full)
                return MONTH_NAMES_FULL, centers_full, bins
    # fallback
    centers = [100 + i*60 for i in range(12)]
    return MONTH_NAMES_FULL, centers, _to_bins_with_tolerance(centers)

def _to_bins_with_tolerance(centers: List[float]) -> List[Tuple[float,float]]:
    centers = list(centers)
    gaps = np.diff(centers) if len(centers) >= 2 else np.array([60.0]*11)
    gap_med = float(np.median(gaps)) if len(gaps) else 60.0
    tol = max(8.0, 0.10 * gap_med)  # tolerance: expand left/right
    edges = [centers[0] - gap_med/2] + [(centers[i]+centers[i+1])/2 for i in range(11)] + [centers[-1] + gap_med/2]
    return [(edges[i]-tol, edges[i+1]+tol) for i in range(12)]

# üîß More permissive numeric detection/parsing to avoid zeros from decimals or stray symbols
def is_value_token(tok: str) -> bool:
    s = tok.strip()
    s = re.sub(r"[,\s$]+", "", s)
    s = re.sub(r"[^\d().-]", "", s)
    return bool(re.fullmatch(r"-?\(?\d+(?:\.\d+)?\)?", s))

def parse_number(tok: str) -> Optional[float]:
    s = tok.strip()
    s = re.sub(r"[,\s$]+", "", s)
    s = re.sub(r"[^\d().-]", "", s)
    if s.startswith("(") and s.endswith(")"): s = "-" + s[1:-1]
    try: return float(s)
    except: return None

# ---------------- Common regex pieces ----------------
TOTAL_WORD   = r"(?:T?otal)"
REGION_NORM  = r"(Europe|Korea|Japan)"  # Far East is excluded from totals
SERVICE_NORM = r"(Army|Navy|USMC)"
REIMB_WORD   = r"(?:NAFI\s+Reimb(?:\.|ursement)?)"
REIMB_TOKEN  = re.compile(r"\bNAFI\s+Reimb(?:\.|ursement)?\b", re.I)

# ---------------- Page pre-filter ----------------
HEADER_PATTERNS = [
    rf"Slot\s+Revenue\s+({REGION_NORM}|Far\s*East)",
    rf"Slot\s+NAFI\s+Reimb(?:\.|ursement)?\s+({REGION_NORM}|Far\s*East)",
    rf"Slot\s+Revenue\s*&\s*NAFI\s+Reimb(?:\.|ursement)?\s+by\s+Month\s*-\s*({REGION_NORM}|Far\s*East)",
    rf"Reimbursement\s+by\s+Month\s*-\s*({REGION_NORM}|Far\s*East)",
    rf"\bnue\s+({REGION_NORM}|Far\s*East)\b",
]
HEADER_RES = [re.compile(p, re.I) for p in HEADER_PATTERNS]

TOTAL_PATTERNS = [
    rf"{TOTAL_WORD}\s+{REGION_NORM}\s+Slot\s+Revenue\s+{SERVICE_NORM}",
    rf"{TOTAL_WORD}\s+{REGION_NORM}\s+{REIMB_WORD}\s+{SERVICE_NORM}",
    rf"Slot\s+Revenue\s+{REGION_NORM}\s+{SERVICE_NORM}",
    rf"Slot\s+NAFI\s+Reimb(?:\.|ursement)?\s+{REGION_NORM}\s+{SERVICE_NORM}",
]
TOTAL_RES = [re.compile(p, re.I) for p in TOTAL_PATTERNS]

MONTH_BAR_RE = re.compile(
    r"October.*November.*December.*January.*February.*March.*April.*May.*June.*July.*August.*September",
    re.I | re.S
)

def page_has_header(text: str) -> bool:
    t = " ".join(text.split())
    return any(p.search(t) for p in HEADER_RES)

def page_has_total(text: str) -> bool:
    t = " ".join(text.split())
    return any(p.search(t) for p in TOTAL_RES)

def page_has_month_bar(text: str) -> bool:
    if MONTH_BAR_RE.search(text): return True
    cnt = sum(1 for m in MONTH_NAMES_FULL if re.search(rf"\b{m}\b", text, re.I))
    return cnt >= 8

def preselect_table_pages(doc: fitz.Document) -> List[int]:
    keep, in_block = [], False
    for i in range(doc.page_count):
        text = doc[i].get_text()
        if page_has_header(text) or page_has_total(text):
            in_block = True; keep.append(i); continue
        if in_block and page_has_month_bar(text):
            keep.append(i); continue
        in_block = False
    return keep

# ---------------- Line-level regex ----------------
HEADER_REV_LINE_RE    = re.compile(rf"Slot\s+Revenue\s+{REGION_NORM}", re.I)
HEADER_REIMB_LINE_RE  = re.compile(rf"Slot\s+NAFI\s+Reimb(?:\.|ursement)?\s+{REGION_NORM}", re.I)
TRUNC_REV_LINE_RE     = re.compile(rf"\b(nue|enue)\s+{REGION_NORM}\b", re.I)

TOP_TITLE_FE_RE = re.compile(
    r"Slot\s+Revenue\s*&\s*NAFI\s+Reimb(?:\.|ursement)?\s+by\s+Month\s*-\s*Far\s*East",
    re.I
)

TOTAL_LINE_A = re.compile(rf"Slot\s+Revenue\s+{REGION_NORM}\s+{SERVICE_NORM}\s*:?", re.I)
TOTAL_LINE_D = re.compile(rf"Slot\s+NAFI\s+Reimb(?:\.|ursement)?\s+{REGION_NORM}\s+{SERVICE_NORM}\s*:?", re.I)
TOTAL_LINE_B = re.compile(rf"{TOTAL_WORD}\s+{REGION_NORM}\s+Slot\s+Revenue\s+{SERVICE_NORM}\s*:?", re.I)
TOTAL_LINE_C = re.compile(rf"{TOTAL_WORD}\s+{REGION_NORM}\s+{REIMB_WORD}\s+{SERVICE_NORM}\s*:?", re.I)
TOTAL_LINE_E = re.compile(
    rf"Slot\s+(?:Revenue|NAFI\s+Reimb(?:\.|ursement)?)\s+{REGION_NORM}\b.*?\b{SERVICE_NORM}\b",
    re.I
)

def _service_norm(s: str) -> str:
    s = s.strip().strip(":).(").upper()
    if s.startswith("USMC"): return "USMC"
    if s.startswith("NAVY"): return "Navy"
    if s.startswith("ARMY"): return "Army"
    return s.title()

def _category_from_line(line: str) -> Optional[str]:
    if re.search(r"\bSlot\s+Revenue\b", line, re.I):
        return "Revenue"
    if REIMB_TOKEN.search(line):
        return "Reimbursement"
    return None

def _pick_total_match(line: str) -> Optional[Tuple[str, str, str]]:
    for pat, cat in [
        (TOTAL_LINE_B, "Revenue"), (TOTAL_LINE_A, "Revenue"),
        (TOTAL_LINE_C, "Reimbursement"), (TOTAL_LINE_D, "Reimbursement"),
        (TOTAL_LINE_E, None),
    ]:
        m = pat.search(line)
        if m:
            region = m.group(1).title()
            service = _service_norm(m.group(2))
            category = cat if cat else _category_from_line(line) or "Revenue"
            return (region, service, category)
    low = line.lower()
    if (("slot revenue" in low) or ("reimb" in low)) and any(r in low for r in ("europe","korea","japan")):
        ms = re.search(r"\b(Army|Navy|USMC)\b[:)]?", line, re.I)
        rg = re.search(rf"\b{REGION_NORM}\b", line, re.I)
        if ms and rg:
            region = rg.group(1).title()
            service = _service_norm(ms.group(1))
            category = "Reimbursement" if "reimb" in low else "Revenue"
            return (region, service, category)
    return None

# ---------------- Slot name builder (hyphen-aware) ----------------
# üîß New: merge ["K","-","16"] into "K-16"; drop standalone dashes and pure-symbol tokens
def build_slot_name(tokens: List[str]) -> str:
    clean = []
    for t in tokens:
        tt = t.strip()
        if not tt: continue
        if tt in ("-","‚Äì","‚Äî"):  # keep dash token; merging is handled below
            clean.append(tt)
            continue
        # Drop tokens with no letters or digits (pure symbols)
        if not re.search(r"[A-Za-z0-9]", tt):
            continue
        clean.append(tt)

    # Merge: token, dash, token  ‚Üí "token-token" (absorb repeated patterns)
    out = []
    i = 0
    while i < len(clean):
        cur = clean[i]
        if cur in ("-","‚Äì","‚Äî"):
            # dash with no neighbors ‚Üí skip
            i += 1
            continue
        # If next is a dash and the next-next has alphanum ‚Üí merge
        if (i + 2 < len(clean)) and clean[i+1] in ("-","‚Äì","‚Äî") and re.search(r"[A-Za-z0-9]", clean[i+2]):
            merged = f"{cur}-{clean[i+2]}"
            i += 3
            # absorb successive "- X" patterns
            while (i + 1 < len(clean)) and clean[i] in ("-","‚Äì","‚Äî") and re.search(r"[A-Za-z0-9]", clean[i+1]):
                merged += f"-{clean[i+1]}"
                i += 2
            out.append(merged)
            continue
        # Otherwise keep as-is
        out.append(cur)
        i += 1

    # Join with spaces
    return " ".join(out).strip()

# ---------------- Page parser ----------------
def parse_page(page: fitz.Page,
               default_fy: Optional[int] = None,
               default_region: Optional[str] = None,
               default_category: Optional[str] = None,
               default_subregion: Optional[str] = None
               ) -> Tuple[List[Dict[str, Any]], Optional[str], Optional[str], Optional[int], List[Tuple[int,str,str,str]], Optional[str]]:

    text = page.get_text()
    fy = extract_fiscal_year(text) or default_fy

    # 12 month columns by coordinates
    header_months, month_centers, month_bins = extract_month_columns(page)
    month_left_edge = month_bins[0][0]          # üîß left boundary of month area
    sep_right_edge  = month_bins[-1][1]         # right boundary of September column (ignore anything to the right)

    words = page.get_text("words")
    # Keep x_center and original boxes per line
    lines: Dict[float, List[Tuple[float, float, float, str]]] = {}
    for x0,y0,x1,y1,txt, *_ in words:
        y = round(y0, 1)
        xc = (x0 + x1) / 2.0
        lines.setdefault(y, []).append((xc, x0, x1, txt.strip()))

    records: List[Dict[str, Any]] = []
    totals_in_order: List[Tuple[int,str,str,str]] = []

    current_country   = default_region
    current_subregion = default_subregion
    current_category  = default_category

    line_no = 0
    for _, toks in sorted(lines.items(), key=lambda x: x[0]):
        line_no += 1
        toks = [(xc,x0,x1,t) for (xc,x0,x1,t) in toks if t and t != "$"]
        if not toks: continue

        # Loc#
        if re.fullmatch(r'\d{3,5}', toks[0][3]):
            toks = toks[1:]
            if not toks: continue

        line = " ".join(t for *_, t in toks)
        low  = line.lower()

        if TOP_TITLE_FE_RE.search(line): continue

        m = HEADER_REV_LINE_RE.search(line)
        if m: current_country, current_category = m.group(1).title(), "Revenue";        current_subregion = None
        m2 = HEADER_REIMB_LINE_RE.search(line)
        if m2: current_country, current_category = m2.group(1).title(), "Reimbursement"; current_subregion = None
        m3 = TRUNC_REV_LINE_RE.search(line)
        if m3: current_country, current_category = m3.group(2).title(), "Revenue";       current_subregion = None

        ts = _pick_total_match(line)
        if ts:
            region, service, category_from_line = ts
            totals_in_order.append((line_no, region, category_from_line, service))
            continue

        if low.startswith("fiscal year"): continue
        if any(k in low for k in (" avg ", "avg slots", "slots avg", "average", " ytd ")): continue
        if "as400" in low or "f&a" in low: continue

        # üîß Treat only numbers to the RIGHT of the left month boundary as month values;
        # numbers on the left are not months
        numeric_positions = [i for i,(xc,_,_,t) in enumerate(toks) if is_value_token(t) and xc >= month_left_edge]
        if not numeric_positions:
            # likely a subregion / section title line
            if any(re.search(r"[A-Za-z]", t) for *_,t in toks):
                month_bar = sum(1 for *_,t in toks if MONTH_TOKEN_RE.match(t)) >= 8
                if month_bar: continue
                if "total" in low or "otal" in low: continue
                current_subregion = " ".join(t for *_,t in toks)
            continue

        if not current_country or not current_category or fy is None:
            continue

        first_num_idx = min(numeric_positions)
        first_num_xc  = toks[first_num_idx][0]

        # üîß slot name = all tokens LEFT of the first month value‚Äôs center (keep hyphens)
        slot_tokens_raw = [t for (xc,_,_,t) in toks if xc < first_num_xc - 1.0]
        slot_name = build_slot_name(slot_tokens_raw)
        if not slot_name: continue
        if slot_name.isdigit() and len(slot_name) in (3,4): continue
        if slot_name.lower() in ("fiscal year", "fiscal"): continue
        if slot_name.lower().startswith(("total ","otal ")): continue

        # Value projection: ignore numbers to the right of the September boundary;
        # if multiple values fall into the same column, keep the rightmost one
        values_by_col = ["-"] * 12
        best_xc = [-1e9] * 12
        for xc, x0, x1, t in toks[first_num_idx:]:
            if xc > sep_right_edge:
                continue
            if not is_value_token(t):
                continue
            for col_idx, (xL, xR) in enumerate(month_bins):
                if xL <= xc <= xR:
                    if values_by_col[col_idx] == "-" or xc > best_xc[col_idx]:
                        values_by_col[col_idx] = t
                        best_xc[col_idx] = xc
                    break

        for i, raw_val in enumerate(values_by_col):
            mname = MONTH_NAMES_FULL[i]
            yval = (fy - 1) if fy and mname in ("October", "November", "December") else fy
            amount = 0.0
            if raw_val != "-":
                parsed = parse_number(raw_val)
                if parsed is not None: amount = parsed
            records.append({
                "__ord": line_no,
                "Category": current_category,
                "Region": current_country,
                "Base": current_subregion,
                "Location": slot_name,
                "Month": mname,
                "Year": yval,
                "Amount": amount
            })

    return records, current_country, current_category, fy, totals_in_order, current_subregion

# ---------------- Build filtered PDF ----------------
def build_filtered_pdf(src_pdf: str, dst_pdf: str) -> List[int]:
    doc = fitz.open(src_pdf)
    keep_pages = preselect_table_pages(doc)
    new_doc = fitz.open()
    for i in keep_pages:
        new_doc.insert_pdf(doc, from_page=i, to_page=i)
    new_doc.save(dst_pdf)
    new_doc.close()
    doc.close()
    return keep_pages

# ---------------- Parse filtered PDF ----------------
def parse_pdf(pdf_path: str, max_pages: Optional[int] = None) -> pd.DataFrame:
    doc = fitz.open(pdf_path)
    page_indices = list(range(doc.page_count)) if max_pages is None else list(range(min(max_pages, doc.page_count)))

    segments: Dict[Tuple[str,str], List[Dict[str,Any]]] = {}
    final_rows: List[Dict[str,Any]] = []

    last_fy: Optional[int] = None
    last_region: Optional[str]   = None
    last_category: Optional[str] = None
    last_subregion: Optional[str] = None

    for i in page_indices:
        page = doc[i]
        recs, end_region, end_category, end_fy, totals_in_order, end_subregion = parse_page(
            page,
            default_fy=last_fy,
            default_region=last_region,
            default_category=last_category,
            default_subregion=last_subregion
        )
        if end_fy is not None: last_fy = end_fy
        last_region, last_category, last_subregion = end_region, end_category, end_subregion

        events: List[Tuple[int,str,Any]] = []
        for r in recs: events.append((int(r["__ord"]), "row", r))
        for (ord_no, region, category, service) in totals_in_order:
            events.append((ord_no, "total", (region, category, service)))
        events.sort(key=lambda x: x[0])

        for _, etype, payload in events:
            if etype == "row":
                r = payload
                r.pop("__ord__", None)
                r.setdefault("Service", None)
                key = (r["Region"], r["Category"])
                segments.setdefault(key, []).append(r)
            else:
                region, category, service = payload
                key = (region, category)
                block = segments.get(key, [])
                if block:
                    for rr in block:
                        rr["Service"] = service
                        final_rows.append(rr)
                    segments[key] = []

    doc.close()
    for block in segments.values():
        final_rows.extend(block)

    df = pd.DataFrame(final_rows)
    desired_cols = ["Service","Category","Region","Base","Location","Month","Year","Amount"]
    if not df.empty:
        df = df.reindex(columns=desired_cols).dropna(how="all").reset_index(drop=True)
    else:
        df = pd.DataFrame(columns=desired_cols)
    return df


# ---------------- Run ----------------
kept = build_filtered_pdf(SRC_PDF, FILTERED_PDF)
print(f"‚úÖ Kept pages after filtering: {len(kept)} ‚Üí {FILTERED_PDF}")

df = parse_pdf(FILTERED_PDF)
print("‚úÖ Parsed rows:", len(df))
display(df.head(50))

# =========================
# Manual patch: Sasebo Navy - Hario (FY2020 Feb‚ÄìSep)
# =========================

def _upsert_row(df, base_row, month, year, amount):
    mask = (
        df["Base"].fillna("").str.strip().str.casefold().eq(base_row["Base"].strip().casefold()) &
        df["Location"].fillna("").str.strip().str.casefold().eq(base_row["Location"].strip().casefold()) &
        df["Category"].fillna("").str.strip().str.casefold().eq(base_row["Category"].strip().casefold()) &
        df["Month"].eq(month) &
        df["Year"].eq(year)
    )
    if mask.any():
        df.loc[mask, "Amount"] = float(amount)
    else:
        newr = {
            "Service":   base_row.get("Service", "Navy"),
            "Category":  base_row.get("Category", "Revenue"),
            "Region":   base_row.get("Region", "Japan"),
            "Base": base_row.get("Base", "Sasebo Navy"),
            "Location": base_row.get("Location", "Hario"),
            "Month":     month,
            "Year":      int(year),
            "Amount":    float(amount),
        }
        df.loc[len(df)] = newr

# 1) Find one Hario row as a template; if not found, use a conservative template
_tmpl = df[
    df["Base"].fillna("").str.contains(r"\bSasebo\s+Navy\b", case=False, regex=True) &
    df["Location"].fillna("").str.contains(r"\bHario\b", case=False, regex=True)
].head(1)

if _tmpl.empty:
    _base = {
        "Service":   "Navy",
        "Category":  "Revenue",
        "Region":   "Japan",
        "Base": "Sasebo Navy",
        "Location": "Hario",
    }
else:
    _base = _tmpl.iloc[0].to_dict()

# 2) Patch 2020/Feb‚ÄìSep according to the provided figure
hario_2020_fixes = {
    "February":  7568,
    "March":     17708,
    "April":     0,
    "May":       0,
    "June":      0,
    "July":      27862,
    "August":    195,
    "September": 41098,
}
for _m, _amt in hario_2020_fixes.items():
    _upsert_row(df, _base, _m, 2020, _amt)

print("‚úÖ Patched: Sasebo Navy - Hario (FY2020 Feb‚ÄìSep)")

df['Base'] = df['Base'].replace('Camp Hanson USMC', 'Camp Hansen USMC')
df['Base'] = df['Base'].replace('Tori Station Army', 'Torii Station Army')
df['Base'] = df['Base'].replace('Garmisch -AFRC', 'Garmisch-AFRC')
df['Base'] = df['Base'].replace(
    to_replace=r'(?i)^\s*hohenfels[\s\-‚Äì‚Äî]*\s*$',
    value='Hohenfels',
    regex=True
)

df.to_csv(OUT_CSV, index=False)
print("Saved CSV:", OUT_CSV)


‚úÖ Kept pages after filtering: 35 ‚Üí C:\Users\User\Desktop\BU\25 fall class\DS 701 Tools for Data Science\DS_701_Project\ds-muckrock-liberation\fa25-team-b\pdf\District Revenues FY20-FY24_FILTERED.pdf
‚úÖ Parsed rows: 9876
‚úÖ Parsed rows: 9876


Unnamed: 0,Service,Category,Region,Base,Location,Month,Year,Amount
0,Army,Revenue,Europe,Ansbach,Action Lanes,October,2019,59201.0
1,Army,Revenue,Europe,Ansbach,Action Lanes,November,2019,35460.0
2,Army,Revenue,Europe,Ansbach,Action Lanes,December,2019,46220.0
3,Army,Revenue,Europe,Ansbach,Action Lanes,January,2020,-3084.0
4,Army,Revenue,Europe,Ansbach,Action Lanes,February,2020,44116.0
5,Army,Revenue,Europe,Ansbach,Action Lanes,March,2020,37220.0
6,Army,Revenue,Europe,Ansbach,Action Lanes,April,2020,0.0
7,Army,Revenue,Europe,Ansbach,Action Lanes,May,2020,0.0
8,Army,Revenue,Europe,Ansbach,Action Lanes,June,2020,0.0
9,Army,Revenue,Europe,Ansbach,Action Lanes,July,2020,18656.0


‚úÖ Patched: Sasebo Navy - Hario (FY2020 Feb‚ÄìSep)
Saved CSV: C:\Users\User\Desktop\BU\25 fall class\DS 701 Tools for Data Science\DS_701_Project\ds-muckrock-liberation\fa25-team-b\CSVs\District Revenue\District_Revenue_filtered_FY20-FY24_final.csv
