**Author:** Dorys Trujillo  
**Project:** Legal Uncertainty Index (IIJ)   
**Data Source:** Ministry of Commerce, Industry and Tourism (MinCIT)  
**Period:** 2018–2025  

#### Environment Setup and Configuration

This block initializes the computational environment required for the OCR preprocessing pipeline. It imports all necessary libraries for PDF handling, image processing, optical character recognition, and data management. It also defines the project directory structure (input, output, and manifest locations), establishes OCR and table-detection parameters, and ensures that required output directories exist.

By centralizing configuration parameters and path definitions, this block guarantees reproducibility, structural consistency, and full separation between raw and processed data.

In [14]:
##### IMPORTS ######
from pathlib import Path
from datetime import datetime
import hashlib
#import json
import re
import shutil

import numpy as np
import pandas as pd

import fitz  # PyMuPDF
import cv2
import pytesseract

In [15]:
# Paths
PROJECT_ROOT = Path(r"C:\Users\dtruj\Documentos\proyectos\legal-uncertainty-index")
RAW_ROOT = PROJECT_ROOT / "data_raw" / "mincit"
OUT_ROOT = PROJECT_ROOT / "data_processed" / "mincit" / "pdf_ocr"

# Master manifest (under OCR output root)
MANIFEST_DIR = OUT_ROOT / "manifests"
MANIFEST_MASTER = MANIFEST_DIR / "manifest_ocr.csv"

# OCR configuration
TESSERACT_EXE = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
OCR_LANG = "spa+eng"
OCR_DPI = 300
QUICK_DPI = 120  # fast rendering for annex/table detection

# Annex/table detection thresholds
LD_THRESHOLD = 0.020
ANNEX_MIN_CONSEC = 2
ANNEX_MIN_FRAC_REST = 0.60
ANNEX_MIN_START_FRAC = 0.50

# Setup
pytesseract.pytesseract.tesseract_cmd = TESSERACT_EXE
OUT_ROOT.mkdir(parents=True, exist_ok=True)
MANIFEST_DIR.mkdir(parents=True, exist_ok=True)

if not RAW_ROOT.exists():
    raise FileNotFoundError(f"RAW_ROOT not found: {RAW_ROOT}")

# Master manifest schema
MANIFEST_COLUMNS = [
    "timestamp_utc","year","input_rel","input_pdf","sha256_in","output_dir",
    "output_pdf_final","sha256_out","output_text","pages_total","cutoff_page_1indexed",
    "pages_kept","pages_removed","action_pdf","ocr_lang","ocr_dpi","status","error"
]

print("RAW_ROOT:", RAW_ROOT)
print("OUT_ROOT:", OUT_ROOT)
print("MANIFEST_MASTER:", MANIFEST_MASTER)

RAW_ROOT: C:\Users\dtruj\Documentos\proyectos\legal-uncertainty-index\data_raw\mincit
OUT_ROOT: C:\Users\dtruj\Documentos\proyectos\legal-uncertainty-index\data_processed\mincit\pdf_ocr
MANIFEST_MASTER: C:\Users\dtruj\Documentos\proyectos\legal-uncertainty-index\data_processed\mincit\pdf_ocr\manifests\manifest_ocr.csv


#### Utility Functions for File Integrity and Structure Management

This block defines auxiliary functions that support file traceability, structural consistency, and systematic document processing. It includes routines to compute SHA-256 hashes for integrity verification, extract year identifiers from file paths for structured output organization, and recursively enumerate all PDF files within the raw data directory.

These utilities ensure full document coverage, reproducibility, and transparent linkage between input and processed outputs without performing any content-level transformations.

In [16]:
# Utility functions: hashing, year extraction, and file listing

YEAR_RE = re.compile(r"(20(1[8-9]|2[0-5]))")

def sha256_file(path: Path) -> str:
    # Compute SHA-256 hash for file integrity verification
    h = hashlib.sha256()
    with path.open("rb") as f:
        for chunk in iter(lambda: f.read(1024 * 1024), b""):
            h.update(chunk)
    return h.hexdigest()

def find_year(pdf_path: Path) -> str:
    # Extract year from file path using predefined pattern
    m = YEAR_RE.search(str(pdf_path))
    return m.group(1) if m else "unknown_year"

def list_pdfs(root: Path) -> list[Path]:
    # Recursively list all PDF files under the given root directory
    return sorted(root.rglob("*.pdf"))

#### Table-Like Page Detection

This block performs image-based structural analysis to identify table-like pages within each PDF. Pages are rendered at low resolution, processed to detect horizontal and vertical line patterns, and classified using a line-density threshold. The resulting page-level indicators are later used to determine whether a table-based annex exists at the end of the document.

In [17]:
# Table-like page detection based on image analysis

def render_page_gray(page: fitz.Page, dpi: int = 120) -> np.ndarray:
    # Render a PDF page to a grayscale image at the specified resolution
    zoom = dpi / 72.0
    mat = fitz.Matrix(zoom, zoom)
    pix = page.get_pixmap(matrix=mat, alpha=False)
    img = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width, pix.n)
    if img.shape[2] == 4:
        img = img[:, :, :3]
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    return gray

def compute_line_density(gray: np.ndarray) -> float:
    # Estimate structural line density to identify table-like layouts
    bw = cv2.adaptiveThreshold(
        gray, 255,
        cv2.ADAPTIVE_THRESH_MEAN_C,
        cv2.THRESH_BINARY_INV,
        25, 15
    )
    h_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (40, 1))
    v_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 40))
    h = cv2.morphologyEx(bw, cv2.MORPH_OPEN, h_kernel, iterations=1)
    v = cv2.morphologyEx(bw, cv2.MORPH_OPEN, v_kernel, iterations=1)
    lines = cv2.bitwise_or(h, v)
    return float((lines.sum() / 255.0) / (lines.shape[0] * lines.shape[1]))

def detect_table_like_pages(pdf_path: Path, quick_dpi=120, ld_threshold=0.02):
    # Identify pages exhibiting high line density consistent with tabular structures
    doc = fitz.open(str(pdf_path))
    flags, densities = [], []
    for i in range(doc.page_count):
        page = doc.load_page(i)
        gray = render_page_gray(page, dpi=quick_dpi)
        ld = compute_line_density(gray)
        flags.append(ld >= ld_threshold)
        densities.append(ld)
    n_pages = doc.page_count
    doc.close()
    return flags, densities, n_pages

#### Annex Cutoff Detection

This block analyzes the sequence of table-like page indicators to determine whether a structured annex composed primarily of tables exists at the end of the document. It applies conservative criteria based on page position, minimum consecutive table-like pages, and predominance toward the document’s end. The output is a cutoff page index or None, indicating whether an annex should be removed.

In [18]:
# Detect annex cutoff point based on table-like patterns at the end of the document

def detect_annex_cutoff(
    table_flags: list[bool],
    min_consec=2,
    min_frac_rest=0.60,
    min_start_frac=0.50
):
    # Determine the page index where annex-like content begins
    n = len(table_flags)
    if n < 6:
        return None

    # Restrict cutoff search to the second half of the document
    start_min = int(n * min_start_frac)

    for i in range(start_min, n):
        if not table_flags[i]:
            continue

        # Count consecutive table-like pages starting at position i
        consec = 1
        j = i + 1
        while j < n and table_flags[j]:
            consec += 1
            j += 1

        # Validate minimum consecutive pages and overall table density
        if consec >= min_consec:
            rest = table_flags[i:]
            frac = sum(rest) / len(rest)
            if frac >= min_frac_rest:
                return i + 1  # Return 1-indexed page number

    return None

#### Truncated PDF Generation

This block creates a cleaned version of the document when an annex cutoff has been detected. It generates a new PDF containing only the relevant pages (from the beginning up to the cutoff point), effectively removing the table-based annex at the end. This truncated document becomes the standardized input for subsequent OCR processing.

In [19]:
# Save truncated PDF up to the detected annex cutoff

def save_truncated_pdf(input_pdf: Path, output_pdf: Path, cutoff_page_1indexed: int):
    # Create a new PDF containing pages prior to the cutoff
    src = fitz.open(str(input_pdf))
    dst = fitz.open()

    last_keep = cutoff_page_1indexed - 1
    if last_keep < 1:
        src.close()
        dst.close()
        raise ValueError("Invalid cutoff: would remove entire document.")

    # Insert pages from the beginning up to the last page to retain
    dst.insert_pdf(src, from_page=0, to_page=last_keep - 1)

    # Ensure destination directory exists before saving
    output_pdf.parent.mkdir(parents=True, exist_ok=True)
    dst.save(str(output_pdf))

    dst.close()
    src.close()

#### Page-Level OCR Processing

This block performs optical character recognition (OCR) on the cleaned PDF document, processing each page individually. Pages are rendered as images, minimally preprocessed to enhance legibility, and passed to Tesseract to extract text. The resulting page-level outputs are concatenated into a structured text representation for downstream analysis.

In [20]:
# Perform page-level OCR on the processed PDF

def ocr_pdf_to_text(pdf_path: Path, dpi: int, lang: str) -> str:
    # Extract text from each page using Tesseract OCR
    doc = fitz.open(str(pdf_path))
    parts = []

    for i in range(doc.page_count):
        page = doc.load_page(i)
        gray = render_page_gray(page, dpi=dpi)

        # Basic image preprocessing to enhance OCR accuracy
        gray = cv2.normalize(gray, None, 0, 255, cv2.NORM_MINMAX)
        bw = cv2.adaptiveThreshold(
            gray, 255,
            cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
            cv2.THRESH_BINARY,
            31, 11
        )

        # Tesseract configuration: LSTM engine with uniform text layout
        config = "--oem 3 --psm 6"
        txt = pytesseract.image_to_string(bw, lang=lang, config=config).strip()

        # Append page header for traceability
        parts.append(f"\n### Page {i+1}\n{txt}")

    doc.close()
    return "\n".join(parts).strip()

#### Single-Document Processing Pipeline

This block orchestrates the complete processing workflow for a single PDF file. It detects potential annexes, generates a cleaned version of the document (if required), applies OCR to the resulting PDF, and produces both page-level text output and detailed metadata records. The block ensures structural consistency, reproducibility, and full traceability for each processed document.

In [21]:
# Process a single PDF end-to-end (annex detection, optional truncation, OCR, and per-file manifest)

from pathlib import Path
import json

def process_one_pdf(pdf_path: Path) -> dict:
    ts = datetime.utcnow().isoformat(timespec="seconds") + "Z"
    sha_in = sha256_file(pdf_path)

    # Derive year and relative input path from the raw directory structure
    rel_path = pdf_path.relative_to(RAW_ROOT)          # e.g., 2018/Document.pdf
    input_rel = rel_path.as_posix()                    # e.g., "2018/Document.pdf"
    year_dir = rel_path.parts[0]                       # e.g., "2018"
    base_name = pdf_path.stem                          # Filename without extension

    # Define per-year output layout
    out_year_dir = OUT_ROOT / year_dir
    out_pdf_dir = out_year_dir / "pdf"
    out_txt_dir = out_year_dir / "txt"
    out_json_dir = out_year_dir / "json"

    out_pdf_dir.mkdir(parents=True, exist_ok=True)
    out_txt_dir.mkdir(parents=True, exist_ok=True)
    out_json_dir.mkdir(parents=True, exist_ok=True)

    # Define output file paths (no per-document subdirectories)
    out_pdf_final = out_pdf_dir / f"{base_name}.pdf"
    out_txt = out_txt_dir / f"{base_name}.txt"
    out_manifest = out_json_dir / f"{base_name}.json"

    # Detect table-like pages to infer an annex cutoff near the document end
    table_flags, line_densities, n_pages = detect_table_like_pages(
        pdf_path, quick_dpi=QUICK_DPI, ld_threshold=LD_THRESHOLD
    )
    cutoff = detect_annex_cutoff(
        table_flags,
        min_consec=ANNEX_MIN_CONSEC,
        min_frac_rest=ANNEX_MIN_FRAC_REST,
        min_start_frac=ANNEX_MIN_START_FRAC
    )

    # Produce the final PDF (either copied as-is or truncated before annex pages)
    if cutoff is None:
        shutil.copy2(pdf_path, out_pdf_final)
        pages_kept = n_pages
        pages_removed = 0
        action_pdf = "copied_no_cut"
    else:
        save_truncated_pdf(pdf_path, out_pdf_final, cutoff_page_1indexed=cutoff)
        pages_kept = cutoff - 1
        pages_removed = n_pages - pages_kept
        action_pdf = "cut_annex_tables"

    # Run OCR on the final PDF and persist the extracted text
    text = ocr_pdf_to_text(out_pdf_final, dpi=OCR_DPI, lang=OCR_LANG)
    out_txt.write_text(text, encoding="utf-8", errors="ignore")

    sha_out = sha256_file(out_pdf_final)

    # Persist a per-document JSON manifest for traceability and diagnostics
    manifest = {
        "timestamp_utc": ts,
        "year": year_dir,
        "input_pdf": str(pdf_path),
        "input_rel": input_rel,
        "sha256_in": sha_in,
        "output_dir": str(out_year_dir),
        "output_pdf_final": str(out_pdf_final),
        "sha256_out": sha_out,
        "output_text": str(out_txt),
        "pages_total": n_pages,
        "cutoff_page_1indexed": cutoff,
        "pages_kept": pages_kept,
        "pages_removed": pages_removed,
        "action_pdf": action_pdf,
        "ocr": {"lang": OCR_LANG, "dpi": OCR_DPI},
        "table_detection": {
            "quick_dpi": QUICK_DPI,
            "ld_threshold": LD_THRESHOLD,
            "line_densities": line_densities,
            "table_flags": table_flags
        }
    }
    out_manifest.write_text(
        json.dumps(manifest, ensure_ascii=False, indent=2),
        encoding="utf-8"
    )

    # Return a single-row summary for the master manifest
    row = {
        "timestamp_utc": ts,
        "year": year_dir,
        "input_rel": input_rel,
        "input_pdf": str(pdf_path),
        "sha256_in": sha_in,
        "output_dir": str(out_year_dir),
        "output_pdf_final": str(out_pdf_final),
        "sha256_out": sha_out,
        "output_text": str(out_txt),
        "pages_total": n_pages,
        "cutoff_page_1indexed": cutoff,
        "pages_kept": pages_kept,
        "pages_removed": pages_removed,
        "action_pdf": action_pdf,
        "ocr_lang": OCR_LANG,
        "ocr_dpi": OCR_DPI,
        "status": "ok",
        "error": ""
    }
    return row

#### Batch Processing and Master Manifest Generation

This block executes the full processing pipeline across all PDF files in the raw data directory. It iteratively applies the single-document workflow, handles errors without interrupting execution, and consolidates results into a master manifest file. The output ensures full corpus coverage, traceability, and structured documentation of all processing outcomes.

In [22]:
# Runner to process all PDFs under RAW_ROOT without skipping any files

def run_all_pdfs():
    pdfs = list_pdfs(RAW_ROOT)
    print("PDFs found:", len(pdfs))

    rows = []
    for i, pdf in enumerate(pdfs, 1):
        print(f"[{i}/{len(pdfs)}] Processing: {pdf.name}")
        try:
            row = process_one_pdf(pdf)
            rows.append(row)
        except Exception as e:
            # Record a minimal error row to keep full coverage in the master manifest
            rows.append({
                "timestamp_utc": datetime.utcnow().isoformat(timespec="seconds") + "Z",
                "year": find_year(pdf),
                "input_rel": str(pdf),
                "input_pdf": str(pdf),
                "sha256_in": "",
                "output_dir": "",
                "output_pdf_final": "",
                "sha256_out": "",
                "output_text": "",
                "pages_total": "",
                "cutoff_page_1indexed": "",
                "pages_kept": "",
                "pages_removed": "",
                "action_pdf": "",
                "ocr_lang": OCR_LANG,
                "ocr_dpi": OCR_DPI,
                "status": "error",
                "error": repr(e)
            })

    df = pd.DataFrame(rows)

    # Write the master manifest to disk
    MANIFEST_MASTER.parent.mkdir(parents=True, exist_ok=True)
    df.to_csv(MANIFEST_MASTER, index=False, encoding="utf-8")

    print("Master manifest written to:", MANIFEST_MASTER)
    print(df["status"].value_counts())
    return df

#### Execution and Sanity Check

This block runs the batch pipeline and performs a quick verification by returning and previewing the consolidated manifest DataFrame. It serves as the notebook entry point for execution and basic output validation.

In [23]:
# Execute the full OCR pipeline and load the resulting master manifest into memory
df_manifest = run_all_pdfs()
# Display the first rows for verification
#df_manifest.head()

PDFs found: 329
[1/329] Processing: 1-Proyecto-de-Decreto-de-Tamano-Empresarial-VF.pdf
[2/329] Processing: DECRETO-10.pdf
[3/329] Processing: DECRETO-11.pdf
[4/329] Processing: DECRETO-12.pdf
[5/329] Processing: DECRETO-13.pdf
[6/329] Processing: DECRETO-14.pdf
[7/329] Processing: DECRETO-15.pdf
[8/329] Processing: DECRETO-16.pdf
[9/329] Processing: DECRETO-17.pdf
[10/329] Processing: DECRETO-19.pdf
[11/329] Processing: DECRETO-2.pdf
[12/329] Processing: DECRETO-20.pdf
[13/329] Processing: DECRETO-21.pdf
[14/329] Processing: DECRETO-22.pdf
[15/329] Processing: DECRETO-23.pdf
[16/329] Processing: DECRETO-24.pdf
[17/329] Processing: DECRETO-25.pdf
[18/329] Processing: DECRETO-26.pdf
[19/329] Processing: DECRETO-27.pdf
[20/329] Processing: DECRETO-28.pdf
[21/329] Processing: DECRETO-29.pdf
[22/329] Processing: DECRETO-3.pdf
[23/329] Processing: DECRETO-30.pdf
[24/329] Processing: DECRETO-31.pdf
[25/329] Processing: DECRETO-32.pdf
[26/329] Processing: DECRETO-5.pdf
[27/329] Processing: DEC

### Post-Processing Review and Validation of OCR Outputs


In [24]:
# Load the master OCR manifest from disk for inspection or post-processing
import pandas as pd

df = pd.read_csv("manifest_ocr.csv", sep=";")

# Preview the first rows if needed
# print(df.head())

In [25]:
# Count non-null error entries in the master manifest
num_errors = df["error"].notna().sum()
print("Number of errors:", num_errors)

Number of errors: 0


In [26]:
# Compute frequency distribution of processing status values

status_counts = df["status"].value_counts(dropna=False)
print(status_counts)

status
ok    329
Name: count, dtype: int64


In [27]:
# Compute frequency distribution of PDF processing actions

action_counts = df["action_pdf"].value_counts(dropna=False)
print(action_counts)

action_pdf
copied_no_cut       310
cut_annex_tables     19
Name: count, dtype: int64


In [28]:
# Identify documents where annex tables were removed

cut_annex_tables = (
    df.loc[df["action_pdf"] == "cut_annex_tables", ["input_rel", "action_pdf"]]
    .sort_values(by="input_rel")
    .reset_index(drop=True)
)

print(cut_annex_tables.to_string(index=False))

                                                                                  input_rel       action_pdf
                                                                         2018/DECRETO-2.pdf cut_annex_tables
           2019/Proyecto-decreto-desgravacion-arancelaria-TLC-Col-ISRAEL-Sept-20-2019-P.pdf cut_annex_tables
                                                          2019/Pyto-Dcto-ATI-Rev-040619.pdf cut_annex_tables
                                           2019/Pyto-Decreto-Profundizacion-COL-GT-VDef.pdf cut_annex_tables
                                                           2019/PytoD-ActualizacionD272.pdf cut_annex_tables
                                                                   2020/19-11-20-PD-ZFs.pdf cut_annex_tables
                                   2020/PD-Desgravacion-Arancelaria-TLC-Israel-22-01-20.pdf cut_annex_tables
                                        2020/Proy-Modif-Dec-2147-16-ZFs-18112020-ingles.pdf cut_annex_tables
                   

In [29]:
# Compute the number of documents with and without removed pages

no_pages_removed_count = (df["pages_removed"] == 0).sum()
pages_removed_count = (df["pages_removed"] > 0).sum()

print("Observations without pages_removed:", no_pages_removed_count)
print("Observations with pages_removed:", pages_removed_count)

Observations without pages_removed: 310
Observations with pages_removed: 19


In [30]:
# Identify documents with one or more pages removed during processing

removed_pages_table = (
    df.loc[df["pages_removed"] > 0, ["input_rel", "pages_removed"]]
    .sort_values(by="pages_removed", ascending=False)
    .reset_index(drop=True)
)

print(removed_pages_table.to_string(index=False))

                                                                                  input_rel  pages_removed
                             2022/18-02-2022-Proyecto-Decreto-Implementacion-tlc-Col-UK.pdf             38
                                   2020/PD-Desgravacion-Arancelaria-TLC-Israel-22-01-20.pdf             23
           2019/Proyecto-decreto-desgravacion-arancelaria-TLC-Col-ISRAEL-Sept-20-2019-P.pdf             18
                            2024/10-05-2024-PD-Certificado-de-Reembolso-Tributario-CERT.pdf             16
              2024/26-12-2024-PD-reglamenta-el-Certificado-de-Reembolso-Tributario-CERT.pdf             15
                                        2020/Proy-Modif-Dec-2147-16-ZFs-18112020-ingles.pdf             14
                                                                   2020/19-11-20-PD-ZFs.pdf             11
                                           2019/Pyto-Decreto-Profundizacion-COL-GT-VDef.pdf              8
                                     

In [31]:
#Comparison between table1 & table2
import pandas as pd

# --- Build name sets from the two tables ---
names_cut = set(cut_annex_tables["input_rel"].dropna().astype(str).str.strip())
names_removed = set(removed_pages_table["input_rel"].dropna().astype(str).str.strip())

# --- Quick equality check ---
are_identical = (names_cut == names_removed)
print("Are the document names identical?:", are_identical)
print("Count (cut_annex_tables):", len(names_cut))
print("Count (pages_removed > 0):", len(names_removed))
print("Intersection:", len(names_cut & names_removed))

# --- Detailed comparison table ---
comparison = (
    pd.DataFrame({"input_rel": sorted(names_cut | names_removed)})
    .assign(
        in_cut=lambda d: d["input_rel"].isin(names_cut),
        in_removed=lambda d: d["input_rel"].isin(names_removed),
    )
    .assign(
        match=lambda d: d["in_cut"] & d["in_removed"]
    )
    .sort_values(["match", "input_rel"], ascending=[True, True])
    .reset_index(drop=True)
)

print(comparison.to_string(index=False))

# --- Differences (optional, but useful) ---
only_in_cut = sorted(names_cut - names_removed)
only_in_removed = sorted(names_removed - names_cut)

print("\nOnly in cut_annex_tables:", len(only_in_cut))
print("\n".join(only_in_cut[:50]) if only_in_cut else "None")

print("\nOnly in removed_pages_table:", len(only_in_removed))
print("\n".join(only_in_removed[:50]) if only_in_removed else "None")

Are the document names identical?: True
Count (cut_annex_tables): 19
Count (pages_removed > 0): 19
Intersection: 19
                                                                                  input_rel  in_cut  in_removed  match
                                                                         2018/DECRETO-2.pdf    True        True   True
           2019/Proyecto-decreto-desgravacion-arancelaria-TLC-Col-ISRAEL-Sept-20-2019-P.pdf    True        True   True
                                                          2019/Pyto-Dcto-ATI-Rev-040619.pdf    True        True   True
                                           2019/Pyto-Decreto-Profundizacion-COL-GT-VDef.pdf    True        True   True
                                                           2019/PytoD-ActualizacionD272.pdf    True        True   True
                                                                   2020/19-11-20-PD-ZFs.pdf    True        True   True
                                   2020/PD-Desgrava

In [32]:
# Inspect top documents with annex truncation by number of removed pages

df.loc[df["action_pdf"] == "cut_annex_tables",
       ["input_rel", "pages_total", "pages_removed", "cutoff_page_1indexed"]
      ] \
  .sort_values("pages_removed", ascending=False) \
  .head(10)

Unnamed: 0,input_rel,pages_total,pages_removed,cutoff_page_1indexed
206,2022/18-02-2022-Proyecto-Decreto-Implementacio...,76,38,39.0
125,2020/PD-Desgravacion-Arancelaria-TLC-Israel-22...,45,23,23.0
73,2019/Proyecto-decreto-desgravacion-arancelaria...,35,18,18.0
259,2024/10-05-2024-PD-Certificado-de-Reembolso-Tr...,31,16,16.0
278,2024/26-12-2024-PD-reglamenta-el-Certificado-d...,30,15,16.0
134,2020/Proy-Modif-Dec-2147-16-ZFs-18112020-ingle...,43,14,30.0
108,2020/19-11-20-PD-ZFs.pdf,44,11,34.0
85,2019/Pyto-Decreto-Profundizacion-COL-GT-VDef.pdf,15,8,8.0
10,2018/DECRETO-2.pdf,15,8,8.0
81,2019/Pyto-Dcto-ATI-Rev-040619.pdf,14,7,8.0


In [33]:
# Compute descriptive statistics for documents with annex truncation

df_cut = df[df["action_pdf"] == "cut_annex_tables"].copy()
# Calculate the proportion of pages removed per document
df_cut["pct_removed"] = df_cut["pages_removed"] / df_cut["pages_total"]
# Generate summary statistics for removed pages and removal ratio
summary = df_cut[["pages_removed", "pct_removed"]].describe()
summary

Unnamed: 0,pages_removed,pct_removed
count,19.0,19.0
mean,10.210526,0.466204
std,8.991224,0.095799
min,3.0,0.25
25%,3.0,0.464286
50%,7.0,0.5
75%,14.5,0.524731
max,38.0,0.555556


In [34]:
# Rank documents with annex truncation by proportion of pages removed

df_cut[["pages_total", "pages_removed", "pct_removed"]] \
    .sort_values("pct_removed", ascending=False) \
    .head(10)

Unnamed: 0,pages_total,pages_removed,pct_removed
161,9,5,0.555556
267,11,6,0.545455
86,13,7,0.538462
10,15,8,0.533333
85,15,8,0.533333
259,31,16,0.516129
73,35,18,0.514286
125,45,23,0.511111
81,14,7,0.5
216,6,3,0.5


In [35]:
# Compute mean and standard deviation of the proportion of pages removed

df_cut["pct_removed"].mean(), df_cut["pct_removed"].std()

(np.float64(0.4662040838828457), np.float64(0.09579932079026335))