In [1]:
from dotenv import load_dotenv
import os
from openai import OpenAI
import re

In [2]:
load_dotenv()

True

## Setup Model

In [3]:
api_key = os.getenv("OPENAI_API_KEY")

In [4]:
client = OpenAI(api_key=api_key)

In [5]:
prompte = """You are analyzing an academic research paper. Provide a structured summary:

1. RESEARCH QUESTION: What problem does this paper address?
2. METHODOLOGY: How did they approach it?
3. DATA/MATERIALS: What did they use?
4. KEY FINDINGS: What did they discover? (3-5 points)
5. CONTRIBUTIONS: What's novel about this work?
6. LIMITATIONS: What are the weaknesses or limitations?

Be specific and use the paper's own terminology.

Paper text: """

In [7]:
paper_text = ""
with open("data/output_PyPDF2.txt", "r", encoding="utf8") as f:
    paper_text = f.read()

In [8]:
response = client.responses.create(
    model = "gpt-4.1-mini",
    input = prompte + paper_text,
    temperature = 0.2,
)

In [9]:
response_text = response.output_text

In [10]:
response_text

'**Structured Summary**\n\n1. **RESEARCH QUESTION**  \n   The paper addresses the problem of how changes in external data sources impact the production of official statistics when using machine learning (ML). Specifically, it investigates the risks, liabilities, and uncertainties associated with changing data sources and their repercussions on the integrity, reliability, and neutrality of ML-driven official statistics.\n\n2. **METHODOLOGY**  \n   The authors conduct a conceptual and analytical overview rather than empirical experimentation. They:  \n   - Review the nature and challenges of using external data sources in ML for official statistics.  \n   - Categorize and analyze types and causes of data source changes (technical, legal, ethical, ownership-related).  \n   - Discuss the consequences of such changes on statistical production and ML model performance.  \n   - Propose a checklist of risks and mitigation strategies based on literature and practical examples.  \n   - Provide r

## Clean result

In [11]:
def clean_llm_output(text) :

    # Remove section headers like "1. RESEARCH QUESTION"
    text = re.sub(r'\n?\d+\.\s+[A-Z/ ]+\s*\n', '\n', text)

    # Remove inline numbered list items
    text = re.sub(r'\n?\d+\.\s+', ' ', text)

    # Remove leftover markdown artifacts
    text = re.sub(r'\*\*(.*?)\*\*', r'\1', text)
    text = re.sub(r'#+\s*', '', text)
    text = re.sub(r'\n?---\n?', '\n', text)

    # Clean LaTeX remnants
    text = re.sub(r'\\\((.*?)\\\)', r'\1', text)
    text = re.sub(r'\\approx', '≈', text)
    text = re.sub(r'\\times', '×', text)

    # Remove bullet dashes
    text = re.sub(r'^\s*-\s+', '', text, flags=re.MULTILINE)

    # Join wrapped lines into paragraphs
    text = re.sub(r'(?<!\n)\n(?!\n)', ' ', text)

    # Normalize spacing
    text = re.sub(r'\n{2,}', '\n\n', text)
    text = re.sub(r'\s{2,}', ' ', text)

    return text.strip()


In [12]:
cleaned_response = clean_llm_output(response_text)

In [13]:
print(cleaned_response)

Structured Summary RESEARCH QUESTION The paper addresses the problem of how changes in external data sources impact the production of official statistics when using machine learning (ML). Specifically, it investigates the risks, liabilities, and uncertainties associated with changing data sources and their repercussions on the integrity, reliability, and neutrality of ML-driven official statistics. METHODOLOGY The authors conduct a conceptual and analytical overview rather than empirical experimentation. They: Review the nature and challenges of using external data sources in ML for official statistics. Categorize and analyze types and causes of data source changes (technical, legal, ethical, ownership-related). Discuss the consequences of such changes on statistical production and ML model performance. Propose a checklist of risks and mitigation strategies based on literature and practical examples. Provide recommendations for statistical agencies to manage these challenges effectivel

In [14]:
from IPython.display import display, Markdown

display(Markdown(cleaned_response))


Structured Summary RESEARCH QUESTION The paper addresses the problem of how changes in external data sources impact the production of official statistics when using machine learning (ML). Specifically, it investigates the risks, liabilities, and uncertainties associated with changing data sources and their repercussions on the integrity, reliability, and neutrality of ML-driven official statistics. METHODOLOGY The authors conduct a conceptual and analytical overview rather than empirical experimentation. They: Review the nature and challenges of using external data sources in ML for official statistics. Categorize and analyze types and causes of data source changes (technical, legal, ethical, ownership-related). Discuss the consequences of such changes on statistical production and ML model performance. Propose a checklist of risks and mitigation strategies based on literature and practical examples. Provide recommendations for statistical agencies to manage these challenges effectively. DATA/MATERIALS The paper does not use original datasets but draws on: Existing literature and technical reports on ML and official statistics. Case examples such as the Twitter API changes. References to various data types (social media, administrative records, surveys) and ML methodologies (supervised, unsupervised learning). Regulatory frameworks (e.g., GDPR) and ethical considerations relevant to data sourcing. KEY FINDINGS Types and Causes of Data Changes: Changes in data types/schemas, sharing/collection technologies, concept drift, frequency interruptions, ownership changes, legal regulations, and ethical/public perception shifts all contribute to data source variability. Consequences: Changing data sources can cause concept drift, model staleness, bias, loss of neutrality, data unavailability, integration challenges, increased labor and costs, breaking changes, and degradation of quality metrics (timeliness, validity, accuracy, completeness). Risks of External Data: Unlike traditional controlled data (surveys, administrative records), external data sources expose statistical agencies to powerlessness and lack of control, increasing vulnerability. Mitigation Strategies: Risk analysis, continuous monitoring, diversification of data sources, building technical robustness (data normalization, validation, testing), and establishing legal agreements (SLAs) are critical to managing changing data sources. Tradeoffs: Mitigation requires significant resources, effort, and long-term planning; no single solution fits all cases. CONTRIBUTIONS Provides a comprehensive checklist and taxonomy of risks related to changing external data sources in ML for official statistics, covering technical, legal, ethical, and operational dimensions. Highlights the underexposed issue of data source change and its critical impact on ML-driven official statistics. Offers practical guidance and best practices for statistical agencies to maintain integrity, reliability, and neutrality despite data source volatility. Emphasizes the importance of data control and legal robustness in the era of external data dependency. Bridges the gap between ML technical challenges and official statistics production requirements. LIMITATIONS The paper is conceptual and does not provide empirical validation or quantitative evaluation of the proposed mitigation strategies. Recommendations are high-level and may lack detailed implementation guidance tailored to specific use cases or domains. The complexity and resource intensity of mitigation efforts may limit practical adoption, especially for smaller statistical agencies. The dynamic nature of data ecosystems and regulatory environments means that some risks and solutions may evolve rapidly beyond the scope of this paper. Does not deeply explore the tradeoffs between model interpretability and performance in the context of changing data sources, though it mentions related ethical concerns. This summary captures the essence of the paper’s problem framing, analytical approach, key insights, novel contributions, and acknowledged limitations as presented by the authors.

In [15]:
with open("data/structured_summary.txt", "w", encoding="utf8") as f:
    f.write(cleaned_response)