# Introduction

In this notebook, we demonstrate how Retrieval-Augmented Generation (RAG) can help generate accurate responses by integrating information from external sources, such as news articles. By using RAG, we can retrieve relevant documents and provide citations to prevent hallucination and generate more reliable content.

To use the NYT API, I first signed up for an account on the NYT Developer Portal. After logging in, I generated an API key that grants access to the Article Search API, which allows me to query articles on various topics.

Article Search API

- The Article Search API allows querying for articles based on keywords, date ranges, and other parameters.
- It's well-suited for searching specific articles related to a particular topic ("AI revolution").

In [7]:
import requests
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Step 1: Retrieve articles using the NYTimes API
def fetch_articles(query, api_key, begin_date, end_date):
    url = "https://api.nytimes.com/svc/search/v2/articlesearch.json"
    params = {
        'q': query,
        'api-key': api_key,
        'begin_date': begin_date,
        'end_date': end_date
    }
    response = requests.get(url, params=params)
    if response.status_code == 200:
        return response.json().get('response', {}).get('docs', [])
    else:
        raise Exception(f"Failed to fetch articles: {response.status_code} - {response.text}")

# Step 2: Extract relevant content and citations
def prepare_context(articles):
    context = ""
    citations = []
    for idx, article in enumerate(articles):
        abstract = article.get('abstract', 'No abstract available.')
        url = article.get('web_url', '')
        if abstract and url:
            context += f"Article {idx+1}: {abstract}\n"
            citations.append(f"Article {idx+1} - {url}")
    return context.strip(), citations

# Step 3: Generate response using retrieved context
def generate_response(query, context, model, tokenizer):
    model_input = f"Query: {query}\nContext: {context}\nAnswer:"
    inputs = tokenizer(model_input, return_tensors="pt", truncation=True, padding=True, max_length=1024)
    inputs["attention_mask"] = inputs["input_ids"] != tokenizer.pad_token_id
    outputs = model.generate(
      inputs['input_ids'],
      attention_mask=inputs['attention_mask'],
      max_new_tokens=200,
      pad_token_id=tokenizer.eos_token_id,
      repetition_penalty=1.2  # Adjust to reduce redundant phrases
  )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# Main Execution
if __name__ == "__main__":
    # Configuration
    api_key = 'nFXUbgmS6atPUsox7no4bSFtSsVQTgGw'
    query = "Social justice movements 2025"
    begin_date = "20240101"
    end_date = "20251231"

    # Initialize the model and tokenizer
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    model = GPT2LMHeadModel.from_pretrained('gpt2')
    tokenizer.pad_token = tokenizer.eos_token  # Avoid padding warnings

    # Fetch articles and prepare context
    articles = fetch_articles(query, api_key, begin_date, end_date)
    context, citations = prepare_context(articles)

    # Generate response using RAG
    if context:
        print("Retrieved Context:\n", context)
        response = generate_response(query, context, model, tokenizer)
        print("\nGenerated Response:\n", response)
        print("\nCitations:")
        for citation in citations:
            print(citation)
    else:
        print("No articles found for the given query.")


Retrieved Context:
 Article 1: Sublime Sienese art at the Met, Pan-African art throughout Chicago, a 200th anniversary at the Brooklyn Museum: These extravaganzas are not to be missed.
Article 2: Fourteen teachers shared recommendations for students at all levels.
Article 3: Exhibitions around the world are celebrating the art movement’s centennial and asking whether our crazy dreams can still set us free.
Article 4: Sahra Wagenknecht, a former Communist, has founded her own party to respond to German grievances about migrants, crime and the dangers of the war in Ukraine.
Article 5: Right-wing parties may be in the ascendant, but overall, democracy is not at risk.
Article 6: The sprawling PST festival of more than 70 exhibitions doesn’t quite live up to its theme of art and science colliding. But there is a handful of impressive entries.
Article 7: This week in Newly Reviewed, Travis Diehl covers Samuel Hindolo’s bohemian atmospheres, Kristin Walsh’s shiny engines and Janiva Ellis’s ca

How RAG Prevents Hallucination
1. Content Anchoring:

Answers are directly linked to specific retrieved documents, reducing the likelihood of speculative content.
If no relevant context is retrieved, the model will output a disclaimer like: "I could not find relevant information."

2. Traceability:

The response is backed by links to verifiable sources, making it easy to fact-check.

3. Controlled Retrieval:

By fine-tuning the retriever, you can limit the context to only high-quality, domain-relevant data.
-