# NYT Data Journalism Tutorial

This tutorial demonstrates how to perform comprehensive text analysis on New York Times articles using modern NLP techniques. We'll cover:

1. **Data Loading & Preprocessing** - Loading and preparing NYT article data
2. **Topic Modeling** - Discovering themes using LDA and BERTopic
3. **Sentiment Analysis** - Analyzing sentiment with FinBERT and PoliBERT
4. **Embeddings & Similarity Search** - Creating and using text embeddings
5. **Book Extraction** - Extracting structured data with LLMs
6. **Visualizations** - Creating insightful visual representations

Each section demonstrates features that are available in the modular implementation located in the `src/` directory.

## 1. Data Loading & Preprocessing

First, we'll load the NYT dataset and prepare it for analysis. The modular implementation provides this functionality in `src/ingest/load_nyt.py` and `src/preprocess/text.py`.

In [None]:
import pandas as pd
import os

# For this tutorial, we'll assume the data is already downloaded
# In production, use: from src.ingest.load_nyt import load_nyt_data

# Load the dataset (adjust path as needed)
# df = pd.read_csv('path_to_nyt_data.csv')

# For demonstration, we'll show the expected structure
print("Expected columns: _id, pub_date, headline, abstract, lead_paragraph, section_name, document_type, web_url")

In [None]:
# Convert pub_date to datetime and create working dataframe
df['pub_date'] = pd.to_datetime(df['pub_date'], errors='coerce')
df.dropna(subset=['pub_date'], inplace=True)

# Create a clean working copy with essential columns
articles_df = df[[
    '_id', 'pub_date', 'headline', 'web_url', 'abstract', 
    'lead_paragraph', 'section_name', 'document_type'
]].copy()

print(f"Total articles: {len(articles_df):,}")
print(f"\nTop sections:")
print(articles_df['section_name'].value_counts().head(10))

### Text Preprocessing

The `src/preprocess/text.py` module provides utilities for cleaning and combining text fields.

In [None]:
import re
import nltk
from nltk.corpus import stopwords

# Download stopwords if needed
try:
    stopwords.words('english')
except LookupError:
    nltk.download('stopwords')

def combine_text_fields(df, columns=['headline', 'abstract', 'lead_paragraph']):
    """Combine multiple text columns into one."""
    combined = df[columns[0]].astype(str)
    for col in columns[1:]:
        combined = combined + ' ' + df[col].astype(str)
    return combined.str.replace('nan', '').str.strip()

def clean_text(text):
    """Clean text by removing punctuation, numbers, and stopwords."""
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    words = text.split()
    stop_words = set(stopwords.words('english'))
    words = [w for w in words if w not in stop_words and len(w) > 2]
    return ' '.join(words)

print("Preprocessing functions ready")

## 2. Topic Modeling

Topic modeling helps discover hidden themes in large text collections. We'll demonstrate both LDA (Latent Dirichlet Allocation) and BERTopic, both available in `src/models/topic_models.py`.

### 2.1 LDA Topic Modeling

Let's analyze World section articles from 2001 to discover major themes.

In [None]:
# Filter World articles from 2001
world_2001 = articles_df[
    (articles_df['pub_date'].dt.year == 2001) &
    (articles_df['section_name'] == 'World')
].copy()

# Combine and clean text
world_2001['combined_text'] = combine_text_fields(world_2001)
world_2001['cleaned_text'] = world_2001['combined_text'].apply(clean_text)

print(f"World articles in 2001: {len(world_2001):,}")
print(f"Sample: {world_2001['combined_text'].iloc[0][:200]}...")

In [None]:
# Install gensim for LDA
import sys
!{sys.executable} -m pip install -q gensim

In [None]:
from gensim.utils import simple_preprocess
from gensim.corpora import Dictionary
from gensim.models import LdaMulticore

# Tokenize
world_2001['tokens'] = world_2001['combined_text'].apply(
    lambda x: simple_preprocess(x, deacc=True)
)

# Create dictionary and corpus
dictionary = Dictionary(world_2001['tokens'])
dictionary.filter_extremes(no_below=5, no_above=0.5)
corpus = [dictionary.doc2bow(tokens) for tokens in world_2001['tokens']]

print(f"Dictionary size: {len(dictionary)} unique tokens")
print(f"Corpus size: {len(corpus)} documents")

In [None]:
# Train LDA model
num_topics = 10
lda_model = LdaMulticore(
    corpus=corpus,
    id2word=dictionary,
    num_topics=num_topics,
    random_state=42,
    passes=10,
    per_word_topics=True
)

# Display topics
print("\nTop 10 Topics from LDA Model:\n")
for idx, topic in lda_model.print_topics(num_words=8):
    print(f"Topic {idx}: {topic}\n")

### 2.2 BERTopic Modeling

BERTopic uses transformer embeddings for more semantic topic discovery.

In [None]:
# Install BERTopic and dependencies
!{sys.executable} -m pip install -q bertopic umap-learn hdbscan sentence-transformers

In [None]:
from bertopic import BERTopic

# Prepare documents
documents = world_2001['combined_text'].tolist()

# Train BERTopic model
topic_model = BERTopic(
    language='english',
    calculate_probabilities=True,
    verbose=True
)

topics, probs = topic_model.fit_transform(documents)

# Display topic info
topic_info = topic_model.get_topic_info()
print("\nBERTopic Results:")
print(topic_info.head(10))

In [None]:
# Visualize topics
fig = topic_model.visualize_topics(width=1200, height=800)
fig.show()

## 3. Sentiment Analysis

We'll use domain-specific sentiment models: PoliBERT for opinion pieces and FinBERT for business articles. These are implemented in `src/models/sentiment.py`.

### 3.1 PoliBERT for Opinion Articles

PoliBERT (RoBERTa-based) is designed for political sentiment analysis.

In [None]:
# Install transformers
!{sys.executable} -m pip install -q transformers torch

In [None]:
# Filter Opinion articles from 2001
opinion_2001 = articles_df[
    (articles_df['pub_date'].dt.year == 2001) &
    (articles_df['section_name'] == 'Opinion')
].copy()

opinion_2001['combined_text'] = combine_text_fields(opinion_2001)

print(f"Opinion articles in 2001: {len(opinion_2001):,}")

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

# Load PoliBERT (using Twitter-RoBERTa as proxy)
model_name = "cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

print("PoliBERT model loaded")

In [None]:
def analyze_sentiment_polibert(text):
    """Analyze sentiment using PoliBERT."""
    if not isinstance(text, str) or not text.strip():
        return 'Neutral', {'negative': 0.0, 'neutral': 1.0, 'positive': 0.0}
    
    # Tokenize and get predictions
    inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Get probabilities
    probs = F.softmax(outputs.logits, dim=1)[0]
    scores = {'negative': probs[0].item(), 'neutral': probs[1].item(), 'positive': probs[2].item()}
    label = max(scores, key=scores.get).capitalize()
    
    return label, scores

# Test on a sample
sample_text = opinion_2001['combined_text'].iloc[0]
label, scores = analyze_sentiment_polibert(sample_text)
print(f"Sample sentiment: {label}")
print(f"Scores: {scores}")

In [None]:
import numpy as np

# Analyze a sample (first 100 articles for speed)
sample = opinion_2001.head(100).copy()
results = sample['combined_text'].apply(analyze_sentiment_polibert)
sample['sentiment_label'] = results.apply(lambda x: x[0])
sample['sentiment_scores'] = results.apply(lambda x: x[1])

print("Sentiment distribution:")
print(sample['sentiment_label'].value_counts())

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Visualize sentiment distribution
plt.figure(figsize=(8, 6))
sns.barplot(x=sample['sentiment_label'].value_counts().index, 
            y=sample['sentiment_label'].value_counts().values)
plt.title('Opinion Article Sentiment Distribution (PoliBERT)')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()

### 3.2 FinBERT for Business Articles

FinBERT is specialized for financial sentiment analysis.

In [None]:
# Filter Business Day articles from 2001
business_2001 = articles_df[
    (articles_df['pub_date'].dt.year == 2001) &
    (articles_df['section_name'] == 'Business Day')
].copy()

business_2001['combined_text'] = combine_text_fields(business_2001)

print(f"Business Day articles in 2001: {len(business_2001):,}")

In [None]:
# Load FinBERT
finbert_model_name = "ProsusAI/finbert"
finbert_tokenizer = AutoTokenizer.from_pretrained(finbert_model_name)
finbert_model = AutoModelForSequenceClassification.from_pretrained(finbert_model_name)

print("FinBERT model loaded")

In [None]:
def analyze_sentiment_finbert(text):
    """Analyze financial sentiment using FinBERT."""
    if not isinstance(text, str) or not text.strip():
        return 'Neutral', {'negative': 0.0, 'neutral': 1.0, 'positive': 0.0}
    
    # Tokenize and get predictions
    inputs = finbert_tokenizer(text, return_tensors='pt', truncation=True, max_length=512)
    with torch.no_grad():
        outputs = finbert_model(**inputs)
    
    # Get probabilities (FinBERT: positive=0, negative=1, neutral=2)
    probs = F.softmax(outputs.logits, dim=1)[0]
    scores = {'positive': probs[0].item(), 'negative': probs[1].item(), 'neutral': probs[2].item()}
    label = max(scores, key=scores.get).capitalize()
    
    return label, scores

# Test on a sample
sample_text = business_2001['combined_text'].iloc[0]
label, scores = analyze_sentiment_finbert(sample_text)
print(f"Sample sentiment: {label}")
print(f"Scores: {scores}")

In [None]:
# Analyze a sample (first 100 articles for speed)
business_sample = business_2001.head(100).copy()
results = business_sample['combined_text'].apply(analyze_sentiment_finbert)
business_sample['finbert_sentiment'] = results.apply(lambda x: x[0])
business_sample['finbert_scores'] = results.apply(lambda x: x[1])

print("FinBERT sentiment distribution:")
print(business_sample['finbert_sentiment'].value_counts())

In [None]:
# Visualize FinBERT sentiment distribution
plt.figure(figsize=(8, 6))
sns.barplot(x=business_sample['finbert_sentiment'].value_counts().index, 
            y=business_sample['finbert_sentiment'].value_counts().values)
plt.title('Business Article Sentiment Distribution (FinBERT)')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()

## 4. Embeddings & Similarity Search

Text embeddings enable semantic similarity search. This functionality is in `src/models/embeddings.py` and `src/models/similarity.py`.

In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

print("Embedding model loaded")

In [None]:
# Create embeddings for a sample of articles
sample_articles = world_2001.head(50).copy()
texts = sample_articles['combined_text'].tolist()

# Generate embeddings
embeddings = embedding_model.encode(texts, show_progress_bar=True)

print(f"Created {len(embeddings)} embeddings of dimension {embeddings.shape[1]}")

In [None]:
# Find similar articles to a query
query = "terrorism and security measures"
query_embedding = embedding_model.encode([query])

# Calculate similarities
similarities = cosine_similarity(query_embedding, embeddings)[0]

# Get top 5 most similar articles
top_indices = np.argsort(similarities)[-5:][::-1]

print(f"Query: '{query}'\n")
print("Top 5 most similar articles:\n")
for i, idx in enumerate(top_indices, 1):
    print(f"{i}. Similarity: {similarities[idx]:.3f}")
    print(f"   {sample_articles.iloc[idx]['combined_text'][:150]}...\n")

## 5. Book Extraction

Extract structured book and author information using LLMs. This is implemented in `src/models/extraction.py`.

In [None]:
# Install dependencies
!{sys.executable} -m pip install -q pydantic openai instructor

In [None]:
# Filter Books section articles
books_2001 = articles_df[
    (articles_df['pub_date'].dt.year == 2001) &
    (articles_df['section_name'] == 'Books')
].copy()

books_2001['combined_text'] = combine_text_fields(books_2001)

print(f"Books articles in 2001: {len(books_2001):,}")
print(f"\nSample: {books_2001['combined_text'].iloc[0][:300]}...")

In [None]:
from pydantic import BaseModel
import json

# Define schema for book extraction
class BookAuthor(BaseModel):
    book_title: str
    author_name: str

print("Pydantic schema defined:")
print(json.dumps(BookAuthor.model_json_schema(), indent=2))

In [None]:
# Note: This requires an OpenAI API key
# For demonstration purposes, we'll show the structure without actual API calls

print("""
Book Extraction Process:

1. The text is sent to an LLM (e.g., GPT-4) with structured output requirements
2. The LLM extracts book titles and author names
3. Results are validated against the Pydantic schema
4. Structured data is stored for further analysis

Example usage:
```python
import openai
import instructor

client = instructor.from_openai(openai.OpenAI(api_key="your-key"))

def extract_book_info(text):
    return client.chat.completions.create(
        model="gpt-4",
        response_model=BookAuthor,
        messages=[{
            "role": "user",
            "content": f"Extract book title and author from: {text}"
        }]
    )
```
""")

## 6. Visualizations

Create visual representations of the data insights.

In [None]:
# Install wordcloud
!{sys.executable} -m pip install -q wordcloud

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Create word cloud for World articles
text_data = ' '.join(world_2001['cleaned_text'].dropna())

wordcloud = WordCloud(
    width=1200,
    height=600,
    background_color='white',
    colormap='viridis',
    max_words=100
).generate(text_data)

plt.figure(figsize=(15, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud: World Articles 2001', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Create section distribution pie chart
section_counts = articles_df[
    articles_df['pub_date'].dt.year == 2001
]['section_name'].value_counts().head(10)

plt.figure(figsize=(10, 8))
plt.pie(section_counts, labels=section_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Top 10 Sections Distribution (2001)', fontsize=14, fontweight='bold')
plt.axis('equal')
plt.show()

In [None]:
# Time series of article counts
monthly_counts = articles_df[
    articles_df['pub_date'].dt.year == 2001
].groupby(articles_df['pub_date'].dt.month).size()

plt.figure(figsize=(12, 6))
plt.plot(monthly_counts.index, monthly_counts.values, marker='o', linewidth=2, markersize=8)
plt.xlabel('Month', fontsize=12)
plt.ylabel('Number of Articles', fontsize=12)
plt.title('NYT Articles Published per Month in 2001', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.xticks(range(1, 13), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
                           'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.tight_layout()
plt.show()

## Conclusion

This tutorial demonstrated the key capabilities of the NYT Data Journalism analysis system:

1. **Data Loading & Preprocessing** - Efficiently loading and cleaning large-scale article data
2. **Topic Modeling** - Discovering themes with both classical (LDA) and modern (BERTopic) approaches
3. **Sentiment Analysis** - Domain-specific sentiment analysis with FinBERT and PoliBERT
4. **Embeddings & Similarity** - Semantic search and article similarity using transformer embeddings
5. **Book Extraction** - Structured data extraction using LLM-based parsing
6. **Visualizations** - Creating insightful visualizations of text data

### Modular Implementation

All these features are available in the production-ready modular implementation:

```
src/
├── ingest/
│   └── load_nyt.py          # Data loading utilities
├── preprocess/
│   └── text.py              # Text preprocessing functions
├── models/
│   ├── topic_models.py      # LDA and BERTopic implementations
│   ├── sentiment.py         # Multi-model sentiment analysis
│   ├── embeddings.py        # Text embedding generation
│   ├── similarity.py        # Similarity search functionality
│   └── extraction.py        # LLM-based extraction
└── api/
    ├── app.py               # Flask API application
    └── main.py              # API entry point
```

### Next Steps

- Explore the API endpoints in `src/api/app.py` for integrating these features into applications
- Scale analysis to multiple years and sections
- Experiment with different topic counts and model parameters
- Build custom visualizations for specific use cases
- Extend the extraction module for other entity types (people, organizations, etc.)

For questions or contributions, refer to the project README.