# News Article Bias Detection with Semantic Classification

This notebook demonstrates how to use fenic's semantic classification capabilities to detect editorial bias and analyze news articles. We'll walk through:

- Language Analysis using `semantic.extract()` to find biased, emotional, or sensationalist language.
- Political Bias Classifcation using `semantic.classify()` grounded in the extracted data.
- News Topic Classification using `semantic.classify()`
- Merging the information together using `semantic.reduce()` to create a 'Media Profile' summary for each analyzed News Source

This is a practical example of how semantic classification can provide insights into media content.

 ## Initial Setup

First, let's configure our fenic session with semantic capabilities using an OpenAI model for our language processing tasks. Alternatively, uncomment the additional supplied configurations to use an Gemini or Anthropic model.

In [None]:
import fenic as fc
from pydantic import BaseModel, Field
from fenic import col, lit

# Configure session with semantic capabilities
print("🔧 Configuring fenic session...")

config = fc.SessionConfig(
    app_name="news_analysis",
    semantic=fc.SemanticConfig(
        language_models={ 
            "openai": fc.OpenAIModelConfig(
                model_name="gpt-4o-mini",
                rpm=500,
                tpm=200_000
            ),
            # "anthropic": fc.AnthropicModelConfig(
            #     model_name="claude-sonnet-4-0",
            #     rpm=500,
            #     input_tpm=80_000,
            #     output_tpm=32_000,
            # ),
            #  "gemini": fc.GoogleGLAModelConfig(
            #     model_name="gemini-2.0-flash",
            #     rpm=500,
            #     tpm=1_000_000
            # ),
        }
    )
)

# Create session
session = fc.Session.get_or_create(config)
print("✅ Session configured successfully!")

## Sample News Articles Dataset

We'll work with a curated dataset of news articles from different sources covering the same stories. This allows us to analyze how different outlets report on identical events and detect bias patterns.

Our dataset includes articles from:
- **Neutral sources** (Global Wire Service, National Press Bureau)
- **Left-leaning sources** (Progressive Voice, Social Justice Today)
- **Right-leaning sources** (Liberty Herald, Free Market Weekly)
- **Mixed sources** (Balanced Tribune, Independent Monitor)

Each source has multiple articles to demonstrate consistency in bias patterns.

In [None]:
# Sample news articles - multiple articles per source to show bias patterns
news_articles = [
    # Global Wire Service (Neutral source, Reuters-style) - 3 articles
    {
        "source": "Global Wire Service",
        "headline": "Federal Reserve Raises Interest Rates by 0.25 Percentage Points",
        "content": "The Federal Reserve announced a quarter-point increase in interest rates Wednesday, bringing the federal funds rate to 5.5%. The decision was unanimous among voting members. Fed Chair Jerome Powell cited persistent inflation concerns and a robust labor market as key factors. The rate hike affects borrowing costs for consumers and businesses. Economic analysts had predicted the move following recent inflation data showing prices remained above the Fed's 2% target."
    },
    {
        "source": "Global Wire Service",
        "headline": "OpenAI Launches GPT-4 Turbo with 128K Context Window",
        "content": "OpenAI today announced GPT-4 Turbo, featuring a 128,000 token context window and updated training data through April 2024. The model offers improved instruction following and reduced likelihood of generating harmful content. Pricing is set at $0.01 per 1K input tokens and $0.03 per 1K output tokens. The release includes enhanced support for JSON mode and function calling. Developer early access begins this week, with general availability planned for December."
    },
    {
        "source": "Global Wire Service",
        "headline": "Climate Summit Reaches Agreement on Fossil Fuel Transition",
        "content": "Delegates at the COP28 climate summit in Dubai reached a consensus agreement calling for a transition away from fossil fuels in energy systems. The deal, approved by nearly 200 countries, marks the first time a COP agreement explicitly mentions fossil fuels. However, the agreement uses the phrase 'transitioning away' rather than 'phasing out,' reflecting compromises necessary to secure broad support. Environmental groups expressed mixed reactions, with some praising the historic mention while others criticized the lack of binding timelines."
    },
    
    # Progressive Voice (Left-leaning source) - 3 articles  
    {
        "source": "Progressive Voice",
        "headline": "Fed's Rate Hike Threatens Working Families as Corporate Profits Soar",
        "content": "Once again, the Federal Reserve has chosen to burden working families with higher borrowing costs while Wall Street celebrates record profits. Wednesday's rate hike to 5.5% will make mortgages, credit cards, and student loans more expensive for millions of Americans already struggling with housing costs. Meanwhile, corporate executives continue awarding themselves massive bonuses. This regressive monetary policy prioritizes the wealthy elite over middle-class families who desperately need relief."
    },
    {
        "source": "Progressive Voice", 
        "headline": "Big Tech's AI Surveillance Threatens Democratic Values",
        "content": "OpenAI's latest AI release represents another troubling escalation in Silicon Valley's surveillance capitalism model. These systems hoover up personal data and creative content without meaningful consent from users. Artists, writers, and creators see their work exploited to train AI systems that directly compete with human creativity. Meanwhile, users surrender intimate conversations to corporate servers with little transparency. We need immediate regulation to protect digital rights and prevent tech giants from privatizing human knowledge for profit."
    },
    {
        "source": "Progressive Voice",
        "headline": "Climate Summit's Weak Language Betrays Future Generations", 
        "content": "The COP28 agreement represents a devastating failure to confront the climate emergency with the urgency science demands. By choosing vague 'transition' language over concrete 'phase out' commitments, world leaders have once again capitulated to fossil fuel lobbying and corporate interests. Young climate activists who traveled to Dubai seeking real action have been betrayed by politicians who prioritize industry profits over planetary survival. We cannot afford more empty promises while the climate crisis accelerates."
    },
    
    # Liberty Herald (Right-leaning source) - 3 articles
    {
        "source": "Liberty Herald",
        "headline": "Fed's Prudent Rate Decision Reinforces Economic Stability",
        "content": "The Federal Reserve's measured quarter-point rate increase demonstrates responsible monetary policy that will preserve long-term economic prosperity. By raising rates to 5.5%, Fed officials are taking necessary steps to prevent runaway inflation that would devastate savings and fixed incomes. This disciplined approach protects the purchasing power that American families have worked hard to build. Free market principles and sound fiscal management require tough decisions that ensure sustainable growth for job creators and investors."
    },
    {
        "source": "Liberty Herald",
        "headline": "American AI Innovation Leads Global Technology Revolution",
        "content": "OpenAI's breakthrough demonstrates why American innovation continues to lead the world in transformative technology. This achievement showcases the power of free enterprise and competitive markets to deliver solutions that benefit humanity. While other nations impose heavy-handed regulations that stifle innovation, American companies are unleashing AI capabilities that will create jobs, boost productivity, and solve complex problems. America's technological superiority depends on supporting pioneering companies through pro-growth policies and reduced government interference."
    },
    {
        "source": "Liberty Herald",
        "headline": "Pragmatic Climate Deal Balances Environmental Goals with Economic Reality",
        "content": "The COP28 agreement demonstrates mature leadership by acknowledging environmental concerns while protecting economic stability and energy security. The careful 'transition away' language recognizes that abrupt fossil fuel elimination would devastate working families and developing nations that depend on affordable energy. American energy producers have already reduced emissions through innovation and cleaner technologies, proving that market solutions work better than government mandates. This balanced approach protects jobs while investing in alternatives."
    }
]

print(f"📰 Loaded {len(news_articles)} news articles from various sources")
print(f"🔍 Sources: {len(set(article['source'] for article in news_articles))} unique news outlets")

## Create DataFrame and Dataset Overview

Let's convert our news articles into a fenic DataFrame and examine the composition of our dataset.

In [None]:
# Create DataFrame from news articles
df = session.create_dataframe(news_articles)

print("📰 News Bias Detection Pipeline")
print("=" * 70)
print(f"Analyzing {df.count()} news articles from {df.select('source').drop_duplicates(['source']).count()} sources")

# Show dataset composition
print("\n📊 Dataset Composition:")
source_counts = df.group_by("source").agg(fc.count("*").alias("articles")).order_by("source")
source_counts.show()

## Define Analysis Schema

Before we perform semantic operations, let's define a Pydantic model that will help us extract structured information about bias indicators, emotional language, and opinion markers from each article.

In [None]:
# Define Pydantic model for detailed article analysis
class ArticleAnalysis(BaseModel):
    """Comprehensive analysis of news article content and bias"""
    bias_indicators: str = Field(..., description="Key words or phrases that indicate political bias")
    emotional_language: str = Field(..., description="Emotionally charged words or neutral descriptive language")
    opinion_markers: str = Field(..., description="Words or phrases that signal opinion vs. factual reporting")

print("✅ Analysis schema defined - ready for semantic extraction!")

## Stage 1: Content Preprocessing and Information Extraction

First, we'll combine headlines and content for richer context, then extract key information about bias indicators, emotional language, and opinion markers from each article. This sets up our data for the classification stage.

In [None]:
print("🔍 Performing semantic bias detection...")
print("First, we extract key information from each article.\n")

# Create combined text for context-aware analysis
combined_content = fc.text.concat(
    fc.col("headline"), 
    fc.lit(" | "), 
    fc.col("content")
)

# Extract information and classify topics
# We can use `.cache()` to ensure these expensive LLM operations don't need to be re-run each time we modify
# the resultant materialized dataframe.
enriched_df = df.with_column("combined_content", combined_content).select(
    fc.col("source"),
    fc.col("headline"),
    fc.col("content"),
    # Primary topic classification
    fc.semantic.classify(
        fc.col("combined_content"),
        ["politics", "technology", "business", "climate", "healthcare"]
    ).alias("primary_topic"),
    # Content Metadata using semantic.extract
    fc.semantic.extract(
        fc.col("combined_content"),
        ArticleAnalysis,
        max_output_tokens=512,
    ).alias("analysis_metadata"),
).unnest("analysis_metadata").cache()
enriched_df.collect()
print("✅ Information extraction completed!")
print("\n📊 Sample extracted information:")
enriched_df.select("source", "headline", "primary_topic", "bias_indicators", "emotional_language", "opinion_markers").show(3)

## Stage 2: Political Bias Classification

Now we'll use the extracted information to classify the political bias of each article. We combine the topic, bias indicators, emotional language, and opinion markers to give the model rich context for accurate bias detection.

In [None]:
# Combine extracted information for bias classification
combined_extracts = fc.text.concat(
    lit("Primary Topic: "),
    fc.col("primary_topic"),
    lit("Political Bias Indicators: "),
    fc.col("bias_indicators"),
    lit("||||||||||||"),
    lit("Emotional Language Summary: "),
    fc.col("emotional_language"),
    lit("||||||||||||"),
    lit("Opinion Markers: "),
    fc.col("opinion_markers")
)

enriched_df = enriched_df.with_column("combined_extracts", combined_extracts)

# Classify political bias and journalistic style
results_df = enriched_df.select(
    "*",
    fc.semantic.classify(
        col("combined_extracts"), 
        ["far_left", "left_leaning", "neutral", "right_leaning", "far_right"]
    ).alias("content_bias"),
    fc.semantic.classify(
        col("combined_extracts"), 
        ["sensationalist", "informational"]
    ).alias("journalistic_style")
).cache()
results_df.collect()
print("✅ Bias classification completed!")

## Results: Complete Bias Detection Analysis

Let's examine our complete results, showing how each article was classified for topic, bias, and journalistic style.

In [None]:
print("📊 Complete Bias Detection Results:")
print("=" * 70)

# Show key results for each article
summary_results = results_df.select(
    "source",
    "headline",
    "primary_topic",
    "content_bias",
    "journalistic_style"
)
summary_results.show()

## Bias Language Analysis

Let's examine the specific language patterns that indicate bias versus neutral reporting. This helps us understand what linguistic markers the model identified.

In [None]:
# Bias Indicators Analysis
bias_indicators_df = results_df.select(
    "source",
    "headline",
    "content_bias",
    "bias_indicators",
    "emotional_language",
    "opinion_markers"
)

print("🔍 Bias Language Analysis:")
print("=" * 70)

# Show examples of neutral vs biased language
print("📰 Neutral Articles - Language Patterns:")
neutral_examples = bias_indicators_df.filter(
    fc.col("content_bias") == "neutral"
).select("source", "headline", "bias_indicators", "opinion_markers")
neutral_examples.show(5)

print("\n📰 Biased Articles - Language Patterns:")
biased_examples = bias_indicators_df.filter(
    (fc.col("content_bias") != "neutral")
).select("source", "headline", "content_bias", "bias_indicators", "emotional_language", "opinion_markers")
biased_examples.show(6)

## AI-Generated Media Profiles

Finally, let's use fenic's semantic reduction capabilities to generate media profiles for each news source based on all the information we've extracted.

In [None]:
# Generate semantic summaries of language patterns for each source
source_language_profiles = results_df.group_by("source").agg(
    # Use semantic.reduce to produce a media profile for each source, without including the entire original articles.
    fc.semantic.reduce(
        """
           Create a concise (3-5 sentence) media profile for {source} based on the following information we have extracted from its articles:
           Detected Political Bias: {content_bias}
           Detected Bias Indicators: {bias_indicators}
           Opinion Indicators: {opinion_markers}
           Emotional Language: {emotional_language}
           Journalistic Style: {journalistic_style}
           
           Summarize the information provided, limit your use of direct quotes from the text.
        """,
        max_output_tokens=512,
    ).alias("source_profile"),
).select(col("source"), col("source_profile"))

print("🏢 AI-Generated Media Profiles:")
print("-" * 50)
source_language_profiles.show()

# Clean up session
session.stop()

print("\n✅ News Bias Detection Complete!")

## Key Insights and Applications

This analysis demonstrates several powerful capabilities of semantic classification:

### 🎯 Key Insights Demonstrated:
- **Content-based bias detection** without relying on source name predictions
- **Source consistency analysis** across multiple articles
- **Language pattern identification** for bias indicators
- **Topic-agnostic bias detection** (same source biased across different topics)
- **Quality assessment** with confidence scoring

### 🔍 Practical Applications:
- **Media literacy education** showing how bias manifests in language
- **Content moderation** for balanced information presentation
- **News aggregation** with bias awareness
- **Research on editorial patterns** and media analysis

### 🚀 Next Steps:
- Try analyzing your own news articles or text data
- Experiment with different classification categories
- Combine with other semantic operations like extraction and mapping
- Build automated content analysis pipelines