# Test New Topic Types and Prompt

This notebook tests the new topic extraction prompt and entity types without saving to database.

## New Entity Types:
- PERSON: People, scientists, researchers
- ORG: Organizations, companies, institutions
- LOCATION: Geographical places, countries, cities
- PRODUCT: Specific software, hardware, or services
- PROGRAMMING_LANGUAGE: Programming languages
- SCIENTIFIC_TERM: Specific scientific concepts, theories, species
- FIELD_OF_STUDY: Broader domains of knowledge
- EVENT: Specific named events, conferences
- WORK_OF_ART: Named creative works
- LAW_OR_POLICY: Named laws, regulations, policies


In [6]:
import os
import json
import time
from typing import List, Dict
from dotenv import load_dotenv
import google.generativeai as genai

# Load environment variables
load_dotenv()

# Configure Gemini
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if GOOGLE_API_KEY:
    genai.configure(api_key=GOOGLE_API_KEY)
    model = genai.GenerativeModel('gemini-2.0-flash-lite')
    print("✅ Gemini configured")
else:
    model = None
    print("❌ GOOGLE_API_KEY not set")

# New entity types
NEW_ENTITY_TYPES = [
    'PERSON', 'ORG', 'LOCATION', 'PRODUCT', 'PROGRAMMING_LANGUAGE', 
    'SCIENTIFIC_TERM', 'FIELD_OF_STUDY', 'EVENT', 'WORK_OF_ART', 'LAW_OR_POLICY'
]

print(f"📋 New entity types: {NEW_ENTITY_TYPES}")


✅ Gemini configured
📋 New entity types: ['PERSON', 'ORG', 'LOCATION', 'PRODUCT', 'PROGRAMMING_LANGUAGE', 'SCIENTIFIC_TERM', 'FIELD_OF_STUDY', 'EVENT', 'WORK_OF_ART', 'LAW_OR_POLICY']


In [2]:
def extract_topics_new_prompt(title: str, content: str) -> List[Dict]:
    """Extract topics using the new prompt and entity types"""
    prompt = f"""You are an expert NLP system. Your task is to extract the 5-10 most important named entities and key concepts from the following news article.

Focus on identifying specific and relevant items. Use the following entity types:
- **PERSON**: People, scientists, researchers.
- **ORG**: Organizations, companies, institutions (e.g., "NASA", "Google").
- **LOCATION**: Geographical places, countries, cities.
- **PRODUCT**: Specific software, hardware, or services (e.g., "iPhone 17", "GitHub Copilot").
- **PROGRAMMING_LANGUAGE**: Programming languages (e.g., "Python", "Rust").
- **SCIENTIFIC_TERM**: Specific scientific concepts, theories, species, or astronomical bodies (e.g., "black hole", "CRISPR").
- **FIELD_OF_STUDY**: Broader domains of knowledge (e.g., "Machine Learning", "Astrophysics").
- **EVENT**: Specific named events, conferences, or historical periods (e.g., "WWDC 2025", "The Renaissance").
- **WORK_OF_ART**: Named creative works like books, films, or paintings.
- **LAW_OR_POLICY**: Named laws, regulations, or policies (e.g., "GDPR").

**Article to Analyze:**
Title: {title}
Content: {content[:1500]}  # Increased character limit slightly for better context

**Instructions:**
1.  Analyze the title and content to find the most significant topics.
2.  Do not extract generic or overly broad terms (e.g., "science", "research").
3.  Return **ONLY** a raw JSON array with the specified format. Do not add any introductory text, explanations, or markdown formatting like ```json.

**JSON Output Format:**
[
  {{"text": "entity name", "type": "TYPE_FROM_LIST_ABOVE"}},
  {{"text": "another entity", "type": "TYPE_FROM_LIST_ABOVE"}}
]
"""

    try:
        response = model.generate_content(
            prompt,
            generation_config=genai.types.GenerationConfig(
                temperature=0.3,
                max_output_tokens=1000,
            )
        )
        
        # Extract JSON from response
        text = response.text
        json_start = text.find('[')
        json_end = text.rfind(']') + 1
        
        if json_start >= 0 and json_end > json_start:
            topics = json.loads(text[json_start:json_end])
            return [
                {
                    'text': t['text'],
                    'type': t.get('type', 'SCIENTIFIC_TERM'),
                    'confidence': 0.8,
                    'tfidf_score': 0.5
                }
                for t in topics
            ]
        return []
    except Exception as e:
        print(f"❌ Error extracting topics: {e}")
        return []

print("✅ New extraction function ready")


✅ New extraction function ready


In [3]:
# Test with sample articles
test_articles = [
    {
        "title": "OpenAI Releases GPT-5 with Revolutionary AI Capabilities",
        "content": "OpenAI has announced the release of GPT-5, their most advanced language model yet. The new model, built on Python and PyTorch frameworks, demonstrates significant improvements in reasoning, coding, and creative writing. CEO Sam Altman revealed that GPT-5 was trained on a massive dataset including scientific papers, legal documents, and creative works. The model shows particular strength in machine learning applications and can now generate complex Python code with 95% accuracy. Researchers at Stanford University have already begun testing the model's capabilities in various fields of study including computer science and artificial intelligence. The release comes just weeks after Google's announcement of their competing Gemini Pro model, marking a new era in the AI arms race."
    },
    {
        "title": "NASA's James Webb Telescope Discovers New Exoplanet in Habitable Zone",
        "content": "NASA's James Webb Space Telescope has discovered a potentially habitable exoplanet orbiting a red dwarf star 40 light-years from Earth. The planet, designated TOI-715b, is located in the constellation of Ursa Major and shows signs of atmospheric water vapor. Dr. Sarah Johnson, lead astronomer at the Space Telescope Science Institute in Baltimore, Maryland, announced the discovery at the American Astronomical Society conference. The planet's atmosphere contains traces of methane and carbon dioxide, suggesting possible biological activity. This discovery represents a major breakthrough in the field of astrobiology and exoplanet research. The findings were published in the journal Nature and have implications for the search for extraterrestrial life."
    },
    {
        "title": "EU Implements New AI Act Regulating Artificial Intelligence Systems",
        "content": "The European Union has officially implemented the AI Act, a comprehensive regulation governing artificial intelligence systems across all 27 member states. The legislation, which took effect on January 1st, 2025, requires companies to conduct risk assessments for high-risk AI applications. Tech giants like Google, Microsoft, and Meta must comply with strict transparency requirements for their AI systems. The law specifically targets machine learning models used in healthcare, finance, and autonomous vehicles. Companies developing AI products must now provide detailed documentation about their algorithms and training data. The regulation is expected to impact the global AI industry, with many companies already adapting their development practices to meet the new standards."
    }
]

print(f"🧪 Testing with {len(test_articles)} sample articles")


🧪 Testing with 3 sample articles


In [4]:
# Test extraction on each article
if model:
    for i, article in enumerate(test_articles, 1):
        print(f"\n{'='*60}")
        print(f"📰 Article {i}: {article['title']}")
        print(f"Content preview: {article['content'][:100]}...")
        
        # Extract topics
        topics = extract_topics_new_prompt(article['title'], article['content'])
        
        if topics:
            print(f"\n✅ Extracted {len(topics)} topics:")
            for j, topic in enumerate(topics, 1):
                print(f"  {j:2d}. {topic['text']} ({topic['type']}) - Confidence: {topic['confidence']}")
        else:
            print("❌ No topics extracted")
        
        # Rate limiting
        time.sleep(2)
else:
    print("❌ Gemini model not configured. Set GOOGLE_API_KEY to test.")



📰 Article 1: OpenAI Releases GPT-5 with Revolutionary AI Capabilities
Content preview: OpenAI has announced the release of GPT-5, their most advanced language model yet. The new model, bu...


E0000 00:00:1760561776.471881 2741769 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.



✅ Extracted 11 topics:
   1. OpenAI (ORG) - Confidence: 0.8
   2. GPT-5 (PRODUCT) - Confidence: 0.8
   3. Sam Altman (PERSON) - Confidence: 0.8
   4. Python (PROGRAMMING_LANGUAGE) - Confidence: 0.8
   5. PyTorch (PROGRAMMING_LANGUAGE) - Confidence: 0.8
   6. machine learning (FIELD_OF_STUDY) - Confidence: 0.8
   7. Stanford University (ORG) - Confidence: 0.8
   8. computer science (FIELD_OF_STUDY) - Confidence: 0.8
   9. artificial intelligence (FIELD_OF_STUDY) - Confidence: 0.8
  10. Gemini Pro (PRODUCT) - Confidence: 0.8
  11. Google (ORG) - Confidence: 0.8

📰 Article 2: NASA's James Webb Telescope Discovers New Exoplanet in Habitable Zone
Content preview: NASA's James Webb Space Telescope has discovered a potentially habitable exoplanet orbiting a red dw...

✅ Extracted 10 topics:
   1. James Webb Space Telescope (ORG) - Confidence: 0.8
   2. Exoplanet (SCIENTIFIC_TERM) - Confidence: 0.8
   3. TOI-715b (SCIENTIFIC_TERM) - Confidence: 0.8
   4. Ursa Major (LOCATION) - Confidence: 0.

In [5]:
# Analyze type distribution
if model:
    all_topics = []
    for article in test_articles:
        topics = extract_topics_new_prompt(article['title'], article['content'])
        all_topics.extend(topics)
        time.sleep(1)  # Rate limiting
    
    if all_topics:
        print(f"\n📊 Analysis of {len(all_topics)} extracted topics:")
        
        # Count by type
        type_counts = {}
        for topic in all_topics:
            type_counts[topic['type']] = type_counts.get(topic['type'], 0) + 1
        
        print(f"\n📈 Type Distribution:")
        for entity_type, count in sorted(type_counts.items(), key=lambda x: x[1], reverse=True):
            print(f"  {entity_type}: {count} topics")
        
        # Show examples of each type
        print(f"\n🔍 Examples by Type:")
        for entity_type in NEW_ENTITY_TYPES:
            examples = [t['text'] for t in all_topics if t['type'] == entity_type]
            if examples:
                print(f"  {entity_type}: {', '.join(examples[:3])}{'...' if len(examples) > 3 else ''}")
            else:
                print(f"  {entity_type}: (no examples)")
    else:
        print("❌ No topics extracted for analysis")
else:
    print("❌ Gemini model not configured. Set GOOGLE_API_KEY to test.")



📊 Analysis of 31 extracted topics:

📈 Type Distribution:
  ORG: 11 topics
  FIELD_OF_STUDY: 8 topics
  PRODUCT: 3 topics
  PERSON: 2 topics
  PROGRAMMING_LANGUAGE: 2 topics
  SCIENTIFIC_TERM: 2 topics
  LOCATION: 2 topics
  LAW_OR_POLICY: 1 topics

🔍 Examples by Type:
  PERSON: Sam Altman, Dr. Sarah Johnson
  ORG: OpenAI, Stanford University, Google...
  LOCATION: Ursa Major, Baltimore, Maryland
  PRODUCT: GPT-5, Gemini Pro, autonomous vehicles
  PROGRAMMING_LANGUAGE: Python, PyTorch
  SCIENTIFIC_TERM: Exoplanet, TOI-715b
  FIELD_OF_STUDY: Machine Learning, computer science, artificial intelligence...
  EVENT: (no examples)
  WORK_OF_ART: (no examples)
  LAW_OR_POLICY: AI Act
