# Structured Text Insights Extraction Demo

This notebook demonstrates the **Structured Text Insights Flow** using the Bloomberg Financial News dataset. 

## What You'll Learn
- How to use the structured insights flow for comprehensive text analysis
- Extract summaries, keywords, entities, and sentiment from financial news
- Analyze and visualize results across large datasets
- Extend the flow with custom blocks for domain-specific analysis

## Flow Capabilities
The structured insights flow performs **4 key analyses** on any text:
1. **📝 Summary**: Concise 2-3 sentence summaries
2. **🔑 Keywords**: Top 10 most important terms
3. **🏷️ Entities**: Named entities (people, organizations, locations)
4. **😊 Sentiment**: Emotional tone analysis (positive/negative/neutral)

All results are combined into a **structured JSON output** for easy processing and analysis.

## Setup and Installation

In [None]:
%load_ext autoreload
%autoreload 2

# pip install sdg_hub[examples]

In [None]:
import json
import random
import warnings

from datasets import load_dataset
import nest_asyncio

from sdg_hub import Flow, FlowRegistry

warnings.filterwarnings('ignore')

# Required for async execution in notebooks
nest_asyncio.apply()

## 1. Flow Discovery and Loading

SDG Hub automatically discovers all available flows. Let's find our structured insights flow:

In [None]:
# Auto-discover all available flows
FlowRegistry.discover_flows()

# List all flows
flows = FlowRegistry.list_flows()

In [None]:
# Search for text analysis flows
text_flows = FlowRegistry.search_flows(tag="text-analysis")
print(f"Text analysis flows: {text_flows}")

# Load our structured insights flow
flow_id = "green-clay-812" 
flow_path = FlowRegistry.get_flow_path(flow_id)
flow = Flow.from_yaml(flow_path)

print(f"\n✅ Loaded flow: {flow_id}") 

flow.print_info()

## 2. Model Configuration

The flow supports multiple LLM models. Let's configure it:

In [None]:
# Check recommended models
print("Default model:", flow.get_default_model())
print("Model recommendations:", flow.get_model_recommendations())

In [None]:
# Configure the flow to use a specific model
# Option 1: Use a local vLLM server
flow.set_model_config(
    model="hosted_vllm/openai/gpt-oss-20b",
    api_base="http://localhost:8201/v1",
    api_key="EMPTY",
    # this only works with models which support reasoning_effort
    # if your model does not support it, you can remove this parameter
    extra_body={"reasoning_effort": "low"}
)

# Option 2: Use OpenAI (requires API key)
# flow.set_model_config(
#     model="gpt-4o-mini",
#     api_key="your-openai-api-key"
# )

# Option 3: Use Anthropic Claude (requires API key)
# flow.set_model_config(
#     model="anthropic/claude-3-haiku",
#     api_key="your-anthropic-api-key"
# )

print("✅ Model configuration ready")

## 3. Dataset Loading and Exploration

We'll use the **Bloomberg Financial News dataset** - 447k financial news articles from 2006-2013:

In [None]:
# Load the Bloomberg Financial News dataset
print("Loading Bloomberg Financial News dataset...")
dataset = load_dataset("danidanou/Bloomberg_Financial_News", split="train")

print(f"📊 Dataset size: {len(dataset):,} articles")
print(f"📅 Columns: {dataset.column_names}")
print(f"💾 Dataset features: {dataset.features}")

In [None]:
# Explore the dataset structure
sample = dataset[0]
print("=== Sample Article ===")
print(f"Headline: {sample['Headline']}")
print(f"Date: {sample['Date']}")
print(f"Journalists: {sample['Journalists']}")
print(f"Article length: {len(sample['Article'])} characters")
print(f"Article preview: {sample['Article'][:300]}...")

In [None]:
# Select a small sample for demonstration (start with 50 articles)
# For production, you can process thousands of articles
sample_size = 50
demo_dataset = dataset.shuffle(seed=42).select(range(sample_size))

print(f"📝 Demo dataset prepared: {len(demo_dataset)} articles")
print(f"📊 Average article length: {sum(len(article['Article']) for article in demo_dataset) / len(demo_dataset):.0f} characters")

In [None]:
# Discover what dataset schema is expected by the flow

schema_dataset = flow.get_dataset_schema() 
print(f"Required columns: {schema_dataset.column_names}")
print(f"Schema: {schema_dataset.features}")

In [None]:
# The flow expects a 'text' column, so we'll use rename the 'Article' column to 'text'
demo_dataset = demo_dataset.rename_column("Article", "text")

## 4. Running the Structured Insights Flow

Now let's extract structured insights from our financial news articles:

In [None]:
# Generate structured insights
print("🚀 Running structured insights extraction...")
print("⏱️ This may take a few minutes depending on your model setup...")

# Run the flow
results = flow.generate(demo_dataset)

print("✅ Processing complete!")
print(f"📊 Generated insights for {len(results)} articles")
print(f"📋 Result columns: {results.column_names}")

In [None]:
# Display a sample result
sample_result = results[random.randint(0, len(results) - 1)]

print("=== First Article Analysis ===")
print(f"📰 Original headline: {dataset[0]['Headline']}")
print(f"📅 Date: {dataset[0]['Date']}")
print(f"✍️ Journalists: {dataset[0]['Journalists']}")
print(f"📄 Article length: {len(sample_result['text'])} characters")
print()

# Parse and display the structured insights
insights = json.loads(sample_result["structured_insights"])
print("🔍 EXTRACTED INSIGHTS:")
print(json.dumps(insights, indent=2, ensure_ascii=False))

## 5. Dynamic Flow Extension: Adding Stock Ticker Extraction

Now we'll demonstrate SDG Hub's **dynamic flow modification** capabilities. Instead of creating separate flow files, we can extend flows at runtime by adding custom processing blocks using existing SDG Hub components.

### What We'll Add:
We'll extend our structured insights flow to extract **stock ticker symbols** from financial news articles. This is perfect for Bloomberg financial news analysis!

### Approach:
We'll use three existing SDG Hub blocks:
1. **PromptBuilderBlock** - Create a prompt to extract stock tickers
2. **LLMChatBlock** - Process the extraction using the LLM
3. **TextParserBlock** - Parse the output to a clean list

Let's see how to modify flows at runtime!

In [None]:
# We'll modify the existing flow by adding our ticker extraction blocks
# First, let's examine the current flow structure
flow.print_info()

In [None]:
# Import the blocks we need
from sdg_hub.core.blocks.llm import LLMChatBlock, PromptBuilderBlock, TextParserBlock, LLMParserBlock
from sdg_hub.core.blocks.transform import JSONStructureBlock

# Step 1: Add stock ticker extraction blocks to the flow
print("🚀 Adding stock ticker extraction blocks to the flow...")

# Create the stock ticker extraction blocks
ticker_prompt_block = PromptBuilderBlock(
    block_name="stock_ticker_prompt",
    input_cols=["text"],
    output_cols=["ticker_prompt"],
    prompt_config_path="extract_stock_tickers.yaml"
)

ticker_llm_block = LLMChatBlock(
    block_name="extract_stock_tickers",
    input_cols=["ticker_prompt"],
    output_cols=["raw_stock_tickers"],
    max_tokens=512,
    temperature=0.1  # Low temperature for more consistent extraction
)

ticker_llm_parser_block = LLMParserBlock(
    block_name="extract_stock_tickers",
    input_cols=["raw_stock_tickers"],
    extract_content=True,
    expand_lists=True
)

ticker_parser_block = TextParserBlock(
    block_name="parse_stock_tickers",
    input_cols=["extract_stock_tickers_content"],
    output_cols=["stock_tickers"],
    start_tags=["[STOCK_TICKERS]"],
    end_tags=["[/STOCK_TICKERS]"]
)

print("✅ Created ticker extraction blocks:")
print(f"  1. {ticker_prompt_block.block_name} - Builds extraction prompt")
print(f"  2. {ticker_llm_block.block_name} - Extracts tickers via LLM")
print(f"  3. {ticker_parser_block.block_name} - Parses LLM output")

# Step 2: Update the JSONStructureBlock to include stock tickers
print("🔧 Updating JSON structure to include stock ticker field...")

# Create a new JSONStructureBlock configuration that includes our new stock_tickers field
enhanced_json_block = JSONStructureBlock(
    block_name="create_enhanced_structured_insights",
    input_cols=["summary", "keywords", "entities", "sentiment", "stock_tickers"],
    output_cols=["enhanced_structured_insights"]
)

print("✅ Enhanced JSON structure will include:")
print("  📝 summary - Article summary")
print("  🔑 keywords - Important keywords")
print("  🏷️ entities - Named entities")
print("  😊 sentiment - Emotional tone")
print("  📈 stock_tickers - Stock ticker symbols (NEW!)")

In [None]:

# Remove the original JSONStructureBlock (if it exists in your flow/blocks list)
# (Assume we are not using a flow object here, just not using the old block.)

# Add the new blocks to a list for the enhanced pipeline
ticker_blocks = [
    ticker_prompt_block,
    ticker_llm_block,
    ticker_llm_parser_block,
    ticker_parser_block,
    enhanced_json_block
]

flow.blocks.pop()
flow.blocks.extend(ticker_blocks)
flow.print_info()


In [None]:
# Configure the new LLM blocks with our model settings
flow.set_model_config(
    model="hosted_vllm/openai/gpt-oss-20b",
    api_base="http://localhost:8201/v1", 
    api_key="EMPTY",
    # this only works with models which support reasoning_effort
    # if your model does not support it, you can remove this parameter
    extra_body={"reasoning_effort": "low"}
)

print("\n🎯 Ready to run enhanced flow with stock ticker extraction!")

In [None]:
# Generate structured insights
print("🚀 Running structured insights extraction...")
print("⏱️ This may take a few minutes depending on your model setup...")

# Run the flow
results2 = flow.generate(demo_dataset)

print("✅ Processing complete!")
print(f"📊 Generated insights for {len(results2)} articles")
print(f"📋 Result columns: {results2.column_names}")

In [None]:
# Display a sample result
sample_result2 = results2[random.randint(0, len(results2) - 1)]

print("=== First Article Analysis ===")
print(f"📰 Original headline: {dataset[0]['Headline']}")
print(f"📅 Date: {dataset[0]['Date']}")
print(f"✍️ Journalists: {dataset[0]['Journalists']}")
print(f"📄 Article length: {len(sample_result2['text'])} characters")
print()

# Parse and display the structured insights
insights2 = json.loads(sample_result2["enhanced_structured_insights"])
print("🔍 EXTRACTED INSIGHTS:")
print(json.dumps(insights2, indent=2, ensure_ascii=False))

## Next Steps

### 🧪 **Experiment Further**
1. **Scale up**: Process 100+ articles to see larger patterns
2. **Time analysis**: Filter by date ranges to see trends over time
3. **Model comparison**: Try different LLMs and compare results
4. **Custom prompts**: Modify the prompt templates for your domain

### 🔧 **Customize for Your Use Case**
1. **Domain adaptation**: Modify prompts for your specific industry
2. **Additional insights**: Add blocks for topic classification, urgency scoring, etc.
3. **Output format**: Customize JSON structure for your applications
4. **Quality filters**: Add validation and quality checks

### 🚀 Build Your Own Model
- Leverage the generated structured insights as high-quality training data for your own machine learning models.
- Fine-tune LLMs or train classifiers to automate similar analyses at scale.
- Refer to Training Hub (https://github.com/Red-Hat-AI-Innovation-Team/training_hub) to setup your own training pipeline.

### 📚 **Learn More**
- Explore other SDG Hub flows in the repository
- Check the documentation for advanced configuration options
- Join the community for questions and contributions