[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/finance/01_Financial_Data_Integration_MCP.ipynb)

# Financial Data Integration (MCP) - Real-Time Market Data

## Overview

This notebook demonstrates **financial data integration using MCP servers** with focus on **MCP server integration**, **real-time data ingestion**, and **multi-source financial KG construction**. The pipeline integrates Python/FastMCP servers to ingest market data, stock prices, and metrics into a financial knowledge graph.

### Key Features

- **MCP Integration**: Showcases MCP (Model Context Protocol) server integration capability
- **Seed Data Management**: Uses foundation market data for entity resolution
- **Real-Time Data Ingestion**: Ingests live market data from MCP servers and APIs
- **Multi-Source Financial KG**: Builds comprehensive financial knowledge graphs from multiple sources
- **Market Network Analysis**: Analyzes market structure using graph analytics
- **Comprehensive Data Sources**: Multiple financial APIs, RSS feeds, and databases
- **Modular Architecture**: Direct use of Semantica modules without core orchestrator

### Learning Objectives

- Ingest financial data from MCP servers, APIs, and RSS feeds
- Use seed data for foundation market information
- Extract financial entities (Companies, Stocks, Prices, Metrics, Markets, Sectors)
- Build financial knowledge graphs with seed data integration
- Analyze market network structure using graph analytics
- Store and query financial data using vector stores

### Pipeline Flow

```mermaid
graph TD
    A[Data Ingestion] --> B[Seed Data Loading]
    B --> C[Document Parsing]
    C --> D[Text Processing]
    D --> E[Entity Extraction]
    E --> F[Relationship Extraction]
    F --> G[Deduplication]
    G --> H[Knowledge Graph]
    H --> I[Embeddings]
    I --> J[Vector Store]
    H --> K[Graph Analytics]
    K --> L[GraphRAG Queries]
    J --> L
    L --> M[Visualization]
    M --> N[Export]
```


## Installation


In [None]:
%pip install -qU semantica networkx matplotlib plotly pandas faiss-cpu beautifulsoup4 groq sentence-transformers scikit-learn


## Configuration & Setup


In [None]:
import os

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY", "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")

# Configuration constants
EMBEDDING_DIMENSION = 384
EMBEDDING_MODEL = "all-MiniLM-L6-v2"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200


## Ingesting Financial Data from Multiple Sources


In [None]:
from semantica.ingest import MCPIngestor, WebIngestor, FeedIngestor, FileIngestor
import os
from contextlib import redirect_stderr
from io import StringIO

os.makedirs("data", exist_ok=True)

feed_sources = [
    # Financial RSS Feeds - More reliable sources
    ("Yahoo Finance", "https://feeds.finance.yahoo.com/rss/2.0/headline"),
    ("Financial Times", "https://www.ft.com/?format=rss"),
    ("Bloomberg", "https://feeds.bloomberg.com/markets/news.rss"),
    ("MarketWatch", "https://feeds.marketwatch.com/marketwatch/markets"),
    ("Seeking Alpha", "https://seekingalpha.com/feed.xml"),
    ("Investing.com", "https://www.investing.com/rss/news.rss"),
    ("Financial News", "https://www.fnlondon.com/rss"),
    ("Wall Street Journal", "https://feeds.a.dj.com/rss/RSSMarketsMain.xml"),
]

feed_ingestor = FeedIngestor()
all_documents = []

print(f"Ingesting from {len(feed_sources)} feed sources...")
for i, (feed_name, feed_url) in enumerate(feed_sources, 1):
    try:
        with redirect_stderr(StringIO()):
            feed_data = feed_ingestor.ingest_feed(feed_url, validate=False)
        
        feed_count = 0
        for item in feed_data.items:
            if not item.content:
                item.content = item.description or item.title or ""
            if item.content:
                if not hasattr(item, 'metadata'):
                    item.metadata = {}
                item.metadata['source'] = feed_name
                all_documents.append(item)
                feed_count += 1
        
        if feed_count > 0:
            print(f"  [{i}/{len(feed_sources)}] {feed_name}: {feed_count} documents")
    except Exception:
        continue

# Example: Ingest from Alpha Vantage API (requires API key)
alpha_vantage_api = "https://www.alphavantage.co/query?function=GLOBAL_QUOTE&symbol=AAPL&apikey=demo"
try:
    web_ingestor = WebIngestor()
    with redirect_stderr(StringIO()):
        api_documents = web_ingestor.ingest(alpha_vantage_api, method="url")
    for doc in api_documents:
        if not hasattr(doc, 'metadata'):
            doc.metadata = {}
        doc.metadata['source'] = 'Alpha Vantage API'
        all_documents.append(doc)
except Exception:
    pass

# MCP Server connection example (commented for demo)
# mcp_ingestor = MCPIngestor()
# mcp_ingestor.connect("financial_server", url="http://localhost:8000/mcp")
# resources = mcp_ingestor.list_available_resources("financial_server")
# mcp_data = mcp_ingestor.ingest_resources("financial_server", resource_uris=["resource://market_data"])

if not all_documents:
    market_data = """
    AAPL stock price: $150.25, market cap: $2.4T, volume: 50M shares, sector: Technology
    MSFT stock price: $380.50, market cap: $2.8T, volume: 30M shares, sector: Technology
    GOOGL stock price: $140.75, market cap: $1.8T, volume: 25M shares, sector: Technology
    JPM stock price: $145.30, market cap: $420B, volume: 15M shares, sector: Financial
    """
    with open("data/market_data.txt", "w") as f:
        f.write(market_data)
    file_ingestor = FileIngestor()
    all_documents = file_ingestor.ingest("data/market_data.txt")

documents = all_documents
print(f"Ingested {len(documents)} documents")


In [None]:
from semantica.seed import SeedDataManager

seed_manager = SeedDataManager()

# Load foundation market data (exchanges, indices, sectors)
seed_data = [
    {"type": "Market", "text": "NASDAQ", "description": "Stock exchange"},
    {"type": "Market", "text": "NYSE", "description": "Stock exchange"},
    {"type": "Market", "text": "S&P 500", "description": "Stock market index"},
    {"type": "Sector", "text": "Technology", "description": "Market sector"},
    {"type": "Sector", "text": "Financial", "description": "Market sector"},
    {"type": "Sector", "text": "Healthcare", "description": "Market sector"},
]

# Add seed data as entities
for item in seed_data:
    entity = {
        "id": item.get("text", "").lower().replace(" ", "_"),
        "text": item.get("text", ""),
        "name": item.get("text", ""),
        "type": item.get("type", ""),
        "description": item.get("description", ""),
        "source": "seed_data"
    }
    seed_manager.seed_data.entities.append(entity)

print(f"Loaded {len(seed_data)} seed data items for market foundation")


## Parsing Financial Documents


In [None]:
from semantica.parse import DocumentParser

parser = DocumentParser()

print(f"Parsing {len(documents)} documents...")
parsed_documents = []
for i, doc in enumerate(documents, 1):
    try:
        parsed = parser.parse(
            doc.content if hasattr(doc, 'content') else str(doc),
            content_type="text"
        )
        parsed_documents.append(parsed)
    except Exception:
        parsed_documents.append(doc)
    if i % 50 == 0 or i == len(documents):
        print(f"  Parsed {i}/{len(documents)} documents...")

documents = parsed_documents


## Normalizing and Chunking Financial Data


In [None]:
from semantica.normalize import TextNormalizer
from semantica.split import TextSplitter

normalizer = TextNormalizer()
# Use recursive chunking for financial documents
splitter = TextSplitter(method="recursive", chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)

print(f"Normalizing {len(documents)} documents...")
normalized_documents = []
for i, doc in enumerate(documents, 1):
    normalized_text = normalizer.normalize(
        doc.content if hasattr(doc, 'content') else str(doc),
        clean_html=True,
        normalize_entities=True,
        normalize_numbers=True,
        remove_extra_whitespace=True,
        lowercase=False
    )
    normalized_documents.append(normalized_text)
    if i % 50 == 0 or i == len(documents):
        print(f"  Normalized {i}/{len(documents)} documents...")

print(f"Chunking {len(normalized_documents)} documents...")
chunked_documents = []
for i, doc_text in enumerate(normalized_documents, 1):
    try:
        with redirect_stderr(StringIO()):
            chunks = splitter.split(doc_text)
        chunked_documents.extend(chunks)
    except Exception:
        simple_splitter = TextSplitter(method="recursive", chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
        chunks = simple_splitter.split(doc_text)
        chunked_documents.extend(chunks)
    if i % 50 == 0 or i == len(normalized_documents):
        print(f"  Chunked {i}/{len(normalized_documents)} documents ({len(chunked_documents)} chunks so far)")

print(f"Created {len(chunked_documents)} chunks from {len(normalized_documents)} documents")


## Extracting Financial Entities


In [None]:
from semantica.semantic_extract import NERExtractor

entity_extractor = NERExtractor(
    method="ml",
    model="en_core_web_sm"
)

all_entities = []
print(f"Extracting entities from {len(chunked_documents)} chunks...")
for i, chunk in enumerate(chunked_documents, 1):
    chunk_text = chunk.text if hasattr(chunk, 'text') else str(chunk)
    try:
        entities = entity_extractor.extract_entities(chunk_text)
        all_entities.extend(entities)
    except Exception:
        continue
    
    if i % 20 == 0 or i == len(chunked_documents):
        print(f"  Processed {i}/{len(chunked_documents)} chunks ({len(all_entities)} entities found)")

# Categorize entities using spaCy's standard types (ORG, GPE, MONEY, etc.)
companies = [e for e in all_entities if e.label in ["ORG", "ORGANIZATION"]]
markets = [e for e in all_entities if e.label in ["GPE", "LOCATION", "LOC"]]
prices = [e for e in all_entities if e.label in ["MONEY", "CURRENCY"]]
metrics = [e for e in all_entities if e.label in ["CARDINAL", "QUANTITY", "PERCENT", "PERCENTAGE"]]

print(f"Extracted {len(companies)} companies/organizations, {len(markets)} markets/locations, {len(prices)} prices, {len(metrics)} metrics")


## Extracting Financial Relationships


In [None]:
from semantica.semantic_extract import RelationExtractor

relation_extractor = RelationExtractor(
    method="dependency",
    model="en_core_web_sm"
)

all_relationships = []
print(f"Extracting relationships from {len(chunked_documents)} chunks...")
for i, chunk in enumerate(chunked_documents, 1):
    chunk_text = chunk.text if hasattr(chunk, 'text') else str(chunk)
    try:
        relationships = relation_extractor.extract_relations(
            chunk_text,
            entities=all_entities,
            relation_types=["trades_on", "has_price", "belongs_to", "correlates_with", "has_metric", "in_sector"]
        )
        all_relationships.extend(relationships)
    except Exception:
        continue
    
    if i % 20 == 0 or i == len(chunked_documents):
        print(f"  Processed {i}/{len(chunked_documents)} chunks ({len(all_relationships)} relationships found)")

print(f"Extracted {len(all_relationships)} relationships")


## Resolving Duplicate Companies and Stocks


## Conflict Detection

- **Temporal Conflict Detection**: Detects time-sensitive conflicts in financial data from multiple sources
- **Most Recent Strategy**: Resolves conflicts by prioritizing the latest market data


In [None]:
from semantica.conflicts import ConflictDetector, ConflictResolver

conflict_detector = ConflictDetector()
conflict_resolver = ConflictResolver()

# Convert Entity objects to dictionaries for conflict detection
entity_dicts = [
    {
        "id": e.text if hasattr(e, 'text') else str(e),
        "text": e.text if hasattr(e, 'text') else str(e),
        "name": e.text if hasattr(e, 'text') else str(e),
        "type": e.label if hasattr(e, 'label') else "ENTITY",
        "confidence": e.confidence if hasattr(e, 'confidence') else 1.0,
        "metadata": e.metadata if hasattr(e, 'metadata') else {},
        "source": e.metadata.get("source", "unknown") if hasattr(e, 'metadata') and isinstance(e.metadata, dict) else "unknown"
    }
    for e in all_entities
]

print(f"Detecting temporal conflicts in {len(entity_dicts)} entities...")
conflicts = conflict_detector.detect_temporal_conflicts(entity_dicts)

print(f"Detected {len(conflicts)} temporal conflicts")

if conflicts:
    print(f"Resolving conflicts using most_recent strategy...")
    resolved = conflict_resolver.resolve_conflicts(
        conflicts,
        strategy="most_recent"
    )
    print(f"Resolved {len(resolved)} conflicts")
else:
    print("No conflicts detected")


In [None]:
from semantica.kg import EntityResolver
from semantica.semantic_extract import Entity

# Convert Entity objects to dictionaries for EntityResolver
print(f"Converting {len(all_entities)} entities to dictionaries...")
entity_dicts = [{"name": e.text, "type": e.label, "start_char": getattr(e, 'start_char', 0), "end_char": getattr(e, 'end_char', 0), "confidence": e.confidence} for e in all_entities]

# Use EntityResolver class to resolve duplicates
entity_resolver = EntityResolver(strategy="fuzzy", similarity_threshold=0.85)

print(f"Resolving duplicates in {len(entity_dicts)} entities...")
resolved_entities = entity_resolver.resolve_entities(entity_dicts)

# Convert back to Entity objects
print(f"Converting {len(resolved_entities)} resolved entities back to Entity objects...")
merged_entities = [
    Entity(text=e["name"], label=e["type"], start_char=e.get("start_char", 0), end_char=e.get("end_char", 0), confidence=e.get("confidence", 1.0))
    for e in resolved_entities
]

# Enhance entities with seed data information
for entity in merged_entities:
    for seed_item in seed_data:
        if entity.text.lower() == seed_item["text"].lower():
            entity.description = seed_item.get("description", "")
            break

print(f"Deduplicated {len(entity_dicts)} entities to {len(merged_entities)} unique entities")


## Building Financial Knowledge Graph


In [None]:
from semantica.kg import GraphBuilder

graph_builder = GraphBuilder()

print(f"Building knowledge graph...")
kg_sources = [{
    "entities": [{"text": e.text, "type": e.label, "confidence": e.confidence} for e in merged_entities],
    "relationships": [{"source": r.subject.text, "target": r.object.text, "type": r.predicate, "confidence": r.confidence} for r in all_relationships]
}]

kg = graph_builder.build(kg_sources)

entities_count = len(kg.get('entities', []))
relationships_count = len(kg.get('relationships', []))
print(f"Graph: {entities_count} entities, {relationships_count} relationships")


## Generating Embeddings for Companies and Stocks


In [None]:
from semantica.embeddings import EmbeddingGenerator

embedding_gen = EmbeddingGenerator(
    provider="sentence_transformers",
    model=EMBEDDING_MODEL
)

print(f"Generating embeddings for {len(companies)} companies and {len(markets)} markets...")
company_texts = [c.text for c in companies]
company_embeddings = embedding_gen.generate_embeddings(company_texts)

market_texts = [m.text for m in markets]
market_embeddings = embedding_gen.generate_embeddings(market_texts)

print(f"Generated {len(company_embeddings)} company embeddings and {len(market_embeddings)} market embeddings")


## Populating Vector Store


In [None]:
from semantica.vector_store import VectorStore

vector_store = VectorStore(backend="faiss", dimension=EMBEDDING_DIMENSION)

print(f"Storing {len(company_embeddings)} company vectors and {len(market_embeddings)} market vectors...")
company_ids = vector_store.store_vectors(
    vectors=company_embeddings,
    metadata=[{"type": "company", "name": c.text, "label": c.label} for c in companies]
)

market_ids = vector_store.store_vectors(
    vectors=market_embeddings,
    metadata=[{"type": "market", "name": m.text, "label": m.label} for m in markets]
)

print(f"Stored {len(company_ids)} company vectors and {len(market_ids)} market vectors")


## Analyzing Market Network Structure


In [None]:
from semantica.kg import GraphAnalyzer, CentralityCalculator

graph_analyzer = GraphAnalyzer()
centrality_calc = CentralityCalculator()

analysis = graph_analyzer.analyze_graph(kg)

degree_centrality = centrality_calc.calculate_degree_centrality(kg)
betweenness_centrality = centrality_calc.calculate_betweenness_centrality(kg)
closeness_centrality = centrality_calc.calculate_closeness_centrality(kg)

# Identify central entities in the market network
central_entities = []
for entity in kg.get("entities", []):
    entity_id = entity.get("id")
    if entity_id in degree_centrality:
        central_entities.append({
            "name": entity.get("text", "Unknown"),
            "type": entity.get("type", "Unknown"),
            "degree": degree_centrality[entity_id]
        })

central_entities.sort(key=lambda x: x['degree'], reverse=True)

print(f"Graph analytics:")
print(f"  - Graph density: {analysis.get('density', 0):.3f}")
print(f"  - Central nodes (degree): {len(degree_centrality)}")
print(f"  - Total entities: {len(kg.get('entities', []))}")
print(f"  - Total relationships: {len(kg.get('relationships', []))}")
if central_entities:
    print(f"\nTop 5 central entities:")
    for i, ent in enumerate(central_entities[:5], 1):
        print(f"  {i}. {ent['name']} ({ent['type']}) - Degree: {ent['degree']:.3f}")


## GraphRAG: Hybrid Vector + Graph Queries


In [None]:
from semantica.context import AgentContext
from semantica.llms import Groq
import os

context = AgentContext(
    vector_store=vector_store,
    knowledge_graph=kg,
    max_expansion_hops=3,
    hybrid_alpha=0.7
)

# Initialize Groq LLM
llm = Groq(model="llama-3.1-8b-instant", api_key=os.getenv("GROQ_API_KEY"))

query = "What technology companies are in the market and what are their key relationships?"

print(f"{'='*80}")
print(f"GraphRAG Query: {query}")
print(f"{'='*80}\n")

# Use multi-hop reasoning with LLM generation
result = context.query_with_reasoning(
    query=query,
    llm_provider=llm,
    max_results=15,
    max_hops=3,
    min_score=0.2
)

print("=" * 80)
print("Generated Answer (with Multi-hop Reasoning):")
print("=" * 80)
response = result.get('response', 'No response generated')
print(response)
print("\n" + "=" * 80)

print(f"\nReasoning Details:")
print(f"- Confidence: {result.get('confidence', 0):.3f}")
print(f"- Sources: {result.get('num_sources', 0)}")
print(f"- Reasoning Paths: {result.get('num_reasoning_paths', 0)}")
print(f"- Total entities in graph: {len(kg.get('entities', []))}")
print(f"- Total relationships in graph: {len(kg.get('relationships', []))}")

if result.get('sources'):
    print(f"\nTop Sources:")
    for i, source in enumerate(result['sources'][:5], 1):
        content = source.get('content', '')[:200] if isinstance(source, dict) else str(source)[:200]
        score = source.get('score', 0) if isinstance(source, dict) else 0
        print(f"  {i}. Score: {score:.3f}")
        print(f"     {content}...")


## Exporting Results


In [None]:
from semantica.export import GraphExporter

exporter = GraphExporter()
exporter.export(kg, output_path="financial_data_kg.json", format="json")
exporter.export(kg, output_path="financial_data_kg.graphml", format="graphml")

print("Exported financial knowledge graph to JSON, GraphML, and CSV formats")
