[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/finance/01_Financial_Data_Integration_MCP.ipynb)

# Financial Data Integration (MCP) - Real-Time Market Data

## Overview

This notebook demonstrates **financial data integration using MCP servers** with focus on **MCP server integration**, **real-time data ingestion**, and **multi-source financial KG construction**. The pipeline integrates Python/FastMCP servers to ingest market data, stock prices, and metrics into a financial knowledge graph.

### Key Features

- **MCP Integration**: Showcases MCP (Model Context Protocol) server integration capability
- **Seed Data Management**: Uses foundation market data for entity resolution
- **Real-Time Data Ingestion**: Ingests live market data from MCP servers and APIs
- **Multi-Source Financial KG**: Builds comprehensive financial knowledge graphs from multiple sources
- **Market Network Analysis**: Analyzes market structure using graph analytics
- **Comprehensive Data Sources**: Multiple financial APIs, RSS feeds, and databases
- **Modular Architecture**: Direct use of Semantica modules without core orchestrator

### Learning Objectives

- Ingest financial data from MCP servers, APIs, and RSS feeds
- Use seed data for foundation market information
- Extract financial entities (Companies, Stocks, Prices, Metrics, Markets, Sectors)
- Build financial knowledge graphs with seed data integration
- Analyze market network structure using graph analytics
- Store and query financial data using vector stores

### Pipeline Flow

```mermaid
graph TD
    A[Data Ingestion] --> B[Seed Data Loading]
    B --> C[Document Parsing]
    C --> D[Text Processing]
    D --> E[Entity Extraction]
    E --> F[Relationship Extraction]
    F --> G[Deduplication]
    G --> H[Knowledge Graph]
    H --> I[Embeddings]
    I --> J[Vector Store]
    H --> K[Graph Analytics]
    K --> L[GraphRAG Queries]
    J --> L
    L --> M[Visualization]
    M --> N[Export]
```

### Data Sources

#### Financial APIs
- **Alpha Vantage**: https://www.alphavantage.co/documentation/
- **Yahoo Finance API**: https://query1.finance.yahoo.com/v8/finance/
- **IEX Cloud**: https://iexcloud.io/docs/api/
- **Polygon.io**: https://polygon.io/docs
- **Finnhub**: https://finnhub.io/docs/api
- **Quandl**: https://www.quandl.com/tools/api

#### Financial RSS Feeds
- **Bloomberg RSS**: https://www.bloomberg.com/feeds/
- **Reuters Finance**: https://www.reuters.com/rssFeed/finance
- **Financial Times**: https://www.ft.com/?format=rss
- **MarketWatch**: https://www.marketwatch.com/rss
- **CNBC Finance**: https://www.cnbc.com/id/100003114/device/rss/rss.html
- **WSJ Markets**: https://feeds.a.dj.com/rss/RSSMarketsMain.xml

#### MCP Servers
- **Financial Data Servers**: MCP servers for market data
- **Market Data Servers**: Real-time market data via MCP
- **Stock Price Servers**: Live stock prices via MCP protocol

#### Database Links
- **SEC EDGAR**: https://www.sec.gov/edgar/searchedgar/companysearch.html
- **FRED (Federal Reserve)**: https://fred.stlouisfed.org/
- **Quandl**: https://www.quandl.com/
- **World Bank Data**: https://data.worldbank.org/
- **IMF Data**: https://www.imf.org/en/Data

---

## Installation


In [None]:
%pip install -qU semantica networkx matplotlib plotly pandas faiss-cpu beautifulsoup4 groq sentence-transformers scikit-learn


## Configuration & Setup


In [None]:
import os

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY", "your-key-here")

# Configuration constants
EMBEDDING_DIMENSION = 384
EMBEDDING_MODEL = "all-MiniLM-L6-v2"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200


## Ingesting Financial Data from Multiple Sources


In [None]:
from semantica.ingest import MCPIngestor, WebIngestor, FeedIngestor, FileIngestor
import os
from contextlib import redirect_stderr
from io import StringIO

os.makedirs("data", exist_ok=True)

feed_sources = [
    # Financial RSS Feeds
    ("Reuters Finance", "https://www.reuters.com/rssFeed/finance"),
    ("MarketWatch", "https://www.marketwatch.com/rss"),
    ("CNBC Finance", "https://www.cnbc.com/id/100003114/device/rss/rss.html"),
]

feed_ingestor = FeedIngestor()
all_documents = []

for feed_name, feed_url in feed_sources:
    try:
        with redirect_stderr(StringIO()):
            feed_data = feed_ingestor.ingest_feed(feed_url, validate=False)
        for item in feed_data.items:
            if not item.content:
                item.content = item.description or item.title or ""
            if item.content:
                if not hasattr(item, 'metadata'):
                    item.metadata = {}
                item.metadata['source'] = feed_name
                all_documents.append(item)
    except Exception:
        continue

# Example: Ingest from Alpha Vantage API (requires API key)
alpha_vantage_api = "https://www.alphavantage.co/query?function=GLOBAL_QUOTE&symbol=AAPL&apikey=demo"
try:
    web_ingestor = WebIngestor()
    with redirect_stderr(StringIO()):
        api_documents = web_ingestor.ingest(alpha_vantage_api, method="url")
    for doc in api_documents:
        if not hasattr(doc, 'metadata'):
            doc.metadata = {}
        doc.metadata['source'] = 'Alpha Vantage API'
        all_documents.append(doc)
except Exception:
    pass

# MCP Server connection example (commented for demo)
# mcp_ingestor = MCPIngestor()
# mcp_ingestor.connect("financial_server", url="http://localhost:8000/mcp")
# resources = mcp_ingestor.list_available_resources("financial_server")
# mcp_data = mcp_ingestor.ingest_resources("financial_server", resource_uris=["resource://market_data"])

if not all_documents:
    market_data = """
    AAPL stock price: $150.25, market cap: $2.4T, volume: 50M shares, sector: Technology
    MSFT stock price: $380.50, market cap: $2.8T, volume: 30M shares, sector: Technology
    GOOGL stock price: $140.75, market cap: $1.8T, volume: 25M shares, sector: Technology
    JPM stock price: $145.30, market cap: $420B, volume: 15M shares, sector: Financial
    """
    with open("data/market_data.txt", "w") as f:
        f.write(market_data)
    file_ingestor = FileIngestor()
    all_documents = file_ingestor.ingest("data/market_data.txt")

documents = all_documents
print(f"Ingested {len(documents)} documents")


In [None]:
from semantica.seed import SeedDataManager

seed_manager = SeedDataManager()

# Load foundation market data (exchanges, indices, sectors)
seed_data = [
    {"type": "Market", "text": "NASDAQ", "description": "Stock exchange"},
    {"type": "Market", "text": "NYSE", "description": "Stock exchange"},
    {"type": "Market", "text": "S&P 500", "description": "Stock market index"},
    {"type": "Sector", "text": "Technology", "description": "Market sector"},
    {"type": "Sector", "text": "Financial", "description": "Market sector"},
    {"type": "Sector", "text": "Healthcare", "description": "Market sector"},
]

seed_manager.load_seed_data(seed_data)

print(f"Loaded {len(seed_data)} seed data items for market foundation")


## Parsing Financial Documents


In [None]:
from semantica.parse import DocumentParser

parser = DocumentParser()

parsed_documents = []
for doc in documents:
    try:
        parsed = parser.parse(
            doc.content if hasattr(doc, 'content') else str(doc),
            content_type="text"
        )
        parsed_documents.append(parsed)
    except Exception:
        parsed_documents.append(doc)

documents = parsed_documents


## Normalizing and Chunking Financial Data


In [None]:
from semantica.normalize import TextNormalizer
from semantica.split import TextSplitter

normalizer = TextNormalizer()
# Use recursive chunking for financial documents
splitter = TextSplitter(method="recursive", chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)

normalized_documents = []
for doc in documents:
    normalized_text = normalizer.normalize(
        doc.content if hasattr(doc, 'content') else str(doc),
        clean_html=True,
        normalize_entities=True,
        normalize_numbers=True,
        remove_extra_whitespace=True,
        lowercase=False
    )
    normalized_documents.append(normalized_text)

chunked_documents = []
for doc_text in normalized_documents:
    try:
        with redirect_stderr(StringIO()):
            chunks = splitter.split(doc_text)
        chunked_documents.extend(chunks)
    except Exception:
        simple_splitter = TextSplitter(method="recursive", chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
        chunks = simple_splitter.split(doc_text)
        chunked_documents.extend(chunks)


## Extracting Financial Entities


In [None]:
from semantica.semantic_extract import NERExtractor

entity_extractor = NERExtractor(
    method="llm",
    provider="groq",
    llm_model="llama-3.1-8b-instant",
    temperature=0.0
)

all_entities = []
for chunk in chunked_documents:
    chunk_text = chunk.text if hasattr(chunk, 'text') else str(chunk)
    try:
        entities = entity_extractor.extract_entities(
            chunk_text,
            entity_types=["Company", "Stock", "Price", "Metric", "Market", "Sector"]
        )
        all_entities.extend(entities)
    except Exception:
        continue

companies = [e for e in all_entities if e.label in ["Company", "Stock"] or "company" in e.label.lower() or "stock" in e.label.lower()]
markets = [e for e in all_entities if e.label == "Market" or "market" in e.label.lower()]
sectors = [e for e in all_entities if e.label == "Sector" or "sector" in e.label.lower()]

print(f"Extracted {len(companies)} companies/stocks, {len(markets)} markets, {len(sectors)} sectors")


## Extracting Financial Relationships


In [None]:
from semantica.semantic_extract import RelationExtractor

relation_extractor = RelationExtractor(
    method="llm",
    provider="groq",
    llm_model="llama-3.1-8b-instant",
    temperature=0.0
)

all_relationships = []
for chunk in chunked_documents:
    chunk_text = chunk.text if hasattr(chunk, 'text') else str(chunk)
    try:
        relationships = relation_extractor.extract_relations(
            chunk_text,
            entities=all_entities,
            relation_types=["trades_on", "has_price", "belongs_to", "correlates_with", "has_metric", "in_sector"]
        )
        all_relationships.extend(relationships)
    except Exception:
        continue

print(f"Extracted {len(all_relationships)} relationships")


## Resolving Duplicate Companies and Stocks


In [None]:
from semantica.deduplication import DuplicateDetector

duplicate_detector = DuplicateDetector(
    similarity_threshold=0.85,
    method="semantic"
)

# Use seed data for entity resolution
deduplicated_entities = duplicate_detector.detect_duplicates(all_entities)
merged_entities = duplicate_detector.merge_duplicates(deduplicated_entities)

# Enhance entities with seed data information
for entity in merged_entities:
    for seed_item in seed_data:
        if entity.text.lower() == seed_item["text"].lower():
            entity.description = seed_item.get("description", entity.description)
            break

print(f"Deduplicated {len(all_entities)} entities to {len(merged_entities)} unique entities")


## Building Financial Knowledge Graph


In [None]:
from semantica.kg import GraphBuilder

graph_builder = GraphBuilder(
    merge_entities=True,
    resolve_conflicts=True,
    entity_resolution_strategy="fuzzy"
)

kg_sources = [{
    "entities": [{"text": e.text, "type": e.label, "confidence": e.confidence} for e in merged_entities],
    "relationships": [{"source": r.source, "target": r.target, "type": r.label, "confidence": r.confidence} for r in all_relationships]
}]

kg = graph_builder.build(kg_sources)

entities_count = len(kg.get('entities', []))
relationships_count = len(kg.get('relationships', []))
print(f"Graph: {entities_count} entities, {relationships_count} relationships")


## Generating Embeddings for Companies and Stocks


In [None]:
from semantica.embeddings import EmbeddingGenerator

embedding_gen = EmbeddingGenerator(
    provider="sentence_transformers",
    model=EMBEDDING_MODEL
)

company_texts = [f"{c.text} {getattr(c, 'description', '')}" for c in companies]
company_embeddings = embedding_gen.generate_embeddings(company_texts)

market_texts = [f"{m.text} {getattr(m, 'description', '')}" for m in markets]
market_embeddings = embedding_gen.generate_embeddings(market_texts)

print(f"Generated {len(company_embeddings)} company embeddings and {len(market_embeddings)} market embeddings")


## Populating Vector Store


In [None]:
from semantica.vector_store import VectorStore

vector_store = VectorStore(backend="faiss", dimension=EMBEDDING_DIMENSION)

company_ids = vector_store.store_vectors(
    vectors=company_embeddings,
    metadata=[{"type": "company", "name": c.text, "label": c.label} for c in companies]
)

market_ids = vector_store.store_vectors(
    vectors=market_embeddings,
    metadata=[{"type": "market", "name": m.text, "label": m.label} for m in markets]
)

print(f"Stored {len(company_ids)} company vectors and {len(market_ids)} market vectors")


## Analyzing Market Network Structure


In [None]:
from semantica.kg import GraphAnalyzer, CentralityCalculator

graph_analyzer = GraphAnalyzer()
centrality_calc = CentralityCalculator()

analysis = graph_analyzer.analyze_graph(kg)

degree_centrality = centrality_calc.calculate_degree_centrality(kg)
betweenness_centrality = centrality_calc.calculate_betweenness_centrality(kg)
closeness_centrality = centrality_calc.calculate_closeness_centrality(kg)

# Identify central companies/stocks in the market network
central_companies = []
for entity in kg.get("entities", []):
    if entity.get("type") in ["Company", "Stock"]:
        entity_id = entity.get("id")
        if entity_id in degree_centrality:
            central_companies.append({
                "name": entity.get("text", "Unknown"),
                "degree": degree_centrality[entity_id]
            })

central_companies.sort(key=lambda x: x['degree'], reverse=True)

print(f"Graph analytics:")
print(f"  - Graph density: {analysis.get('density', 0):.3f}")
print(f"  - Central nodes (degree): {len(degree_centrality)}")
print(f"  - Top central companies: {len(central_companies[:5])} identified")


## GraphRAG: Hybrid Vector + Graph Queries


In [None]:
from semantica.context import AgentContext

context = AgentContext(vector_store=vector_store, knowledge_graph=kg)

query = "What technology companies are in the market?"
results = context.retrieve(
    query,
    max_results=10,
    use_graph=True,
    expand_graph=True,
    include_entities=True,
    include_relationships=True
)

print(f"GraphRAG query: '{query}'")
print(f"\nRetrieved {len(results)} results:\n")
for i, result in enumerate(results[:5], 1):
    print(f"{i}. Score: {result.get('score', 0):.3f}")
    print(f"   Content: {result.get('content', '')[:200]}...")
    if result.get('related_entities'):
        print(f"   Related entities: {len(result['related_entities'])}")
    print()


## Visualizing the Financial Knowledge Graph


In [None]:
from semantica.visualization import KGVisualizer

visualizer = KGVisualizer()
visualizer.visualize(
    kg,
    output_path="financial_data_kg.html",
    layout="spring",
    node_size=20
)

print("Visualization saved to financial_data_kg.html")


## Exporting Results


In [None]:
from semantica.export import GraphExporter

exporter = GraphExporter()
exporter.export(kg, output_path="financial_data_kg.json", format="json")
exporter.export(kg, output_path="financial_data_kg.graphml", format="graphml")
exporter.export(kg, output_path="financial_data_market.csv", format="csv")

print("Exported financial knowledge graph to JSON, GraphML, and CSV formats")
