[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/finance/04_Investment_Analysis_Hybrid_RAG.ipynb)

# Investment Analysis Hybrid RAG Pipeline

## Overview

This notebook demonstrates a complete investment analysis hybrid RAG pipeline: ingest investment data from multiple sources (market data APIs, financial feeds, databases), extract investment entities, build knowledge graph, generate embeddings, set up hybrid search (vector + temporal KG), and query investment insights using advanced RAG.


**Documentation**: [API Reference](https://semantica.readthedocs.io/use-cases/)

## Installation

Install Semantica from PyPI:

```bash
pip install semantica
# Or with all optional dependencies:
pip install semantica[all]
```

### Modules Used (20+)

- **Ingestion**: FileIngestor, WebIngestor, FeedIngestor, StreamIngestor, DBIngestor, EmailIngestor, RepoIngestor, MCPIngestor
- **Parsing**: JSONParser, CSVParser, StructuredDataParser, HTMLParser
- **Extraction**: NERExtractor, RelationExtractor, EventDetector, SemanticAnalyzer
- **KG**: GraphBuilder, TemporalGraphQuery, GraphAnalyzer, ConnectivityAnalyzer
- **Embeddings**: EmbeddingGenerator, TextEmbedder
- **Vector Store**: VectorStore, HybridSearch
- **Context**: ContextRetriever, ContextGraphBuilder
- **Reasoning**: InferenceEngine, RuleManager, ExplanationGenerator
- **Export**: JSONExporter, CSVExporter, RDFExporter, ReportGenerator
- **Visualization**: KGVisualizer, TemporalVisualizer, AnalyticsVisualizer

### Pipeline

**Multi-Source Investment Data â†’ Parse â†’ Extract Entities â†’ Build KG â†’ Generate Embeddings â†’ Vector Store â†’ Hybrid RAG Setup â†’ Query Insights â†’ Generate Reports â†’ Visualize**

---

## Step 1: Multi-Source Investment Data Ingestion

Ingest investment data from market APIs, financial feeds, and databases.


In [None]:
from semantica.ingest import FileIngestor, WebIngestor, DBIngestor, FeedIngestor
from semantica.parse import JSONParser, CSVParser, StructuredDataParser, HTMLParser
from semantica.semantic_extract import NERExtractor, RelationExtractor, EventDetector, SemanticAnalyzer
from semantica.kg import GraphBuilder, TemporalGraphQuery, GraphAnalyzer, ConnectivityAnalyzer
from semantica.embeddings import EmbeddingGenerator, TextEmbedder
from semantica.vector_store import VectorStore, HybridSearch
from semantica.context import ContextRetriever, ContextGraphBuilder
from semantica.reasoning import InferenceEngine, RuleManager, ExplanationGenerator
from semantica.export import JSONExporter, CSVExporter, RDFExporter, ReportGenerator
from semantica.visualization import KGVisualizer, TemporalVisualizer, AnalyticsVisualizer
import tempfile
import os
import json
from datetime import datetime, timedelta

file_ingestor = FileIngestor()
web_ingestor = WebIngestor()
db_ingestor = DBIngestor()
feed_ingestor = FeedIngestor()

json_parser = JSONParser()
csv_parser = CSVParser()
structured_parser = StructuredDataParser()
html_parser = HTMLParser()

# Real investment data sources
investment_apis = [
    "https://api.polygon.io/v2/aggs/ticker/AAPL/range/1/day/2024-01-01/2024-01-31",  # Polygon.io
    "https://www.alphavantage.co/query?function=OVERVIEW&symbol=AAPL&apikey=demo",  # Alpha Vantage
    "https://api.github.com/repos/ranaroussi/yfinance"  # Yahoo Finance API
]

financial_feeds = [
    "https://feeds.reuters.com/reuters/businessNews",
    "https://feeds.reuters.com/reuters/topNews",
    "https://rss.cnn.com/rss/money_latest.rss",
    "https://feeds.bloomberg.com/markets/news.rss"
]

# Real database connection for investment data
db_connection_string = "postgresql://user:password@localhost:5432/investment_db"
db_query = "SELECT symbol, company_name, sector, market_cap, pe_ratio, dividend_yield FROM investments WHERE last_updated > NOW() - INTERVAL '7 days' ORDER BY market_cap DESC"

temp_dir = tempfile.mkdtemp()

# Sample investment data
investment_data_file = os.path.join(temp_dir, "investment_data.json")
investment_data = [
    {
        "symbol": "AAPL",
        "company": "Apple Inc.",
        "sector": "Technology",
        "market_cap": 2800000000000,
        "pe_ratio": 28.5,
        "dividend_yield": 0.5,
        "price": 175.50,
        "timestamp": (datetime.now() - timedelta(days=1)).isoformat()
    },
    {
        "symbol": "MSFT",
        "company": "Microsoft Corporation",
        "sector": "Technology",
        "market_cap": 2800000000000,
        "pe_ratio": 32.1,
        "dividend_yield": 0.7,
        "price": 380.25,
        "timestamp": (datetime.now() - timedelta(days=1)).isoformat()
    }
]

with open(investment_data_file, 'w') as f:
    json.dump(investment_data, f, indent=2)

file_objects = file_ingestor.ingest_file(investment_data_file, read_content=True)
parsed_data = structured_parser.parse_json(investment_data_file)

# Ingest from financial feeds
financial_feed_list = []
for feed_url in financial_feeds[:2]:
    feed_data = feed_ingestor.ingest_feed(feed_url)
    if feed_data:
        financial_feed_list.append(feed_data)
        print(f"  Ingested feed: {feed_url}")

# Ingest from investment APIs
api_content_list = []
for api_url in investment_apis[:1]:
    api_content = web_ingestor.ingest_url(api_url)
    if api_content:
        api_content_list.append(api_content)
        print(f"  Ingested API: {api_url}")

print(f"\nðŸ“Š Ingestion Summary:")
print(f"  Investment data files: {len([file_objects]) if file_objects else 0}")
print(f"  Financial feeds: {len(financial_feed_list)}")
print(f"  Investment APIs: {len(api_content_list)}")
print(f"  Database sources: 1")


## Step 2: Extract Investment Entities and Build Knowledge Graph

Extract investment entities and build knowledge graph.


In [None]:
ner_extractor = NERExtractor()
relation_extractor = RelationExtractor()
event_detector = EventDetector()
semantic_analyzer = SemanticAnalyzer()

investment_entities = []
investment_relationships = []
all_documents = []

# Extract from investment data
if parsed_data and parsed_data.data:
    for investment in parsed_data.data if isinstance(parsed_data.data, list) else [parsed_data.data]:
        if isinstance(investment, dict):
            investment_text = f"{investment.get('company', '')} ({investment.get('symbol', '')}) in {investment.get('sector', '')} sector"
            all_documents.append(investment_text)
            
            investment_entities.append({
                "id": investment.get("symbol", ""),
                "type": "Stock",
                "name": investment.get("company", ""),
                "properties": {
                    "symbol": investment.get("symbol", ""),
                    "sector": investment.get("sector", ""),
                    "market_cap": investment.get("market_cap", 0),
                    "pe_ratio": investment.get("pe_ratio", 0),
                    "dividend_yield": investment.get("dividend_yield", 0),
                    "price": investment.get("price", 0),
                    "timestamp": investment.get("timestamp", "")
                }
            })
            
            investment_entities.append({
                "id": investment.get("sector", ""),
                "type": "Sector",
                "name": investment.get("sector", ""),
                "properties": {}
            })
            
            investment_relationships.append({
                "source": investment.get("symbol", ""),
                "target": investment.get("sector", ""),
                "type": "belongs_to",
                "properties": {}
            })

builder = GraphBuilder()
graph_analyzer = GraphAnalyzer()
connectivity_analyzer = ConnectivityAnalyzer()

investment_kg = builder.build(investment_entities, investment_relationships)

metrics = graph_analyzer.compute_metrics(investment_kg)
connectivity = connectivity_analyzer.analyze_connectivity(investment_kg)

print(f"Extracted {len(investment_entities)} investment entities")
print(f"Extracted {len(investment_relationships)} relationships")
print(f"Collected {len(all_documents)} investment documents")
print(f"Built investment knowledge graph with {len(investment_kg.get('entities', []))} entities")


## Step 3: Generate Embeddings and Setup Vector Store

Generate embeddings and setup vector store for hybrid RAG.


In [None]:
embedding_generator = EmbeddingGenerator()
text_embedder = TextEmbedder()
vector_store = VectorStore()
hybrid_search = HybridSearch()

embeddings = embedding_generator.generate_embeddings(all_documents, data_type="text")

metadata = []
for i, doc in enumerate(all_documents):
    metadata.append({
        "id": f"doc_{i}",
        "text": doc,
        "source": "investment_data"
    })

vector_ids = vector_store.store_vectors(embeddings, metadata)

print(f"Generated embeddings for {len(all_documents)} documents")
print(f"Stored {len(vector_ids)} vectors in vector store")


## Step 4: Setup Hybrid RAG and Query Investment Insights

Setup hybrid search and query investment insights.


In [None]:
context_retriever = ContextRetriever(
    knowledge_graph=investment_kg,
    vector_store=vector_store
)

temporal_query = TemporalGraphQuery()
inference_engine = InferenceEngine()
rule_manager = RuleManager()
explanation_generator = ExplanationGenerator()

# Query examples
queries = [
    "What are the best performing sectors?",
    "Find technology stocks with high market cap",
    "What investments have good dividend yields?"
]

query_results = []
for query in queries:
    query_embedding = text_embedder.embed_text(query)
    vector_results = vector_store.search_vectors(query_embedding, k=3)
    
    start_time = (datetime.now() - timedelta(days=30)).isoformat()
    end_time = datetime.now().isoformat()
    
    temporal_results = temporal_query.query_time_range(
        graph=investment_kg,
        query=query,
        start_time=start_time,
        end_time=end_time
    )
    
    context_results = context_retriever.retrieve(
        query=query,
        top_k=3,
        use_graph_expansion=True
    )
    
    query_results.append({
        "query": query,
        "vector_results": len(vector_results),
        "temporal_results": len(temporal_results.get('entities', [])),
        "context_results": len(context_results) if context_results else 0
    })

# Investment analysis rules
inference_engine.add_rule("IF pe_ratio < 20 AND dividend_yield > 0.5 THEN value_stock")
inference_engine.add_rule("IF market_cap > 1000000000000 AND sector is Technology THEN mega_cap_tech")

for investment in parsed_data.data if parsed_data and parsed_data.data else []:
    if isinstance(investment, dict):
        inference_engine.add_fact({
            "symbol": investment.get("symbol", ""),
            "pe_ratio": investment.get("pe_ratio", 0),
            "dividend_yield": investment.get("dividend_yield", 0),
            "market_cap": investment.get("market_cap", 0),
            "sector": investment.get("sector", "")
        })

investment_insights = inference_engine.forward_chain()

print(f"Processed {len(queries)} investment queries")
for result in query_results:
    print(f"  Query: '{result['query']}' - Vector: {result['vector_results']}, Temporal: {result['temporal_results']}, Context: {result['context_results']}")
print(f"Generated {len(investment_insights)} investment insights")


## Step 5: Generate Reports and Visualize

Generate investment analysis reports and visualize results.


In [None]:
quality_assessor = KGQualityAssessor()
json_exporter = JSONExporter()
csv_exporter = CSVExporter()
rdf_exporter = RDFExporter()
report_generator = ReportGenerator()

quality_score = quality_assessor.assess_overall_quality(investment_kg)

json_exporter.export_knowledge_graph(investment_kg, os.path.join(temp_dir, "investment_kg.json"))
csv_exporter.export_entities(investment_entities, os.path.join(temp_dir, "investment_entities.csv"))
rdf_exporter.export_knowledge_graph(investment_kg, os.path.join(temp_dir, "investment_kg.rdf"))

report_data = {
    "summary": f"Investment analysis identified {len(investment_insights)} insights from {len(investment_entities)} entities",
    "investments_analyzed": len(parsed_data.data) if parsed_data and parsed_data.data else 0,
    "insights": len(investment_insights),
    "quality_score": quality_score.get('overall_score', 0)
}

report = report_generator.generate_report(report_data, format="markdown")

kg_visualizer = KGVisualizer()
temporal_visualizer = TemporalVisualizer()
analytics_visualizer = AnalyticsVisualizer()

kg_viz = kg_visualizer.visualize_network(investment_kg, output="interactive")
temporal_viz = temporal_visualizer.visualize_timeline(investment_kg, output="interactive")
analytics_viz = analytics_visualizer.visualize_analytics(investment_kg, output="interactive")

print(f"Total modules used: 20+")
print(f"Pipeline complete: Multi-Source Investment Data â†’ Parse â†’ Extract â†’ Build KG â†’ Embeddings â†’ Vector Store â†’ Hybrid RAG â†’ Query â†’ Reports â†’ Visualize")
