# RAG Data Analytics & Visualization Notebook

This notebook provides an end-to-end workflow for analyzing and visualizing data extracted by the Retrieval-Augmented Generation (RAG) pipeline. It is intended for data analytics and visualization only. All extraction, ingestion, and deployment logic remains in the original Python files.


In [1]:
# --- Imports & Setup ---
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pymongo import MongoClient
import json
import os

# Set up plotting
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)


## 1. Data Loading
Connect to MongoDB or load exported data files.


In [2]:
# --- MongoDB Connection ---
mongo_uri = 'mongodb://mongodb:27017/'  # Change if running locally
db_name = 'pdf_rag'
client = MongoClient(mongo_uri)
db = client[db_name]
invoices = list(db.invoices.find())
qa_pairs = list(db.qa_pairs.find())

# Convert to DataFrame for analysis
invoices_df = pd.DataFrame(invoices)
qa_pairs_df = pd.DataFrame(qa_pairs)

print(f'Loaded {len(invoices_df)} invoices and {len(qa_pairs_df)} QA pairs.')

ServerSelectionTimeoutError: mongodb:27017: [Errno 8] nodename nor servname provided, or not known (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms), Timeout: 30s, Topology Description: <TopologyDescription id: 680bdc2e243c88559c172100, topology_type: Unknown, servers: [<ServerDescription ('mongodb', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('mongodb:27017: [Errno 8] nodename nor servname provided, or not known (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms)')>]>

## 2. Data Exploration
Preview the data and basic statistics.


In [None]:
# --- Preview Data ---
invoices_df.head()

In [None]:
# --- Summary Statistics ---
invoices_df.describe(include='all')

## 3. Analytics
Analyze extracted fields (amounts, dates, customer names, etc.).


In [None]:
# --- Example: Invoice Amount Distribution ---
if 'amount' in invoices_df.columns:
    invoices_df['amount'] = invoices_df['amount'].replace('[$,]', '', regex=True).astype(float)
    sns.histplot(invoices_df['amount'], bins=20, kde=True)
    plt.title('Invoice Amount Distribution')
    plt.xlabel('Amount ($)')
    plt.ylabel('Count')
    plt.show()


## 4. Time Series or Trends
Visualize trends over time (e.g., invoice counts or totals by month).


In [None]:
# --- Example: Invoices Over Time ---
if 'date' in invoices_df.columns:
    invoices_df['date'] = pd.to_datetime(invoices_df['date'], errors='coerce')
    invoices_df.set_index('date').resample('M').size().plot(kind='bar')
    plt.title('Invoices per Month')
    plt.xlabel('Month')
    plt.ylabel('Number of Invoices')
    plt.show()


## 5. QA Pairs Analytics
Analyze and visualize QA pairs if relevant.


In [None]:
# --- Preview QA Pairs ---
qa_pairs_df.head()

## 6. Custom Visualizations & Analysis
Add more analytics or visualizations as needed.
