<a href="https://colab.research.google.com/github/zhan5555/LLMRAGProject/blob/main/10KDoc_RAG_LLM_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Objectives

**Business use case**: As a product manager, I can use a chatbot system to conduct quick analysis of a company during my market research process. For example, the analysis will include key elements of a SWOT in a predefined structure that covers key points from financial documents such as 10K annual report of companies and can incroporate the analysis with real-time market news.

**Technical Implmentation of the Prtotype:**
Phase I. User can store 10K annual reports and we will experiment on using the get text embedding gte-Qwen2-7B-instruct model (top performing from leaderboard + open source + smaller model) from hugging face to convert texts in to vectors; then use TinyLlama LLM model (a light weight) for the decodcing /Chatbot functionality.

**TinyLlama** is a small-scale large language model (LLM) based on Meta’s Llama 2 architecture. Despite its modest size (1.1 billion parameters), it was trained on an enormous dataset (~3 trillion tokens) to achieve surprising competency in language tasks​
HUB.ATHINA.AI
. This model strikes a balance between capability and resource footprint, making it suitable for scenarios where larger models are impractical. Below, we explore key use cases of TinyLlama – from chatbots to code generation – and discuss prompt engineering strategies, performance optimizations, and limitations relative to bigger LLMs.

TinyLlama’s efficiency comes with **trade-offs in raw performance metrics**. On many NLP benchmarks, it will score lower than larger models in things like accuracy, reasoning, or code generation capability. It’s optimized to be good enough for many tasks with limited resources, not to be the best model on absolute performance.

Use cases for TinyLlma:
**Mobile and IoT scenarios,  real-time analysis on edge devices.**
One common pattern is using TinyLlama in a retrieval-augmented generation (RAG) pipeline: external documents are broken into chunks and fed to the model along with a query, so that TinyLlama can base its answer on provided context. For example, Red Hat’s OpenShift AI tutorial demonstrates a RAG chatbot using TinyLlama to answer questions about a 400+ page manual, by combining it with document loaders, text splitters, and a vector database for context retrieval​
CLOUD.REDHAT.COM
. In this setup, TinyLlama reads the relevant documentation snippets and generates an answer, effectively functioning as a lightweight Q&A system for large knowledge bases.

**Limitations:**


### Notes:


*   Not scaled for production - not optimized for data loading and embedding process
*   Colab set up : Choose A100 GPU (40 GB VRAM, Llama 2 7B requires at least 24 GB CRAM)



Check GPU info if need to:

In [None]:
# mount google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 1. Install Required Packages

In [None]:
# CODE CELL 1
!pip install --upgrade pip
!pip install pdfplumber         # for reading PDFs
!pip install pinecone    # for Pinecone vector DB
!pip install transformers       # for huggingface transformers
!pip install accelerate         # helps accelerate HF models on GPU
!pip install bitsandbytes       # for 8-bit inference
!pip install -U langchain-community
!pip install pypdf
!pip install openai
!pip install torch torchvision torchaudio
!pip install huggingface_hub
!pip install transformers sentencepiece accelerate torch
!pip install requests
!pip install langchain langgraph pinecone transformers accelerate torch
!pip install pymupdf

Collecting pip
  Downloading pip-25.0.1-py3-none-any.whl.metadata (3.7 kB)
Downloading pip-25.0.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m76.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.0.1
Collecting pdfplumber
  Downloading pdfplumber-0.11.5-py3-none-any.whl.metadata (42 kB)
Collecting pdfminer.six==20231228 (from pdfplumber)
  Downloading pdfminer.six-20231228-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
Downloading pdfplumber-0.11.5-py3-none-any.whl (59 kB)
Downloading pdfminer.six-20231228-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.

## 2. Import

In [None]:
from langchain.text_splitter import SentenceTransformersTokenTextSplitter, RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader
import os
import gc
import uuid
import pinecone
import torch
import pickle
from transformers import AutoTokenizer, AutoModel, pipeline, AutoModelForCausalLM
import openai
import numpy as np
from huggingface_hub import login
from concurrent.futures import ThreadPoolExecutor
from langgraph.graph import StateGraph
from langchain.schema import Document

# from langgraph.graph.message import Message
#from langgraph.prebuilt import Chains, ChainState

In [None]:
# check GPU details
import tensorflow as tf
import torch

gpus = tf.config.list_physical_devices('GPU')
if gpus:
    try:
        gpu_details = tf.config.experimental.get_device_details(gpus[0])
        print("TensorFlow GPU Details:")
        for key, value in gpu_details.items():
            print(f"  {key}: {value}")

        if torch.cuda.is_available():
            gpu_index = 0
            torch_gpu_properties = torch.cuda.get_device_properties(gpu_index)
            print("\nPyTorch GPU Details:")
            for property_name in dir(torch_gpu_properties):
                if not property_name.startswith('_'): #skip private attributes.
                  property_value = getattr(torch_gpu_properties, property_name)
                  print(f"  {property_name}: {property_value}")

            vram_bytes = torch_gpu_properties.total_memory
            vram_gb = vram_bytes / (1024**3)
            print(f"  total_memory (GB): {vram_gb:.2f}")

        else:
            print("\nPyTorch CUDA not available.")

        print(f"\nDefault GPU Device: {tf.test.gpu_device_name()}")

    except Exception as e:
        print(f"Error retrieving GPU information: {e}")

else:
    print("No GPU found.")

TensorFlow GPU Details:
  compute_capability: (8, 0)
  device_name: NVIDIA A100-SXM4-40GB

PyTorch GPU Details:
  L2_cache_size: 41943040
  gcnArchName: NVIDIA A100-SXM4-40GB
  is_integrated: 0
  is_multi_gpu_board: 0
  major: 8
  max_threads_per_multi_processor: 2048
  minor: 0
  multi_processor_count: 108
  name: NVIDIA A100-SXM4-40GB
  regs_per_multiprocessor: 65536
  total_memory: 42474471424
  uuid: 0c2516ed-7173-ef9c-d5e8-8ca905b25b36
  warp_size: 32
  total_memory (GB): 39.56

Default GPU Device: /device:GPU:0


## 3. Authentication

In [None]:
# hugging face authentication

login(token="xxxx")  # Replace with your actual token

In [None]:
from pinecone import Pinecone, ServerlessSpec

# ✅ Create Pinecone instance
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(
    api_key="xxxx"  # your key here
)

index_name = "documentvectorstore"

# ✅ Check if index exists; if not, create it
all_indexes = pc.list_indexes().names()  # Returns a list of index names

if index_name not in all_indexes:
    pc.create_index(
        name=index_name,
        dimension=3584,  # Match your embedding dimension
        metric="cosine",
        # You can specify cloud/region if you want a non-default location
        spec=ServerlessSpec(
            cloud="aws",
            region="us-west-1"
        )
    )

# ✅ Access your index
index = pc.Index(index_name)

print(f"✅ Successfully connected to index: {index_name}")


✅ Successfully connected to index: documentvectorstore


Delete data in Pinecone if needed:

In [None]:
# Delete all data from the index
index.delete(delete_all=True)
print("✅ All vectors deleted from Pinecone index.")

✅ All vectors deleted from Pinecone index.


Check if any data is in Pinecone: https://app.pinecone.io/organizations/-OJBMJJ-o4-9ICTel_0c/projects/8d23e6c4-96cf-4fae-af86-a291c5d4fe9d/indexes/documentvectorstore/browser

## 4. Define text embedding function (use for indexing data -convert text to vector step and convert user input to vector for vector-based retrieval)

Pick a open source model from huggingface leaderboard.  https://huggingface.co/spaces/mteb/leaderboard

Selected a lightweight but high performance: Qwen2-7B - opensource embedding model from Alibab- rank # 3 Mar 2025


*  zero-shot model with NA (it can predict labels for new, unseen classes ("zero-shot"), and the "NA" signifies that it is specifically designed to handle "not applicable" or "unknown" categories as a potential output when the model is unsure about the classification. )
* No. parameters:7B
* 3584 embedding dimensions
* max tokens 32768
* Mean (task) 62.51
* Mean task type 56
* Pair classfication 85.13
* Retrieval 60.08






In [None]:
from transformers import AutoModel, AutoTokenizer
import torch
import numpy as np
from transformers import BitsAndBytesConfig  # Import for quantization

# CODE CELL 3 - Optimized Batch Embedding Function for Qwen2-7B (Quantized)

EMBED_MODEL_NAME = "Alibaba-NLP/gte-Qwen2-7B-instruct"  # Adjust as needed
EMBEDDING_DIM = 3584  # Model-specific embedding size
BATCH_SIZE = 8  # Adjust based on available VRAM

# Quantization Config - Load Model in 4-bit Precision
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,    # Enable 4-bit quantization
    bnb_4bit_compute_dtype=torch.float16,  # Use FP16 for computation
    bnb_4bit_quant_type="nf4",  # Use NF4 quantization (better for accuracy)
    bnb_4bit_use_double_quant=True  # Double quantization for efficiency
)

# Load tokenizer and model with quantization
tokenizer_embed = AutoTokenizer.from_pretrained(EMBED_MODEL_NAME)
model_embed = AutoModel.from_pretrained(EMBED_MODEL_NAME, quantization_config=bnb_config, device_map="auto")

def mean_pooling(outputs, attention_mask):
    """
    Computes mean pooling by averaging token embeddings,
    ignoring padding tokens using the attention mask.
    """
    token_embeddings = outputs.last_hidden_state  # Shape: [batch_size, seq_len, hidden_dim]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, dim=1) / torch.clamp(input_mask_expanded.sum(dim=1), min=1e-9)

def embed_text_batch(texts: list) -> np.ndarray:
    """
    Generates vector embeddings for a batch of text inputs.

    Args:
        texts (list of str): List of text inputs to embed.

    Returns:
        np.ndarray: NumPy array of shape (batch_size, EMBEDDING_DIM)
    """
    # Tokenize batch
    inputs = tokenizer_embed(texts, return_tensors="pt", padding=True, truncation=True, max_length=512).to(model_embed.device)

    with torch.no_grad():
        outputs = model_embed(**inputs)

    # Apply mean pooling to get sentence-level embeddings
    embeddings = mean_pooling(outputs, inputs['attention_mask'])

    return embeddings.cpu().numpy()  # Convert to NumPy for faster storage & retrieval

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.31k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/80.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/370 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/902 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/27.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/7 [00:00<?, ?it/s]

model-00001-of-00007.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00007.safetensors:   0%|          | 0.00/4.78G [00:00<?, ?B/s]

model-00003-of-00007.safetensors:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

model-00004-of-00007.safetensors:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

model-00005-of-00007.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00006-of-00007.safetensors:   0%|          | 0.00/3.66G [00:00<?, ?B/s]

model-00007-of-00007.safetensors:   0%|          | 0.00/2.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

In [None]:
# make sure function is wokring

test_vec = embed_text_batch(["This is a test."])
print(type(test_vec), test_vec.shape)


<class 'numpy.ndarray'> (1, 3584)


In [None]:
# Test the batch embedding function with sample text
sample_texts = [
    "Adobe is a global software company specializing in creative tools.",
    "The company launched Acrobat AI Assistant in 2024.",
    "Adobe Experience Cloud is used for marketing and customer analytics."
]

# Generate embeddings
embeddings = embed_text_batch(sample_texts)

# Print shape to confirm correct output
print(f"Generated embeddings shape: {embeddings.shape}")  # Expected: (3, 768)


Generated embeddings shape: (3, 3584)


In [None]:
# make sure my pinecone index is configured to accept the same embedding dimension
index.describe_index_stats()

{'dimension': 3584,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 3419}},
 'total_vector_count': 3419,
 'vector_type': 'dense'}

## 5. PDF Document Processing (Content-Aware Chunking using LangChain, Gte-Qwen for tokenizing and embedding, Pinecone Ingestion.) - Run when new PDF is uploaded

Content-aware chunking	Ensures natural splits at sentence/paragraph levels.

Batch embedding	Processes multiple chunks efficiently.

Pre-summarization	Reduces retrieval noise & improves answer clarity.

Stores both full & summarized text	Provides flexibility for retrieval.



To improve output quality of TinyLlama, first I will define a summarize_text function using GPT-4 to summarize content chunks from PDFs before embedding it. GPT-4 provide high quality summary.

### Option 1. Use GPT 4.0 model ($ cost) - dont use

In [None]:
pip install --upgrade openai

Collecting openai
  Downloading openai-1.65.4-py3-none-any.whl.metadata (27 kB)
Downloading openai-1.65.4-py3-none-any.whl (473 kB)
Installing collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.61.1
    Uninstalling openai-1.61.1:
      Successfully uninstalled openai-1.61.1
Successfully installed openai-1.65.4


In [None]:
!pip install requests



In [None]:
import requests

# ✅ Replace this with your actual OpenAI API key

OPENAI_API_KEY = "insert API key here"

 # OpenAI API endpoint
url = "https://api.openai.com/v1/models"

# Set headers
headers = {
    "Authorization": f"Bearer {OPENAI_API_KEY}",
    "Content-Type": "application/json"
}

# Send request
response = requests.get(url, headers=headers)

# ✅ Debugging Output
print("🔹 Full API Response:", response.json())

# ✅ Check for errors
if "error" in response.json():
    print("❌ ERROR: Invalid API Key or Access Issue.")
else:
    print("✅ API Key is working!")


🔹 Full API Response: {'error': {'message': 'Incorrect API key provided: insert A*******here. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}
❌ ERROR: Invalid API Key or Access Issue.


In [None]:
import requests
import os

OPENAI_API_KEY = "xxxx"

def summarize_text(text, max_length=300):
    """
    Uses OpenAI's REST API to summarize text with GPT-4o.

    Args:
        text (str): The text to summarize.
        max_length (int): Maximum number of tokens for the summary.

    Returns:
        str: The summarized text.
    """
    url = "https://api.openai.com/v1/chat/completions"

    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "Content-Type": "application/json"
    }

    payload = {
        "model": "gpt-4o",  # ✅ Correct model from your available list
        "messages": [
            {"role": "system", "content": "You are an AI assistant that summarizes financial reports concisely."},
            {"role": "user", "content": f"Summarize this in {max_length} words:\n\n{text}"}
        ],
        "temperature": 0.3,
        "max_tokens": max_length
    }

    try:
        response = requests.post(url, headers=headers, json=payload)
        response_data = response.json()

        if "choices" in response_data:
            return response_data["choices"][0]["message"]["content"].strip()
        else:
            print("⚠️ OpenAI API Error:", response_data)
            return f"❌ GPT-4 API Error: {response_data}"

    except Exception as e:
        print("Error in GPT-4 summarization:", str(e))
        return "❌ GPT-4 failed to generate summary"


In [None]:
print(summarize_text("Adobe launched AI-powered tools in 2024 to enhance digital content creation and enterprise solutions."))

⚠️ OpenAI API Error: {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}
❌ GPT-4 API Error: {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}


In [None]:
# code cell 4 - chunking pdf documents by natural spilts and ingest to pinecone

INDEX_NAME = "documentvectorstore"
index = pc.Index(INDEX_NAME)

pdf_folder = "/content/drive/MyDrive/DataScienceProjects/10Kdocuments"
batch_size = 8  # Matches embedding batch size
chunk_size = 1500  # Adjust for optimal retrieval
chunk_overlap = 200  # Avoid cutting off key details

# Use content-aware chunking to respect sentence structure
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separators=["\n\n", "\n", ".", " ", ""]
)

vectors_to_upsert = []

for filename in os.listdir(pdf_folder):
    if filename.lower().endswith(".pdf"):
        pdf_path = os.path.join(pdf_folder, filename)

        loader = PyPDFLoader(pdf_path)
        documents = loader.load()
        chunks = text_splitter.split_documents(documents)

        company_name = filename.split('-')[0].lower()
        year = filename.split('-')[1]

        # Process in batches
        chunk_texts = [chunk.page_content for chunk in chunks if chunk.page_content.strip()]
        chunk_summaries = [summarize_text(text) for text in chunk_texts]
        embeddings = embed_text_batch(chunk_texts)  # Use batch embedding function defined in code cell 3 using Gte Qwen2 7B model - into 3584 dimentional embeddings

        for i in range(len(chunk_texts)):
            vector_id = str(uuid.uuid4())

            metadata = {
                "summary": chunk_summaries[i],  # Store summarized chunk
                "truncated_text": chunk_texts[i][:2000],  # Store original text
                "source_pdf": filename,
                "company": company_name,
                "year": str(year)
            }

            vectors_to_upsert.append((vector_id, embeddings[i], metadata))

            # Upsert in batches
            if len(vectors_to_upsert) >= batch_size:
                index.upsert(vectors=vectors_to_upsert)
                vectors_to_upsert = []

if vectors_to_upsert:
    index.upsert(vectors=vectors_to_upsert)

print("✅ Documents chunked, summarized, and upserted into Pinecone.")

Error in GPT-4 summarization: 'choices'
Error in GPT-4 summarization: 'choices'
Error in GPT-4 summarization: 'choices'
Error in GPT-4 summarization: 'choices'
Error in GPT-4 summarization: 'choices'
Error in GPT-4 summarization: 'choices'
Error in GPT-4 summarization: 'choices'
Error in GPT-4 summarization: 'choices'
Error in GPT-4 summarization: 'choices'
Error in GPT-4 summarization: 'choices'
Error in GPT-4 summarization: 'choices'
Error in GPT-4 summarization: 'choices'
Error in GPT-4 summarization: 'choices'
Error in GPT-4 summarization: 'choices'
Error in GPT-4 summarization: 'choices'
Error in GPT-4 summarization: 'choices'
Error in GPT-4 summarization: 'choices'
Error in GPT-4 summarization: 'choices'
Error in GPT-4 summarization: 'choices'
Error in GPT-4 summarization: 'choices'
Error in GPT-4 summarization: 'choices'
Error in GPT-4 summarization: 'choices'
Error in GPT-4 summarization: 'choices'
Error in GPT-4 summarization: 'choices'
Error in GPT-4 summarization: 'choices'


OutOfMemoryError: CUDA out of memory. Tried to allocate 16.57 GiB. GPU 0 has a total capacity of 39.56 GiB of which 10.04 GiB is free. Process 25928 has 29.51 GiB memory in use. Of the allocated memory 23.98 GiB is allocated by PyTorch, and 5.03 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
import requests
import os

# Set OpenAI API key
OPENAI_API_KEY = "sk-..."  # Replace with your actual API key

def summarize_text(text, max_length=300):
    """
    Uses OpenAI's REST API to summarize text with GPT-4.
    """
    url = "https://api.openai.com/v1/chat/completions"

    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "Content-Type": "application/json"
    }

    payload = {
        "model": "gpt-4-turbo",  # Use "gpt-4" or "gpt-4o" if needed
        "messages": [
            {"role": "system", "content": "You are an AI assistant that summarizes financial reports concisely."},
            {"role": "user", "content": f"Summarize this in {max_length} words:\n\n{text}"}
        ],
        "temperature": 0.3,
        "max_tokens": max_length
    }

    try:
        response = requests.post(url, headers=headers, json=payload)
        response_data = response.json()

        # ✅ Print full response for debugging
        print("Full API Response:", response_data)

        # ✅ Check if "choices" exists in response
        if "choices" in response_data:
            return response_data["choices"][0]["message"]["content"].strip()
        else:
            return f"❌ GPT-4 API Error: {response_data}"

    except Exception as e:
        print("Error in GPT-4 summarization:", str(e))
        return "❌ GPT-4 failed to generate summary"


In [None]:
print(summarize_text("Adobe launched AI-powered tools in 2024 to enhance digital content creation and enterprise solutions."))


Full API Response: {'error': {'message': 'Incorrect API key provided: sk-.... You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}
❌ GPT-4 API Error: {'error': {'message': 'Incorrect API key provided: sk-.... You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}


In [None]:
# test GPT-4 Summarization
sample_text = "Adobe is a leading software company with products like Photoshop and Acrobat. In 2024, Adobe launched AI-powered tools to improve digital content creation and enterprise solutions."
print(summarize_text(sample_text))


In [None]:
# confirm Pinecone contains new vectors
index.describe_index_stats()

### Option 2. Use free Llama 2 7B model to summarize - - dont use

In [None]:
# CODE CELL 4 finalized summarization function

# ✅ Load the optimized Llama-2 model for summarization (set device properly)
summarizer = pipeline(
    "text-generation",
    model="meta-llama/Llama-2-7b-chat-hf",
    device=0 if torch.cuda.is_available() else -1  # ✅ Use GPU if available, else CPU
)

def summarize_text(text: str, max_new_tokens: int = 300) -> str:
    """
    Summarizes input text using Llama-2.

    Args:
        text (str): The input text to summarize.
        max_new_tokens (int): Maximum number of tokens for the summary output.

    Returns:
        str: The summarized text.
    """
    prompt = f"""
    ### Instruction:
    You are a professional summarization assistant.
    Your task is to **SUMMARIZE the following text into a concise version**, keeping the most important details.

    **Rules:**
    - Do **NOT** rewrite the full text.
    - Keep the summary within **{max_new_tokens} tokens**.
    - Make the summary clear, structured, and easy to read.

    ### Text:
    {text}

    ### Summary:
    """

    result = summarizer(
        prompt,
        max_new_tokens=max_new_tokens,  # ✅ Ensures output is constrained
        num_return_sequences=1,
        do_sample=False,  # ✅ Ensures deterministic output
    )

    return result[0]['generated_text'].strip()


config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Device set to use cuda:0


In [None]:
# free up RAM

def free_memory():
    """Frees up RAM and GPU memory in Colab."""
    torch.cuda.empty_cache()  # ✅ Clears unused GPU memory
    gc.collect()  # ✅ Forces Python garbage collection

# ✅ Run cleanup before re-processing PDFs
free_memory()

print("✅ RAM and GPU memory cleared successfully.")

✅ RAM and GPU memory cleared successfully.


### Find max token length for chunking  (one time run)

Use max_new_tokens to experiment on finding the ideal token length. Then based on obversation, - summarize_text function

In [None]:
from transformers import pipeline

# ✅ Load the optimized Llama-2 model for summarization
summarizer = pipeline("text-generation", model="meta-llama/Llama-2-7b-chat-hf", device_map="auto")

# Adjust max new token value based on experimented results

def summarize_text(text: str, max_new_tokens: int = 250) -> str:
    """
    Summarizes input text using Llama-2.

    Args:
        text (str): The input text to summarize.
        max_new_tokens (int): Maximum number of tokens for the summary output.

    Returns:
        str: The summarized text.
    """
    prompt = f"""
    ### Instruction:
    You are a professional summarization assistant.
    Your task is to **SUMMARIZE the following text into a concise version**, while keeping the most important details.

    **Rules:**
    - Do **NOT** rewrite the full text.
    - Keep the summary within **{max_new_tokens} tokens**.
    - Make the summary clear, structured, and easy to read.

    ### Text:
    {text}

    ### Summary:
    """

    result = summarizer(
        prompt,
        max_new_tokens=max_new_tokens,  # ✅ Ensures output is constrained
        num_return_sequences=1,
        do_sample=False,  # ✅ Ensures deterministic output
        truncation=True  # ✅ Prevents memory issues
    )

    return result[0]['generated_text'].strip()



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cpu


Try different combination of token numbers in for tokens in [300, 350, 400, 450, 500]. Start from 100, 200, 300

In [None]:
# ✅ Use the full Adobe Experience Cloud text for summarization
test_text = """
Our goal is to be a leading provider of cloud-based solutions for delivering digital experiences and enabling digital
transformation. Adobe Experience Cloud Apps and services are designed to manage customer journeys, enable personalized
experiences at scale and deliver intelligence for businesses of any size in any industry. Adobe Experience Platform further
strengthens our differentiation by offering a way to connect our comprehensive set of solutions. Further descriptions of our
Digital Experience products are included below under “Principal Products, Services and Solutions.”

Adobe Experience Cloud delivers solutions for our customers across the following strategic growth pillars:
• Data insights and audiences. Our products deliver actionable data to our customers in real time to enable highly
tailored and adaptive experiences across platforms.
• Content, commerce and workflows. Our products help our customers manage, deliver, personalize, and optimize
content delivery; build multi-channel commerce experiences for B2B and B2C customers; strategically plan,
manage, collaborate and execute on workflows for marketing campaigns and other projects at speed and scale; and
leverage self-serve capabilities to deliver on-brand content.
• Customer journeys. Our products help businesses manage, test, target and personalize customer journeys delivered
as campaigns across B2B and B2C use cases.

Our goal is to deploy our capabilities to help enterprises generate highly engaging experiences, enable content creators to
enhance creativity and scale content production, and provide marketing strategists with recommendations to improve marketing
strategy and enhance the delivery of personalized customer journeys. In our Adobe Experience Cloud, we believe our
innovations, natively embedded AI services, and end-to-end content supply chain solutions enhance the delivery and
personalization of digital experiences by allowing our customers to gain insights and understand their data, simulate outcomes,
automate tasks, generate audiences, create content, optimize marketing campaigns and enable the delivery of more relevant and
personalized customer journeys. Adobe Experience Cloud offers domain-specific AI services in areas such as attribution and
automated insights, customer journey management, lead management, sentiment analysis, one-click personalization, enhanced
anomaly detection and more that work with Adobe Experience Platform to augment our Experience Cloud product offerings.

By building on these features and capabilities, we increase the value we provide our customers and create a competitive
differentiation in the market.

Adobe Experience Cloud offers an open platform and ecosystem through Adobe Experience Platform. Adobe Experience
Platform’s open system transforms businesses’ customer data from across their Adobe solutions and third-party software into
robust customer profiles. These profiles are updated in real time and include AI-driven insights, allowing a brand to deliver the
right customer experiences across channels and deliver true one-to-one personalization. This open architecture offers scalability
with a wide variety of supporting products and services and empowers businesses to quickly develop innovative Apps to
interact with their customers and enables a broad industry ecosystem.

To drive growth of Adobe Experience Cloud, we focus on delivering on-brand, customer engagement, growth within
existing customer accounts, product differentiation and the best customer experience management solutions for B2B and B2C
buyers across enterprise and mid-market segments. We intend to pursue growth through a scaled go-to-market approach
focused on C-suite partnerships, transformational accounts, continued customer acquisition, customer value realization and
solution expansion. We utilize a direct sales force to market and license our Experience Cloud solutions, as well as an extensive
ecosystem of partners, including marketing agencies, systems integrators and independent software vendors that help license
and deploy our solutions to their customers. We also maintain several strategic partnerships with other technology companies
that allow us to increase our market reach. We have made significant investments to broaden the scale and size of all these
routes to market and believe these investments will result in continued growth in revenue in our Digital Experience segment in
fiscal 2025 and beyond.
"""

# Test different lengths
for tokens in [300, 350, 400, 450, 500]:
    print(f"\n🔹 Summarization with max_new_tokens={tokens}:\n")
    print(summarize_text(test_text, max_new_tokens=tokens))


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.



🔹 Summarization with max_new_tokens=300:





### Instruction:
    You are a professional summarization assistant. 
    Your task is to **SUMMARIZE the following text into a concise version**, while keeping the most important details.

    **Rules:**
    - Do **NOT** rewrite the full text. 
    - Keep the summary within **300 tokens**.
    - Make the summary clear, structured, and easy to read.
    
    ### Text:
    
Our goal is to be a leading provider of cloud-based solutions for delivering digital experiences and enabling digital 
transformation. Adobe Experience Cloud Apps and services are designed to manage customer journeys, enable personalized 
experiences at scale and deliver intelligence for businesses of any size in any industry. Adobe Experience Platform further 
strengthens our differentiation by offering a way to connect our comprehensive set of solutions. Further descriptions of our 
Digital Experience products are included below under “Principal Products, Services and Solutions.”

Adobe Experience Cloud delivers so

**Test run on 100, 200, 300 Obversation: **

✅ Issue: Llama-2 Only Summarizing at max_new_tokens=300

This means that Llama-2 is not producing a meaningful summary unless given more tokens.

At 100 or 200 tokens, it likely doesn’t generate enough output to be useful.

🔹 Why is This Happening?

Llama-2 needs more space to process and rewrite text concisely.
When max_new_tokens is too small (100 or 200), it may truncate output before completing the summary.
When set to 300, it has enough space to produce a structured summary.

✅ Solution: Adjust max_new_tokens & Improve Prompt

Set a minimum of max_new_tokens=250 (since 300 worked well).
Make the prompt stricter so Llama-2 is forced to summarize even at lower values.

**Test run on 300, 350, 400, 450, 500:**

Set max_new_tokens=300 as the default.

No need to test 550, since 500 already proved excessive.

Llama-2 efficiently summarizes within 300 tokens—going beyond that doesn't help.

No benefit in increasing beyond 300—Llama-2 naturally stops at the same point.


### Current version PDF Processing code **cell**

Add new PDFs, then re-run pipeline:

✅ Step 1: Parsing and chunking new PDFs (Existing ones are skipped)
chunk_pdfs("/content/drive/MyDrive/DataScienceProjects/10Kdocuments")

✅ Step 2: Summarize new PDFs (Existing summaries are skipped)
parallel_summarize_chunks(batch_size=50)

✅ Step 3: Embed summaries into Pinecone (Already embedded ones are skipped)
embed_summaries_to_pinecone()

In [None]:
pip install pymupdf pytesseract pdf2image

Collecting pytesseract
  Downloading pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Collecting pdf2image
  Downloading pdf2image-1.17.0-py3-none-any.whl.metadata (6.2 kB)
Downloading pytesseract-0.3.13-py3-none-any.whl (14 kB)
Downloading pdf2image-1.17.0-py3-none-any.whl (11 kB)
Installing collected packages: pytesseract, pdf2image
Successfully installed pdf2image-1.17.0 pytesseract-0.3.13


In [None]:
import os
import gc
import pickle
from typing import Dict, Any, List
import re
import fitz  # PyMuPDF
import pytesseract
from pdf2image import convert_from_path
from langchain.text_splitter import RecursiveCharacterTextSplitter

# ✅ Path to the Google Drive folder for storing chunked data
SAVE_DIR = "/content/drive/MyDrive/DataScienceProjects/DocumentSummarizedChunks/"
os.makedirs(SAVE_DIR, exist_ok=True)  # Ensure the directory exists

CHUNKS_FILE = os.path.join(SAVE_DIR, "pdf_chunks.pkl")
PROCESSED_CHUNKS_FILE = os.path.join(SAVE_DIR, "processed_chunks.pkl")

def load_progress(file: str) -> set:
    """
    Loads previously processed PDFs from disk,
    or returns an empty set if file not found.
    """
    try:
        with open(file, "rb") as f:
            return pickle.load(f)
    except FileNotFoundError:
        return set()

def save_progress(processed_files: set, file: str):
    """
    Saves processed PDF records (e.g., filenames) to disk.
    """
    with open(file, "wb") as f:
        pickle.dump(processed_files, f)

# ✅ Track which PDFs are already processed
processed_files = load_progress(PROCESSED_CHUNKS_FILE)

def extract_year(filename: str) -> str:
    """
    Extracts a valid four-digit year (19XX or 20XX) from the filename using regex.
    If no match is found, returns 'Unknown'.
    """
    match = re.search(r"\b(19|20)\d{2}\b", filename)
    return match.group(0) if match else "Unknown"

def extract_text_from_pdf(pdf_path: str) -> List[Dict[str, Any]]:
    """
    Extracts text and metadata from a PDF file using PyMuPDF (fitz) and OCR fallback.
    """
    pdf_text_data = []
    try:
        with fitz.open(pdf_path) as doc:
            for page_num in range(len(doc)):
                page = doc[page_num]
                text = page.get_text("text")

                if not text.strip():  # If no text is extracted, try OCR
                    print(f"⚠️ No text extracted from {pdf_path}, Page {page_num + 1}. Running OCR...")
                    images = convert_from_path(pdf_path, first_page=page_num+1, last_page=page_num+1)
                    text = pytesseract.image_to_string(images[0]) if images else ""

                print(f"Extracted text from {pdf_path}, Page {page_num + 1}: {text[:200]}")  # Debug first 200 chars
                pdf_text_data.append({
                    "page_num": page_num + 1,
                    "text": text
                })
    except Exception as e:
        print(f"⚠️ Failed to extract text from {pdf_path} due to: {e}")
    return pdf_text_data

def content_aware_chunking(
    text: str,
    chunk_size: int = 500,
    chunk_overlap: int = 100,
    separators: List[str] = None
) -> List[str]:
    """
    Splits text into meaningful chunks using LangChain's RecursiveCharacterTextSplitter,
    ensuring overlap for context continuity and logical sentence boundaries.
    """
    if separators is None:
        separators = ["\n\n", ". ", "? ", "! "]

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=separators
    )
    chunks = text_splitter.split_text(text)
    print(f"Generated {len(chunks)} chunks.")
    return chunks

def chunk_pdfs(
    pdf_folder: str,
    max_tokens_per_chunk: int = 500
) -> None:
    """
    Processes PDFs in 'pdf_folder' by:
        1. Extracting text from PDFs using PyMuPDF and OCR fallback
        2. Splitting page text into content-aware chunks
        3. Storing chunk metadata (company, year, PDF path, page number, chunk_text)
        4. Saving all chunk data into 'pdf_chunks.pkl' under Google Drive
        5. Tracking processed PDFs in 'processed_chunks.pkl'
    """
    all_chunks: Dict[str, List[Dict[str, Any]]] = {}

    for filename in os.listdir(pdf_folder):
        if filename.lower().endswith(".pdf") and filename not in processed_files:
            pdf_path = os.path.join(pdf_folder, filename)
            print(f"🔹 Processing {filename} ...")

            # ✅ Extract text from PDF
            pages_data = extract_text_from_pdf(pdf_path)

            # ✅ Extract metadata (Company Name & Year)
            parts = filename.split('-')
            company_name = parts[0] if len(parts) > 1 else "Unknown"
            year = extract_year(filename)

            # Prepare storage for chunk data
            pdf_chunks = []

            for page in pages_data:
                page_text = page["text"]
                page_num = page["page_num"]

                # Content-Aware Chunking
                chunks = content_aware_chunking(
                    text=page_text,
                    chunk_size=max_tokens_per_chunk,
                    chunk_overlap=50
                )

                # Build chunk records
                for chunk_text in chunks:
                    chunk_record = {
                        "company": company_name,
                        "year": year,
                        "pdf_path": pdf_path,
                        "page_num": page_num,
                        "chunk_text": chunk_text
                    }
                    pdf_chunks.append(chunk_record)

            # Store chunk data under filename
            all_chunks[filename] = pdf_chunks

            # Mark PDF as processed
            processed_files.add(filename)
            save_progress(processed_files, PROCESSED_CHUNKS_FILE)

    # ✅ Save all chunk data to pdf_chunks.pkl in Google Drive
    with open(CHUNKS_FILE, "wb") as f:
        pickle.dump(all_chunks, f)

    print(f"\n✅ PDF chunking complete. Chunk data saved to '{CHUNKS_FILE}'.")
    gc.collect()

# ✅ Example usage
if __name__ == "__main__":
    pdf_folder_path = "/content/drive/MyDrive/DataScienceProjects/10Kdocuments"
    chunk_pdfs(pdf_folder_path, max_tokens_per_chunk=500)


🔹 Processing Workday-2024-10K.pdf ...
Extracted text from /content/drive/MyDrive/DataScienceProjects/10Kdocuments/Workday-2024-10K.pdf, Page 1: Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
 
FORM 10-K
 
(Mark One)
☒
ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 193
Extracted text from /content/drive/MyDrive/DataScienceProjects/10Kdocuments/Workday-2024-10K.pdf, Page 2: Table of Contents
TABLE OF CONTENTS
 
PART I
 
Item 1.
Business
1
Item 1A.
Risk Factors
8
Item 1B.
Unresolved Staff Comments
28
Item 1C.
Cybersecurity
28
Item 2.
Properties
30
Item 3.
Legal Proceeding
Extracted text from /content/drive/MyDrive/DataScienceProjects/10Kdocuments/Workday-2024-10K.pdf, Page 3: Table of Contents
PART I
As used in this report, the terms “Workday,” “registrant,” “we,” “us,” and “our” mean Workday, Inc. and its subsidiaries unless the context indicates
otherwise.
Our fiscal yea
Extracted text from /content/drive/MyDriv

In [None]:
import os
pdf_folder_path = "/content/drive/MyDrive/DataScienceProjects/10Kdocuments"
print(os.listdir(pdf_folder_path))


['Workday-2024-10K.pdf', 'Paypal-2024-10K.pdf']


In [None]:
import pickle

CHUNKS_FILE = "/content/drive/MyDrive/DataScienceProjects/DocumentSummarizedChunks/pdf_chunks.pkl"

with open(CHUNKS_FILE, "rb") as f:
    chunked_data = pickle.load(f)

print(f"🔹 Found {len(chunked_data)} PDFs processed.")
for file, chunk_list in chunked_data.items():
    print(f"📄 File: {file}, Total Chunks: {len(chunk_list)}")
    # Optionally view sample chunk
    if chunk_list:
        print("Sample chunk:", chunk_list[0])

🔹 Found 2 PDFs processed.
📄 File: Workday-2024-10K.pdf, Total Chunks: 2120
Sample chunk: {'company': 'Workday', 'year': '2024', 'pdf_path': '/content/drive/MyDrive/DataScienceProjects/10Kdocuments/Workday-2024-10K.pdf', 'page_num': 1, 'chunk_text': 'Table of Contents\nUNITED STATES\nSECURITIES AND EXCHANGE COMMISSION\nWashington, D.C'}
📄 File: Paypal-2024-10K.pdf, Total Chunks: 1299
Sample chunk: {'company': 'Paypal', 'year': '2024', 'pdf_path': '/content/drive/MyDrive/DataScienceProjects/10Kdocuments/Paypal-2024-10K.pdf', 'page_num': 1, 'chunk_text': 'UNITED STATES\nSECURITIES AND EXCHANGE COMMISSION\nWashington, D.C'}


In [None]:
# free up RAM

def free_memory():
    """Frees up RAM and GPU memory in Colab."""
    torch.cuda.empty_cache()  # ✅ Clears unused GPU memory
    gc.collect()  # ✅ Forces Python garbage collection

# ✅ Run cleanup before re-processing PDFs
free_memory()

print("✅ RAM and GPU memory cleared successfully.")

✅ RAM and GPU memory cleared successfully.


In [None]:
# check GPU utliztaion
!nvidia-smi

Tue Mar 11 00:12:37 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   32C    P0             51W /  400W |   38417MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

### old code cells (summarize using Llam2)

In [None]:
# need enhancements -old

# ✅ File Paths
CHUNKS_FILE = "pdf_chunks.pkl"
PROCESSED_SUMMARIES_FILE = "processed_summaries.pkl"
SUMMARIZED_CHUNKS_FILE = "/content/drive/MyDrive/DataScienceProjects/DocumentSummarizedChunks.pkl"

# ✅ Check GPU availability
device = 0 if torch.cuda.is_available() else -1
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32  # Use lower precision on GPU

try:
    summarizer = pipeline(
        "text-generation",
        model="meta-llama/Llama-2-7b-chat-hf",
        device=device,
        torch_dtype=torch_dtype
    )
    print(f"✅ Model loaded successfully on {'GPU' if device == 0 else 'CPU'}.")
except Exception as e:
    print(f"⚠️ Error loading model: {e}")

# ✅ Load Chunked Documents (Includes Metadata)
def load_chunked_documents():
    try:
        with open(CHUNKS_FILE, "rb") as f:
            return pickle.load(f)
    except FileNotFoundError:
        print("⚠️ No chunked documents found! Run chunking first.")
        return {}

# ✅ Load Already Processed Summaries
def load_progress(file):
    try:
        with open(file, "rb") as f:
            return pickle.load(f)
    except FileNotFoundError:
        return set()

# ✅ Save Progress
def save_progress(processed_files, file_path):
    with open(file_path, "wb") as f:
        pickle.dump(processed_files, f)

# ✅ Summarization Function
def summarize_chunk(chunk):
    """Summarizes a chunk while ensuring large chunks are processed in smaller segments."""
    try:
        if len(chunk) > 1000:
            split_chunks = [chunk[i:i+300] for i in range(0, len(chunk), 300)]
            summaries = []
            for sub_chunk in split_chunks:
                result = summarizer(
                    sub_chunk,
                    max_new_tokens=300,
                    do_sample=False
                )[0]['generated_text'].strip()
                summaries.append(result)
            return " ".join(summaries)  # ✅ Merge summarized sections
        return chunk  # ✅ Return as is if small
    except Exception as e:
        print(f"⚠️ Error in summarizing chunk: {e}")
        return "⚠️ Summarization failed for this chunk"

# ✅ Summarize All Chunks in Parallel
def parallel_summarize_chunks(batch_size=50):
    """
    Uses ThreadPoolExecutor to summarize document chunks in parallel
    while ensuring efficient memory usage.
    """
    chunked_documents = load_chunked_documents()
    processed_files = load_progress(PROCESSED_SUMMARIES_FILE)
    summaries = {}

    torch.cuda.empty_cache()
    gc.collect()

    for filename, data in chunked_documents.items():
        if filename not in processed_files:
            company = data["company"]
            year = data["year"]
            chunks = data["chunks"]

            print(f"\n🔹 Summarizing {filename} (Company: {company}, Year: {year})...")
            total_chunks = len(chunks)
            print(f"   📌 Found {total_chunks} chunks to summarize.")

            if total_chunks == 0:
                print(f"⚠️ No chunks found for {filename}. Skipping.")
                continue

            summaries[filename] = {
                "company": company,
                "year": year,
                "summarized_chunks": []
            }

            for i in range(0, total_chunks, batch_size):
                batch_chunks = chunks[i:i + batch_size]
                print(f"   📌 Summarizing batch {i // batch_size + 1} of {filename}...")

                with ThreadPoolExecutor(max_workers=1) as executor:
                    batch_summaries = list(executor.map(summarize_chunk, batch_chunks))
                    summaries[filename]["summarized_chunks"].extend(batch_summaries)

                torch.cuda.empty_cache()
                gc.collect()

            # ✅ Save Summarized Chunks with Metadata
            with open(SUMMARIZED_CHUNKS_FILE, "wb") as f:
                pickle.dump(summaries, f)

            processed_files.add(filename)
            save_progress(processed_files, PROCESSED_SUMMARIES_FILE)

            print(f"✅ {filename} summarization **completed** ({total_chunks} chunks summarized).")

    print("\n🎯 ✅ **Summarization complete for all documents.** ✅ 🎯")

# ✅ Run the Summarization
parallel_summarize_chunks(batch_size=50)


validate summarization is sucessful and saved

In [None]:
print(summaries.keys())

dict_keys(['Workday-2024-10K.pdf', 'Paypal-2024-10K.pdf'])


In [None]:
# ✅ Save Summarized Chunks
with open(PROCESSED_SUMMARIES_FILE, "wb") as f:
    pickle.dump(summaries, f)

print("✅ Summaries successfully saved to disk.")

✅ Summaries successfully saved to disk.


In [None]:
PROCESSED_SUMMARIES_FILE = "/content/drive/MyDrive/DataScienceProjects/DocumentSummarizedChunks/processed_summaries.pkl"

try:
    with open(PROCESSED_SUMMARIES_FILE, "rb") as f:
        processed_summaries = pickle.load(f)
    print(f"✅ {len(processed_summaries)} PDFs summarized already: {processed_summaries.keys()}")
except FileNotFoundError:
    print("⚠️ No summaries found. Check if they were saved correctly.")

✅ 2 PDFs summarized already: dict_keys(['Workday-2024-10K.pdf', 'Paypal-2024-10K.pdf'])


### Embedding chunks to vector database

with updated code cell, it took 7.5 minute to embed chunks of 2 PDFs

In [None]:
# optimized ib memory

import numpy as np
from transformers import BitsAndBytesConfig
from tqdm import tqdm  # Progress bar for better monitoring

# Make sure you have:
# 1. Defined `embed_text_batch` in cell #3
# 2. Initialized Pinecone client & index (with dimension=3584)

CHUNKS_FILE = "/content/drive/MyDrive/DataScienceProjects/DocumentSummarizedChunks/pdf_chunks.pkl"
PROCESSED_EMBEDDINGS_FILE = "/content/drive/MyDrive/DataScienceProjects/DocumentSummarizedChunks/processed_embeddings.pkl"

# ✅ Load or initialize the set of PDFs that were already embedded
try:
    with open(PROCESSED_EMBEDDINGS_FILE, "rb") as f:
        processed_embeddings = pickle.load(f)
    print(f"✅ Loaded {len(processed_embeddings)} already embedded PDFs.")
except FileNotFoundError:
    processed_embeddings = set()
    print("⚠️ No embedded PDFs found yet. Starting from scratch.")

def embed_and_upsert_chunks(batch_size: int = 10, upsert_batch_size: int = 100):
    """
    Optimized function to:
    - Load chunked text data in **streaming mode** (not all at once).
    - Generate embeddings in batches while managing limited RAM.
    - Upsert in mini-batches to **avoid high RAM spikes**.

    Parameters:
        batch_size (int): Number of texts to embed at once.
        upsert_batch_size (int): Number of embeddings to store before sending to Pinecone.
    """
    if not os.path.exists(CHUNKS_FILE):
        print(f"⚠️ '{CHUNKS_FILE}' not found. Run the PDF chunking step first.")
        return

    # ✅ Load chunked data in streaming mode (no full load in RAM)
    with open(CHUNKS_FILE, "rb") as f:
        chunked_docs = pickle.load(f)

    print(f"✅ Loaded metadata for {len(chunked_docs)} PDFs.")

    torch.cuda.empty_cache()
    gc.collect()

    vectors_to_upsert = []

    # Loop through all PDFs
    for filename, chunk_records in chunked_docs.items():
        if filename in processed_embeddings:
            print(f"⚠️ Already embedded '{filename}'. Skipping.")
            continue

        print(f"\n🔹 Embedding chunks from {filename} ...")

        # Process PDF in memory-efficient sub-batches
        for start_idx in tqdm(range(0, len(chunk_records), batch_size), desc=f"Processing {filename}"):
            sub_batch = chunk_records[start_idx: start_idx + batch_size]
            texts = [cr["chunk_text"] for cr in sub_batch]

            # ✅ Embed sub-batch using quantized model
            with torch.no_grad():
                embeddings = embed_text_batch(texts)  # Output shape: (len(sub_batch), 3584)

            # ✅ Convert embeddings to FP16 NumPy (low RAM usage)
            embeddings = embeddings.astype(np.float16)

            # ✅ Prepare vectors for Pinecone
            for i, emb_vec in enumerate(embeddings):
                chunk_info = sub_batch[i]
                vector_id = str(uuid.uuid4())
                metadata = {
                    "company": chunk_info["company"],
                    "year": chunk_info["year"],
                    "pdf_path": chunk_info["pdf_path"],
                    "page_num": chunk_info["page_num"],
                    "chunk_text": chunk_info["chunk_text"][:2000]  # Truncate to save space
                }

                vectors_to_upsert.append((vector_id, emb_vec.tolist(), metadata))

            # ✅ Free memory after embedding batch
            del embeddings
            torch.cuda.empty_cache()
            gc.collect()

            # ✅ Upsert in smaller batches to avoid memory spikes
            if len(vectors_to_upsert) >= upsert_batch_size:
                index.upsert(vectors_to_upsert)
                vectors_to_upsert.clear()
                torch.cuda.empty_cache()
                gc.collect()

        # ✅ Final upsert for remaining vectors
        if vectors_to_upsert:
            index.upsert(vectors_to_upsert)
            vectors_to_upsert.clear()

        processed_embeddings.add(filename)
        with open(PROCESSED_EMBEDDINGS_FILE, "wb") as f:
            pickle.dump(processed_embeddings, f)

        print(f"✅ Finished embedding & upsert for '{filename}'.")

    print("\n🎯 ✅ All PDFs have been embedded & upserted to Pinecone efficiently. 🎯")
    gc.collect()

# ✅ Example usage (lower batch size to fit within 25GB RAM)
if __name__ == "__main__":
    embed_and_upsert_chunks(batch_size=10, upsert_batch_size=50)


⚠️ No embedded PDFs found yet. Starting from scratch.
✅ Loaded metadata for 2 PDFs.

🔹 Embedding chunks from Workday-2024-10K.pdf ...


Processing Workday-2024-10K.pdf: 100%|██████████| 212/212 [03:55<00:00,  1.11s/it]


✅ Finished embedding & upsert for 'Workday-2024-10K.pdf'.

🔹 Embedding chunks from Paypal-2024-10K.pdf ...


Processing Paypal-2024-10K.pdf: 100%|██████████| 130/130 [02:29<00:00,  1.15s/it]


✅ Finished embedding & upsert for 'Paypal-2024-10K.pdf'.

🎯 ✅ All PDFs have been embedded & upserted to Pinecone efficiently. 🎯


## RAG Pipeline

In [None]:
# check vectors stores in pinecone

index.describe_index_stats()

{'dimension': 3584,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 3419}},
 'total_vector_count': 3419,
 'vector_type': 'dense'}

🔹 Step 1: Define Required Imports & Initialize Models

Load Llama 3.2 and define summariztaion function

In [None]:
import torch
import gc
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from difflib import SequenceMatcher

######################
#    MODEL LOADING   #
######################

# ✅ 4-bit quantization config
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    llm_int8_enable_fp32_cpu_offload=False
)

LLAMA3B_MODEL_NAME = "meta-llama/Llama-3.2-3B"  # Hypothetical 3B param model

# ✅ Load the tokenizer
tokenizer_llama3b = AutoTokenizer.from_pretrained(LLAMA3B_MODEL_NAME)

# ✅ Load the Llama-3.2-3B model with 4-bit quantization
model_llama3b = AutoModelForCausalLM.from_pretrained(
    LLAMA3B_MODEL_NAME,
    device_map="auto",  # rely on accelerate for device assignment
    quantization_config=quant_config
)

# ✅ Create the text-generation pipeline with EOS token for stopping
summarizer_pipeline = pipeline(
    "text-generation",
    model=model_llama3b,
    tokenizer=tokenizer_llama3b,
    eos_token_id=tokenizer_llama3b.eos_token_id,  # Stop at EOS token
    pad_token_id=tokenizer_llama3b.eos_token_id  # Avoids unwanted padding
)

print("✅ Llama-3.2-3B loaded in 4-bit quantization successfully!")
gc.collect()
torch.cuda.empty_cache()

######################
# SUMMARIZATION CODE #
######################

def is_similar(a: str, b: str, threshold: float = 0.8) -> bool:
    """
    Determines whether two strings are similar based on a similarity threshold.

    Args:
        a (str): First string.
        b (str): Second string.
        threshold (float): Similarity ratio threshold (0 to 1).

    Returns:
        bool: True if strings are similar, False otherwise.
    """
    return SequenceMatcher(None, a.lower(), b.lower()).ratio() > threshold  # Lowercase for better match


def summarize_text(text: str, max_new_tokens: int = 300) -> str:
    """
    Summarizes the given text using the loaded Llama-3.2-3B model in 4-bit quantization.

    Args:
        text (str): The text to be summarized.
        max_new_tokens (int): Maximum tokens for the summary.

    Returns:
        str: A concise summary with unique bullet points.
    """
    prompt = f"""
    ### Instruction:
    You are a financial analysis expert. Summarize the following text in a structured manner using bullet points.
    Ensure that each bullet point is unique and does not repeat the same information.
    Keep your summary concise (3-5 bullet points) and focus only on the key insights.
    Do not provide duplicated sentences.

    ### Text:
    {text}

    ### Summary:
    """

    result = summarizer_pipeline(
        prompt,
        max_new_tokens=max_new_tokens,
        num_return_sequences=1,
        do_sample=False
    )[0]["generated_text"].strip()

    # Ensure we only return content after the ### Summary: header
    if "### Summary:" in result:
        result = result.split("### Summary:")[-1].strip()

    # Post-processing: remove duplicate bullet points using similarity check
    lines = result.splitlines()
    unique_lines = []

    for line in lines:
        line = line.strip()
        # Consider non-empty bullet points only
        if line and not any(is_similar(line, u_line) for u_line in unique_lines):
            unique_lines.append(line)

    final_summary = "\n".join(unique_lines)
    return final_summary

######################
# QUICK TEST        #
######################

if __name__ == "__main__":
    test_text = (
        "Adobe Experience Cloud provides AI-powered solutions for digital content "
        "management and customer engagement. It helps businesses personalize experiences, "
        "optimize workflows, and improve automation. Adobe also uses AI-driven insights "
        "to enhance content delivery and customer satisfaction."
    )

    summary_output = summarize_text(test_text, max_new_tokens=300)
    print("🔹 **Test Summary:**\n", summary_output)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0


✅ Llama-3.2-3B loaded in 4-bit quantization successfully!
🔹 **Test Summary:**
 - AI-powered solutions for digital content management and customer engagement
- Helps businesses personalize experiences, optimize workflows, and improve automation
- Adobe also uses AI-driven insights to enhance content delivery and customer satisfaction


In [None]:
# EXAMPLE USAGE:
paragraph = """At Workday, innovation is a core value. Our culture encourages out-of-the-box thinking and creativity, which enables us to create applications designed to
change the way people work. Our architecture enables us to deploy our solutions rapidly to meet evolving business needs. We invest a significant percentage of our
resources in product development and are committed to rapidly building and/or acquiring new applications and solutions. Our product development organization is
responsible for product design, development, testing, and certification. We focus our efforts on developing new applications and core technologies, as well as
further enhancing the usability, functionality, reliability, security, performance, and flexibility of existing applications. To grow our unified suite of Workday
applications, we primarily invest in research and development, but we also selectively acquire companies that are consistent with our design principles, existing
product set, corporate strategy, and company culture. We also manage a portfolio of strategic investments through Workday Ventures, our strategic investment arm.
We invest primarily in enterprise cloud technology companies that we believe are digitally transforming their industries, improving customer experiences, helping
us expand our solution ecosystem or supporting other corporate initiatives. We plan to continue making these types of strategic investments as opportunities arise
 that we find attractive.
"""

summary_output = summarize_text(paragraph, max_new_tokens=150)

print("🔹 **Paragraph:**\n", paragraph)
print("\n🔹 **Generated Summary:**\n", summary_output)

🔹 **Paragraph:**
 At Workday, innovation is a core value. Our culture encourages out-of-the-box thinking and creativity, which enables us to create applications designed to
change the way people work. Our architecture enables us to deploy our solutions rapidly to meet evolving business needs. We invest a significant percentage of our
resources in product development and are committed to rapidly building and/or acquiring new applications and solutions. Our product development organization is
responsible for product design, development, testing, and certification. We focus our efforts on developing new applications and core technologies, as well as
further enhancing the usability, functionality, reliability, security, performance, and flexibility of existing applications. To grow our unified suite of Workday
applications, we primarily invest in research and development, but we also selectively acquire companies that are consistent with our design principles, existing
product set, corpora

Load Tiny Llama and define Query expansion frunction

In [None]:
import torch
import gc
from transformers import (
    pipeline,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig
)

#######################
# TINYLLAMA 4-BIT SETUP
#######################
# If you absolutely must keep TinyLlama on CPU, remove "quant_config" and set device=-1
# (but then no 4-bit quant.)

TINYLAMA_MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# ✅ Use bitsandbytes for 4-bit quant (designed for GPU usage)
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    llm_int8_enable_fp32_cpu_offload=False  # Typically for GPU offload
)

# Load tokenizer
tokenizer_tiny = AutoTokenizer.from_pretrained(TINYLAMA_MODEL_NAME)

# Load TinyLlama 4-bit on GPU with accelerate
# This uses very little VRAM compared to full precision
model_tiny = AutoModelForCausalLM.from_pretrained(
    TINYLAMA_MODEL_NAME,
    device_map="auto",        # Let accelerate place layers on GPU
    quantization_config=quant_config
)

# ✅ Create pipeline (no device arg, to avoid accelerate conflict)
tinyllama_pipeline = pipeline(
    "text-generation",
    model=model_tiny,
    tokenizer=tokenizer_tiny,
)

print("✅ TinyLlama loaded in 4-bit quantization (GPU) successfully.")
gc.collect()
torch.cuda.empty_cache()

##############################
# QUERY EXPANSION FUNCTION
##############################

def expand_query_tinyllama(query: str, max_new_tokens: int = 50) -> str:
    """
    Expands the user's query for better retrieval or searching,
    using the 4-bit quantized TinyLlama model.

    Args:
        query (str): The original user query.
        max_new_tokens (int): Max tokens for the expansion output.

    Returns:
        str: The expanded query text.
    """
    prompt = f"""
    ### Instruction:
    You are an AI assistant that refines and expands user queries for improved search.
    Take the user’s query and add synonyms, related terms, or clarifications,
    ensuring the original intent is preserved but adding coverage for potential
    relevant concepts.

    **User Query:**
    {query}

    **Expanded Query:**
    """

    # Use the pipeline to generate text
    result = tinyllama_pipeline(
        prompt,
        max_new_tokens=max_new_tokens,
        num_return_sequences=1,
        do_sample=False  # deterministic expansion
    )

    expanded_text = result[0]["generated_text"].strip()
    return expanded_text


##############################
# EXAMPLE USAGE
##############################
if __name__ == "__main__":
    user_query = "What are the top AI trends for enterprise software?"
    expanded_q = expand_query_tinyllama(user_query, max_new_tokens=60)
    print("🔹 Original Query:", user_query)
    print("\n🔹 Expanded Query:\n", expanded_q)


Device set to use cuda:0


✅ TinyLlama loaded in 4-bit quantization (GPU) successfully.
🔹 Original Query: What are the top AI trends for enterprise software?

🔹 Expanded Query:
 ### Instruction:
    You are an AI assistant that refines and expands user queries for improved search.
    Take the user’s query and add synonyms, related terms, or clarifications,
    ensuring the original intent is preserved but adding coverage for potential
    relevant concepts.

    **User Query:**
    What are the top AI trends for enterprise software?

    **Expanded Query:**
    1. Top AI trends for enterprise software
    2. Enterprise AI trends
    3. Enterprise AI trends for software
    4. Enterprise AI trends for software development
    5. Enterprise AI trends for software development process


In [None]:
def clean_memory():
    """Frees up GPU memory before running models."""
    torch.cuda.empty_cache()
    gc.collect()

clean_memory()  # ✅ Run before loading models

## RAG-  Basic RAG pipeline (query expansion -> retrieveal -> summarize or SWOT response)

-> Define LangGraph state schema, define pipeline steps, compile pipeline and invoke pipeline

In [None]:
import torch
import gc
import uuid
import numpy as np
from typing import TypedDict, Optional
from langgraph.graph import StateGraph, END

###########################################
# 1. Define the RAG State Schema
###########################################
class RAGState(TypedDict):
    query: str
    expanded_query: str
    company: str
    year: str
    context: Optional[str]
    swot_analysis: Optional[str]
    validated_response: str
    response_type: Optional[str]
    final_answer: Optional[str]

###########################################
# 2. Wrapper for Single-Text Embedding (using embed_text_batch)
###########################################
def embed_text(text: str) -> np.ndarray:
    """
    Embeds a single text string using embed_text_batch.
    Returns a NumPy array of shape (1, EMBEDDING_DIM).
    """
    embeddings = embed_text_batch([text])
    return embeddings

###########################################
# 3. Retrieval: Query Pinecone for Context
###########################################
def retrieve_context(state: RAGState) -> RAGState:
    try:
        query = state["expanded_query"]
        company = state["company"]
        year = state["year"]
        print(f"\n🔹 Retrieving summarized context for {company} ({year})...")

        query_embedding = embed_text(query)  # (1, EMBEDDING_DIM)
        query_vector = query_embedding[0].tolist()

        response = index.query(
            vector=query_vector,
            top_k=5,
            include_metadata=True,
            filter={"company": {"$eq": company}, "year": {"$eq": year}}
        )

        retrieved_texts = [
            match["metadata"].get("chunk_text", "")
            for match in response["matches"]
            if "metadata" in match
        ]
        deduped_texts = list(dict.fromkeys([t for t in retrieved_texts if t.strip()]))

        if not deduped_texts:
            print(f"⚠️ No relevant context found for {company} ({year}).")
            context = "No relevant context found."
        else:
            context = "\n\n".join(deduped_texts)
            print(f"✅ Retrieved {len(deduped_texts)} unique text chunk(s).")

        state["context"] = context
        torch.cuda.empty_cache()
        gc.collect()
        return state

    except Exception as e:
        print(f"⚠️ Error retrieving context: {e}")
        state["context"] = "Error retrieving context."
        return state

###########################################
# 4. Determine Response Type
###########################################
def determine_response_type(state: RAGState) -> RAGState:
    query_lower = state["query"].lower()
    if "swot" in query_lower or "company analysis" in query_lower:
        state["response_type"] = "swot"
    elif any(word in query_lower for word in ["explain", "analyze", "describe", "reason", "how", "why"]):
        state["response_type"] = "reasoning"
    else:
        state["response_type"] = "simple"
    print(f"🔹 Determined response type: {state['response_type']}")
    return state

###########################################
# 5. Response Generation Functions
###########################################
def generate_simple_answer(state: RAGState) -> RAGState:
    prompt = f"""Answer concisely based on the context below. Keep your answer to 2–3 sentences.

Context: {state['context']}

Question: {state['query']}

Answer:"""
    answer = tinyllama_pipeline(prompt, max_new_tokens=150, do_sample=False)[0]["generated_text"].strip()
    state["final_answer"] = answer
    return state

def generate_reasoning_answer(state: RAGState) -> RAGState:
    prompt = f"""Based on the context below, provide a detailed yet concise answer to the following question.
Keep your answer to 2–3 sentences.

Context: {state['context']}

Question: {state['query']}

Answer:"""
    answer = summarizer_pipeline(prompt, max_new_tokens=200, do_sample=False)[0]["generated_text"].strip()
    state["final_answer"] = answer
    return state

def generate_swot_answer(state: RAGState) -> RAGState:
    prompt = f"""Please carefully scan the entire context provided below and extract all key points.
Generate a structured SWOT analysis for the company with the following five distinct sections:

1. **Strengths**: Summarize the unique strengths mentioned in the document as bullet points.
2. **Weaknesses**: Summarize the distinct weaknesses as bullet points, ensuring these do not overlap with the strengths.
3. **Opportunities**: Summarize all potential opportunities for growth and improvement as bullet points.
4. **Threats**: Summarize the risks or threats mentioned as bullet points.
5. **Summary**: Provide a concise summary (2–3 sentences) highlighting the potential growth area for the company. You can be creative.

Make sure each section is clear and does not repeat content from the other sections.

Context:
{state['context']}

Question: {state['query']}

SWOT Analysis:"""

    answer = summarizer_pipeline(
        prompt,
        max_new_tokens=1000,
        do_sample=False
    )[0]["generated_text"].strip()

    state["final_answer"] = answer
    return state


def generate_response(state: RAGState) -> RAGState:
    rt = state.get("response_type", "simple")
    if rt == "swot":
        return generate_swot_answer(state)
    elif rt == "reasoning":
        return generate_reasoning_answer(state)
    else:
        return generate_simple_answer(state)

###########################################
# 6. Validate Final Response using Llama-3.2
###########################################
def validate_response(state: RAGState) -> RAGState:
    prompt = f"""Validate and refine the following response for clarity, accuracy, and coherence.
Ensure duplicate content is removed and the answer is concise (2–3 sentences for non-SWOT answers).

Response to validate:
{state['final_answer']}

Validated Response:"""
    inputs = tokenizer_llama3b(prompt, return_tensors="pt", truncation=True, max_length=1024).to(model_llama3b.device)
    with torch.no_grad():
        outputs = model_llama3b.generate(**inputs, max_new_tokens=150, do_sample=False)
    validated_response = tokenizer_llama3b.decode(outputs[0], skip_special_tokens=True).strip()
    state["validated_response"] = validated_response
    return state

###########################################
# 7. Build the LangGraph Pipeline
###########################################
rag_graph = StateGraph(RAGState)
rag_graph.add_node("expand_query", lambda state: {**state, "expanded_query": expand_query_tinyllama(state["query"])})
rag_graph.add_node("retrieve_context", retrieve_context)
rag_graph.add_node("determine_response", determine_response_type)
rag_graph.add_node("generate_response", generate_response)
rag_graph.add_node("validate_response", validate_response)

rag_graph.add_edge("expand_query", "retrieve_context")
rag_graph.add_edge("retrieve_context", "determine_response")
rag_graph.add_edge("determine_response", "generate_response")
rag_graph.add_edge("generate_response", "validate_response")
rag_graph.add_edge("validate_response", END)

rag_graph.set_entry_point("expand_query")
rag_pipeline = rag_graph.compile()

###########################################
# 8. Test the RAG Pipeline
###########################################
if __name__ == "__main__":
    initial_state: RAGState = {
        "query": "Provide a SWOT analysis of Workday's approach to product development.",
        "expanded_query": "",
        "company": "Workday",
        "year": "2024",
        "context": None,
        "swot_analysis": None,
        "validated_response": "",
        "response_type": None,
        "final_answer": None
    }
    result_state = rag_pipeline.invoke(initial_state)
    print("\n🎯 Final Validated Response:\n", result_state["validated_response"])



🔹 Retrieving summarized context for Workday (2024)...
✅ Retrieved 5 unique text chunk(s).
🔹 Determined response type: swot


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



🎯 Final Validated Response:
 Validate and refine the following response for clarity, accuracy, and coherence.
Ensure duplicate content is removed and the answer is concise (2–3 sentences for non-SWOT answers).

Response to validate:
Please carefully scan the entire context provided below and extract all key points.
Generate a structured SWOT analysis for the company with the following five distinct sections:

1. **Strengths**: Summarize the unique strengths mentioned in the document as bullet points.
2. **Weaknesses**: Summarize the distinct weaknesses as bullet points, ensuring these do not overlap with the strengths.
3. **Opportunities**: Summarize all potential opportunities for growth and improvement as bullet points.
4. **Threats**: Summarize the risks or threats mentioned as bullet points.
5. **Summary**: Provide a concise summary (2–3 sentences) highlighting the potential growth area for the company. You can be creative.

Make sure each section is clear and does not repeat cont

clean up memory

In [None]:
def clean_memory():
    """Frees up GPU memory before running models."""
    torch.cuda.empty_cache()
    gc.collect()

clean_memory()  # ✅ Run before loading models


test SWOT response

In [None]:
query = "Provide a SWOT analysis of Paypal's 2024 financial performance."
company_name = "Paypal"
year_to_search = "2024"

# ✅ Run the pipeline
result = rag_pipeline.invoke({"query": query, "company": company_name, "year": year_to_search})

print("\n🎯 **Final Validated Response:**\n", result["validated_response"])


🔹 Retrieving summarized context for Paypal (2024)...
✅ Retrieved 5 unique text chunk(s).
🔹 Determined response type: swot


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



🎯 **Final Validated Response:**
 Validate and refine the following response for clarity, accuracy, and coherence.
Ensure duplicate content is removed and the answer is concise (2–3 sentences for non-SWOT answers).

Response to validate:
Please carefully scan the entire context provided below and extract all key points.
Generate a structured SWOT analysis for the company with the following five distinct sections:

1. **Strengths**: Summarize the unique strengths mentioned in the document as bullet points.
2. **Weaknesses**: Summarize the distinct weaknesses as bullet points, ensuring these do not overlap with the strengths.
3. **Opportunities**: Summarize all potential opportunities for growth and improvement as bullet points.
4. **Threats**: Summarize the risks or threats mentioned as bullet points.
5. **Summary**: Provide a concise summary (2–3 sentences) highlighting the potential growth area for the company. You can be creative.

Make sure each section is clear and does not repeat 

test general response

### troubleshoot retrieval issue

In [None]:
index_stats = index.describe_index_stats()
print(index_stats)

{'dimension': 3584,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 2592}},
 'total_vector_count': 2592,
 'vector_type': 'dense'}


Testing code cells-

Define the local text-generation function using the tokenizer and model objects you already downloaded/saved.

Explanation of Each Step
Function Signature

tinyllama_generate_local takes a prompt and optional parameters like max_new_tokens, temperature, and top_p. These control how the model responds.
Tokenizing (inputs)

We use tokenizer_llama to turn the prompt string into numerical tokens.
We then move these tokens to the same device (model_llama.device) that the model is on.
Model Generation (model_llama.generate)

max_new_tokens determines how many tokens the model can add beyond the prompt.
temperature and top_p affect the sampling creativity.
do_sample=True allows stochastic sampling instead of purely greedy predictions.
eos_token_id signals the model to stop when it predicts the end-of-sequence token.
Decoding (tokenizer_llama.decode)

Converts the generated token IDs back into a natural language string.
Return Value

The function returns the final text output for further use in your pipeline (e.g., display to the user or feed into other processes).
This code cell completes the local generation step for your RAG or general LLM workflow using TinyLlama/TinyLlama_v1.1.

In [None]:
# Quick check to verify Adobe 2024 data exists in Pinecone
check_results = index.query(
    vector=embed_text("test"),  # arbitrary vector to trigger the search
    top_k=5,
    include_metadata=True,
    filter={
        "company": {"$eq": "adobe"},
        "year": {"$eq": "2024"}
    }
)

print(f"Found {len(check_results.matches)} matches for Adobe 2024:")
for match in check_results.matches:
    print(match.metadata)


Found 5 matches for Adobe 2024:
{'company': 'adobe', 'source_pdf': 'Adobe-2024-10k.pdf', 'truncated_text': 'train and customize models to support brand consistency amongst creative and marketing teams ; firefly video model ( public beta ), which enables generative text - to - video and image - to - video ; and firefly image 3 foundation model, which allows for faster and higher - quality image generations. developing our own foundation models enables us to design firefly to be commercially safe and in line with our ai ethics principles of accountability, responsibility and transparency. we continue to pursue ways to inspire, empower and connect the creative community and support live, interactive tutorials with creators. additionally, with adobe genstudio and adobe genstudio for performance marketing, a generative ai - first product that natively integrates digital media and digital experience offerings, we help enterprises to quickly create on - brand content variations and accelerate