<a href="https://colab.research.google.com/github/vera-lovelace/GenAI-final/blob/graphRAG/Extended_RAG_Model_GraphRAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG Mini Project
## Milestone #2 : Vectorise and store Chunks

Use the embedding code from Assignment A1 to create embeddings from the  text chunks generated and save in Pickle file from Milestone #1.

Create a Python dictionary as a Vector database using the embedding vector as keys (note: convert list of embeddings to a tuple) and the text as the value
Experiment with some queries and use cosine similarity to get the most similar text from your vector database.
If the results are not satisfactory, you may want to refactor your code by:
changing the embedding technique
modifying the chunking technique from Milestone #1. Your code should be modular enough to make this fairly straightforward if needed. It is what software development is all about.
When satisfied, store your Python dict (vector db) in a pickle file.


### Deliverables: Zip file with

Jupyter Notebook
Summary of your efforts (issues, success in matching chunks to queries based on embeddings, …)
Pickle file with the Python vector database for use in the final Mini Project Deliverable

In [None]:
# Imports

!pip install python-docx
!pip install docx
!pip install rdflib

from docx import Document
from io import BytesIO
import re
import os
from pathlib import Path

from google.colab import files
import pickle

import numpy as np
from sentence_transformers import SentenceTransformer

import spacy
from rdflib import Graph, Literal

from torch.nn.functional import cosine_similarity
import torch



In [None]:
# Extract Chunks using document paragraphs
# Chunk size is controlled by parameter
def extract_fixed_chunks(file_path, chunk_size=1000):
    """
    Extract fixed-size chunks from a Word document.

    Args:
        file_path (str or bytes): Path to Word document or binary content
        chunk_size (int): Target size of each chunk in characters

    Returns:
        list: List of text chunks of approximately chunk_size characters
    """
    try:
        # Handle both file path and binary content
        if isinstance(file_path, bytes):
            doc = Document(BytesIO(file_path))
        else:
            doc = Document(file_path)

        # Extract and clean all text
        full_text = ""
        for para in doc.paragraphs:
            text = para.text.strip()
            if text:  # Skip empty paragraphs
                # Clean and normalise the text
                text = re.sub(r'\n{3,}', '\n\n', text)
                text = re.sub(r'\s+', ' ', text)  # Remove multiple spaces
                full_text += text + " "  # Add space between paragraphs

        # Split text into sentences
        sentences = re.split('(?<=[.!?-]) +', full_text)

        chunks = []
        current_chunk = ""

        for sentence in sentences:
            # If adding this sentence would exceed chunk_size
            if len(current_chunk) + len(sentence) > chunk_size:
                # If current chunk is not empty, add it to chunks
                if current_chunk:
                    chunks.append(current_chunk.strip())
                    current_chunk = ""

                # Handle sentences longer than chunk_size
                if len(sentence) > chunk_size:
                    # Split long sentence into fixed-size chunks
                    words = sentence.split()
                    temp_chunk = ""

                    for word in words:
                        if len(temp_chunk) + len(word) + 1 <= chunk_size:
                            temp_chunk += (" " + word if temp_chunk else word)
                        else:
                            chunks.append(temp_chunk.strip())
                            temp_chunk = word

                    if temp_chunk:
                        current_chunk = temp_chunk
                else:
                    current_chunk = sentence
            else:
                # Add sentence to current chunk
                current_chunk += (" " + sentence if current_chunk else sentence)

        # Add the last chunk if not empty
        if current_chunk:
            chunks.append(current_chunk.strip())

        return chunks

    except Exception as e:
        raise Exception(f"Error processing document: {str(e)}")

# Find cosine similarity of sentences
def find_similar_sentences(query, embeddings):
    """
    Finds similar texts to query based on similarity threshold.

    Args:
        query: embeddings of query
        embeddings: List of text embeddings

    Returns:
        List of similar sentence embeddings
    """
    similar_sentences = []
    for i in range(len(embeddings)):
        similarity = np.dot(query, embeddings[i]) / (
            np.linalg.norm(query) * np.linalg.norm(embeddings[i]))
        if similarity > 0.55:
            similar_sentences.append(embeddings[i])
    return similar_sentences

In [None]:
# Main - Note that chunk size to use is set here in main and overrides default
def main():
    try:
        # Directory containing Word documents
        directory = "content/docs"

        # Get all .docx files in the directory
        docx_files = list(Path(directory).glob("*.docx"))
        print(f"Found files: {docx_files}")

        if not docx_files:
            print(f"No Word documents found in {directory}")
            return

        print(f"Found {len(docx_files)} Word documents")

        vectors_dict = {}
        vectors = []
        # Initialize the model
        model = SentenceTransformer('all-MiniLM-L6-v2')

          # Process each document
        for doc_path in docx_files:
          try:
              print(f"\nProcessing: {doc_path.name}")

              # Extract chunks of approximately 100 characters
              chunks = extract_fixed_chunks(str(doc_path), chunk_size=1500)

              # get chunk embeddings and save to vector dictionary
              print(f"\nGenerating embeddings for next {len(chunks)} chunks...\n")
              for chunk in chunks:
                  embeddings = model.encode(chunk)
                  vectors_dict[tuple(embeddings)] = chunk
                  vectors.append(embeddings)

          except Exception as e:
              print(f"Error processing {doc_path.name}: {str(e)}")
              continue


        # run queries to find similarity in chunks and graphs
        queries = ["When was the Tor network released?",
          "List where the GDPR approach was applied.",
          "How major data breaches impacted Apple and Microsoft?",
          "How privacy regulations affect various industries in the USA?",
          "When was the TRW Credit Data breach and how many credit records were exposed?",
          "How have approaches to data breach notification evolved since 2000, and what are the key differences between jurisdictions?",
          "What kind of data is protected by privacy acts?",
          "Summarize how GDPR is applicable to international organizations using only articles tagged with GDPR  (EU GDPR paper)",
          "What privacy protection is applicable in California?",
          "Who is covered by privacy protection?",
          "What are the key differences between the articles tagged with PrivacyLaw?"]

        # === Querying RAG ===
        print("\n=== Querying RAG ===\n")
        print("\nExtracting relevant chunks to queries...\n")
        for query in queries:
          query_embedding = model.encode(query)
          similar_sentences = find_similar_sentences(query_embedding, vectors)

          print(f"Query: {query}")
          print("Similar Sentences:")
          for sentence in similar_sentences:
            chunk = vectors_dict[tuple(sentence)]
            print(chunk)
            print('\n')


        # === Querying GraphRAG ===

        print("\n=== Querying GraphRAG ===\n")
        # Load and process triples
        triples = load_triples_from_ttl("content/privacy_and_security.ttl")
        triple_texts = [triple_to_text(str(s), str(p), str(o)) for s, p, o in triples]

        # Embed all triple texts
        triple_embeddings = embed_texts(triple_texts, model)

        for query in queries:
          # Retrieve top 20 relevant triples
          top_triples = retrieve_top_k(query, triple_texts, triple_embeddings, model, k=20)

          # Output results
          print(f"Query: {query}")
          print("Top Relevant Triples:")
          for t in top_triples:
              print("-", t)
          print("\n")

    except Exception as e:
        print(f"Error accessing directory: {str(e)}")

# Call main and start the creating embeddings
main()



Found files: [PosixPath('content/docs/CCPA.docx'), PosixPath('content/docs/EU GDPR.docx'), PosixPath('content/docs/HIPAA.docx'), PosixPath('content/docs/3.Major Data Breaches and Their Impact on Privacy Regulation.docx'), PosixPath('content/docs/1.The Evolution of Privacy.docx'), PosixPath('content/docs/2.DevelopmentPrivacyProtectionUSA.docx'), PosixPath('content/docs/CPRA.docx'), PosixPath('content/docs/4.The Evolution of European Data Protection.docx'), PosixPath('content/docs/5.Global Approaches to Data Protection.docx')]
Found 9 Word documents

Processing: CCPA.docx

Generating embeddings for next 126 chunks...


Processing: EU GDPR.docx

Generating embeddings for next 276 chunks...


Processing: HIPAA.docx

Generating embeddings for next 58 chunks...


Processing: 3.Major Data Breaches and Their Impact on Privacy Regulation.docx

Generating embeddings for next 3 chunks...


Processing: 1.The Evolution of Privacy.docx

Generating embeddings for next 3 chunks...


Processing: 2.Deve

/content/sample_data/mydata

In [None]:

# Load RDF triples from a Turtle file
def load_triples_from_ttl(file_path):
    g = Graph()
    g.parse(file_path, format="ttl")
    return list(g)

# Convert a triple to a readable sentence
def triple_to_text(s, p, o):
    s = s.split("#")[-1] if "#" in s else s.split("/")[-1]
    p = p.split("#")[-1] if "#" in p else p.split("/")[-1]
    o = o.split("#")[-1] if "#" in o else o.split("/")[-1]

    return f"{s} {p.replace('_', ' ')} {o}".replace('"', '')

# Embed a list of sentences
def embed_texts(texts, model):
    return model.encode(texts, convert_to_tensor=True)

# Retrieve top-k most similar triples
def retrieve_top_k(query, triple_texts, triple_embeddings, model, k=4):
    query_embedding = embed_texts([query], model)
    scores = cosine_similarity(query_embedding, triple_embeddings)
    top_k_indices = torch.topk(scores, k).indices
    return [triple_texts[i] for i in top_k_indices]

# === Main Pipeline ===

# Load and process triples
triples = load_triples_from_ttl("content/privacy_and_security.ttl")
triple_texts = [triple_to_text(str(s), str(p), str(o)) for s, p, o in triples]

# Load embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Embed all triple texts
triple_embeddings = embed_texts(triple_texts, model)

# run queries to find similarity in graph
queries = ["When was the Tor network released?",
  "List where the GDPR approach was applied.",
  "How major data breaches impacted Apple and Microsoft?",
  "How privacy regulations affect various industries in the USA?",
  "When was the TRW Credit Data breach and how many credit records were exposed?",
  "How have approaches to data breach notification evolved since 2000, and what are the key differences between jurisdictions?",
  "What kind of data is protected by privacy acts?",
  "Summarize how GDPR is applicable to international organizations using only articles tagged with GDPR  (EU GDPR paper)",
  "What privacy protection is applicable in California?",
  "Who is covered by privacy protection?",
  "What are the key differences between the articles tagged with PrivacyLaw?"]


for query in queries:
  # Retrieve top 20 relevant triples
  top_triples = retrieve_top_k(query, triple_texts, triple_embeddings, model, k=20)

  # Output results
  print(f"Query: {query}")
  print("Top Relevant Triples:")
  for t in top_triples:
      print("-", t)
  print("\n")

Query: When was the Tor network released?
Top Relevant Triples:
- Tor name Tor Network
- Tor type Software
- Tor functionality Anonymous internet browsing
- Tor influenced ModernAnonymityTools
- CDUniverseBreach1999 name CD Universe Breach (1999)
- TJXBreach2007 name TJX Companies Breach (2007)
- CapitalOneBreach2019 resultedIn CloudSecurityEnhancement
- EggheadBreach2000 name Egghead.com Breach (2000)
- TLS name Transport Layer Security
- PhilZimmermann jobTitle Creator of Pretty Good Privacy (PGP)
- TLS type Software
- CapitalOneBreach2019 name Capital One Breach (2019)
- TRWBreach1984 name TRW Credit Data Breach (1984)
- MicrosoftExchangeBreach2021 name Microsoft Exchange Breach (2021)
- PGP creator PhilZimmermann
- MicrosoftExchangeBreach2021 resultedIn CloudSecurityEnhancement
- TJXCompanies name TJX Companies
- Encryption supportsCompliance GDPR
- TRWBreach1984 resultedIn EnhancedCreditAgencySecurity
- EuropeanCommission name European Commission


Query: List where the GDPR appro