<a href="https://colab.research.google.com/github/vera-lovelace/GenAI-final/blob/graphRAG/Extended_RAG_Model_GraphRAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG Mini Project
## Milestone #2 : Vectorise and store Chunks

Use the embedding code from Assignment A1 to create embeddings from the  text chunks generated and save in Pickle file from Milestone #1.

Create a Python dictionary as a Vector database using the embedding vector as keys (note: convert list of embeddings to a tuple) and the text as the value
Experiment with some queries and use cosine similarity to get the most similar text from your vector database.
If the results are not satisfactory, you may want to refactor your code by:
changing the embedding technique
modifying the chunking technique from Milestone #1. Your code should be modular enough to make this fairly straightforward if needed. It is what software development is all about.
When satisfied, store your Python dict (vector db) in a pickle file.


### Deliverables: Zip file with

Jupyter Notebook
Summary of your efforts (issues, success in matching chunks to queries based on embeddings, …)
Pickle file with the Python vector database for use in the final Mini Project Deliverable

In [6]:
# Imports

!pip install python-docx
!pip install docx

from docx import Document
from io import BytesIO
import re
import os
from pathlib import Path

from google.colab import files
import pickle

import numpy as np
from sentence_transformers import SentenceTransformer




In [19]:
from google.colab import drive
drive.mount('/content/drive') # Mount Google Drive


Mounted at /content/drive


In [20]:
# Extract Chunks using document paragraphs
# Chunk size is controlled by parameter

def extract_fixed_chunks(file_path, chunk_size=1000):
    """
    Extract fixed-size chunks from a Word document.

    Args:
        file_path (str or bytes): Path to Word document or binary content
        chunk_size (int): Target size of each chunk in characters

    Returns:
        list: List of text chunks of approximately chunk_size characters
    """
    try:
        # Handle both file path and binary content
        if isinstance(file_path, bytes):
            doc = Document(BytesIO(file_path))
        else:
            doc = Document(file_path)

        # Extract and clean all text
        full_text = ""
        for para in doc.paragraphs:
            text = para.text.strip()
            if text:  # Skip empty paragraphs
                # Clean and normalise the text
                text = re.sub(r'\n{3,}', '\n\n', text)
                text = re.sub(r'\s+', ' ', text)  # Remove multiple spaces
                full_text += text + " "  # Add space between paragraphs

        # Split text into sentences
        sentences = re.split('(?<=[.!?-]) +', full_text)

        chunks = []
        current_chunk = ""

        for sentence in sentences:
            # If adding this sentence would exceed chunk_size
            if len(current_chunk) + len(sentence) > chunk_size:
                # If current chunk is not empty, add it to chunks
                if current_chunk:
                    chunks.append(current_chunk.strip())
                    current_chunk = ""

                # Handle sentences longer than chunk_size
                if len(sentence) > chunk_size:
                    # Split long sentence into fixed-size chunks
                    words = sentence.split()
                    temp_chunk = ""

                    for word in words:
                        if len(temp_chunk) + len(word) + 1 <= chunk_size:
                            temp_chunk += (" " + word if temp_chunk else word)
                        else:
                            chunks.append(temp_chunk.strip())
                            temp_chunk = word

                    if temp_chunk:
                        current_chunk = temp_chunk
                else:
                    current_chunk = sentence
            else:
                # Add sentence to current chunk
                current_chunk += (" " + sentence if current_chunk else sentence)

        # Add the last chunk if not empty
        if current_chunk:
            chunks.append(current_chunk.strip())

        return chunks

    except Exception as e:
        raise Exception(f"Error processing document: {str(e)}")



In [21]:
# prompt: a function that reads triples from a .txt file, vectorises the triples and stores them in a dictionary with the embedding as key and the triple as value

import pickle
import numpy as np
from sentence_transformers import SentenceTransformer

def vectorise_triples(file_path):
    """
    Reads triples from a text file, vectorizes them, and stores them in a dictionary.

    Args:
        file_path (str): The path to the .txt file containing the triples.

    Returns:
        dict: A dictionary where keys are embedding vectors (as tuples) and values are the corresponding triples.
              Returns an empty dictionary if the file does not exist or if an error occurs during processing.
    """

    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            triples = [line.strip() for line in file if line.strip()]
    except FileNotFoundError:
        print(f"Error: File not found at '{file_path}'")
        return {}
    except Exception as e:
        print(f"An error occurred while reading the file: {e}")
        return {}

    model = SentenceTransformer('all-mpnet-base-v2') # Example model
    vector_database = {}

    for triple in triples:
      try:
          embedding = model.encode(triple)
          vector_database[tuple(embedding)] = triple
      except Exception as e:
          print(f"Error processing triple '{triple}': {e}")

    return vector_database

# Example usage
file_path = 'privacy_and_security_knowledge_base.ttl.txt' # file path updated
vector_db = vectorise_triples(file_path)


# Save the vector database to a pickle file
with open('vector_database.pickle', 'wb') as handle:
    pickle.dump(vector_db, handle, protocol=pickle.HIGHEST_PROTOCOL)


In [22]:
# Find cosine similarity of sentences
def find_similar_sentences(query, plain_embeddings=True, graph_embeddings=False):
    """
    Finds similar texts to query based on similarity threshold.

    Args:
        query: embeddings of query
        plain_embeddings: List of plain text embeddings
        graph_embeddings: List of graph base embeddings

    Returns:
        List of similar sentence embeddings
    """
    similar_sentences = []
    if plain_embeddings:
      for i in range(len(plain_embeddings)):
          similarity = np.dot(query, plain_embeddings[i]) / (
              np.linalg.norm(query) * np.linalg.norm(plain_embeddings[i]))
          if similarity > 0.55:
              similar_sentences.append(plain_embeddings[i])
    if graph_embeddings:
      for i in range(len(graph_embeddings)):
          similarity = np.dot(query, graph_embeddings[i]) / (
              np.linalg.norm(query) * np.linalg.norm(graph_embeddings[i]))
          if similarity > 0.55:
              similar_sentences.append(graph_embeddings[i])
    return similar_sentences

In [23]:
# Main - Note that chunk size to use is set here in main and overrides default
def main():
    try:
        # Directory containing Word documents
        directory = "content/docs"

        # Get all .docx files in the directory
        docx_files = list(Path(directory).glob("*.docx"))
        print(f"Found files: {docx_files}")

        if not docx_files:
            print(f"No Word documents found in {directory}")
            return

        print(f"Found {len(docx_files)} Word documents")

        vectors_dict = {}
        vectors = []
        # Initialize the model
        model = SentenceTransformer('all-MiniLM-L6-v2')

          # Process each document
        for doc_path in docx_files:
          try:
              print(f"\nProcessing: {doc_path.name}")

              # Extract chunks of approximately 100 characters
              chunks = extract_fixed_chunks(str(doc_path), chunk_size=1500)

              # get chunk embeddings and save to vector dictionary
              print(f"\nGenerating embeddings for next {len(chunks)} chunks...\n")
              for chunk in chunks:
                  embeddings = model.encode(chunk)
                  vectors_dict[tuple(embeddings)] = chunk
                  vectors.append(embeddings)

          except Exception as e:
              print(f"Error processing {doc_path.name}: {str(e)}")
              continue

        # run queries to find similarity in chunks
        queries = ['When was the TRW Credit Data breach and how many credit records were exposed?','How have major data breaches influenced the development of privacy regulations in both the EU and US? Provide specific examples.','Compare and contrast how encryption technologies have evolved to meet different regional privacy requirements. Include specific examples from the EU, US, and Asia.','What role have tech companies played in shaping privacy standards globally, and how have different regions responded to their influence?', 'How have approaches to data breach notification evolved since 2000, and what are the key differences between jurisdictions?', 'What kind of data is protected by privacy acts?', 'Summarize how GDPR is applicable to international organizations','What privacy protection is applicable in California?', 'Who is covered by privacy protection?', 'What are the key differences between privacy acts?']
        print("\nExtracting relevant chunks to queries...\n")
        for query in queries:
          query_embedding = model.encode(query)
          similar_sentences = find_similar_sentences(query_embedding, vectors)

          print(f"Query: {query}")
          print("Similar Sentences:")
          for sentence in similar_sentences:
            chunk = vectors_dict[tuple(sentence)]
            print(chunk)
            print('\n')

    except Exception as e:
        print(f"Error accessing directory: {str(e)}")

# Call main and start the creating embeddings
main()



Found files: []
No Word documents found in content/docs


/content/sample_data/mydata