<a href="https://colab.research.google.com/github/vera-lovelace/GenAI-final/blob/miniRAG/RAG_model_Baseline_miniRAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install python-docx
!pip install docx
!pip install nltk
!pip install sentence-transformers

Collecting python-docx
  Downloading python_docx-1.1.2-py3-none-any.whl.metadata (2.0 kB)
Downloading python_docx-1.1.2-py3-none-any.whl (244 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/244.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m235.5/244.3 kB[0m [31m7.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.3/244.3 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-docx
Successfully installed python-docx-1.1.2
Collecting docx
  Downloading docx-0.2.4.tar.gz (54 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.9/54.9 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: docx
  Building wheel for docx (setup.py) ... [?25l[?25hdone
  Created wheel for docx: filename=docx-0.2.4-py3-none-

In [None]:
import docx
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from typing import List, Dict, Tuple
import os
from pathlib import Path
import re
import pickle
from transformers import AutoTokenizer
import openai

# Download required NLTK resources
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# 1. Create Tokens (Chunks)

class DocumentChunker:
    def __init__(self, min_chunk_size=50, max_chunk_size=1000,
                 overlap=20, min_sentence_similarity=0.3):
        self.min_chunk_size = min_chunk_size
        self.max_chunk_size = max_chunk_size
        self.overlap = overlap
        self.min_sentence_similarity = min_sentence_similarity
        self.vectorizer = TfidfVectorizer()
        self.stop_words = nltk.corpus.stopwords.words('english')

    def extract_text_from_docx(self, file_path: str) -> str:
        doc = docx.Document(file_path)
        full_text = []
        for paragraph in doc.paragraphs:
            if paragraph.text.strip():
                full_text.append(paragraph.text.strip())
        return "\n\n".join(full_text)

    def preprocess_text(self, text: str) -> str:
        text = re.sub(r'\n{3,}', '\n\n', text)
        text = re.sub(r'[^\w\s.,!?;:-]', ' ', text)
        text = re.sub(r'\s+', ' ', text).strip()
        text = ' '.join([word for word in text.split() if word.lower() not in self.stop_words]) #remove stop words
        return text

    def calculate_sentence_similarities(self, sentences: List[str]) -> np.ndarray:
        if not sentences:
            return np.array([])
        tfidf_matrix = self.vectorizer.fit_transform(sentences)
        return cosine_similarity(tfidf_matrix)

    def create_semantic_chunks(self, text: str) -> List[str]:
        sentences = sent_tokenize(text)
        if not sentences:
            return []
        similarity_matrix = self.calculate_sentence_similarities(sentences)
        chunks = []
        current_chunk = []
        current_length = 0

        for i in range(len(sentences)):
            current_chunk.append(sentences[i])
            current_length += len(sentences[i])

            if current_length >= self.min_chunk_size:
                if current_length <= self.max_chunk_size:
                    chunks.append(' '.join(current_chunk))

                current_chunk = current_chunk[self.overlap:]
                current_length = sum(len(sent) for sent in current_chunk)

        if current_chunk and current_length >= self.min_chunk_size :
            chunks.append(' '.join(current_chunk))

        return chunks

    def process_document(self, file_path: str) -> List[str]:
        raw_text = self.extract_text_from_docx(file_path)
        processed_text = self.preprocess_text(raw_text)
        return self.create_semantic_chunks(processed_text)

def create_chunks_from_docx(directory):
    chunker = DocumentChunker()
    all_chunks = []
    docx_files = list(Path(directory).glob("*.docx"))

    for doc_path in docx_files:
        try:
            chunks = chunker.process_document(doc_path)
            all_chunks.extend(chunks)
        except Exception as e:
            print(f"Error processing {doc_path}: {e}")

    return all_chunks

# 2. Create Embeddings and Vector Database
def create_vector_database(chunks):
    # Embedding Technique: Sentence-BERT (all-mpnet-base-v2)
    model = SentenceTransformer('all-mpnet-base-v2')
    embeddings = model.encode(chunks)
    vector_database = list(zip(chunks, embeddings))
    #vector_database = {tuple(emb.tolist()): chunk for chunk, emb in zip(chunks, embeddings)}
    return vector_database


# 3. Search with Query, Embedding, and Context Window
def search_database(query, vector_database, top_k=10):
    model = SentenceTransformer('all-mpnet-base-v2')
    query_embedding = model.encode(query)

    similarities = [cosine_similarity(query_embedding.reshape(1, -1),
                                      embedding.reshape(1, -1))[0][0]
                    for _, embedding in vector_database]

    # Get top_k indices and chunks
    top_indices = np.argsort(similarities)[-top_k:]
    top_chunks = [vector_database[i][0] for i in top_indices]

    return top_chunks


if __name__ == "__main__":
     # Use Google Colab's /content directory
    directory = '/content/'
    chunks = create_chunks_from_docx(directory)
    vector_database = create_vector_database(chunks)

    query = input("Enter your query: ")
    results = search_database(query, vector_database)

    print("\nTop results (context window = 10):")
    for chunk in results:
        print(chunk)
        print("-" * 20)

    # Save the vector database
    with open('rag_vector_database.pkl', 'wb') as f:
        pickle.dump(vector_database, f)
        print("Vector database saved successfully!")

Enter your query: What privacy protection is applicable in California? 

Top results (context window = 10):
TITLE 1.81.5. California Consumer Privacy Act 2018 1798.100 - 1798.199.100 Title 1.81.5 added Stats.
--------------------
response measure qualification, Legislature enacted California Consumer Privacy Act 2018 CCPA law.
--------------------
5 Disclose following information online privacy policy policies business online privacy policy policies California specific description consumers privacy rights, business maintain policies, internet website, update information least every 12 months: description consumer rights pursuant Sections 1798.100, 1798.105, 1798.106, 1798.110, 1798.115, 1798.125 one two designated methods submitting requests, except provided subparagraph paragraph 1 subdivision .
--------------------
5 Disclose following information online privacy policy policies business online privacy policy policies California-specific description consumers privacy rights, business 

In [None]:
if __name__ == "__main__":
     # Use Google Colab's /content directory
    directory = '/content/'
    chunks = create_chunks_from_docx(directory)
    vector_database = create_vector_database(chunks)

    query = input("Enter your query: ")
    results = search_database(query, vector_database)

    print("\nTop results (context window = 10):")
    for chunk in results:
        print(chunk)
        print("-" * 20)

    # Save the vector database
    with open('rag_vector_database.pkl', 'wb') as f:
        pickle.dump(vector_database, f)
        print("Vector database saved successfully!")

Enter your query: Who is covered by privacy protection? 

Top results (context window = 10):
document examines different regions approach privacy data protection.
--------------------
covered entity must designate privacy official responsible developing implementing privacy policies procedures, contact person contact office responsible receiving complaints providing individuals information covered entity privacy practices.65 Workforce Training Management.
--------------------
covered entity must mitigate, extent practicable, harmful effect learns caused use disclosure protected health information workforce business associates violation privacy policies procedures Privacy Rule.69 Data Safeguards.
--------------------
Consumers Right Know Personal Information Collected.
--------------------
1 Individual. covered entity may disclose protected health information individual subject information.
--------------------
covered entity must make reasonable efforts use, disclose, request minimum a

In [None]:
if __name__ == "__main__":
     # Use Google Colab's /content directory
    directory = '/content/'
    chunks = create_chunks_from_docx(directory)
    vector_database = create_vector_database(chunks)

    query = input("Enter your query: ")
    results = search_database(query, vector_database)

    print("\nTop results (context window = 10):")
    for chunk in results:
        print(chunk)
        print("-" * 20)

    # Save the vector database
    with open('rag_vector_database.pkl', 'wb') as f:
        pickle.dump(vector_database, f)
        print("Vector database saved successfully!")

Enter your query: What are the key differences between the articles tagged with PrivacyLaw? 

Top results (context window = 10):
summary key elements Privacy Rule complete comprehensive guide compliance.
--------------------
21 Review existing Insurance Code provisions regulations relating consumer privacy, except relating insurance rates pricing, determine whether provisions Insurance Code provide greater protection consumers provisions title.
--------------------
Recent Developments: - Growing number state privacy laws Virginia, Colorado, Utah - Increased focus biometric privacy protection - Emerging regulations artificial intelligence - Ongoing debates federal privacy legislation U.S. approach continues evolve, calls comprehensive federal privacy legislation growing stronger.
--------------------
20 Review existing Insurance Code provisions regulations relating consumer privacy, except relating insurance rates pricing, determine whether provisions Insurance Code provide greater prot

In [None]:
if __name__ == "__main__":
     # Use Google Colab's /content directory
    directory = '/content/'
    chunks = create_chunks_from_docx(directory)
    vector_database = create_vector_database(chunks)

    query = input("Enter your query: ")
    results = search_database(query, vector_database)

    print("\nTop results (context window = 10):")
    for chunk in results:
        print(chunk)
        print("-" * 20)

    # Save the vector database
    with open('rag_vector_database.pkl', 'wb') as f:
        pickle.dump(vector_database, f)
        print("Vector database saved successfully!")

Enter your query:  When was the TRW Credit Data breach and how many credit records were exposed?

Top results (context window = 10):
law established important principles including: - Consumer right access credit reports - Requirement accurate reporting - Time limits negative information - Procedures disputing incorrect information Privacy Act 1974 represented another significant step, though limited government agencies.
--------------------
2 Paragraph 1 shall apply extent activity involving collection, maintenance, disclosure, sale, communication use information agency, furnisher, user subject regulation Fair Credit Reporting Act, section 1681 et seq., Title 15 United States Code information collected, maintained, used, communicated, disclosed, sold except authorized Fair Credit Reporting Act.
--------------------
document examines key breaches lasting impact privacy protection.
--------------------
1798.150. Personal Information Security Breaches 1 consumer whose nonencrypted nonreda

In [None]:
if __name__ == "__main__":
     # Use Google Colab's /content directory
    directory = '/content/'
    chunks = create_chunks_from_docx(directory)
    vector_database = create_vector_database(chunks)

    query = input("Enter your query: ")
    results = search_database(query, vector_database)

    print("\nTop results (context window = 10):")
    for chunk in results:
        print(chunk)
        print("-" * 20)

    # Save the vector database
    with open('rag_vector_database.pkl', 'wb') as f:
        pickle.dump(vector_database, f)
        print("Vector database saved successfully!")

Enter your query:  How major data breaches impacted Apple and Microsoft?

Top results (context window = 10):
Federal Trade Commission FTC became de facto privacy regulator, using authority to: - Enforce company privacy promises - Investigate data breaches - Issue privacy guidelines - Impose fines privacy violations Notable FTC actions included: - 2011 Facebook settlement requiring privacy audits - 2012 Google privacy violation fine 22.5 million - 2019 Facebook fine 5 billion privacy violations State-Level Innovation: absence comprehensive federal legislation, states taken lead: California Leadership: - 2003 Security Breach Notification Law first nation - 2018 California Consumer Privacy Act CCPA - 2020 California Privacy Rights Act CPRA laws influenced states national privacy discussions.
--------------------
4.5.2016 L 119 52 Official Journal European Union EN 2.The communication data subject referred paragraph 1 Article shall describe clear plain language nature personal data breach 

In [None]:
if __name__ == "__main__":
     # Use Google Colab's /content directory
    directory = '/content/'
    chunks = create_chunks_from_docx(directory)
    vector_database = create_vector_database(chunks)

    query = input("Enter your query: ")
    results = search_database(query, vector_database)

    print("\nTop results (context window = 10):")
    for chunk in results:
        print(chunk)
        print("-" * 20)

    # Save the vector database
    with open('rag_vector_database.pkl', 'wb') as f:
        pickle.dump(vector_database, f)
        print("Vector database saved successfully!")

Enter your query: List where the GDPR approach was applied

Top results (context window = 10):
4.5.2016 L 119 47 Official Journal European Union EN Article 25 Data protection design default 1.Taking account state art, cost implementation nature, scope, context purposes processing well risks varying likelihood severity rights freedoms natural persons posed processing, controller shall, time determination means processing time processing itself, implement appropriate technical organisational measures, pseudonymisation, designed implement data-protection principles, data minimisation, effective manner integrate necessary safeguards processing order meet requirements Regulation protect rights data subjects.
--------------------
processing personal data private bodies falls within scope Regulation, Regulation provide possibility Member States specific conditions restrict law certain obligations rights restriction constitutes necessary proportionate measure democratic society safeguard speci

In [None]:
if __name__ == "__main__":
     # Use Google Colab's /content directory
    directory = '/content/'
    chunks = create_chunks_from_docx(directory)
    vector_database = create_vector_database(chunks)

    query = input("Enter your query: ")
    results = search_database(query, vector_database)

    print("\nTop results (context window = 10):")
    for chunk in results:
        print(chunk)
        print("-" * 20)

    # Save the vector database
    with open('rag_vector_database.pkl', 'wb') as f:
        pickle.dump(vector_database, f)
        print("Vector database saved successfully!")

Enter your query: How privacy regulations affect various industries in the USA?

Top results (context window = 10):
processing personal data private bodies falls within scope Regulation, Regulation provide possibility Member States specific conditions restrict law certain obligations rights restriction constitutes necessary proportionate measure democratic society safeguard specific important interests including public security prevention, investigation, detection prosecution criminal offences execution criminal penalties, including safeguarding prevention threats public security.
--------------------
7 Businesses held accountable violate consumers privacy rights, penalties higher violation affects children.
--------------------
C Implementation Law 1 rights consumers responsibilities businesses implemented goal strengthening consumer privacy, giving attention impact business innovation.
--------------------
Major Data Breaches Impact Privacy Regulation history data protection signific

Improvements for Relevance and Accuracy:

Advanced Chunking:

Stop Word Removal: Added stop word removal in preprocessing to focus on meaningful content.
TF-IDF Enhancement (Optional): Consider incorporating TF-IDF within the chunking process to prioritize sentences with important keywords and further enhance semantic coherence.
Powerful Embeddings:

Sentence-BERT (all-mpnet-base-v2): Switched to a more powerful Sentence-BERT model (all-mpnet-base-v2) known for better performance in semantic similarity tasks.
Context Window:

Limited to 10: The output is limited to the top 10 most relevant chunks, providing a focused context window.

Tuples for Vector Database:

(Chunk, Embedding): The vector database is created as a list of tuples, where each tuple contains the chunk and its embedding for easy access.
How it Works:

create_chunks_from_docx(): Processes all .docx files in the directory, extracts text, preprocesses it, creates semantic chunks, and returns a list of all chunks.
create_vector_database(): Uses Sentence-BERT (all-mpnet-base-v2) to generate embeddings for the chunks and stores them in a vector database as tuples.
search_database(): Takes a query, generates its embedding, calculates cosine similarities with chunks in the database, and returns the top top_k (default 10) most similar chunks.
Remember to adjust parameters like min_chunk_size, max_chunk_size, overlap, and min_sentence_similarity based on the characteristics of your documents for optimal results.