<a href="https://colab.research.google.com/github/swathisl170499/Retrieval_Augmented-Generation-RAG-Solution/blob/main/VectorDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install requests tqdm faiss-cpu transformers tensorflow sentence-transformers textblob gensim



In [None]:
import os
import requests
import zipfile
from pathlib import Path
from tqdm import tqdm

# Directory to store downloaded and extracted data
DATA_DIR = Path("./mimic_textbooks")

# Download and extract the dataset zip file
def download_and_extract_zip(url, extract_to=DATA_DIR):
    # Ensure the directory exists
    extract_to.mkdir(parents=True, exist_ok=True)

    # Download the zip file
    zip_path = extract_to / "textbooks.zip"
    print("Downloading dataset...")
    response = requests.get(url, stream=True)
    with open(zip_path, "wb") as file:
        for chunk in tqdm(response.iter_content(chunk_size=1024), unit='KB'):
            if chunk:
                file.write(chunk)

    # Extract the zip file
    print("Extracting dataset...")
    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        zip_ref.extractall(extract_to)
    print("Dataset downloaded and extracted.")

# Download and extract textbooks
dataset_url = "https://www.dropbox.com/scl/fi/54p9kkx5n93bffyx08eba/textbooks.zip?rlkey=2y2c5x8y0uncnddichn9cmd7n&st=m290nmkk&dl=1"
download_and_extract_zip(dataset_url)


Downloading dataset...


88121KB [00:06, 13357.59KB/s]


Extracting dataset...
Dataset downloaded and extracted.


In [None]:
import re
from gensim.utils import simple_preprocess
from textblob import TextBlob

# Load text files
def load_text_files(directory):
    texts = []
    for file_path in Path(directory).glob("*.txt"):
        with open(file_path, "r", encoding="utf-8") as file:
            texts.append(file.read())
    return texts

# Cleaning and preprocessing function
def clean_and_tokenize(text):
    # Basic regex cleaning
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    text = text.lower()  # Lowercase all text
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # Remove special characters

    # Tokenize with gensim
    tokens = simple_preprocess(text)
    return ' '.join(tokens)

# Spell correction
def correct_spelling(text):
    return str(TextBlob(text).correct())

# Chunk text into fixed-size chunks
def chunk_text(text, chunk_size=200):
    words = text.split()
    return [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]

# Load, clean, correct, and chunk documents
documents = load_text_files(DATA_DIR / "textbooks/en")
cleaned_documents = [clean_and_tokenize(doc) for doc in documents]
# corrected_documents = [correct_spelling(doc) for doc in cleaned_documents]
chunked_documents = []
for doc in cleaned_documents:
    chunked_documents.extend(chunk_text(doc))

print(f"Total document chunks created: {len(chunked_documents)}")


Total document chunks created: 60061


In [None]:
from transformers import TFAutoModel, AutoTokenizer
import tensorflow as tf
import numpy as np

# Verify that TensorFlow detects the GPU
print("Available devices:", tf.config.list_physical_devices('GPU'))

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = TFAutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

# Function to generate embeddings for all chunks in a batch
def get_embeddings_in_batch(texts, batch_size=16):
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i + batch_size]

        # Tokenize the batch of texts
        inputs = tokenizer(batch_texts, return_tensors="tf", truncation=True, padding=True, max_length=512)

        # Generate embeddings on the GPU
        outputs = model(inputs).last_hidden_state  # [batch_size, sequence_length, hidden_size]
        batch_embeddings = tf.reduce_mean(outputs, axis=1).numpy()  # Mean pooling

        # Append batch embeddings to the list
        all_embeddings.extend(batch_embeddings)

    return np.array(all_embeddings)

# Generate embeddings for all document chunks in batches
embeddings = get_embeddings_in_batch(chunked_documents, batch_size=128)
print(f"Generated embeddings for {len(embeddings)} document chunks.")



Available devices: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['embeddings.position_ids']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Generated embeddings for 60061 document chunks.


In [None]:
import faiss
import numpy as np

# Define the dimension of embeddings
dimension = 384  # Embedding size from MiniLM model
index = faiss.IndexFlatL2(dimension)

# Convert embeddings to NumPy array for FAISS
embedding_matrix = np.array([embedding.flatten() for embedding in embeddings]).astype('float32')

# Add embeddings to FAISS index
index.add(embedding_matrix)
print(f"Total embeddings indexed: {index.ntotal}")


Total embeddings indexed: 60061


In [None]:
# Example query for testing

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="tf", truncation=True, padding=True)
    outputs = model(inputs).last_hidden_state
    return tf.reduce_mean(outputs, axis=1).numpy()  # Average pooling for sentence embedding



query_text = "What are causes of heart failure?"
query_embedding = get_embedding(query_text)
query_embedding = np.array(query_embedding).reshape(1, -1).astype('float32')

# Search FAISS for the most similar documents
k = 5  # Number of closest documents to retrieve
distances, indices = index.search(query_embedding, k)

# Retrieve and print the most similar chunks
print("Top similar document chunks:")
for idx in indices[0]:
    print(chunked_documents[idx])


Top similar document chunks:
down to six principal mechanisms failure of the pump in the most common situation the cardiac muscle contracts weakly and the chambers cannot empty systolic dysfunction in some cases the muscle cannot relax sufficiently to permit ventricular filling resulting in diastolic dysfunction obstruction to flow lesions that prevent valve opening eg calcific aortic valve stenosis or cause increased ventricular chamber pressures eg systemic hypertension or aortic coarctation can overwork the myocardium which has to pump against the obstruction regurgitant flow valve pathology that allows backward flow of blood results in increased volume workload and may overwhelm the pumping capacity of the affected chambers shunted flow defects congenital or acquired that divert blood inappropriately from one chamber to another or from one vessel to another lead to pressure and volume overloads disorders of cardiac conduction uncoordinated cardiac impulses or blocked conduction pat