<a href="https://colab.research.google.com/github/victusfate/rag_demos/blob/main/rag_phi2_7_22_24.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install requests beautifulsoup4 torch transformers sentence-transformers numpy scikit-learn




In [2]:

import requests
from bs4 import BeautifulSoup
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np



In [3]:
# Web scraping and text processing
def fetch_and_process_text(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract text from paragraphs
    paragraphs = soup.find_all('p')
    text = ' '.join([p.get_text() for p in paragraphs])

    # Simple text cleaning
    text = text.replace('\n', ' ').replace('\r', '')
    return text

In [4]:
# Vector store
class SimpleVectorStore:
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        self.encoder = SentenceTransformer(model_name)
        self.vectors = []
        self.texts = []

    def add_text(self, text, chunk_size=256):
        chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
        embeddings = self.encoder.encode(chunks)
        self.vectors.extend(embeddings)
        self.texts.extend(chunks)

    def search(self, query, k=2):
        query_vector = self.encoder.encode([query])
        similarities = cosine_similarity(query_vector, self.vectors)[0]
        top_k_indices = np.argsort(similarities)[-k:][::-1]
        return [self.texts[i] for i in top_k_indices]

In [5]:
# Initialize models
model_name = "microsoft/phi-2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda" if torch.cuda.is_available() else "cpu"
)

vector_store = SimpleVectorStore()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [6]:
def generate_response(query, context):
    system_prompt = "You are a helpful AI assistant. Provide concise and accurate answers based on the given context."
    query_wrapper_prompt = "Context: {context}\nQuestion: {query}\nAnswer:"

    prompt = query_wrapper_prompt.format(context=" ".join(context), query=query)
    full_prompt = f"{system_prompt}\n\n{prompt}"

    inputs = tokenizer(full_prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=256,
            do_sample=False,
            temperature=0.0
        )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response.split("Answer:")[-1].strip()

In [7]:
url = 'https://www.gutenberg.org/cache/epub/8001/pg8001-images.html'
text = fetch_and_process_text(url)
vector_store.add_text(text)

In [9]:
# Example usage
query = "List the ten commandments"
relevant_context = vector_store.search(query)
response = generate_response(query, relevant_context)
print(f"Query: {query}")
print(f"Response: {response}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Query: List the ten commandments
Response: 1. Thou shalt not kill.
2. Thou shalt not commit adultery.
3. Thou shalt not steal.
4. Thou shalt not bear false witness against thy neighbour.
5. Thou shalt not covet thy neighbour's house, thou shalt not covet thy neighbour's wife, nor thy neighbour's manservant, nor thy neighbour's ox, nor thy neighbour's ass, nor thy neighbour's sheep, nor thy neighbour's oxen, nor thy neighbour's donkey, nor thy neighbour's mule, nor thy neighbour's ass, nor thy neighbour's oxen, nor thy neighbour's donkey, nor thy neighbour's mule.
6. Thou shalt not covet thy neighbour's wife, nor thy neighbour's manservant, nor thy neighbour's ox, nor thy neighbour's ass, nor thy neighbour's sheep, nor thy neighbour's oxen, nor thy neighbour's donkey, nor thy neighbour's mule, nor thy neighbour's ass, nor thy neighbour's oxen, nor thy neighbour's donkey, nor thy neighbour's mule.
7. Thou shalt not covet thy neighbour's house, thou shalt not covet thy neighbour's wife, n