<a href="https://colab.research.google.com/github/sam4410/RAG-Technique-based-models/blob/main/RAG_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q beautifulsoup4==4.12.3 requests==2.31.0 deeplake==3.9.18 openai

Components involved in Building RAG Pipeline

* Data collection and preparation
* Data embedding and storage
* Augmented generation

In [2]:
import os
from google.colab import userdata
import openai
from deeplake.core.vectorstore.deeplake_vectorstore import VectorStore
import deeplake.util
from google.colab import drive

# Connect this Colab to my Google Drive
drive.mount("/content/drive")

#Retrieving and setting OpenAI API key
f = open("drive/MyDrive/Colab Notebooks/key_files/openai_api_key.txt", "r")
API_KEY=f.readline().strip()
f.close()

#The OpenAI API key
os.environ['OPENAI_API_KEY'] =API_KEY
openai.api_key = os.getenv("OPENAI_API_KEY")

#Retrieving and setting Activeloop API token
f = open("drive/MyDrive/Colab Notebooks/key_files/activeloop_token.txt", "r")
API_token=f.readline().strip()
f.close()
ACTIVELOOP_TOKEN=API_token
os.environ['ACTIVELOOP_TOKEN'] =ACTIVELOOP_TOKEN

# signing to hugging face hub
hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=False)



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
# # Creating a subprocess to download files from GitHub
# # program contains a function to download files using curl, with the option to add a private token if necessary
# import subprocess
# url = "https://raw.githubusercontent.com/Denis2054/RAG-Driven-Generative-AI/main/commons/grequests.py"
# output_file = "grequests.py"

# # Prepare the curl command using the private token
# curl_command = [
# "curl",
# "-o", output_file,
# url
# ]

# # Execute the curl command
# try:
#   subprocess.run(curl_command, check=True)
#   print("Download successful.")
# except subprocess.CalledProcessError:
#   print("Failed to download the file.")

Download successful.


In [4]:
# # add a private token after the filename if necessary
# def download(directory, filename):
# # The base URL of the image files in the GitHub repository
#     base_url = 'https://raw.githubusercontent.com/Denis2054/RAG-Driven-Generative-AI/main/'
#     # Complete URL for the file
#     file_url = f"{base_url}{directory}/{filename}"
#     # Use curl to download the file, including an Authorization header for the private token
#     try:
#         # Prepare the curl command with the Authorization header
#         #curl_command = f'curl -H "Authorization: token {private_token}" -o{filename} {file_url}'
#         curl_command = f'curl -H -o {filename} {file_url}'
#         # Execute the curl command
#         subprocess.run(curl_command, check=True, shell=True)
#         print(f"Downloaded '{filename}' successfully.")
#     except subprocess.CalledProcessError:
#         print(f"Failed to download '{filename}'. Check the URL, your internet connection, and if the token is correct and has appropriate permissions.")

In [3]:
# For Google Colab and Activeloop(Deeplake library)
#This line writes the string "nameserver 8.8.8.8" to the file. This is specifying that the DNS server the system
#should use is at the IP address 8.8.8.8, which is one of Google's Public DNS servers.
with open('/etc/resolv.conf', 'w') as file:
    file.write("nameserver 8.8.8.8")

## Component 1: Data Collection and Preprocessing

We will retrieve and process 10 Wikipedia articles that provide a comprehensive view of various aspects of space exploration:
* Space exploration: Overview of the history, technologies, missions, and plans involved in the
exploration of space (https://en.wikipedia.org/wiki/Space_exploration)
* Apollo program: Details about the NASA program that landed the first humans on the Moon
and its significant missions (https://en.wikipedia.org/wiki/Apollo_program)
* Hubble Space Telescope: Information on one of the most significant telescopes ever built,
which has been crucial in many astronomical discoveries (https://en.wikipedia.org/wiki/
Hubble_Space_Telescope)
* Mars rover: Insight into the rovers that have been sent to Mars to study its surface and environment
(https://en.wikipedia.org/wiki/Mars_rover)
* International Space Station (ISS): Details about the ISS, its construction, international collaboration,
and its role in space research (https://en.wikipedia.org/wiki/International_
Space_Station)
* SpaceX: Covers the history, achievements, and goals of SpaceX, one of the most influential
private spaceflight companies (https://en.wikipedia.org/wiki/SpaceX)
* Juno (spacecraft): Information about the NASA space probe that orbits and studies Jupiter, its
structure, and moons (https://en.wikipedia.org/wiki/Juno_(spacecraft))
42 RAG Embedding Vector Stores with Deep Lake and OpenAI
* Voyager program: Details on the Voyager missions, including their contributions to our understanding
of the outer solar system and interstellar space (https://en.wikipedia.org/wiki/Voyager_program)
* Galileo (spacecraft): Overview of the mission that studied Jupiter and its moons, providing
valuable data on the gas giant and its system (https://en.wikipedia.org/wiki/Galileo_
(spacecraft))
* Kepler space telescope: Information about the space telescope designed to discover Earth-size
planets orbiting other stars (https://en.wikipedia.org/wiki/Kepler_Space_Telescope)

These articles cover a wide range of topics in space exploration, from historical programs to modern
technological advances and missions.

In [4]:
# Collecting the data
import requests
from bs4 import BeautifulSoup
import re

# select the URLs we need
urls = [
"https://en.wikipedia.org/wiki/Space_exploration",
"https://en.wikipedia.org/wiki/Apollo_program",
"https://en.wikipedia.org/wiki/Hubble_Space_Telescope",
"https://en.wikipedia.org/wiki/Mars_over",
"https://en.wikipedia.org/wiki/International_Space_Station",
"https://en.wikipedia.org/wiki/SpaceX",
"https://en.wikipedia.org/wiki/Juno_(spacecraft)",
"https://en.wikipedia.org/wiki/Voyager_program",
"https://en.wikipedia.org/wiki/Galileo_(spacecraft)",
"https://en.wikipedia.org/wiki/Kepler_Space_Telescope"
]

# Preparing the data
# remove numerical references such as [1] [2] from a given text string, using regular expressions
def clean_text(content):
# Remove references that usually appear as [1], [2], etc.
    content = re.sub(r'\[\d+\]', '', content)
    return content

# fetch and clean function, which will return a nice and clean text by extracting the content we need from the documents
def fetch_and_clean(url):
  # Fetch the content of the URL
  response = requests.get(url)
  soup = BeautifulSoup(response.content, 'html.parser')

  # Find the main content of the article, ignoring side boxes and headers
  content = soup.find('div', {'class': 'mw-parser-output'})

  # Remove the bibliography section, which generally follows a header like "References", "Bibliography"
  for section_title in ['References', 'Bibliography', 'External links', 'See also']:
    section = content.find('span', id=section_title)
    if section:
      # Remove all content from this section to the end of the document
      for sib in section.parent.find_next_siblings():
        sib.decompose()
      section.parent.decompose()

  # Extract and clean the text
  text = content.get_text(separator=' ', strip=True)
  text = clean_text(text)
  return text

In [5]:
# write the content in llm.txt file
with open('llm.txt', 'w', encoding = 'utf-8') as f:
  for url in urls:
    f.write(fetch_and_clean(url))

print("Content written to llm.txt")

Content written to llm.txt


In [6]:
# verify the content written to file
with open('llm.txt', 'r', encoding = 'utf-8') as file:
  lines = file.readlines()
  # Print the first 20 lines
  for line in lines[:20]:
    print(line.strip())

Exploration of space, planets, and moons For broader coverage of this topic, see Exploration . Buzz Aldrin taking a core sample of the Moon during the Apollo 11 mission Self-portrait of Curiosity rover on Mars 's surface Part of a series on Spaceflight History History of spaceflight Space Race Timeline of spaceflight Space probes Lunar missions Mars missions Applications Communications Earth observation Exploration Espionage Military Navigation Colonization Habitation Exploration Telescopes Tourism Spacecraft Robotic spacecraft Satellite Space probe Cargo spacecraft Crewed spacecraft Apollo Lunar Module Space capsules Space Shuttle Space stations Spaceplanes Vostok Space launch Spaceport Launch pad Expendable and reusable launch vehicles Escape velocity Non-rocket spacelaunch Spaceflight types Sub-orbital Orbital Interplanetary Interstellar Intergalactic List of space organizations Space agencies Space forces Companies Spaceflight portal v t e Space exploration is the use of astronomy 

## Component 2: Data embedding and Storage

In [7]:
source_text = "llm.txt"

# creating chunks
with open(source_text, 'r') as f:
  text = f.read()

CHUNK_SIZE = 1000
chunked_text = [text[i:i+CHUNK_SIZE] for i in range(0,len(text), CHUNK_SIZE)]

# vectorize data
vector_store_path = "hub://sam4410/space_exploration_v1"

In [8]:
try:
  # attempt to load the vectorstore
  vector_store = VectorStore(path=vector_store_path)
  print("Vector store exists")
except FileNotFoundError:
  print("Vector store does not exist. You can create it.")
  # Code to create the vector store
  create_vector_store = True

Deep Lake Dataset in hub://sam4410/space_exploration_v1 already exists, loading from the storage
Vector store exists


In [9]:
# The embedding function will transform the chunks of data we created into vectors to enable vector-based search
# will use "text-embedding-3-small" to embed the documents
def embedding_function(texts, model="text-embedding-3-small"):
  if isinstance(texts, str):
    texts = [texts]
  texts = [t.replace("\n", " ") for t in texts]
  return [data.embedding for data in openai.embeddings.create(input = texts, model=model).data]

In [10]:
# Adding Data to Vector store
add_to_vector_store = True
if add_to_vector_store == True:
  with open(source_text, 'r') as f:
    text = f.read()
    CHUNK_SIZE = 1000
    chunked_text = [text[i:i+1000] for i in range(0, len(text), CHUNK_SIZE)]

vector_store.add(text = chunked_text,
                 embedding_function = embedding_function,
                 embedding_data = chunked_text,
                 metadata = [{"source": source_text}]*len(chunked_text))

Creating 348 embeddings in 1 batches of size 348:: 100%|██████████| 1/1 [00:08<00:00,  8.04s/it]

Dataset(path='hub://sam4410/space_exploration_v1', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype       shape      dtype  compression
  -------    -------     -------    -------  ------- 
 embedding  embedding  (696, 1536)  float32   None   
    id        text      (696, 1)      str     None   
 metadata     json      (696, 1)      str     None   
   text       text      (696, 1)      str     None   





Dataset contains 4 tensors:
1. embedding: Each chunk of data is embedded in a vector
2. id: The ID is a string of characters and is unique
3. metadata: The metadata contains the source of the data—in this case, the llm.txt file.
4. text: The content of a chunk of text in the dataset

In [11]:
# visualize how the dataset is organized to verify the structure
print(vector_store.summary())

Dataset(path='hub://sam4410/space_exploration_v1', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype       shape      dtype  compression
  -------    -------     -------    -------  ------- 
 embedding  embedding  (696, 1536)  float32   None   
    id        text      (696, 1)      str     None   
 metadata     json      (696, 1)      str     None   
   text       text      (696, 1)      str     None   
None


In [12]:
# Vector store information
# Activeloop's API reference provides us with all the information we need to manage our datasets: https://docs.deeplake.ai/en/latest/.
# We can visualize our datasets once we sign in at https://app.activeloop.ai/datasets/mydatasets/
# we can load our data using 1 line of code
ds = deeplake.load("hub://sam4410/space_exploration_v1")

|

This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/sam4410/space_exploration_v1



/

hub://sam4410/space_exploration_v1 loaded successfully.



 

In [13]:
# display the estimated size of a dataset
ds_size = ds.size_approx()
# Convert bytes to megabytes and limit to 5 decimal places
ds_size_mb = ds_size / 1048576
print(f"Dataset size in megabytes: {ds_size_mb:.5f} MB")
ds_size_gb = ds_size / 1073741824
print(f"Dataset size in gigabytes: {ds_size_gb:.5f} GB")

Dataset size in megabytes: 55.31311 MB
Dataset size in gigabytes: 0.05402 GB


## Step 3: Augmented input generation

In [14]:
# Augmented generation is the third pipeline component. We will use the data we retrieved to augment the user input. This component processes the user input, queries the vector store,
#augments the input, and calls gpt-4-turbo model
# load dataset from vector store
vector_store_path = "hub://sam4410/space_exploration_v1"
ds = deeplake.load(vector_store_path)

-

This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/sam4410/space_exploration_v1



/

hub://sam4410/space_exploration_v1 loaded successfully.



 

In [15]:
# Input and query retrieval
# will need the same embedding function, defined above to embed the user input ensure full compatibility between the input and the vector dataset
# now either use an interactive prompt for an input or process user inputs in batches
def get_user_prompt():
  # Request user input for the search prompt
  return input("Enter your search query: ")

In [16]:
user_prompt = get_user_prompt()

Enter your search query: Tell me about space exploration on the Moon and Mars.


In [17]:
# plug the prompt into the search query and store the output
search_results = vector_store.search(embedding_data=user_prompt,
                                     embedding_function=embedding_function)

In [18]:
# wrap the retrieved text to obtain a formatted output
def wrap_text(text, width=80):
  lines = []
  while len(text) > width:
    split_index = text.rfind(' ', 0, width)
    if split_index == -1:
      split_index = width
    lines.append(text[:split_index])
    text = text[split_index:].strip()
  lines.append(text)
  return '\n'.join(lines)

In [19]:
# Assuming the search results are ordered with the top result first
top_score = search_results['score'][0]
top_text = search_results['text'][0]
top_metadata = search_results['metadata'][0]['source']

# Print the top search result
print("Top Search Result:")
print(f"Score: {top_score}")
print(f"Source: {top_metadata}")
print("Text:")
print(wrap_text(top_text))

Top Search Result:
Score: 0.6073837280273438
Source: llm.txt
Text:
Exploration of space, planets, and moons For broader coverage of this topic,
see Exploration . Buzz Aldrin taking a core sample of the Moon during the
Apollo 11 mission Self-portrait of Curiosity rover on Mars 's surface Part of a
series on Spaceflight History History of spaceflight Space Race Timeline of
spaceflight Space probes Lunar missions Mars missions Applications
Communications Earth observation Exploration Espionage Military Navigation
Colonization Habitation Exploration Telescopes Tourism Spacecraft Robotic
spacecraft Satellite Space probe Cargo spacecraft Crewed spacecraft Apollo
Lunar Module Space capsules Space Shuttle Space stations Spaceplanes Vostok
Space launch Spaceport Launch pad Expendable and reusable launch vehicles
Escape velocity Non-rocket spacelaunch Spaceflight types Sub-orbital Orbital
Interplanetary Interstellar Intergalactic List of space organizations Space
agencies Space forces Companies 

In [20]:
# Augmented input by adding the top retrieved text to the user input (pronmpt)
augmented_input = user_prompt+" "+ top_text
print(augmented_input)

Tell me about space exploration on the Moon and Mars. Exploration of space, planets, and moons For broader coverage of this topic, see Exploration . Buzz Aldrin taking a core sample of the Moon during the Apollo 11 mission Self-portrait of Curiosity rover on Mars 's surface Part of a series on Spaceflight History History of spaceflight Space Race Timeline of spaceflight Space probes Lunar missions Mars missions Applications Communications Earth observation Exploration Espionage Military Navigation Colonization Habitation Exploration Telescopes Tourism Spacecraft Robotic spacecraft Satellite Space probe Cargo spacecraft Crewed spacecraft Apollo Lunar Module Space capsules Space Shuttle Space stations Spaceplanes Vostok Space launch Spaceport Launch pad Expendable and reusable launch vehicles Escape velocity Non-rocket spacelaunch Spaceflight types Sub-orbital Orbital Interplanetary Interstellar Intergalactic List of space organizations Space agencies Space forces Companies Spaceflight p

In [23]:
from openai import OpenAI
client = OpenAI()
import time

gpt_model = "gpt-4o-mini"
start_time = time.time()

# write the generative AI call, adding roles to the message we create for the model
def call_gpt4_with_full_text(itext):
  # Join all lines to form a single string
  text_input = '\n'.join(itext)
  prompt = f"Please summarize or elaborate on the following content:\n{text_input}"

  try:
    response = client.chat.completions.create(
      model=gpt_model,
      messages=[
        {"role": "system", "content": "You are a space exploration expert."},
        {"role": "assistant", "content": "You can read the input and answer in detail." },
        {"role": "user", "content": prompt}
        ],
      temperature=0.1
    )
    return response.choices[0].message.content.strip()
  except Exception as e:
    return str(e)

In [24]:
# call the generative model giving aumented input
gpt4_response = call_gpt4_with_full_text(augmented_input)

response_time = time.time() - start_time # Measure response time
print(f"Response Time: {response_time:.2f} seconds")
print(gpt_model, "Response:", gpt4_response)

Response Time: 13.34 seconds
gpt-4o-mini Response: Space exploration refers to the investigation and study of outer space through the use of various technologies, including spacecraft, satellites, and telescopes. This field encompasses a wide range of activities, including the exploration of celestial bodies such as the Moon and Mars, as well as the broader study of planets, moons, and other astronomical phenomena.

### Moon Exploration
The Moon has been a focal point of human space exploration, particularly during the Apollo missions. Notably, Buzz Aldrin famously took a core sample of the lunar surface during the Apollo 11 mission, marking a significant achievement in lunar exploration. The Moon serves as a valuable site for scientific research, offering insights into the early solar system and the geological history of celestial bodies.

### Mars Exploration
Mars is another primary target for exploration due to its potential for past or present life and its similarities to Earth. Ro

In [26]:
# format the output with textwrap and print the result. First, checks if the response returned contains Markdown features
import textwrap
import re
from IPython.display import display, Markdown, HTML
import markdown

def print_formatted_response(response):
# Check for markdown by looking for patterns like headers, bold, lists, etc.
  markdown_patterns = [
  r"^#+\s", # Headers
  r"^\*+", # Bullet points
  r"\*\*", # Bold
  r"_", # Italics
  r"\[.+\]\(.+\)", # Links
  r"-\s", # Dashes used for lists
  r"\`\`\`" # Code blocks
  ]

  # If any pattern matches, assume the response is in markdown
  if any(re.search(pattern, response, re.MULTILINE) for pattern in markdown_patterns):
    # Markdown detected, convert to HTML for nicer display
    html_output = markdown.markdown(response)
    display(HTML(html_output)) # Use display(HTML()) to render HTML in Colab
  else:
    # No markdown detected, wrap and print as plain text
    wrapper = textwrap.TextWrapper(width=80)
    wrapped_text = wrapper.fill(text=response)
    print("Text Response:")
    print("--------------------")
    print(wrapped_text)
    print("--------------------\n")

In [27]:
print_formatted_response(gpt4_response)

## Evaluating the output with cosine similarity

In [28]:
# implement cosine similarity to measure the similarity between user input and the generative AI model's output
# will also measure the augmented user input with the generative AI model's output. Let's first define a cosine similarity function:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def calculate_cosine_similarity(text1, text2):
  vectorizer = TfidfVectorizer()
  tfidf = vectorizer.fit_transform([text1, text2])
  similarity = cosine_similarity(tfidf[0:1], tfidf[1:2])
  return similarity[0][0]

In [29]:
# let’s calculate a score that measures the similarity between the user prompt and GPT-4’s response
cosine_similarity_score = calculate_cosine_similarity(user_prompt, gpt4_response)
print(f"Cosine Similarity Score: {cosine_similarity_score}")

Cosine Similarity Score: 0.3852138472642536


In [30]:
# calculate the similarity between the augmented input and GPT-4’s response
cosine_similarity_score_augmented = calculate_cosine_similarity(augmented_input, gpt4_response)
print(f"Cosine Similarity Score (Augmented Input): {cosine_similarity_score_augmented}")

Cosine Similarity Score (Augmented Input): 0.5904078875104394


In [32]:
# TF-IDF relies heavily on exact vocabulary overlap and takes into account important language features such as semantic meanings, synonyms, or contextual usage
# this method may produce lower similarity scores for texts that are conceptually similar but differ in word choice
# In contrast, using Sentence Transformers to calculate similarity involves embeddings that capture deeper semantic relationships between words and phrases
# This approach is more effective in recognizing the contextual and conceptual similarity between texts
# !pip install sentence-transformers

In [33]:
# use the MiniLM architecture to perform the task with all-MiniLM-L6-v2. This model is available through the Hugging Face Model Hub.
# It's part of the sentence-transformers library, which is an extension of the Hugging Face Transformers library.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

def calculate_cosine_similarity_with_embeddings(text1, text2):
  embeddings1 = model.encode(text1)
  embeddings2 = model.encode(text2)
  similarity = cosine_similarity([embeddings1], [embeddings2])
  return similarity[0][0]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [34]:
# calculate the similarity between the augmented user input and GPT-4's response
similarity_score = calculate_cosine_similarity_with_embeddings(augmented_input,
                                                               gpt4_response)
print(f"Cosine Similarity Score: {similarity_score:.3f}")

Cosine Similarity Score: 0.784


In [None]:
"""
The output shows that the Sentence Transformer captures semantic similarities between the texts more effectively, resulting in a high cosine similarity score.
"""