 The hand-in exercise for this topic is in the notebook named ‘rag_task.ipynb’. Do all 4 
tasks within this notebook. For task 2, you should try at least 3 types of chunking such as 
chunk in paragraphs, sentences or even by punctuation marks – you are welcome to 
choose your own chunking strategy. For task 4 you should try at least one other type of 
similarity or distance function to calculate the similarity. 

Task
1. Creat a RAG pipeline that can take following text and answer following questions
2. Try different types of chunking to get better answers?
3. Does asking questions differently give better answers? Why?
4. Try a different similarity search instead of cosine similarity - do the answers improve?



In [1]:
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
import re

In [2]:
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
model = SentenceTransformer(MODEL_NAME)

In [3]:
sample_text = """
The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers. It spans across nine countries, including Brazil, Peru, and Colombia. The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds.

Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, and urbanization. This deforestation contributes to climate change, as the rainforest acts as a major carbon sink, absorbing millions of tons of carbon dioxide annually.

Indigenous tribes have lived in the Amazon for thousands of years, relying on its rich biodiversity for food, medicine, and shelter. These tribes have unique languages, traditions, and knowledge of the ecosystem. However, many face threats from illegal land encroachment and industrial activities.

Scientists believe that the Amazon plays a crucial role in global weather patterns by releasing water vapor into the atmosphere, which influences rainfall across South America and even other continents. The Amazon River, which flows through the rainforest, is the second longest river in the world and carries more water than any other river.

Efforts to protect the Amazon include international agreements, conservation programs, and sustainable development projects that aim to balance economic growth with environmental protection. Many organizations and governments are working to reduce illegal logging and promote reforestation initiatives.
"""

In [4]:
questions = [
    "What is the Amazon rainforest?",
    "Which countries does the Amazon span across?",
    "Why is deforestation a problem in the Amazon?",
    "How does the Amazon rainforest affect global weather patterns?",
    "What role do indigenous tribes play in the Amazon?",
    "What is the importance of the Amazon River?",
    "What types of wildlife can be found in the Amazon?",
    "How does deforestation contribute to climate change?",
    "What efforts are being made to protect the Amazon?",
    "Why is the Amazon considered a major carbon sink?"
]

In [5]:
# Function to generate embeddings
def get_embeddings(texts):
    return model.encode(texts, convert_to_numpy=True, batch_size=32)

# Function to retrieve relevant passages using cosine similarity
def retrieve_passages_cosine(query, stored_texts, stored_embeddings, top_k=1):
    query_embedding = get_embeddings([query])
    similarities = cosine_similarity(query_embedding, stored_embeddings)[0]
    
    # Get index of most similar passage
    best_match_idx = np.argmax(similarities)
    
    return stored_texts[best_match_idx]

# Function to answer questions
def answer_question(query, stored_texts, stored_embeddings):
    return retrieve_passages_cosine(query, stored_texts, stored_embeddings)

Create a RAG pipeline

In [6]:
# Task 1: Create a RAG pipeline with paragraph chunking
def split_by_paragraphs(text):
    return [para.strip() for para in re.split("\n+", text) if para.strip()]

# Split text into paragraphs and generate embeddings
paragraphs = split_by_paragraphs(sample_text)
paragraph_embeddings = get_embeddings(paragraphs)

# Test the basic RAG pipeline
print("--- Basic RAG Pipeline with Paragraph Chunking ---")
for question in questions[:3]:  # Testing with first 3 questions
    answer = answer_question(question, paragraphs, paragraph_embeddings)
    print(f"Q: {question}")
    print(f"A: {answer}\n")

--- Basic RAG Pipeline with Paragraph Chunking ---
Q: What is the Amazon rainforest?
A: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers. It spans across nine countries, including Brazil, Peru, and Colombia. The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds.

Q: Which countries does the Amazon span across?
A: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers. It spans across nine countries, including Brazil, Peru, and Colombia. The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds.

Q: Why is deforestation a problem in the Amazon?
A: Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, 

 Try different types of chunking

In [7]:
# Chunking Strategy 1: Paragraph-based chunking (already implemented above)
print(f"Paragraph chunking - Number of chunks: {len(paragraphs)}")

Paragraph chunking - Number of chunks: 5


In [8]:
# Chunking Strategy 2: Sentence-based chunking
def split_by_sentences(text):
    sentences = re.split(r'(?<![\w\.])(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
    return [sentence.strip() for sentence in sentences if sentence.strip()]

sentences = split_by_sentences(sample_text)
sentence_embeddings = get_embeddings(sentences)

print(f"Sentence chunking - Number of chunks: {len(sentences)}")
print("\nExample sentences:")
for i, sent in enumerate(sentences[:3]):
    print(f"{i+1}: {sent}")

Sentence chunking - Number of chunks: 1

Example sentences:
1: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers. It spans across nine countries, including Brazil, Peru, and Colombia. The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds.

Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, and urbanization. This deforestation contributes to climate change, as the rainforest acts as a major carbon sink, absorbing millions of tons of carbon dioxide annually.

Indigenous tribes have lived in the Amazon for thousands of years, relying on its rich biodiversity for food, medicine, and shelter. These tribes have unique languages, traditions, and knowledge of the ecosystem. However, many face threats from illegal land encroachment and industrial activitie

In [9]:
# Chunking Strategy 3: Word-based chunking (simpler approach)
def split_by_words(text, words_per_chunk=50):
    words = text.split()
    chunks = []
    
    for i in range(0, len(words), words_per_chunk):
        chunk = ' '.join(words[i:i + words_per_chunk])
        chunks.append(chunk)
    
    return chunks

# Generate word-based chunks
word_chunks = split_by_words(sample_text)
word_chunk_embeddings = model.encode(word_chunks, convert_to_numpy=True)

print(f"Word chunking - Number of chunks: {len(word_chunks)}")
print("\nExample word chunks:")
for i, chunk in enumerate(word_chunks[:2]):
    print(f"{i+1}: {chunk[:100]}...")

Word chunking - Number of chunks: 5

Example word chunks:
1: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 mi...
2: birds. Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost...


In [10]:
# Test word-based chunking
print("\n--- Answers with Word Chunking ---")
for question in questions[:3]:
    answer = answer_question(question, word_chunks, word_chunk_embeddings)
    print(f"Q: {question}")
    print(f"A: {answer}\n")


--- Answers with Word Chunking ---
Q: What is the Amazon rainforest?
A: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers. It spans across nine countries, including Brazil, Peru, and Colombia. The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and

Q: Which countries does the Amazon span across?
A: in the Amazon for thousands of years, relying on its rich biodiversity for food, medicine, and shelter. These tribes have unique languages, traditions, and knowledge of the ecosystem. However, many face threats from illegal land encroachment and industrial activities. Scientists believe that the Amazon plays a crucial role in global

Q: Why is deforestation a problem in the Amazon?
A: birds. Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, and urbanization.

In [11]:
# Test different chunking strategies
print("\n--- Answers with Sentence Chunking ---")
for question in questions[:3]:
    answer = answer_question(question, sentences, sentence_embeddings)
    print(f"Q: {question}")
    print(f"A: {answer}\n")

print("\n--- Answers with words Chunking ---")
for question in questions[:3]:
    answer = answer_question(question, word_chunks, word_chunk_embeddings)
    print(f"Q: {question}")
    print(f"A: {answer}\n")


--- Answers with Sentence Chunking ---
Q: What is the Amazon rainforest?
A: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers. It spans across nine countries, including Brazil, Peru, and Colombia. The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds.

Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, and urbanization. This deforestation contributes to climate change, as the rainforest acts as a major carbon sink, absorbing millions of tons of carbon dioxide annually.

Indigenous tribes have lived in the Amazon for thousands of years, relying on its rich biodiversity for food, medicine, and shelter. These tribes have unique languages, traditions, and knowledge of the ecosystem. However, many face threats from illegal land encroachment and indust

Does asking questions differently give better answers?

In [12]:
# Alternative formulations of the same questions
alternative_questions = [
    "Tell me about the Amazon rainforest.",
    "Name the countries where the Amazon is located.",
    "What are the negative effects of Amazon deforestation?",
    "Explain how the Amazon influences weather patterns globally.",
    "How do indigenous people interact with the Amazon ecosystem?"
]

print("\n--- Testing Different Question Formulations ---")
for i, alt_q in enumerate(alternative_questions):
    original_q = questions[i]
    
    original_answer = answer_question(original_q, paragraphs, paragraph_embeddings)
    alt_answer = answer_question(alt_q, paragraphs, paragraph_embeddings)
    
    print(f"Original Q: {original_q}")
    print(f"Original A: {original_answer}")
    print(f"Alternative Q: {alt_q}")
    print(f"Alternative A: {alt_answer}")
    print()


--- Testing Different Question Formulations ---
Original Q: What is the Amazon rainforest?
Original A: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers. It spans across nine countries, including Brazil, Peru, and Colombia. The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds.
Alternative Q: Tell me about the Amazon rainforest.
Alternative A: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers. It spans across nine countries, including Brazil, Peru, and Colombia. The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds.

Original Q: Which countries does the Amazon span across?
Original A: The Amazon rainforest is the largest tropical rainforest in the world, covering appro

Try a different similarity search instead of cosine similarity

In [13]:
# Function to retrieve passages using Euclidean distance
def retrieve_passages_euclidean(query, stored_texts, stored_embeddings):
    query_embedding = get_embeddings([query])
    distances = euclidean_distances(query_embedding, stored_embeddings)[0]
    
    # Get index of closest passage (smallest distance)
    best_match_idx = np.argmin(distances)
    
    return stored_texts[best_match_idx]

# Test with Euclidean distance
print("\n--- Answers with Euclidean Distance ---")
for question in questions[:5]:
    cosine_answer = retrieve_passages_cosine(question, paragraphs, paragraph_embeddings)
    euclidean_answer = retrieve_passages_euclidean(question, paragraphs, paragraph_embeddings)
    
    print(f"Q: {question}")
    print(f"Cosine Similarity A: {cosine_answer}")
    print(f"Euclidean Distance A: {euclidean_answer}")
    print()


--- Answers with Euclidean Distance ---
Q: What is the Amazon rainforest?
Cosine Similarity A: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers. It spans across nine countries, including Brazil, Peru, and Colombia. The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds.
Euclidean Distance A: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers. It spans across nine countries, including Brazil, Peru, and Colombia. The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds.

Q: Which countries does the Amazon span across?
Cosine Similarity A: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers. It spans acro

In [14]:
# Analysis of results
def analyze_results():
    print("\n--- Analysis of Results ---")
    
    print("\nTask 2 - Different Chunking Strategies:")
    print("- Paragraph chunking: Preserves context but may include irrelevant information")
    print("- Sentence chunking: More precise but may lose context between sentences")
    print("- Character chunking: Can break text at arbitrary points, potentially splitting concepts")
    
    print("\nTask 3 - Different Question Formulations:")
    print("- More specific questions tend to retrieve more relevant passages")
    print("- Questions with keywords from the text often perform better")
    print("- The way a question is phrased can significantly impact the retrieved passage")
    
    print("\nTask 4 - Different Similarity Metrics:")
    print("- Cosine similarity focuses on the direction of vectors (semantic similarity)")
    print("- Euclidean distance considers both direction and magnitude")
    print("- Different metrics may be better suited for different types of questions")

analyze_results()


--- Analysis of Results ---

Task 2 - Different Chunking Strategies:
- Paragraph chunking: Preserves context but may include irrelevant information
- Sentence chunking: More precise but may lose context between sentences
- Character chunking: Can break text at arbitrary points, potentially splitting concepts

Task 3 - Different Question Formulations:
- More specific questions tend to retrieve more relevant passages
- Questions with keywords from the text often perform better
- The way a question is phrased can significantly impact the retrieved passage

Task 4 - Different Similarity Metrics:
- Cosine similarity focuses on the direction of vectors (semantic similarity)
- Euclidean distance considers both direction and magnitude
- Different metrics may be better suited for different types of questions
