In [1]:
pip install PyPDF2 pdfminer.six


Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting pdfminer.six
  Downloading pdfminer.six-20240706-py3-none-any.whl.metadata (4.1 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pdfminer.six-20240706-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m77.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2, pdfminer.six
Successfully installed PyPDF2-3.0.1 pdfminer.six-20240706


In [27]:
from PyPDF2 import PdfReader

# Example of extracting text from a PDF (Using PyPDF2)
from PyPDF2 import PdfReader

def extract_text_from_pdf(pdf_path):
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    return text


# Example Usage
pdf_path = "/content/Environmental Science Earth as a Living Planet by Daniel B. Botkin and Edward A. Keller.pdf"  # Replace with your file
extracted_textE = extract_text_from_pdf(pdf_path)
pdf_path = "/content/Data Structures and Algorithms in Java Fourth Edition.pdf"  # Replace with your file
extracted_textD = extract_text_from_pdf(pdf_path)
pdf_path = "/content/best book Principles of Artificial Intelligence ( PDFDrive ).pdf"  # Replace with your file
extracted_textB = extract_text_from_pdf(pdf_path)

# Save extracted text to a file
with open("textbookE.txt", "w", encoding="utf-8") as f:
    f.write(extracted_textE)
with open("textbookD.txt", "w", encoding="utf-8") as f:
    f.write(extracted_textD)
with open("textbookB.txt", "w", encoding="utf-8") as f:
    f.write(extracted_textB)
print("Text extraction complete!")


Text extraction complete!


In [3]:
import re
import json

def organize_textbook(file_path, textbook_title):
    """
    Organizes a textbook's text into chapters and sections.

    Args:
        file_path (str): Path to the .txt file containing textbook content.
        textbook_title (str): Title of the textbook for reference.

    Returns:
        dict: Hierarchical structure with chapters, sections, and paragraphs.
    """
    # Read the file content
    with open(file_path, "r", encoding="utf-8") as file:
        text = file.read()

    # Split text into chapters using a regex pattern for "Chapter X"
    chapters = re.split(r"\nChapter\s+\d+[:\s]", text, flags=re.IGNORECASE)
    hierarchy = {"textbook_title": textbook_title, "chapters": []}

    # Process each chapter
    for chapter_num, chapter_text in enumerate(chapters):
        if chapter_text.strip():  # Ignore empty splits
            chapter_data = {
                "chapter_number": chapter_num + 1,
                "sections": []
            }

            # Split chapter into sections using "Section X.Y"
            sections = re.split(r"\nSection\s+\d+(\.\d+)*[:\s]", chapter_text, flags=re.IGNORECASE)

            for section_num, section_text in enumerate(sections):
                if section_text.strip():  # Ignore empty splits
                    section_data = {
                        "section_number": section_num + 1,
                        "content": [p.strip() for p in section_text.split("\n\n") if p.strip()]
                    }
                    chapter_data["sections"].append(section_data)

            hierarchy["chapters"].append(chapter_data)

    return hierarchy

# Process all three books
books = [
    {"file": "textbookE.txt", "title": "Book E"},
    {"file": "textbookB.txt", "title": "Book B"},
    {"file": "textbookD.txt", "title": "Book D"}
]

organized_books = []

for book in books:
    organized_books.append(organize_textbook(book["file"], book["title"]))

# Save the organized structure for each book as JSON
for book, data in zip(books, organized_books):
    output_file = f"{book['title'].replace(' ', '_').lower()}_organized.json"
    with open(output_file, "w", encoding="utf-8") as json_file:
        json.dump(data, json_file, indent=4)

print("Books organized and saved as JSON!")


Books organized and saved as JSON!


In [6]:
import re
import json
import networkx as nx

def sanitize_text(text):
    """
    Sanitizes text to ensure it is XML-compatible.
    Removes control characters and ensures UTF-8 compliance.

    Args:
        text (str): Text to sanitize.

    Returns:
        str: Sanitized text.
    """
    if not text:
        return ""
    # Remove control characters and ensure UTF-8 encoding
    text = re.sub(r"[\x00-\x1F\x7F]", "", text)  # Remove non-printable characters
    text = text.replace("&", "&amp;").replace("<", "&lt;").replace(">", "&gt;")
    return text

def build_hierarchical_tree(hierarchy):
    """
    Builds a hierarchical tree (graph) from the organized textbook data.

    Args:
        hierarchy (dict): Organized textbook data with chapters, sections, and paragraphs.

    Returns:
        nx.DiGraph: A directed graph representing the hierarchical tree.
    """
    graph = nx.DiGraph()
    textbook_title = hierarchy["textbook_title"]
    graph.add_node(textbook_title, node_type="root")

    # Add chapters and sections as nodes
    for chapter in hierarchy["chapters"]:
        chapter_id = f"Chapter {chapter['chapter_number']}"
        graph.add_node(chapter_id, node_type="chapter")
        graph.add_edge(textbook_title, chapter_id)  # Connect chapter to root

        for section in chapter["sections"]:
            section_id = f"{chapter_id} - Section {section['section_number']}"
            graph.add_node(section_id, node_type="section")
            graph.add_edge(chapter_id, section_id)  # Connect section to chapter

            # Add paragraphs as leaf nodes
            for i, paragraph in enumerate(section["content"], 1):
                sanitized_paragraph = sanitize_text(paragraph)
                paragraph_id = f"{section_id} - Paragraph {i}"
                graph.add_node(paragraph_id, node_type="paragraph", content=sanitized_paragraph)
                graph.add_edge(section_id, paragraph_id)  # Connect paragraph to section

    return graph

# Process all three books and save hierarchical trees
organized_files = [
    {"json_file": "book_e_organized.json", "title": "Book E"},
    {"json_file": "book_b_organized.json", "title": "Book B"},
    {"json_file": "book_d_organized.json", "title": "Book D"}
]

for book in organized_files:
    # Load the organized JSON
    with open(book["json_file"], "r", encoding="utf-8") as json_file:
        organized_hierarchy = json.load(json_file)

    # Build the hierarchical tree
    tree = build_hierarchical_tree(organized_hierarchy)

    # Save the tree as a GraphML file
    output_file = f"{book['title'].replace(' ', '_').lower()}_tree.graphml"
    nx.write_graphml(tree, output_file)
    print(f"Hierarchical tree for {book['title']} saved as {output_file}")


Hierarchical tree for Book E saved as book_e_tree.graphml
Hierarchical tree for Book B saved as book_b_tree.graphml
Hierarchical tree for Book D saved as book_d_tree.graphml


In [35]:
!pip install nltk
!pip install rank-bm25
!pip install sentence-transformers




In [36]:
import nltk
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize

# Download required NLTK data
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')  # WordNet for multilingual support

def expand_query(query):
    """
    Expands a user query by adding synonyms for each word in the query.

    Args:
        query (str): User query.

    Returns:
        list: Expanded list of query terms.
    """
    expanded_terms = set()
    tokens = word_tokenize(query)

    for token in tokens:
        expanded_terms.add(token)  # Add the original word

        # Add synonyms using WordNet
        for syn in wordnet.synsets(token):
            for lemma in syn.lemmas():
                expanded_terms.add(lemma.name().replace('_', ' '))  # Add synonym

    return list(expanded_terms)

# Example usage
user_query = "machine learning techniques"
expanded_query = expand_query(user_query)
print("Expanded Query Terms:", expanded_query)


Expanded Query Terms: ['machine', 'larn', 'watch', 'learnedness', 'instruct', 'motorcar', 'techniques', 'teach', 'pick up', 'car', 'get a line', 'see', 'get wind', 'political machine', 'learning', 'take', 'automobile', 'read', 'erudition', 'eruditeness', 'con', 'determine', 'scholarship', 'acquisition', 'check', 'technique', 'get word', 'ascertain', 'learn', 'memorize', 'study', 'encyclopedism', 'auto', 'hear', 'encyclopaedism', 'acquire', 'proficiency', 'memorise', 'discover', 'simple machine', 'find out']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [31]:
from rank_bm25 import BM25Okapi
from nltk.tokenize import word_tokenize

# Example paragraphs for retrieval (You should replace this with actual data from the textbooks)
all_paragraphs = [
    {"id": 1, "content": "Machine learning is a branch of artificial intelligence."},
    {"id": 2, "content": "Deep learning is a subfield of machine learning."},
    {"id": 3, "content": "Natural language processing involves machine learning techniques."}
]

# Tokenize paragraphs
tokenized_corpus = [word_tokenize(paragraph['content'].lower()) for paragraph in all_paragraphs]
bm25 = BM25Okapi(tokenized_corpus)

def bm25_retrieve(query, top_k=5):
    """
    Retrieves top-k paragraphs using BM25.

    Args:
        query (str): User query.
        top_k (int): Number of top results to return.

    Returns:
        list: Top-k matching paragraphs.
    """
    tokenized_query = word_tokenize(query.lower())
    scores = bm25.get_scores(tokenized_query)
    top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
    return [all_paragraphs[i] for i in top_indices]

# Example usage
query = "neural networks"
bm25_results = bm25_retrieve(query)
print("BM25 Results:", bm25_results)


BM25 Results: [{'id': 1, 'content': 'Machine learning is a branch of artificial intelligence.'}, {'id': 2, 'content': 'Deep learning is a subfield of machine learning.'}, {'id': 3, 'content': 'Natural language processing involves machine learning techniques.'}]


In [32]:
def dense_retrieve(query, top_k=5):
    """
    Retrieves top-k paragraphs using Sentence-BERT for dense retrieval.

    Args:
        query (str): User query.
        top_k (int): Number of top results to return.

    Returns:
        list: Top-k matching paragraphs.
    """
    query_embedding = model.encode(query, convert_to_tensor=True)
    scores = util.cos_sim(query_embedding, paragraph_embeddings)[0]

    # Ensure top_k doesn't exceed the number of available paragraphs
    top_k = min(top_k, len(all_paragraphs))

    top_indices = scores.topk(k=top_k).indices
    return [all_paragraphs[i] for i in top_indices]

# Example usage
query = "deep learning applications"
dense_results = dense_retrieve(query)
print("Dense Results:", dense_results)


NameError: name 'paragraph_embeddings' is not defined

In [41]:
def hybrid_retrieve(query, bm25_weight=0.5, top_k=5):
    """
    Combines BM25 and Sentence-BERT results for hybrid retrieval.

    Args:
        query (str): User query.
        bm25_weight (float): Weight for BM25 scores in the hybrid method.
        top_k (int): Number of top results to return.

    Returns:
        list: Top-k matching paragraphs, re-ranked by hybrid scores.
    """
    # BM25 scores
    tokenized_query = word_tokenize(query.lower())
    bm25_scores = bm25.get_scores(tokenized_query)

    # Dense retrieval scores
    query_embedding = model.encode(query, convert_to_tensor=True)
    dense_scores = util.cos_sim(query_embedding, paragraph_embeddings)[0].cpu().numpy()

    # Hybrid scores
    hybrid_scores = bm25_weight * bm25_scores + (1 - bm25_weight) * dense_scores
    top_indices = sorted(range(len(hybrid_scores)), key=lambda i: hybrid_scores[i], reverse=True)[:top_k]
    return [all_paragraphs[i] for i in top_indices]

# Example usage
query = "neural networks and backpropagation"
hybrid_results = hybrid_retrieve(query)
print("Hybrid Results:", hybrid_results)


Hybrid Results: [{'id': 2, 'content': 'Deep learning is a subfield of machine learning.'}, {'id': 1, 'content': 'Machine learning is a branch of artificial intelligence.'}, {'id': 3, 'content': 'Natural language processing involves machine learning techniques.'}]


In [42]:
#Step4

In [43]:
!pip install openai




In [47]:
!pip install --upgrade openai


Collecting openai
  Downloading openai-1.59.7-py3-none-any.whl.metadata (27 kB)
Downloading openai-1.59.7-py3-none-any.whl (454 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m454.8/454.8 kB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.59.6
    Uninstalling openai-1.59.6:
      Successfully uninstalled openai-1.59.6
Successfully installed openai-1.59.7


In [13]:
import os
import json

def preprocess_text(book_file, output_file):
    """Preprocess a text file into a hierarchical structure and save as JSON."""
    with open(book_file, 'r') as f:
        content = f.read()

    paragraphs = content.split("\n\n")  # Split into paragraphs
    hierarchical_tree = {"paragraphs": paragraphs}  # Simple structure, could be expanded

    with open(output_file, 'w') as f:
        json.dump(hierarchical_tree, f)

# Process your books
preprocess_text("book_e.txt", "book_e_tree.json")
preprocess_text("book_b.txt", "book_b_tree.json")
preprocess_text("book_d.txt", "book_d_tree.json")


In [14]:
from sentence_transformers import SentenceTransformer, util
import json

# Load model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

def retrieve_relevant_content(query, top_k=3):
    """Retrieve relevant paragraphs for the query."""
    paragraphs = []
    # Load the hierarchical trees
    for book_file in ["book_e_tree.json", "book_b_tree.json", "book_d_tree.json"]:
        with open(book_file, "r") as f:
            hierarchical_tree = json.load(f)
            paragraphs.extend(hierarchical_tree["paragraphs"])

    # Create embeddings for the paragraphs
    paragraph_embeddings = model.encode(paragraphs, convert_to_tensor=True)

    # Create query embedding
    query_embedding = model.encode(query, convert_to_tensor=True)

    # Calculate similarity scores
    scores = util.cos_sim(query_embedding, paragraph_embeddings)[0]

    # Get top-k relevant paragraphs based on similarity score
    top_k_indices = scores.topk(k=top_k).indices
    relevant_paragraphs = [paragraphs[i] for i in top_k_indices]

    return relevant_paragraphs

# Example usage:
query = "What are the applications of deep learning?"
retrieved_content = retrieve_relevant_content(query)
print("Retrieved Content:", retrieved_content)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Retrieved Content: ['Introduction to Deep Learning.', 'Deep learning is a subfield of machine learning...', 'Neural networks form the backbone of deep learning...']


In [19]:
#Step4.3

In [20]:
pip install transformers torch




In [21]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

# Load pre-trained GPT-2 model and tokenizer
model_name = "gpt2"  # You can switch to "gpt2-medium", "gpt2-large", or "gpt2-xl" for more power
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

def generate_answer_hf(query, retrieved_content):
    """
    Generate an answer using Hugging Face GPT-2 model.

    Parameters:
        query (str): The user's question.
        retrieved_content (str): The relevant content retrieved from the database.

    Returns:
        str: The generated answer.
    """
    # Prepare the input prompt
    prompt = f"Given the following context, answer the query.\n\nContext:\n{retrieved_content}\n\nQuery: {query}\n\nAnswer:"

    # Tokenize input
    inputs = tokenizer.encode(prompt, return_tensors="pt", max_length=1024, truncation=True)

    # Generate a response from the model
    output = model.generate(inputs, max_length=150, num_return_sequences=1, no_repeat_ngram_size=2, top_p=0.95, temperature=0.7)

    # Decode and return the answer
    answer = tokenizer.decode(output[0], skip_special_tokens=True)
    return answer

# Example usage
query = "What are the applications of deep learning?"
retrieved_content = "Deep learning is widely used in computer vision, natural language processing, and robotics."
answer = generate_answer_hf(query, retrieved_content)
print("Generated Answer:", answer)


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Generated Answer: Given the following context, answer the query.

Context:
Deep learning is widely used in computer vision, natural language processing, and robotics.

Query: What are the applications of deep learning?

Answer: Deep learning has been used to improve the accuracy of human speech recognition. It has also been applied to the detection of disease and to detect disease-related behaviors. Deep neural networks are used for the processing of complex data. They are also used as a means of learning and for learning about the world around us. In this context we will focus on the application of Deep Learning to human language recognition and the use of neural network to learn about human behavior. The following examples illustrate the neural connections between the two.


Example


In [22]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
import json

# Load pre-trained GPT-2 model and tokenizer
model_name = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Function to generate answer using GPT-2
def generate_answer_hf(query, retrieved_content):
    prompt = f"Given the following context, answer the query.\n\nContext:\n{retrieved_content}\n\nQuery: {query}\n\nAnswer:"
    inputs = tokenizer.encode(prompt, return_tensors="pt", max_length=1024, truncation=True)
    output = model.generate(inputs, max_length=150, num_return_sequences=1, no_repeat_ngram_size=2, top_p=0.95, temperature=0.7)
    answer = tokenizer.decode(output[0], skip_special_tokens=True)
    return answer

# Simulate retrieving content (replace with your actual retrieval logic)
retrieved_content = "Deep learning is a subset of machine learning in which algorithms learn from large amounts of data."
query = "What is deep learning?"

# Generate the answer
answer = generate_answer_hf(query, retrieved_content)
print("Generated Answer:", answer)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Answer: Given the following context, answer the query.

Context:
Deep learning is a subset of machine learning in which algorithms learn from large amounts of data.

Query: What is deep learning?

Answer: Deep learning refers to the process of learning from a large set of information. It is the most common form of deep neural network training. Deep neural networks are a type of neural net that can be trained on a set number of inputs. The inputs are the same as the outputs, and the training process is similar to that of a traditional neural nets. In this case, the input data is an image, a text, or a video. A deep network can learn a lot from the data, but it can also learn more


In [23]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
import json

# Load pre-trained GPT-2 model and tokenizer
model_name = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Function to generate answer using GPT-2
def generate_answer_hf(query, retrieved_content):
    prompt = f"Given the following context, answer the query.\n\nContext:\n{retrieved_content}\n\nQuery: {query}\n\nAnswer:"
    inputs = tokenizer.encode(prompt, return_tensors="pt", max_length=1024, truncation=True)
    output = model.generate(inputs, max_length=150, num_return_sequences=1, no_repeat_ngram_size=2, top_p=0.95, temperature=0.7)
    answer = tokenizer.decode(output[0], skip_special_tokens=True)
    return answer

# Simulate retrieving content (replace with your actual retrieval logic)
retrieved_content = "Deep learning is a subset of machine learning in which algorithms learn from large amounts of data."
query = "What is deep learning?"

# Generate the answer
answer = generate_answer_hf(query, retrieved_content)
print("Generated Answer:", answer)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Answer: Given the following context, answer the query.

Context:
Deep learning is a subset of machine learning in which algorithms learn from large amounts of data.

Query: What is deep learning?

Answer: Deep learning refers to the process of learning from a large set of information. It is the most common form of deep neural network training. Deep neural networks are a type of neural net that can be trained on a set number of inputs. The inputs are the same as the outputs, and the training process is similar to that of a traditional neural nets. In this case, the input data is an image, a text, or a video. A deep network can learn a lot from the data, but it can also learn more


In [24]:
pip install nltk rouge-score


Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=529f7783fc716354121a00ef5703761efecfb3f91e32cf91cc3c5057a550d9cd
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


In [25]:
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer

# Define reference and generated answer (for testing)
reference = ["Deep learning is a subset of machine learning that teaches computers to learn from large amounts of data."]
generated_answer = "Deep learning refers to the process of learning from a large set of information."

# BLEU score
bleu_score = sentence_bleu([reference], generated_answer.split())
print(f"BLEU score: {bleu_score}")

# ROUGE score
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(' '.join(reference), generated_answer)
print(f"ROUGE score: {scores}")


BLEU score: 0
ROUGE score: {'rouge1': Score(precision=0.6428571428571429, recall=0.5, fmeasure=0.5625000000000001), 'rouge2': Score(precision=0.15384615384615385, recall=0.11764705882352941, fmeasure=0.13333333333333333), 'rougeL': Score(precision=0.5, recall=0.3888888888888889, fmeasure=0.43750000000000006)}


In [26]:
queries = [
    "What is machine learning?",
    "Explain the applications of deep learning in healthcare.",
    "What is the difference between supervised and unsupervised learning?",
    "What are the limitations of neural networks?"
]

for query in queries:
    answer = generate_answer_hf(query, retrieved_content)
    print(f"Query: {query}")
    print(f"Answer: {answer}\n")


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Query: What is machine learning?
Answer: Given the following context, answer the query.

Context:
Deep learning is a subset of machine learning in which algorithms learn from large amounts of data.

Query: What is machine learning?

Answer: Machine learning refers to the process of learning from data to a set of algorithms. Machine Learning is the processing of information in a way that is similar to human intelligence. It is an approach to machine intelligence that has been developed by many researchers and is used by some of the world's most advanced AI companies. The term is often used to describe the way in-depth knowledge of a system is developed. In this context machine-learning refers not only to learning but also to how the system learns. This is why machine learners are



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Query: Explain the applications of deep learning in healthcare.
Answer: Given the following context, answer the query.

Context:
Deep learning is a subset of machine learning in which algorithms learn from large amounts of data.

Query: Explain the applications of deep learning in healthcare.

Answer: The following is an example of a deep neural network. It is based on a neural net, but it is not a machine. The neural nets are based upon a set of algorithms that are trained on the data and then used to perform a task. In this case, the neural networks are a combination of the two. They are not the same as the machine, and they are different from each other. For example, a human can learn to read a book by reading a sentence. A machine can also



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Query: What is the difference between supervised and unsupervised learning?
Answer: Given the following context, answer the query.

Context:
Deep learning is a subset of machine learning in which algorithms learn from large amounts of data.

Query: What is the difference between supervised and unsupervised learning?

Answer: The difference is that supervised learning involves a large amount of training data, whereas unstructured data is much more difficult to train. Unstructuring data can be very difficult, and it is often difficult for a trained algorithm to learn. In this context the term supervised is used to refer to the process of learning from a set of large data sets. The term un-super-vised is also used in this sense. It refers to a process that involves training a small set, but not

Query: What are the limitations of neural networks?
Answer: Given the following context, answer the query.

Context:
Deep learning is a subset of machine learning in which algorithms learn from la