# Google Colab Configuration

**Execute this steps to configure the Google Colab environment in order to execute this notebook. It is not required if you are executing it locally and you have properly configured your local environment according to what explained in the Github Repository.**

In [None]:
# @title Colab Setup

repository_name = "NLP-MBD-EN"
repository_url = 'https://github.com/acastellanos-ie/' + repository_name

print("### Cloning the Repository ###")
! git clone $repository_url
print()

print("### Installing requirements ###")
! pip3 install -Uqqr $repository_name/requirements.txt

%cd $repository_name/qa_practice_dl

### Cloning the Repository ###
Cloning into 'NLP-MBD-EN'...
remote: Enumerating objects: 4524, done.[K
remote: Counting objects: 100% (77/77), done.[K
remote: Compressing objects: 100% (70/70), done.[K
remote: Total 4524 (delta 47), reused 7 (delta 7), pack-reused 4447 (from 3)[K
Receiving objects: 100% (4524/4524), 14.84 MiB | 14.50 MiB/s, done.
Resolving deltas: 100% (177/177), done.

### Installing requirements ###
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m25.3/25.3 MB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

# Building an End-to-End Question-Answering System With BERT



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import ktrain
from ktrain import text

### STEP 1:  Index the Documents

We will first index the documents into a search engine that will be used to quickly retrieve documents that are likely to contain answers to a question. To do so, we must choose an index location, which must be a folder that does not already exist.

In [None]:
import shutil
import os
!pip install textract==1.6.3


Collecting textract==1.6.3
  Downloading textract-1.6.3-py3-none-any.whl.metadata (2.5 kB)
Collecting argcomplete==1.10.0 (from textract==1.6.3)
  Downloading argcomplete-1.10.0-py2.py3-none-any.whl.metadata (16 kB)
Collecting beautifulsoup4==4.8.0 (from textract==1.6.3)
  Downloading beautifulsoup4-4.8.0-py3-none-any.whl.metadata (3.0 kB)
Collecting chardet==3.0.4 (from textract==1.6.3)
  Downloading chardet-3.0.4-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting docx2txt==0.8 (from textract==1.6.3)
  Downloading docx2txt-0.8.tar.gz (2.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting EbookLib==0.17.1 (from textract==1.6.3)
  Downloading EbookLib-0.17.1.tar.gz (111 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.6/111.6 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting extract-msg==0.23.1 (from textract==1.6.3)
  Downloading extract_msg-0.23.1-py2.py3-none-any.whl.metadata (7.

In [None]:
# Clear the directory if it exists
index_dir = '/tmp/myindex'
if os.path.exists(index_dir):
    shutil.rmtree(index_dir)


text.SimpleQA.initialize_index(index_dir)
# Define the folder containing documents to be indexed
docs_folder = '/content/drive/Shared drives/NLP/RAG_Documents/QA/' # Replace with the path to your folder with text files

# Index documents
text.SimpleQA.index_from_folder(
    docs_folder,       # Folder containing the documents to index
    index_dir,         # Directory where the index will be stored
    commit_every=1,    # Commit each document
 # Break long documents into smaller chunks
    use_text_extraction=True
)

1 docs indexed
2 docs indexed


### STEP 2: Create a QA instance

Next, we create a QA instance.  This step will automatically download the BERT SQuAD model if it does not already exist on your system.

In [None]:
qa = text.SimpleQA(index_dir)

That's it!  In roughly **3 lines of code**, we have built an end-to-end QA system that can now be used to generate answers to questions.  Let's ask our system some questions.

### STEP 3:  Ask Questions


In [None]:
answers = qa.ask('when are you open?')

# Extract the full answer
full_answer = answers[0]['full_answer']  # Access the 'full_answer' key

# Split the answer based on 'a :' and get the part after it
extracted_answer = full_answer.split('a :', 1)[1].strip().lower()

# Print the extracted answer
print(extracted_answer)

yes, we have a patio area that is open during the spring and summer months, weather permitting.


In [None]:
answers = qa.ask('where are you located')
qa.display_answers(answers[:5])

Unnamed: 0,Candidate Answer,Context,Confidence,Document Reference
0,"at calle de maria de molina 31 bis, chamartin, madrid 28006","q : where are you located ? a : we are located at calle de maria de molina 31 bis, chamartin, madrid 28006 , close to avenida de america metro station.",1.0,RAG_Q&A.docx


In [None]:
import pickle

# Save the SimpleQA object to a pickle file
pickle_file_path = "simpleqa_model.pkl"

with open(pickle_file_path, "wb") as f:
    pickle.dump(qa, f)

print(f"SimpleQA model saved as '{pickle_file_path}'")




SimpleQA model saved as 'simpleqa_model.pkl'


In [None]:
from google.colab import files

# Download the pickle file
files.download('simpleqa_model.pkl')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Evaluate the model

In [None]:
from google.colab import files
import json

# Upload the file
uploaded = files.upload()  # This will prompt you to upload files from your local machine

# Access the file name (key of the uploaded dictionary)
file_name = list(uploaded.keys())[0]

# Load the JSON file
with open(file_name, "r") as f:
    test_set = json.load(f)

# Print the loaded data to verify
print(test_set)


Saving paraphrased_questions_with_original.json to paraphrased_questions_with_original.json
[{'original_question': 'What are your hours of operation?', 'paraphrased_question': 'When are you open?'}, {'original_question': 'What are your hours of operation?', 'paraphrased_question': 'Can you tell me your opening hours?'}, {'original_question': 'What are your hours of operation?', 'paraphrased_question': 'What time do you open and close?'}, {'original_question': 'What are your hours of operation?', 'paraphrased_question': 'When is the restaurant operating?'}, {'original_question': 'What are your hours of operation?', 'paraphrased_question': 'What is your schedule during the week?'}, {'original_question': 'What are your hours of operation?', 'paraphrased_question': 'What time can I visit on weekdays?'}, {'original_question': 'What are your hours of operation?', 'paraphrased_question': 'At what hours can we dine in?'}, {'original_question': 'What are your hours of operation?', 'paraphrased_

In [None]:
for item in test_set:
    print(item.keys())
    break  # Print only the first item's keys


dict_keys(['original_question', 'paraphrased_question'])


In [None]:
import pandas as pd
from sklearn.metrics import precision_score, recall_score, f1_score

def evaluate_simpleqa_with_df(qa, test_set):
    """
    Evaluates the SimpleQA system by comparing answers for both original and paraphrased questions.

    Parameters:
        qa: The SimpleQA instance.
        test_set: A list of dictionaries with keys `original_question` and `paraphrased_question`.

    Returns:
        results_df: A DataFrame containing evaluation results.
        metrics: A dictionary with evaluation metrics (accuracy, precision, recall, f1_score).
    """
    # Initialize an empty list to store the results
    results = []

    for item in test_set:
        # Extract questions
        original_question = item["original_question"].strip().lower()
        paraphrased_question = item["paraphrased_question"].strip().lower()

        # Get the predicted answers for both questions
        original_answers = qa.ask(original_question)
        paraphrased_answers = qa.ask(paraphrased_question)

        # Extract the predicted answers (if available)
        if original_answers and len(original_answers) > 0:
            if 'full_answer' in original_answers[0] and 'a :' in original_answers[0]['full_answer']:
                original_predicted_answer = original_answers[0]['full_answer'].split('a :', 1)[1].strip().lower()
            else:
                original_predicted_answer = ""
        else:
            original_predicted_answer = ""

        if paraphrased_answers and len(paraphrased_answers) > 0:
            if 'full_answer' in paraphrased_answers[0] and 'a :' in paraphrased_answers[0]['full_answer']:
                paraphrased_predicted_answer = paraphrased_answers[0]['full_answer'].split('a :', 1)[1].strip().lower()
            else:
                paraphrased_predicted_answer = ""
        else:
            paraphrased_predicted_answer = ""

        # Check if the answers for both questions match
        is_correct = original_predicted_answer == paraphrased_predicted_answer

        # Append the data to the results list
        results.append({
            "original_question": original_question,
            "paraphrased_question": paraphrased_question,
            "original_predicted_answer": original_predicted_answer,
            "paraphrased_predicted_answer": paraphrased_predicted_answer,
            "is_correct": is_correct
        })

    # Create a DataFrame
    results_df = pd.DataFrame(results)

    # Calculate metrics
    total_questions = len(results_df)
    correct_answers = results_df["is_correct"].sum()
    all_expected = [1] * total_questions
    all_predictions = results_df["is_correct"].astype(int).tolist()

    accuracy = correct_answers / total_questions
    precision = precision_score(all_expected, all_predictions, zero_division=0)
    recall = recall_score(all_expected, all_predictions, zero_division=0)
    f1 = f1_score(all_expected, all_predictions, zero_division=0)

    # Add metrics to the output
    metrics = {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1_score": f1
    }

    return results_df, metrics


In [None]:
results_df, metrics = evaluate_simpleqa_with_df(qa, test_set)

# Display Results
print("Results DataFrame:")
print(results_df)

print("\nEvaluation Metrics:")
print(metrics)


Results DataFrame:
                         original_question  \
0        what are your hours of operation?   
1        what are your hours of operation?   
2        what are your hours of operation?   
3        what are your hours of operation?   
4        what are your hours of operation?   
..                                     ...   
75  can i cancel or modify my reservation?   
76  can i cancel or modify my reservation?   
77  can i cancel or modify my reservation?   
78  can i cancel or modify my reservation?   
79  can i cancel or modify my reservation?   

                                 paraphrased_question  \
0                                  when are you open?   
1                 can you tell me your opening hours?   
2                    what time do you open and close?   
3                   when is the restaurant operating?   
4              what is your schedule during the week?   
..                                                ...   
75    what's the process for 

In [None]:
from google.colab import files

# Save the file in the Colab environment
results_df.to_excel("evaluation_results.xlsx", index=False)

# Download the file
files.download("evaluation_results.xlsx")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
answers = qa.ask('recommend me dish with pasta and fresh')
qa.display_answers(answers[:5])
print(answers)
# Extract the best answer from the answers list
best_answer = answers[0]['Candidate Answer']  # Access the first result, which is typically the best one

# Print or use the extracted answer
print(best_answer)

[]




IndexError: list index out of range

In [None]:
answers = qa.ask(' where is the restaurant located?')
qa.display_answers(answers[:1])

In [None]:
answers = qa.ask('WHAT IS MY NAME?')
qa.display_answers(answers[:1])