# Introduction
About Retrieval Augmented Generation
Retrieval Augmented Generation (RAG) is a versatile pattern that can unlock a number of use cases requiring factual recall of information, such as querying a knowledge base in natural language.

In its simplest form, RAG requires these steps:

Extract knowledge base passages from documents (once)
Create vector embedding representations of each passage in the knowledge base
Retreive question from end user and generate vector embedding for it.
Retrieve relevant passage(s) from knowledge base (for every user query) using vector similarity search
Generate a response by feeding retrieved passage into a large language model (for every user query)

## Embeddings and Vector Databases
The current state-of-the-art in RAG is to create dense vector representations of the knowledge base in order to calculate the semantic similarity to a given user query.

We can generate dense vector representations using embedding models. In this notebook, we use SentenceTransformers all-MiniLM-L6-v2 to embed both the knowledge base passages and user queries. all-MiniLM-L6-v2 is a performant open-source model that is small enough to run locally.

A vector database is optimized for dense vector indexing and retrieval. This notebook uses Chroma, a user-friendly open-source vector database, licensed under Apache 2.0, which offers good speed and performance with the all-MiniLM-L6-v2 embedding model.

To generate the final response to a query based on the retrieved passage, we leverage an open-source model, Flan-UL2 (20B), and include a prompt

### About the example dataset
The dataset used in this cookbook is a subset of nq_open, an open-source question answering dataset based on contents from Wikipedia. The selected subset includes the gold standard passages to answer the queries in the dataset, which enables evaluating the retrieval quality.

You can select one of the two dataset available:

nq910 - an information retrieval (a.k.a. search) data set extracted from Google's Natural Questions dataset.
LongNQ - an end-to-end retrieval and answer dataset extracted from the same NQ dataset, but focused more on abstractive, longer-form question answering. The answers were modified for fluency by IBM Research.
These datasets are available in the data assets.

Limitations
Given that we are leveraging a locally-hosted embedding model, data ingestion and querying speeds can be slow.

Cookbook Structure
Set-up dependencies
Index knowledge base
Generate a retrieval-augmented response
Evaluate RAG performance on your data


#### Disclaimer
The IBM GenAI Python library used in this notebook is currently in Beta and will change in the future.

##### 1.1 Install the required dependencies

Note that `ibm-generative-ai` requires `python>=3.9`. Ensure these pre-requisites are met before using this notebook

In [1]:
!pip install chromadb==0.4.5
!pip install ibm-watson-machine-learning==1.0.311
!pip install langchain==0.0.261
!pip install rouge==1.0.1
!pip install sentence-transformers==2.2.2
!pip install wget

Collecting chromadb==0.4.5
  Downloading chromadb-0.4.5-py3-none-any.whl (402 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m402.8/402.8 kB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting posthog>=2.4.0
  Downloading posthog-3.0.1-py2.py3-none-any.whl (37 kB)
Collecting fastapi<0.100.0,>=0.95.2
  Downloading fastapi-0.99.1-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.4/58.4 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tqdm>=4.65.0
  Downloading tqdm-4.66.1-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.3/78.3 kB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting importlib-resources
  Downloading importlib_resources-6.0.1-py3-none-any.whl (34 kB)
Collecting onnxruntime>=1.14.1
  Downloading onnxruntime-1.15.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [

Collecting marshmallow<4.0.0,>=3.18.0
  Downloading marshmallow-3.20.1-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-inspect<1,>=0.4.0
  Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
Installing collected packages: typing-inspect, tenacity, numexpr, openapi-schema-pydantic, marshmallow, langsmith, aiohttp, dataclasses-json, langchain
  Attempting uninstall: tenacity
    Found existing installation: tenacity 8.0.1
    Uninstalling tenacity-8.0.1:
      Successfully uninstalled tenacity-8.0.1
  Attempting uninstall: numexpr
    Found existing installation: numexpr 2.8.3
    Uninstalling numexpr-2.8.3:
      Successfully uninstalled numexpr-2.8.3
  Attempting uninstall: aiohttp
    Found existing installation: aiohttp 3.8.1
    Uninstalling aiohttp-3.8.1:
      Successfully uninstalled aiohttp-3.8.1
Successfully installed aiohttp-3.8.5 dataclasses-json-0.

Installing collected packages: safetensors, regex, filelock, nltk, huggingface-hub, transformers, sentence-transformers
Successfully installed filelock-3.12.2 huggingface-hub-0.16.4 nltk-3.8.1 regex-2023.8.8 safetensors-0.3.2 sentence-transformers-2.2.2 transformers-4.31.0
Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25ldone
[?25h  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9673 sha256=348a09e51e215a84cf02784f6bcd81cad944424f6d7e4bef4802bf5472c081c7
  Stored in directory: /tmp/wsuser/.cache/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [8]:
import os
from getpass import getpass
from typing import Optional, Any, Iterable, List

import wget
import pandas as pd
import chromadb
from langchain.vectorstores import Chroma
from sentence_transformers import SentenceTransformer
from chromadb.api.types import EmbeddingFunction
from rouge import Rouge

from ibm_watson_machine_learning.foundation_models import Model
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams

#### 1.3. Load credentials for `ibm-watson-machine-learning`


```
API_KEY=<your-api_key>
IBM_CLOUD_URL=<your-url>
PROJECT_ID=<your-project_id>
```

In [5]:
IBM_CLOUD_API_KEY = getpass("Enter your IBM CLoud API Key: ")
IBM_CLOUD_URL= os.getenv("RUNTIME_ENV_APSX_URL", "https://us-south.ml.cloud.ibm.com")
PROJECT_ID = os.getenv("PROJECT_ID")

wml_creds = {
    "url": IBM_CLOUD_URL,
    "apikey": IBM_CLOUD_API_KEY
}

Enter your IBM CLoud API Key: ········


## 2. Index knowledge base

### 2.1. Load data

Select one of the two dataset available:
1. *nq910* - an Information Retrieval (a.k.a. search) data set extracted from Google's Natural Questions dataset.
2. *LongNQ* - an end-to-end retrieval and answer dataset extracted from the same NQ dataset, but focused more on abstractive question answering.

These datasets are provided under the /data directory.

In [6]:
data_download_paths = {
    'output.csv':'https://raw.githubusercontent.com/rich-nieto-ibm/techexchange_watsonx_workshop/main/output.csv',
    'questions.csv':'https://raw.githubusercontent.com/rich-nieto-ibm/techexchange_watsonx_workshop/main/questions.csv'
}

for file,url in data_download_paths.items():
    if os.path.isfile(file) is False:
        wget.download(url)
    if os.path.isfile(file) is False:
        raise IOError(f"Failed to download {file}")

In [21]:
questions = pd.read_csv("./questions.csv").head(3000)
documents = pd.read_csv("./output.csv").head(3000)

In [22]:
dataset = 'LongNQ'

In [23]:
documents['indextext'] = documents['title'].astype(str) + "\n" + documents['text']

#### 1.2. Create an embedding function

Note that you can feed a custom embedding function to be used by chromadb. The performance of chromadb may differ depending on the embedding model used.

In [58]:
from langchain.embeddings import HuggingFaceEmbeddings

In [59]:
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

#### 2.3. Set up Chroma upsert
Upserting a document means update the document even if it exists in the database. Otherwise re-inserting a document throws an error. This is useful for experimentation purpose.

In [60]:
PERSIST_DIR = './storage'
os.makedirs(PERSIST_DIR, exist_ok=True)
chroma_client = chromadb.PersistentClient(PERSIST_DIR)

In [26]:
documents_ = documents.indextext.tolist()
ids = [str(x) for x in documents.index.tolist()]

In [29]:
from langchain.document_loaders.csv_loader import CSVLoader

In [39]:
documents

Unnamed: 0,id,text,title,indextext
0,1,History of Idaho - wikipedia History of Idaho ...,History of Idaho,History of Idaho\nHistory of Idaho - wikipedia...
1,2,"1957 . Location Cataldo , Idaho Built 1848 Arc...",History of Idaho,"History of Idaho\n1957 . Location Cataldo , Id..."
2,3,"of the Columbia was created in June 1816 , and...",History of Idaho,History of Idaho\nof the Columbia was created ...
3,4,"Canyon , he concluded that water transport was...",History of Idaho,"History of Idaho\nCanyon , he concluded that w..."
4,5,"1842 , Father Pierre - Jean De Smet , with Fr....",History of Idaho,"History of Idaho\n1842 , Father Pierre - Jean ..."
...,...,...,...,...
2995,2996,"by Punjabi University , ISBN 81 - 7380 - 778 -...",Ranjit Singh,"Ranjit Singh\nby Punjabi University , ISBN 81 ..."
2996,2997,"National Book Shop , 1994 . ISBN 81 - 7116 - 1...",Ranjit Singh,"Ranjit Singh\nNational Book Shop , 1994 . ISBN..."
2997,2998,e-Punjab Maharaja Ranjit Singh '' . External l...,Ranjit Singh,Ranjit Singh\ne-Punjab Maharaja Ranjit Singh '...
2998,2999,Guru Angad Guru Amar Das Guru Ram Das Guru Arj...,Ranjit Singh,Ranjit Singh\nGuru Angad Guru Amar Das Guru Ra...


In [55]:
from langchain.document_loaders import DataFrameLoader

loader = DataFrameLoader(documents, page_content_column="indextext")
documents = loader.load()

In [62]:
# Index the vector database by embedding then inserting document chunks
# this is automatically tokenizing and embedding documents with given
# embedding function. the documents are also stored.
vectordb = Chroma.from_documents(documents=documents, 
                                 embedding=embeddings,
                                 collection_name=dataset,
                                 persist_directory=PERSIST_DIR,
                                 collection_metadata=None,
                                 ids=ids)
# Save vector database as persistent files in the output folder
vectordb.persist()

In [66]:
# maximal marginal relevance (MMR)
retriever = vectordb.as_retriever(search_type="similarity", search_kwargs={"k":3})
doc_hits = retriever.get_relevant_documents("michael jackson")
for doc in doc_hits:
    print(doc.page_content)
    #print(f'...from page {doc.metadata["page"]}:\n{doc.page_content}\n')

Shaquille O'Neal
Shaquille O'Neal - wikipedia Shaquille O'Neal Jump to : navigation , search `` Shaquille '' redirects here . For other people called Shaquille , see Shaquille ( disambiguation ) . Shaquille O'Neal O'Neal in 2011 ( 1972 - 03 - 06 ) March 6 , 1972 ( age 45 ) Newark , New Jersey Nationality American Listed height 7 ft 1 in ( 2.16 m ) Listed weight 325 lb ( 147 kg ) Career information High school Robert G. Cole ( San Antonio , Texas ) College LSU ( 1989 -- 1992 ) NBA draft 1992 / Round : 1 / Pick : 1st overall Selected by the Orlando Magic Playing career 1992 -- 2011 Position Center Number 32 , 34 , 33 , 36 Career history 1992 -- 1996 Orlando Magic 1996 -- 2004 Los Angeles Lakers 2004 -- 2008 Miami Heat 2008 -- 2009 Phoenix Suns 2009 -- 2010 Cleveland Cavaliers 2010 -- 2011 Boston Celtics Career highlights and awards 4 × NBA champion ( 2000 -- 2002 , 2006 ) 3 × NBA Finals MVP ( 2000 -- 2002 ) NBA Most Valuable Player ( 2000 ) 15 × NBA All - Star ( 1993 -- 1998 , 2000 -- 20

#### 2.4 Embed and index documents with Chroma
You will now generate embeddings for the passages. This will take

However if you want to full experience, then delete these files and rebuild them yourself. Note that creating the embeddings and indexes can take a long time. E.g. on a 2021 Macbook Pro, it took 45 mins to generate these files for the LongNQ dataset.

### 3. Generate a retrieval-augmented response to a question
3.1. Instantiate watsonx model

In [None]:
params = {
        GenParams.DECODING_METHOD: "greedy",
        GenParams.MIN_NEW_TOKENS: 1,
        GenParams.MAX_NEW_TOKENS: 100,
        GenParams.TEMPERATURE: 0,
    }
model = Model(model_id='google/flan-ul2', params=params, credentials=creds, project_id=project_id)

#### 3.2. Select a question

In [None]:
question_index = 65
question_text = questions.text[question_index].strip("?") + "?"
print(question_text)

#### 3.3. Retrieve relevant context


In [None]:
relevant_chunks = chroma.query(
    query_texts=[question_text],
    n_results=5,
)
for i, chunk in enumerate(relevant_chunks['documents'][0]):
    print("=========")
    print("Paragraph index : ", relevant_chunks['ids'][0][i])
    print("Paragraph : ", chunk)
    print("Distance : ", relevant_chunks['distances'][0][i])

#### 3.4. Feed the context and the question to `watsonx` model.

In [None]:
def make_prompt(context, question_text):
    return (f"{context}\n\nPlease answer a question using this "
          + f"text. "
          + f"If the question is unanswerable, say \"unanswerable\"."
          + f"Question: {question_text}")

In [None]:
context = "\n\n\n".join(relevant_chunks["documents"][0])
prompt = make_prompt(context, question_text)

In [None]:
response = model.generate_text(prompt)


In [None]:
print("Question = ", question_text)
print("Answer = ", response)
print("Expected Answer(s) (may not be appear with exact wording in the dataset) = ", questions.answers[question_index])

### 4. Evaluate RAG performance on your data
Evaluating the performance of your Generative AI system is critical to ensuring happy end users. However evaluation also requires having a test dataset. In this case, the top passages that shoudl be return for each question.

Note that we want to evaluate the performance of both (1) the embedding function plus (2) how well the GenAI model summarizes the results.

So our test set must contain:

The indexes of the passage(s) that contain the answer - i.e. the goldstandard passages (if the question is answerable by the knowledge base)
The question's gold standard answer (this can be short or long-form)


4.1. Evaluate the retrieval quality
Were the correct passages returned via the similarity search functionality

There are many ways to compute retrieval quality, namely how the information contained in the documents that are relevant to the question being asked. We're focusing here on success at given number of returns (aka recall at given levels), which is to say, given a fixed number of documents returned (e.g., 1, 3, 5), is the question's answer contained in them. The scores increase with the recall level.

In [None]:
def compute_score(questions, answers, ranks=[1, 3, 5, 10], use_rouge=False, rouge_threshold=0.7):
    """
    Computes the success at different levels of recall, given the goldstandard passage indexes per query.
    It computes two scores:
       * Success at rank_i, defined as sum_q 1_{top i answers for question q contains a goldstandard passage} / #questions
       * Lenient success at rank i, defined as
                sum_q 1_{ one in the documents in top i for question q contains a goldstandard answer) / #questions
    Note that a document that contains the actual textual answer does not necesarily answer the question, hence it's a
    more lenient evaluation. Any goldstandard passage will contain a goldstandard answer text, by definition.
    Args:
        :param questions: List[Dict['id': AnyStr, 'text': AnyStr, 'relevant': AnyStr, 'answers': AnyStr]]
           - the input queries. Each query is a dictionary with the keys 'id','text', 'relevant', 'answers'.
        :param input_passages: List[Dict['id': AnyStr, 'text': AnyStr', 'title': AnyStr]]
           - the input passages. These are used to create a reverse-index list for the passages (so we can get the
             text for a given passage ID)
        :param answers: List[List[AnyStr]]
           - the retrieved passages IDs for each query
        :param ranks: List[int]
           - the ranks at which to compute success
        :param use_rouge: Boolean
           - turns on the use of rouge as a scorer
        :param rouge_threshold: float, default=0.7
           - defines the minimum rouge-l/r score to accept the answer as a match,
    Returns:


    """
    # if "relevant" not in input_queries[0] or input_queries[0]['relevant'] is None:
    #     print("The input question file does not contain answers. Please fix that and restart.")
    #     sys.exit(12)

    scores = {r: 0 for r in ranks}
    lscores = {r: 0 for r in ranks}

    gt = {}
    for q_relevant, q_qid in zip(questions.relevant, questions.qid):
        if isinstance(q_relevant, str):
            rel = [int(i) for i in q_relevant.split(",")]
        else:
            rel = [q_relevant]
        gt[q_qid] = rel

    def update_scores(ranks, rnk, scores):
        j = 0
        while j < len(ranks) and ranks[j] < rnk:
            j += 1
        for k in ranks[j:]:
            scores[k] += 1

    scorer = None
    if use_rouge:
        from rouge import Rouge
        scorer = Rouge()

    num_eval_questions = 0

    for qi, (qid, q_answers) in enumerate(zip(questions.qid, questions.answers)):
        tmp_scores = {r: 0 for r in ranks}

        text_answers = str(q_answers).split("::")
        if "-" in text_answers:
            # The question does not have answers, skip it for retrieval score purposes.
            continue
        num_eval_questions += 1
        # Compute scores based on the goldstandard annotation
        for ai, ans in enumerate(answers[qi]):
            if int(ans['id']) in gt[qid]:  # Great, we found a match.
                update_scores(ranks, ai + 1, tmp_scores)
                break

        # Compute score on approximate match - either answer inclusion in the text or
        # minimum rouge score alignment.
        tmp_lscores = tmp_scores.copy()  # making sure we're actually lenient
        #inputq = questions[qi]
        for ai, ans in enumerate(answers[qi]):
            txt = ans['text'].lower()
            found = False
            for text_answer in text_answers:
                if use_rouge:
                    score = scorer.get_scores(text_answer.lower(), txt)
                    if max(score[0]['rouge-l']['r'], score[0]['rouge-l']['p']) > rouge_threshold:
                        update_scores(ranks, ai + 1, tmp_lscores)
                        break
                else:
                    if not isinstance(text_answer, str):
                        print(f"Error on text_answer {text_answer}, question {qi}, answer {ai}-{ans}")
                    if txt.find(text_answer.lower()) >= 1:
                        update_scores(ranks, ai + 1, tmp_lscores)
                        break

        for r in ranks:
            scores[r] += int(tmp_scores[r] >= 1)
            lscores[r] += int(tmp_lscores[r] >= 1)

    res = {"num_ranked_queries": num_eval_questions,
           "num_judged_queries": num_eval_questions,
           "success":
               {r: int(1000 * scores[r] / num_eval_questions) / 1000.0 for r in ranks},
           "lenient_success":
               {r: int(1000 * lscores[r] / num_eval_questions) / 1000.0 for r in ranks},
           "counts": {r: scores[r] for r in ranks},
           'lcounts': {r: lscores[r] for r in ranks}
           }

    return res

#### Compute the retrieval score over all the documents
Can take up to a minute

In [None]:
k = 5
retrieved_docs = []
for q in questions.text:
    answers = chroma.query(query_texts=q, n_results=k)

    retrieved_docs.append([{'id': id, 'text': text}
                           for (id, text) in zip(answers['ids'][0], answers['documents'][0])])

res = compute_score(questions, retrieved_docs,
                    ranks=[1, 3, 5], use_rouge=(data_dir == 'docs_and_qs'))
print(res)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
def plot(res):
    fig, ax = plt.subplots()
    scores = res['success'].values()
    keys = [f'R@{i}' for i in res['success'].keys()]
    x_pos = np.arange(len(keys))
    ax.bar(x_pos, scores, align='center', alpha=0.5)
    ax.set_ylabel('Success Rate')
    ax.set_xticks(x_pos)
    ax.set_xticklabels(keys)
    ax.set_title('Success rates at different recall rates.')
    ax.yaxis.grid(True)

    # Save the figure and show
    plt.tight_layout()
    plt.savefig('bar_plot.png')
    plt.show()

In [None]:
plot(res)


### 4.2. Evaluate quality of generated responses
I.e. how well did the GenAI model summarize and extract the correct answer to the user's question from the passages returned by the similarity function.  Obviously if the returned passages were invalid, then performance at this phase would suffer too.

##### Automatically evaluating the quality of answers is difficult, as many factors come into play, such as fluency, helpfulness, coverage, etc. One simplified way of computing this quality is using the ROUGE metric, in particular ROUGE-L. To compute this metric, for every answer returned for a question, we measure the maximum subsequence of words between the system answer and the gold-standard answer. Given this sequence, we can compute the precision of the given answer as the length (all lengths are in words) of this sequence divided by the length of the system answer and the recall as the length of the longest common subsequence divided by the length the gold-standard answer.
$$ P_{ROUGE-L} = \frac{|lcs(system,gold)|}{|system|} \\ R_{ROUGE_L} = \frac{|lcs(system,gold|}{|gold|} $$

where $lcs(system, gold)$ is the longest commong subsequence between $system$ and $gold$.

ROUGE was devised in the NLP community to evaluate summarization, and is commonly used to also evaluate abstractive question answering.

In [None]:
def score_answers(_answers, _reference, score_type="rouge-l", val="r", use_rouge=True):
    """
    Compute the score of a set of answers, given a set of references, using Rouge score.
    :param answers: Union[List[str], str]
       - the returned answer/answers.
    :param reference:
        - the reference answers, in a string. Answers are separated by ':::'
    :param use_rouge: Boolean
        - if true, then use rouge for scoring, otherwise use substring.
    :return:
       - The maximum rouge-L score of the cartesian product of answers/references
    """
    if isinstance(_answers, str):
        _answers = [_answers]
    _references = _reference.lower().split("::")
    max_score = -1
    scorer = Rouge()
    closest_ref = ""
    for ref in _references:
        for _answer in _answers:
            if use_rouge:
                scores = scorer.get_scores(_answer.lower(), ref)
                score = scores[0][score_type][val]
            else:
                score = int(ref.find(_answer.lower()) >= 0)
            if score > max_score:
                max_score = score
                closest_ref = ref

    return max_score, closest_ref

In [None]:
print("Question = ", question_text)
print("Answer = ", response)
score, closest_ref = score_answers(response, questions.answers[question_index], val='r')
print(f"Closest reference: \"{closest_ref}\"")
print(f"Recall:\t\t{100*score:5.2f}%")
score, _ = score_answers(response, questions.answers[question_index], val='p')
print(f"Precision:\t{100*score:5.2f}%")


#### Compute (Rouge-based) precision and recall for the entire collection.
It takes about 1-2 seconds per question. For a corpus of ~1000 questions, this take can take up to 30min.

In [None]:
def is_answerable(relevant):
    return "-1" in relevant

In [None]:
rscore = 0
pscore = 0
import tqdm
num_eval_questions = 50
eval_questions = questions[:num_eval_questions]
count = {"11": 0, "10": 0, "01": 0, "00": 0}
seq = []
for (question_text, answers, relevant) in tqdm.tqdm(zip(eval_questions.text, eval_questions.answers, eval_questions.relevant), total=len(eval_questions)):
    # ans = qa(question.question)
    relevant_chunks = chroma.query(
        query_texts=[question_text],
        n_results=5,
    )
    context = "\n\n\n".join(relevant_chunks["documents"][0])
    prompt = make_prompt(context, question_text)
    ans = model.generate_text(prompt)
    q_answerable = is_answerable(relevant)
    if ans == "unanswerable":
        res = "10" if q_answerable else "00"
        count [res] += 1
        if not q_answerable:
            rscore += 1
            pscore += 1
    else:
        res = "11" if q_answerable else "10"
        count[res] += 1
        if q_answerable:
            qrscore, _ = score_answers(ans, answers, val='r')
            rscore += qrscore
            qpscore, _ = score_answers(ans, answers, val='p')
            pscore += qpscore
    seq.append(res)


In [None]:
from IPython.display import HTML, display
def displayHTMLTables(*tables):
    def htmlTable(table):
        return '<table border="2"><tr>{}</tr></table>'.format(
                    '</tr><tr>'.join(
                        '<td>{}</td>'.format('</td><td>'.join(str(_) for _ in row)) for row in table)
                )

    display(HTML('<table><tr><td>{}</td></tr></table>'.format(
                "</td><td>".join(htmlTable(table) for table in tables))
))

In [None]:
res = [['', 'Overall', 'Answerable questions'],
       ['Precision', f"{100*pscore/len(eval_questions):5.2f}", f"{100*(pscore-count['00'])/(count['10']+count['11']):5.2f}"],
       ['Recall',    f'{100*pscore/len(eval_questions):5.2f}', f"{188*(rscore-count['00'])/(count['10']+count['11']):5.2f}"],
       ]
counts = [['Gold/System', 'No Answer', 'Answered'],
        ['No Answer', count["00"], count["01"]],
        ['With Answer', count["10"], count["11"]]]#%% md

displayHTMLTables(res, [], [], counts)