# Build a Q&A App with PyTorch

In this notebook, we will build a Question and Answer (Q&A) application using PyTorch. The Q&A application will take a given question and a context paragraph as input, and it will provide the answer to the question based on the information in the context paragraph.

Let's get started!

## Dataset

We will be using the SQuAD (Stanford Question Answering Dataset) dataset for training and evaluating our Q&A model. The SQuAD dataset consists of questions and answers pairs, where each question is associated with a context paragraph. Our goal is to train a model that can accurately answer questions based on the given context. Dataset is available at https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json

To load and read data from a JSON file in Python, you can use the `json` module. Here's an example of how to do it:


In [1]:
import json

with open("train-v2.0.json", 'r') as f:
  data = json.load(f)

The observed structure should be similar to the one provided below. From this data we will focus on the `question` and `answers` fields where the topic is `Premier League`. This will provide us with exact answers to a specific number of questions.

To obtain the `questions` and `answers`, define and run the following function `get_qa`. This should return a set of 357 pairs of questions and answers.

In [2]:
# get the available questions and answers for a given topic
def get_qa(topic, data):
    q = []
    a = []
    for d in data['data']:
        if d['title']==topic:
            for paragraph in d['paragraphs']:
                for qa in paragraph['qas']:
                    if not qa['is_impossible']:
                        q.append(qa['question'])
                        a.append(qa['answers'][0]['text'])
            return q,a

questions, answers = get_qa(topic='Premier_League', data=data)

print("Number of available questions: {}".format(len(questions)))

Number of available questions: 357


## QA Embedding Model

We will be using a pre-trained transformer-based model for our Q&A application. QA embedding model transforms QA into a format that a computer can understand. It converts text to numbers, then compare those numbers to find the best match, and use that match to provide the user with an answer.

Training a model from scratch is time consuming and expensive. This is where HuggingFace comes in. The shell script for downloading the pretrained model from HuggingFace is provided. If you are using Windows, use ‘curl’.

~~~
#!/bin/bash
# defines the qa model
MODEL_DIR="https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2/resolve/main"
MODEL_NAME="paraphrase-MiniLM-L6-v2"
# downloads the qa model. To make this image more general one can use curl
# with the "-O" argument to download the necessary files defined
# in "require.txt".
mkdir ${MODEL_NAME} &&\
    curl -o ${MODEL_NAME}/vocab.txt ${MODEL_DIR}/vocab.txt &&\
    curl -o ${MODEL_NAME}/tokenizer_config.json ${MODEL_DIR}/tokenizer_config.json &&\
    curl -o ${MODEL_NAME}/tokenizer.json ${MODEL_DIR}/tokenizer.json &&\
    curl -o ${MODEL_NAME}/special_tokens_map.json ${MODEL_DIR}/special_tokens_map.json &&\
    curl -o ${MODEL_NAME}/sentence_bert_config.json ${MODEL_DIR}/sentence_bert_config.json &&\
    curl -o ${MODEL_NAME}/pytorch_model.bin ${MODEL_DIR}/pytorch_model.bin &&\
    curl -o ${MODEL_NAME}/modules.json ${MODEL_DIR}/modules.json &&\
    curl -o ${MODEL_NAME}/config_sentence_transformers.json ${MODEL_DIR}/config_sentence_transformers.json &&\
    curl -o ${MODEL_NAME}/config.json ${MODEL_DIR}/config.json
~~~

If you don’t have the `transformers` package, start by installing it with pip.

'pip install transformers'

Then, run the following `get_model` function. If all files were downloaded properly and all dependencies met, this should run without issues.

In [5]:
from transformers import AutoModel, AutoTokenizer
def get_model(model_name):
    model = AutoModel.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return model, tokenizer
  
model, tokenizer = get_model(model_name="sentence-transformers/paraphrase-MiniLM-L6-v2")

Let’s now run our embedding model over a sample of the context questions. To do this, run the following instructions.

The code should print the shape of our new embeddings vector.

`Embeddings shape: torch.Size([3, 384]`

In [7]:
import torch

# Mean Pooling - Take attention mask into account for correct averaging
# source: https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    
    input_mask_expanded = (
      attention_mask
      .unsqueeze(-1)
      .expand(token_embeddings.size())
      .float()
    )
    
    pool_emb = (
      torch.sum(token_embeddings * input_mask_expanded, 1) 
      / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    )
    
    return pool_emb

def get_embeddings(questions, tokenizer, model):
  # Tokenize sentences
  encoded_input = tokenizer(questions, padding=True, truncation=True, return_tensors='pt')

  # Compute token embeddings
  with torch.no_grad():
      model_output = model(**encoded_input)

  # Average pooling
  embeddings = mean_pooling(model_output, encoded_input['attention_mask']) 
  
  return embeddings

embeddings = get_embeddings(questions[:3], tokenizer, model)
print("Embeddings shape: {}".format(embeddings.shape))

Embeddings shape: torch.Size([3, 384])


Let’s start by checking our previous sample questions:

~~~
[
  'How many club members are there?',
  'How many matches does each team play?',
  'What days are most games played?'
]
~~~

Then, paraphrase the last one to:

'Which days have the most events played at?'

Finally, let’s embed our new question and calculate the Euclidean distance between `new_embedding` and `embeddings`.

The code should output the following distances, indicating that the last question in our sample is indeed the closest (smallest distance) to our new question.

`tensor([71.4029, 59.8726, 23.9430])`

In [8]:
new_question = 'Which days have the most events played at?'
new_embedding = get_embeddings([new_question], tokenizer, model)

# squared Euclidean distance between sample questions and new_question
((embeddings - new_embedding)**2).sum(axis=1)

tensor([71.4030, 59.8726, 23.9431])

## Model Deployment

QAEmbedder

In [9]:
class QAEmbedder:
  def __init__(self, model_name="paraphrase-MiniLM-L6-v2"):
    """
    Defines a QA embedding model. This is, given a set of questions,
    this class returns the corresponding embedding vectors.
    
    Args:
      model_name (`str`): Directory containing the necessary tokenizer
        and model files.
    """
    self.model = None
    self.tokenizer = None
    self.model_name = model_name
    self.set_model(model_name)
  
  
  def get_model(self, model_name):
    """
    Loads a general tokenizer and model using pytorch
    'AutoTokenizer' and 'AutoModel'
    
    Args:
      model_name (`str`): Directory containing the necessary tokenizer
        and model files.
    """
    model = AutoModel.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return model, tokenizer
  
  
  def set_model(self, model_name):
    """
    Sets a general tokenizer and model using the 'self.get_model'
    method.
    
    Args:
      model_name (`str`): Directory containing the necessary tokenizer
        and model files.
    """
    self.model, self.tokenizer = self.get_model(self.model_name)
  
  
  def _mean_pooling(self, model_output, attention_mask):
    """
    Internal method that takes a model output and an attention
    mask and outputs a mean pooling layer.
    
    Args:
      model_output (`torch.Tensor`): output from the QA model
      attention_mask (`torch.Tensor`): attention mask defined in the QA tokenizer
      
    Returns:
      The averaged tensor.
    """
    token_embeddings = model_output[0]
    
    input_mask_expanded = (
      attention_mask
      .unsqueeze(-1)
      .expand(token_embeddings.size())
      .float()
    )
    
    pool_emb = (
      torch.sum(token_embeddings * input_mask_expanded, 1) 
      / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    )
    
    return pool_emb
  
  
  def get_embeddings(self, questions, batch=32):
    """
    Gets the corresponding embeddings for a set of input 'questions'.
    
    Args:
      questions (`list` of `str`): List of strings defining the questions to be embedded
      batch (`int`): Performs the embedding job 'batch' questions at a time
      
    Returns:
      The embedding vectors.
    """
    question_embeddings = []
    for i in range(0, len(questions), batch):
    
        # Tokenize sentences
        encoded_input = self.tokenizer(questions[i:i+batch], padding=True, truncation=True, return_tensors='pt')

        # Compute token embeddings
        with torch.no_grad():
            model_output = self.model(**encoded_input)

        # Perform mean pooling
        batch_embeddings = self._mean_pooling(model_output, encoded_input['attention_mask'])
        question_embeddings.append(batch_embeddings)
    
    question_embeddings = torch.cat(question_embeddings, dim=0)
    return question_embeddings

QASearcher

In [10]:
class QASearcher:
  def __init__(self, model_name="paraphrase-MiniLM-L6-v2"):
    """
    Defines a QA Search model. This is, given a new question it searches
    the most similar questions in a set 'context' and returns both the best
    question and associated answer.
    
    Args:
      model_name (`str`): Directory containing the necessary tokenizer
        and model files.
    """
    self.answers = None
    self.questions = None
    self.question_embeddings = None
    self.embedder = QAEmbedder(model_name=model_name)
  
  
  def set_context_qa(self, questions, answers):
    """
    Sets the QA context to be used during search.
    
    Args:
      questions (`list` of `str`):  List of strings defining the questions to be embedded
      answers (`list` of `str`): Best answer for each question in 'questions'
    """
    self.answers = answers
    self.questions = questions
    self.question_embeddings = self.get_q_embeddings(questions)
  
  
  def get_q_embeddings(self, questions):
    """
    Gets the embeddings for the questions in 'context'.
    
    Args:
      questions (`list` of `str`):  List of strings defining the questions to be embedded
    
    Returns:
      The embedding vectors.
    """
    question_embeddings = self.embedder.get_embeddings(questions)
    question_embeddings  = torch.nn.functional.normalize(question_embeddings, p=2, dim=1)
    return question_embeddings.transpose(0,1)
  
  
  def cosine_similarity(self, questions, batch=32):
    """
    Gets the cosine similarity between the new questions and the 'context' questions.
    
    Args:
      questions (`list` of `str`):  List of strings defining the questions to be embedded
      batch (`int`): Performs the embedding job 'batch' questions at a time
    
    Returns:
      The cosine similarity
    """
    question_embeddings = self.embedder.get_embeddings(questions, batch=batch)
    question_embeddings = torch.nn.functional.normalize(question_embeddings, p=2, dim=1)
    
    cosine_sim = torch.mm(question_embeddings, self.question_embeddings)
    
    return cosine_sim
  
  
  def get_answers(self, questions, batch=32):
    """
    Gets the best answers in the stored 'context' for the given new 'questions'.
    
    Args:
      questions (`list` of `str`):  List of strings defining the questions to be embedded
      batch (`int`): Performs the embedding job 'batch' questions at a time
    
    Returns:
      A `list` of `dict`'s containing the original question ('orig_q'), the most similar
      question in the context ('best_q') and the associated answer ('best_a').
    """
    similarity = self.cosine_similarity(questions, batch=batch)
    
    response = []
    for i in range(similarity.shape[0]):
      best_ix = similarity[i].argmax()
      best_q = self.questions[best_ix]
      best_a = self.answers[best_ix]
      
      response.append(
        {
          'orig_q':questions[i],
          'best_q':best_q,
          'best_a':best_a,
        }
      )
    
    return response

Define the FastAPI app


In [18]:
%pip install fastapi
%pip install uvicorn
%pip install utils

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Users\victo\Git\vbcalinao\boomai-mle\venv\Scripts\python.exe -m pip install --upgrade pip' command.


Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Users\victo\Git\vbcalinao\boomai-mle\venv\Scripts\python.exe -m pip install --upgrade pip' command.


Collecting utils
  Downloading utils-1.0.1-py2.py3-none-any.whl (21 kB)
Installing collected packages: utils
Successfully installed utils-1.0.1
Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Users\victo\Git\vbcalinao\boomai-mle\venv\Scripts\python.exe -m pip install --upgrade pip' command.


In [21]:
import uvicorn
from fastapi import FastAPI, Request
from app.utils import QASearcher

app = FastAPI()

@app.post("/set_context")
async def set_context(data:Request):
  """
  Fastapi POST method that sets the QA context for search.
  
  Args:
    data(`dict`): Two fields required 'questions' (`list` of `str`)
      and 'answers' (`list` of `str`)
  """
  data = await data.json()
  
  qa_search.set_context_qa(
    data['questions'], 
    data['answers']
  )
  return {"message": "Search context set"}


@app.post("/get_answer")
async def get_answer(data:Request):
  """
  Fastapi POST method that gets the best question and answer 
  in the set context.
  
  Args:
    data(`dict`): One field required 'questions' (`list` of `str`)
  
  Returns:
    A `dict` containing the original question ('orig_q'), the most similar
    question in the context ('best_q') and the associated answer ('best_a').
  """
  data = await data.json()
  
  response = qa_search.get_answers(data['questions'], batch=1)
  return response


# initialises the QA model and starts the uvicorn app
if __name__ == "__main__":
  qa_search = QASearcher()
  uvicorn.run(app, host="0.0.0.0", port=8000)

ImportError: cannot import name 'QASearcher' from 'utils' (c:\Users\victo\Git\vbcalinao\boomai-mle\venv\lib\site-packages\utils\__init__.py)