# Dataset Exploration


### Structure
The Quora QA answering dataset consists of a single csv file with two columns - `question` and `answer`. There are  ~ 56k rows of QA pairs. The dataset was constructed by scraping Quora and has needs cleaning before using for finetuning. There are no additonal documents which can be used for information retrieval and there is no additonal context column. Therefore this task is "closed book question answering". The model is expected to answer questions using the knowledge stored in its parametes.
As there is no separate test split we would create it later after analysis.


In [2]:
from datasets import load_dataset

dataset = load_dataset("toughdata/quora-question-answer-dataset")
dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 56402
    })
})

### Content
The dataset is in English language. There are several exact duplicate questions and only ~3k unique questions. From dataset viewer on huggingface, we can observer that questions have lengths upto 250 characters, most of them having between 35 - 59 characters. The answers can go pretty long upto 411k characters.

In [3]:
question_set = {}
for i, q in enumerate(dataset["train"]["question"]):
    if q not in question_set:
        question_set[q] = []
    question_set[q].append(i)
print("total questions: ", len(question_set))
print("max answers for a single question: ", max(len(question_set[q]) for q in question_set))
print("min answers for a single question: ", min(len(question_set[q]) for q in question_set))


total questions:  3234
max answers for a single question:  106
min answers for a single question:  1


Now, only a few of these answers for each question would be of relevance and high-quality. This especially true for Quora where there is very little moderation compared to other forums like stackoverflow. We would like to filter the good answers. Moreover, this would also reduce the dataset size and correspondingly the training time. 
But, judging an answer can be a complex task as we don't have the voting information. Without relying human evaluators, we can use llms to score these answers. This would be better than using heuristics like text similarity. 

#### Zero shot answer scoring
MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli is a popular model based on Microsoft's DeBERTa-v3-base. It was trained on multiple Natural Language Inference (NLI) datasets and is suitable for zero-shot classification. It has ~200M params and inference time is reasonable.

We can keep answers with the top $20\%$ of scores. 

In [4]:
from transformers import pipeline
import numpy as np
from math import  ceil
from tqdm import tqdm
import json

threshold = 0.20
default_max_length = 512

answer_judge = pipeline("zero-shot-classification", model= "MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli", device=0)

for i, q in tqdm(enumerate(question_set)):
    scores = [] 
    for index in question_set[q]:
        answer = dataset["train"][index]['answer']
        prompt = f"""
        Does the following answer reliably answer the question 
        {q}
        Provide a yes/no response. 
        Answer: {answer}
        """
        candidate_labels = ["yes"]
        res = answer_judge(prompt, candidate_labels)
        scores.append(res['scores'][0])
    sorted_indices = np.argsort(scores)
    total_selects = ceil(len(sorted_indices) * threshold)
    sorted_indices = sorted_indices[-total_selects:].tolist()
    question_set[q] = sorted_indices
    
    with open("../log/q_{}.json".format(i), "w") as f:
        json.dump({q : sorted_indices}, f)



0it [00:00, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
26it [01:30,  3.47s/it]


OutOfMemoryError: CUDA out of memory. Tried to allocate 24.96 GiB. GPU 0 has a total capacity of 5.80 GiB of which 3.66 GiB is free. Process 8125 has 870.00 MiB memory in use. Including non-PyTorch memory, this process has 1.28 GiB memory in use. Of the allocated memory 883.76 MiB is allocated by PyTorch, and 324.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)