# Dataset Exploration


### Structure
The Quora QA answering dataset consists of a single csv file with two columns - `question` and `answer`. There are  ~ 56k rows of QA pairs. The dataset was constructed by scraping Quora and has needs cleaning before using for finetuning. There are no additonal documents which can be used for information retrieval and there is no additonal context column. Therefore this task is "closed book question answering". The model is expected to answer questions using the knowledge stored in its parametes.
As there is no separate test split we would create it later after analysis.


In [1]:
!pip install pytextrank



In [2]:
from datasets import load_dataset

dataset = load_dataset("toughdata/quora-question-answer-dataset")
dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 56402
    })
})

### Content
The dataset is in English language. There are several exact duplicate questions and only ~3k unique questions. 

In [3]:
question_set = {}
for i, q in enumerate(dataset["train"]["question"]):
    if q not in question_set:
        question_set[q] = []
    question_set[q].append(i)
print("total questions: ", len(question_set))
print("max answers for a single question: ", max(len(question_set[q]) for q in question_set))
print("min answers for a single question: ", min(len(question_set[q]) for q in question_set))


total questions:  3234
max answers for a single question:  106
min answers for a single question:  1


We can observe that questions have lengths upto 300 characters. But, the answers can be pretty long upto 450k characters. As majority of the answers are within 50k chars we discard the longer ones. Also, after inspection I found the very longer answers have to no to little relevance to the question. So, we should be able to ignore them safely.


In [4]:
import plotly.express as px

q_lens = {"length": [len(ans) for ans in dataset['train']['question']]}

fig = px.histogram(q_lens, nbins=10, x='length')

fig.update_layout(title='Distribution of length of Questions',
                  xaxis_title='Length',
                  yaxis_title='Frequency',
                  bargap=0.2,
                  bargroupgap=0.1,)
		
fig.show()

In [5]:
import plotly.express as px

ans_lens = {"length": [len(ans) for ans in dataset['train']['answer']]}

fig = px.histogram(ans_lens, nbins=10, x='length')

fig.update_layout(title='Distribution of length of Answers',
                  xaxis_title='Length',
                  yaxis_title='Frequency',
                  bargap=0.2,
                  bargroupgap=0.1,)
		
fig.show()

For all the answers within 50k length, we want to summarize the longer answers (say having more than 300 words) for practical reasons.
We want to extract the top important sentences and phrases from the answer. TextRank can be used to rank the sentences based on their importance. It can capture semantic relationships making it useful for summarization.

To further improve the data quality, we may use LLM to generate a coherent abstractive summary from the extracted text. But, for now I am going to directly use TextRank generated summary. It gives good results and is practical because we need to run it for almost each of the 50k rows. Also, as we are going to use a pretrained LLM for finetuning, we can assume that the model will not forget the grammar and only memorize the facts.


In [6]:

import spacy
import torch
import pytextrank
from tqdm import tqdm



nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textrank")

MAX_LENGTH = 50_000



def process_answer(ans, max_tokens=300, num_sentences=5):
    if len(ans.split()) <= max_tokens:
        return ans
    doc = nlp(ans)
    top_sentences = [sent.text for sent in doc._.textrank.summary(limit_sentences=num_sentences)]
    summary = ""
    for sent in top_sentences:
        if len(summary.split()) + len(sent.split()) > max_tokens:
            break
        summary = summary + " " + sent
    return summary
   


clean_data = {
    "question": [],
    "answer": []
}

for i, data in tqdm(enumerate(dataset["train"])):
    if len(data['answer']) > MAX_LENGTH:
        continue
    clean_data['question'].append(data['question'])
    clean_data['answer'].append(process_answer(data['answer']))

/home/a/miniforge3/envs/nlp_/lib/python3.8/site-packages


56402it [06:32, 143.76it/s]


We can observe that we still have the same number of unique questions as before. But, the size of the dataset is still huge because of multiple answers to each question.

In [7]:
from datasets import Dataset

clean_dataset = Dataset.from_dict(clean_data)
new_set = {}
for i, q in enumerate(clean_dataset["question"]):
    if q not in new_set:
        new_set[q] = []
    new_set[q].append(i)
clean_dataset.save_to_disk('clean_dataset')

print("dataset size: ", len(clean_dataset))

print("total questions: ", len(new_set))

Saving the dataset (0/1 shards):   0%|          | 0/56382 [00:00<?, ? examples/s]

dataset size:  56382
total questions:  3234


Now, only a few of these answers for each question would be of relevance and high-quality. This especially true for Quora where there is little moderation compared to other forums like stackoverflow. We would like to filter the good answers. Moreover, this would also reduce the dataset size and correspondingly the training time. 
But, judging an answer can be a complex task as we don't have the voting information. Without relying human evaluators, we can use llms to score these answers. This would be better than using heuristics like text similarity. 

## Answer  filtering

#### Zero shot answer classification
We can score the answers by asking the llm whether the answer reliably answers the question. Using the confidence score from the llm we can sort the answers.  I am using a popular model  based on Microsoft's DeBERTa-v3-base. It was trained on multiple Natural Language Inference (NLI) datasets and is suitable for zero-shot classification. It has ~200M params and inference time is reasonable.

We can keep answers with the top $20\%$ of scores. This will result in ~5x reduction in dataset size.


In [9]:
for i, question in tqdm(enumerate(new_set)):
    print(new_set[question]) 

3234it [00:00, 166215.88it/s]

[0, 4178, 5192, 6222, 6382, 7733, 8881, 9201, 10114, 11578, 13516, 19179, 19784, 22059, 26857, 27400, 28558, 29126, 34306, 34660, 38214, 39425, 41377, 43238, 43987, 45787, 49532, 52170, 54749]
[1, 26, 251, 1616, 2145, 2750, 3422, 3424, 3628, 4092, 5505, 5976, 6825, 6975, 8652, 9403, 11224, 11317, 11887, 12340, 12706, 13018, 13308, 13623, 13672, 13888, 14160, 14798, 15684, 16098, 16380, 16599, 16965, 17127, 19249, 19343, 19696, 20532, 21935, 22341, 22454, 22943, 23680, 23903, 23917, 24765, 24777, 25503, 25764, 26721, 27802, 28080, 28130, 28229, 28391, 28805, 30503, 30973, 31605, 32608, 32649, 34253, 34662, 34960, 35639, 36039, 36788, 37047, 37144, 37196, 37420, 38721, 38810, 38856, 40773, 43389, 43445, 43622, 44688, 44711, 45005, 46179, 46645, 47656, 48346, 48555, 49933, 49978, 50047, 51081, 52184, 53306, 53599, 53716, 54046, 54200, 54423, 54581, 55143, 56083]
[2, 1642, 3126, 3777, 4589, 4675, 5296, 5396, 9543, 11942, 14889, 15231, 16851, 18935, 19261, 20132, 20334, 20571, 22779, 24195,




In [12]:
from transformers import pipeline
import numpy as np
from math import  ceil
from tqdm import tqdm
import json
import torch
import pdb

threshold = 20 # select top threshold % of the answers
answer_judge = pipeline("zero-shot-classification", model= "MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli", device=0)

shortlisted_indices = []

for i, question in tqdm(enumerate(new_set)):
    scores = [] 
    prompts = [
        f"""
        Does the following answer reliably answer the question 
        {question}
        Provide a yes/no response. 
        Answer: {clean_dataset['answer'][ans_index]}
        """
        for ans_index in new_set[question]
    ]
    candidate_labels = ["yes"]
    try:
        with torch.no_grad():
            res = answer_judge(prompts, candidate_labels)
            scores.extend([item['scores'][0] for item in res])
    except Exception as e:
        print(e)
    del res
    torch.cuda.empty_cache()
    sorted_indices = np.argsort(scores)
    total_selects = ceil(len(sorted_indices) * threshold/100)
    sorted_indices = sorted_indices[-total_selects:].tolist()
    select_indices = [new_set[question][index] for index in sorted_indices]
    shortlisted_indices.extend(select_indices)
    with open(f"../log/q_{i}.json", "w") as f:
        json.dump(select_indices, f)

print("remaining dataset size: ", len(shortlisted_indices))

0it [00:00, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
10it [00:40,  5.00s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
3234it [1:08:30,  1.27s/it]

remaining dataset size:  12710





In [16]:
reduced_dataset = clean_dataset[shortlisted_indices]
reduced_dataset = Dataset.from_dict(reduced_dataset)
reduced_dataset.save_to_disk('reduced_dataset')

Saving the dataset (0/1 shards):   0%|          | 0/12710 [00:00<?, ? examples/s]

# Question Analysis

We would now study the question text to find patterns. We should preserve the case so as not to lose information about acronyms or nouns. We also keep the digits as the question may involve years or numbers.
1. Tokenize the questions into words
2. Stop word removal
3. Lemmatization - as it gives more meaningful words compared to stemming. AS the number of questions is only ~3k so the higher computational complexity is still practical. 
4. Topic modelling 

In [50]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from collections import Counter



lemmatizer = WordNetLemmatizer()
nltk.download('stopwords')
nltk.download('wordnet')

stop_words = stopwords.words('english')

questions = list(new_set.keys())
docs = [word_tokenize(question) for question in questions]
# Remove non-alphanumeric characters from each token
docs = [[re.sub(r'\W+', '', token) for token in doc ] for doc in docs]

no_stop_doc = [[token for token in doc if token not in stop_words] for doc in docs]

lemmatized = [[lemmatizer.lemmatize(token) for token in doc] for doc in no_stop_doc ]
flatten_tokens = [token for doc in lemmatized for token in doc if len(token) > 1]

unique_tokens = {}
for token in flatten_tokens:
    if token not in unique_tokens:
        unique_tokens[token] = 0
    unique_tokens[token] += 1

print("num unique tokens: ", len(unique_tokens))

[nltk_data] Downloading package stopwords to /home/a/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/a/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


num unique tokens:  8905


the top-10 most frequent words are the question identifiers

In [53]:
sorted(unique_tokens, key=unique_tokens.get, reverse=True)[:10]

['What', 'How', 'Why', 'Is', 'Can', 'would', 'best', 'get', 'like', 'If']

In [56]:
from gensim.models import LdaModel
from gensim.corpora import Dictionary

# Create the dictionary and corpus
dictionary = Dictionary(lemmatized)
corpus = [dictionary.doc2bow(text) for text in lemmatized]

# Choose the number of topics
num_topics = 5

# Train the LDA model
lda_model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=10)
for idx, topic in lda_model.print_topics():
    print(f"Topic {idx}: {topic}")

Topic 0: 0.124*"" + 0.038*"What" + 0.020*"How" + 0.015*"I" + 0.006*"best" + 0.005*"get" + 0.005*"Is" + 0.004*"Why" + 0.003*"one" + 0.003*"company"
Topic 1: 0.125*"" + 0.037*"What" + 0.008*"Why" + 0.006*"I" + 0.006*"would" + 0.004*"thing" + 0.004*"difference" + 0.004*"people" + 0.003*"nt" + 0.003*"How"
Topic 2: 0.174*"" + 0.024*"I" + 0.014*"What" + 0.013*"Why" + 0.010*"How" + 0.009*"Is" + 0.005*"like" + 0.005*"Do" + 0.004*"get" + 0.004*"Which"
Topic 3: 0.103*"" + 0.041*"How" + 0.022*"I" + 0.011*"Can" + 0.006*"Why" + 0.005*"use" + 0.005*"get" + 0.004*"many" + 0.004*"Which" + 0.003*"language"
Topic 4: 0.111*"" + 0.037*"What" + 0.010*"Why" + 0.005*"Is" + 0.004*"difference" + 0.004*"best" + 0.003*"like" + 0.003*"way" + 0.003*"Who" + 0.003*"Are"


In [14]:
for _ in range(total):
    with torch.no_grad():
        res = answer_judge(prompt, candidate_labels)
    del res
    torch.cuda.empty_cache()

answer

In [None]:
# url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)'
# filtered_tokens = [token for token in tokens if not re.match(url_pattern, token)]1+1