## Recuperação de perguntas e respostas utilizando similaridade com spacy-sentence-bert

https://spacy.io/universe/project/spacy-sentence-bert

Being able to automatically answer questions accurately remains a difficult problem in natural language processing. This dataset has everything you need to try your own hand at this task. Can you correctly generate the answer to questions given the Wikipedia article text the question was originally generated from?

Content:
There are three question files, one for each year of students: S08, S09, and S10, as well as 690,000 words worth of cleaned text from Wikipedia that was used to generate the questions.

The "questionanswerpairs.txt" files contain both the questions and answers. The columns in this file are as follows:

ArticleTitle is the name of the Wikipedia article from which questions and answers initially came.
Question is the question.
Answer is the answer.
DifficultyFromQuestioner is the prescribed difficulty rating for the question as given to the question-writer.
DifficultyFromAnswerer is a difficulty rating assigned by the individual who evaluated and answered the question, which may differ from the difficulty in field 4.
ArticleFile is the name of the file with the relevant article

Questions that were judged to be poor were discarded from this data set.
There are frequently multiple lines with the same question, which appear if those questions were answered by multiple individuals. https://www.kaggle.com/rtatman/questionanswer-dataset

In [None]:
!python -m pip uninstall -y spacy

In [None]:
!python -m pip install -U pip setuptools wheel
!python -m pip install -U spacy[cuda102]
!python -m pip install spacy-sentence-bert

In [None]:
import pandas as pd
import re
import random
import string

In [None]:
import spacy
print(spacy.__version__)
import spacy_sentence_bert

In [None]:
nlp = spacy_sentence_bert.load_model('en_stsb_roberta_large')

#### Criar dataframes de QA Sxx_question_answer_pairs.txt

In [None]:
# Importar dados
df_08 = pd.read_table('../input/questionanswer-dataset/S08_question_answer_pairs.txt/S08_question_answer_pairs.txt')
df_09 = pd.read_table('../input/questionanswer-dataset/S08_question_answer_pairs.txt/S09_question_answer_pairs.txt')
df_10 = pd.read_table('../input/questionanswer-dataset/S08_question_answer_pairs.txt/S10_question_answer_pairs.txt', engine = 'python', error_bad_lines = False)

In [None]:
# Retirar colunas que não serão utilizadas
df_08.drop(['DifficultyFromQuestioner', 'DifficultyFromAnswerer', 'ArticleTitle'], axis = 1, inplace=True)
df_09.drop(['DifficultyFromQuestioner', 'DifficultyFromAnswerer', 'ArticleTitle'], axis = 1, inplace=True)
df_10.drop(['DifficultyFromQuestioner', 'DifficultyFromAnswerer', 'ArticleTitle'], axis = 1, inplace=True)

In [None]:
# Remover dados faltantes na base
print('-' * 15, df_08.isna().sum(), sep='\n')
print('-' * 15, df_09.isna().sum(), sep='\n')
print('-' * 15, df_10.isna().sum(), sep='\n')

In [None]:
df_08.dropna(inplace=True)
df_09.dropna(inplace=True)
df_10.dropna(inplace=True)
print('-' * 15, df_08.isna().sum(), sep='\n')
print('-' * 15, df_09.isna().sum(), sep='\n')
print('-' * 15, df_10.isna().sum(), sep='\n')

In [None]:
print('-' * 15, df_08.shape, sep='\n')
print('-' * 15, df_09.shape, sep='\n')
print('-' * 15, df_10.shape, sep='\n')

In [None]:
# Limpar coluna "Answer"
def strip_last_punctuation(s):
  if s and s[-1] in string.punctuation:
    return s[:-1]
  else:
    return s


df_08['answer_clean'] = df_08['Answer'].str.lower().map(strip_last_punctuation)
df_09['answer_clean'] = df_09['Answer'].str.lower().map(strip_last_punctuation)
df_10['answer_clean'] = df_10['Answer'].str.lower().map(strip_last_punctuation)

In [None]:
# Remover perguntas e respostas duplicadas
df_08.drop_duplicates(subset=['answer_clean', 'Question'], keep='last', inplace = True)
df_09.drop_duplicates(subset=['answer_clean', 'Question'], keep='last', inplace = True)
df_10.drop_duplicates(subset=['answer_clean', 'Question'], keep='last', inplace = True)

print('-' * 15, df_08.shape, sep='\n')
print('-' * 15, df_09.shape, sep='\n')
print('-' * 15, df_10.shape, sep='\n')

### S08_question_answer_pairs.txt

In [None]:
# Criar objeto tipo spacy para cada pergunta
df_08['question_doc'] = [nlp(text) for text in df_08.Question]

In [None]:
# Escolher randomicamente pergunta no dataset S08
query_08 = random.choice(df_08.Question)
doc_08 = nlp(query_08)

In [None]:
# Encontrar a resposta
for index, r in df_08.iterrows():
  if r['question_doc'].similarity(query_08) > 0.99:
    print('Question: ', r.Question, '\n', 'Answer: ', r.Answer )

### S09_question_answer_pairs.txt

In [None]:
# Criar objeto tipo spacy para cada pergunta
df_09['question_doc'] = [nlp(text) for text in df_09.Question]

In [None]:
# Escolher randomicamente pergunta no dataset S09
query_09 = nlp(random.choice(df_09.Question))

In [None]:
# Encontrar a resposta
for index, r in df_09.iterrows():
  if r['question_doc'].similarity(query_09) > 0.99:
    print('Question: ', r.Question, '\n', 'Answer: ', r.Answer )

### S10_question_answer_pairs.txt

In [None]:
# Criar objeto tipo spacy para cada pergunta
df_10['question_doc'] = [nlp(text) for text in df_10.Question]

In [None]:
# Escolher randomicamente pergunta no dataset S10
query_10 = nlp(random.choice(df_10.Question))

In [None]:
# Encontrar resposta
for index, r in df_10.iterrows():
  if r['question_doc'].similarity(query_10) > 0.99:
    print('Question: ', r.Question, '\n', 'Answer: ', r.Answer )