<a href="https://colab.research.google.com/github/rrfsantos/Desafios-NLP/blob/main/Desafio-NLP--Question%26Answer/spaCy_chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Recuperação de perguntas e respostas utilizando similaridade com spacy-sentence-bert

https://spacy.io/universe/project/spacy-sentence-bert

Being able to automatically answer questions accurately remains a difficult problem in natural language processing. This dataset has everything you need to try your own hand at this task. Can you correctly generate the answer to questions given the Wikipedia article text the question was originally generated from?

Content:
There are three question files, one for each year of students: S08, S09, and S10, as well as 690,000 words worth of cleaned text from Wikipedia that was used to generate the questions.

The "questionanswerpairs.txt" files contain both the questions and answers. The columns in this file are as follows:

ArticleTitle is the name of the Wikipedia article from which questions and answers initially came.
Question is the question.
Answer is the answer.
DifficultyFromQuestioner is the prescribed difficulty rating for the question as given to the question-writer.
DifficultyFromAnswerer is a difficulty rating assigned by the individual who evaluated and answered the question, which may differ from the difficulty in field 4.
ArticleFile is the name of the file with the relevant article

Questions that were judged to be poor were discarded from this data set.
There are frequently multiple lines with the same question, which appear if those questions were answered by multiple individuals. https://www.kaggle.com/rtatman/questionanswer-dataset

In [1]:
!python -m pip uninstall -y spacy

Found existing installation: spacy 3.0.6
Uninstalling spacy-3.0.6:
  Successfully uninstalled spacy-3.0.6


In [2]:
!python -m pip install -U pip setuptools wheel
!python -m pip install -U spacy[cuda102]
!python -m pip install spacy-sentence-bert

Collecting spacy[cuda102]
  Using cached spacy-3.0.6-cp37-cp37m-manylinux2014_x86_64.whl (12.8 MB)
Installing collected packages: spacy
Successfully installed spacy-3.0.6


In [3]:
import pandas as pd
import re
import random
import string

In [4]:
import spacy
print(spacy.__version__)
import spacy_sentence_bert

3.0.6


In [5]:
nlp = spacy_sentence_bert.load_model('en_stsb_roberta_large')

In [6]:
from google.colab import drive
drive.mount('/content/drive')

import os
workdir_path = '/content/drive/My Drive/desafio 3/'  # Inserir o local da pasta onde estão os arquivos de entrada (treino e teste)
os.chdir(workdir_path)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#### Criar dataframes de QA Sxx_question_answer_pairs.txt

In [7]:
# Importar dados
df_08 = pd.read_table('S08_question_answer_pairs.txt')
df_09 = pd.read_table('S09_question_answer_pairs.txt')
df_10 = pd.read_table('S10_question_answer_pairs.txt', engine = 'python', error_bad_lines = False)

Skipping line 765: '	' expected after '"'
Skipping line 876: '	' expected after '"'
Skipping line 1219: '	' expected after '"'


In [8]:
# Retirar colunas que não serão utilizadas
df_08.drop(['DifficultyFromQuestioner', 'DifficultyFromAnswerer', 'ArticleTitle'], axis = 1, inplace=True)
df_09.drop(['DifficultyFromQuestioner', 'DifficultyFromAnswerer', 'ArticleTitle'], axis = 1, inplace=True)
df_10.drop(['DifficultyFromQuestioner', 'DifficultyFromAnswerer', 'ArticleTitle'], axis = 1, inplace=True)

In [9]:
# Remover dados faltantes na base
print('-' * 15, df_08.isna().sum(), sep='\n')
print('-' * 15, df_09.isna().sum(), sep='\n')
print('-' * 15, df_10.isna().sum(), sep='\n')

---------------
Question        19
Answer         240
ArticleFile      2
dtype: int64
---------------
Question         0
Answer         100
ArticleFile      0
dtype: int64
---------------
Question        18
Answer         236
ArticleFile      0
dtype: int64


In [10]:
df_08.dropna(inplace=True)
df_09.dropna(inplace=True)
df_10.dropna(inplace=True)
print('-' * 15, df_08.isna().sum(), sep='\n')
print('-' * 15, df_09.isna().sum(), sep='\n')
print('-' * 15, df_10.isna().sum(), sep='\n')

---------------
Question       0
Answer         0
ArticleFile    0
dtype: int64
---------------
Question       0
Answer         0
ArticleFile    0
dtype: int64
---------------
Question       0
Answer         0
ArticleFile    0
dtype: int64


In [11]:
print('-' * 15, df_08.shape, sep='\n')
print('-' * 15, df_09.shape, sep='\n')
print('-' * 15, df_10.shape, sep='\n')

---------------
(1473, 3)
---------------
(725, 3)
---------------
(1219, 3)


In [12]:
# Limpar coluna "Answer"
def strip_last_punctuation(s):
  if s and s[-1] in string.punctuation:
    return s[:-1]
  else:
    return s


df_08['answer_clean'] = df_08['Answer'].str.lower().map(strip_last_punctuation)
df_09['answer_clean'] = df_09['Answer'].str.lower().map(strip_last_punctuation)
df_10['answer_clean'] = df_10['Answer'].str.lower().map(strip_last_punctuation)

In [13]:
# Remover perguntas e respostas duplicadas
df_08.drop_duplicates(subset=['answer_clean', 'Question'], keep='last', inplace = True)
df_09.drop_duplicates(subset=['answer_clean', 'Question'], keep='last', inplace = True)
df_10.drop_duplicates(subset=['answer_clean', 'Question'], keep='last', inplace = True)

print('-' * 15, df_08.shape, sep='\n')
print('-' * 15, df_09.shape, sep='\n')
print('-' * 15, df_10.shape, sep='\n')

---------------
(1148, 4)
---------------
(593, 4)
---------------
(1046, 4)


### S08_question_answer_pairs.txt

In [14]:
# Criar objeto tipo spacy para cada pergunta
df_08['question_doc'] = [nlp(text) for text in df_08.Question]

In [15]:
# Escolher randomicamente pergunta no dataset S08
query_08 = random.choice(df_08.Question)
doc_08 = nlp(query_08)

In [16]:
# Encontrar a resposta
for index, r in df_08.iterrows():
  if r['question_doc'].similarity(doc_08) == 1:
    print('Question: ', r.Question, '\n', 'Answer: ', r.Answer )

Question:  Is Canada bilingual? 
 Answer:  Yes.
Question:  Is Canada bilingual? 
 Answer:  Yes, it is.


### S09_question_answer_pairs.txt

In [17]:
# Criar objeto tipo spacy para cada pergunta
df_09['question_doc'] = [nlp(text) for text in df_09.Question]

In [18]:
query_09 = random.choice(df_09.Question)
doc_09 = nlp(query_09)

In [19]:
# Encontrar a resposta
for index, r in df_09.iterrows():
  if r['question_doc'].similarity(doc_09) == 1:
    print('Question: ', r.Question, '\n', 'Answer: ', r.Answer )

Question:  What happened to Copenhagen between 1251 and 1255? 
 Answer:  a bunch of things


### S10_question_answer_pairs.txt

In [20]:
# Criar objeto tipo spacy para cada pergunta
df_10['question_doc'] = [nlp(text) for text in df_10.Question]

In [21]:
# Escolher randomicamente pergunta no dataset S10
query_10 = random.choice(df_10.Question)
doc_10 = nlp(query_10)

In [22]:
# Encontrar resposta
for index, r in df_10.iterrows():
  if r['question_doc'].similarity(doc_10) == 1:
    print('Question: ', r.Question, '\n', 'Answer: ', r.Answer )

Question:  Drums are usually played by what? 
 Answer:  the hands, or by one or two sticks
Question:  Drums are usually played by what? 
 Answer:  the hands
