<a href="https://colab.research.google.com/github/rgvictor03/rankqa/blob/master/QA%20with%20transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install sentence-transformers torch pinecone-client datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pinecone-client
  Downloading pinecone_client-2.2.1-py3-none-any.whl (177 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.2/177.2 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers<5.0.0,>=4.6.0 (from sentence-transformers)
  Downloading transformers-4.29.1-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m58

Obtenemos los datos de HuggingFace `datasets`

In [3]:
import datasets

qa = datasets.load_dataset('squad', split='validation')
qa

Downloading and preparing dataset squad/plain_text to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.


Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 10570
})

In [4]:
qa[0]

{'id': '56be4db0acb8001400a502ec',
 'title': 'Super_Bowl_50',
 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.',
 'question': 'Which NFL team represented the AFC at Super Bowl 50?',
 'answers': {'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'],


Eliminamos contextos duplicados. Hay varias preguntas por cada contexto por lo que para reducir el tamaño nos quedaremo únicamente con una pregunta por contexto, por lo que eliminaremos los contextos repetidos.

In [5]:
unique_contexts = []
unique_ids = []

# hacemos una lista de los IDs que representan solo el primer ejemplo de cada conteto
for row in qa:
  if row['context'] not in unique_contexts:
    unique_contexts.append(row['context'])
    unique_ids.append(row['id'])

# ahora filtramos cada ejemplo que no está incluido en IDs únicos
qa = qa.filter(lambda x: True if x['id'] in unique_ids else False)
qa

Filter:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 2067
})

`qa` es un dataset que contiene como columnas `['id', 'title', 'context', 'question', 'answers', 'encoding']` y 2067 filas. 

Creamos vectores contexto con el modelo de recuperador

In [8]:
from sentence_transformers import SentenceTransformer
# Tenemos que indexar cada contexto en vectores de contexto.
# Para ello vamos a utilizar el siguiente RETRIEVAL MODEL.
# no es el mejor pero es de los más rápidos 
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

Primera capa: Capa de transformers basados en Bert
Segunda capa: Pooling
Tercera capa: Normalizar



Codificamos los vectores contexto.

`model.encode().tolist()` crea un vector que representa la frase que le le introduzcamos. 

In [9]:
qa = qa.map(lambda x: {
    'encoding': model.encode(x['context']).tolist()
}, batched=True, batch_size=32)
qa

Map:   0%|          | 0/2067 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers', 'encoding'],
    num_rows: 2067
})

Añadimos a `qa` la columna `encoding` que contiene el conteto codificado.

### Create Vector Database (and index context vectors)

O bien usamos Faiss o usamos Pinecone. Para Pineconde necesitamos una API key que tenemos que tener en `app.pinecone.io`

También hay que instalar el cliente:
`!pip install pinecone-client`

In [10]:
!pip install pinecone-client

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [11]:
API_KEY = "4eac1c0b-259a-4486-9816-db478f767621"

In [15]:
import pinecone

pinecone.init(API_KEY, environment='asia-southeast1-gcp-free')

Creamos un índice con pinecone

In [16]:
pinecone.create_index('qa-index', dimension=len(model.encode('hola').tolist()))


In [17]:
index = pinecone.Index('qa-index')

Ahora queremos organizarlo por batches.

In [33]:
from tqdm.auto import tqdm # para la barra de progreso
#upload an insert
upserts = [(v['id'], v['title']) for v in qa]
# upserts contiene por cada fila de qa el valor de id y el valor de encoding
# básicamente va a ser una matriz de esa tabla pero con solo esas dos columnas
for i in tqdm(range(0, len(upserts), 50)): # 
  i_end = i + 50
  if i_end > len(upserts):
    i_end = len(upserts)
  #index.upsert(vectors=upserts[i:i_end])



  0%|          | 0/42 [00:00<?, ?it/s]

In [34]:
i, i_end

(2050, 2067)

-----

### QA Inference

In [35]:
query = "Which NFL team represented the AFC at Super Bowl 50?"
xq = model.encode([query]).tolist()

In [36]:
xc = index.query(xq, top_k=5)
xc

{'matches': [{'id': '56be4db0acb8001400a502ec',
              'score': 0.685847461,
              'values': []},
             {'id': '56be53b8acb8001400a50314',
              'score': 0.586465836,
              'values': []},
             {'id': '56be4e1facb8001400a502f6',
              'score': 0.54540956,
              'values': []},
             {'id': '56becb823aeaaa14008c948b',
              'score': 0.538328886,
              'values': []},
             {'id': '56bec0dd3aeaaa14008c9357',
              'score': 0.520058692,
              'values': []}],
 'namespace': ''}

In [38]:
xc['matches']

[{'id': '56be4db0acb8001400a502ec', 'score': 0.685847461, 'values': []},
 {'id': '56be53b8acb8001400a50314', 'score': 0.586465836, 'values': []},
 {'id': '56be4e1facb8001400a502f6', 'score': 0.54540956, 'values': []},
 {'id': '56becb823aeaaa14008c948b', 'score': 0.538328886, 'values': []},
 {'id': '56bec0dd3aeaaa14008c9357', 'score': 0.520058692, 'values': []}]

In [39]:
ids = [x['id'] for x in xc['matches']] 
ids

['56be4db0acb8001400a502ec',
 '56be53b8acb8001400a50314',
 '56be4e1facb8001400a502f6',
 '56becb823aeaaa14008c948b',
 '56bec0dd3aeaaa14008c9357']

In [40]:
contexts = qa.filter(lambda x: True if x['id'] in ids else False)
contexts

Filter:   0%|          | 0/2067 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers', 'encoding'],
    num_rows: 5
})

In [41]:
contexts['context']

['Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.',
 'The Panthers finished the regular season with a 15–1 record, and quarterback Cam Newton was named the NFL Most Valuable Player (MVP). They defeated the Arizona Cardinals 49–15 in the NFC Championship Game and advanced

### Extractive pipeline

In [42]:
# from transformers import pipeline

# model_name = 'deepset/electra-base-squad2'
# nlp = pipeline(tokenizer=model_name, model=model_name, task='question-answering')

Downloading (…)lve/main/config.json:   0%|          | 0.00/635 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [43]:
# for context in contexts['context']:
#   print(nlp(question=query, context=context))

{'score': 0.999852180480957, 'start': 177, 'end': 191, 'answer': 'Denver Broncos'}
{'score': 6.596079629161977e-07, 'start': 525, 'end': 539, 'answer': 'Dallas Cowboys'}
{'score': 1.1174681276315823e-05, 'start': 15, 'end': 93, 'answer': 'NFL Commissioner Roger Goodell stated that the league planned to make the 50th'}
{'score': 2.3440297012428113e-12, 'start': 564, 'end': 579, 'answer': 'Super Bowl XXXV'}
{'score': 0.00967063196003437, 'start': 68, 'end': 74, 'answer': 'Denver'}


### Abstractive pipeline

In [44]:
from transformers import pipeline

model_name = 'yjernite/bart_eli5'
nlp = pipeline(tokenizer=model_name, model=model_name, 
               task='text2text-generation') #seq2seq

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [52]:
for context in contexts['context']:
   print(nlp(
      f"question: {query}, context: {context}",
      num_beams=4,
      do_sample=True,
      temperature=1.2,
      max_length=64
   ), query, context)
   break

[{'generated_text': ' It was the AFC at Super Bowl 50. The AFC and NFC were two conferences. The AFC and NFC were the two conferences for the 2014 season. The AFC and NFC were the two conferences for the 2013 season. So the AFC and NFC were the two conferences for the 2014 season. So the AFC and NFC'}] Which NFL team represented the AFC at Super Bowl 50? Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numeral

### Closed book

In [53]:
from transformers import pipeline

model_name = 'EleutherAI/gpt-neo-125M'
nlp = pipeline(tokenizer=model_name, model=model_name, 
               task='text-generation')

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/526M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/560 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/357 [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memorry_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [55]:
nlp(query, max_length=32)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Which NFL team represented the AFC at Super Bowl 50?\n\nThe NFL is a team of the NFL. The NFL is a team of the NFL. The'}]