# TAPAS





Google has recently open-sourced one of their models called ‘TAPAS’ (for TAble PArSing) wherein you can ask questions about your data in natural language.

TAPAS is essentially a BERT model-based approach to question answering over tables. However, instead of conventional NLP approaches handling natural language questions as a semantic parsing task based on logical forms (precisely specified semantic version of the syntactic text), it is a weak-supervision technique relying on denotations (i.e. literal or primary meaning of the words and not the underlying idea or emotions). It predicts the denotation by selecting table cells, optionally applies a corresponding aggregation operator to such selection and makes end-to-end predictions.

To read about more, please refer [this](https://analyticsindiamag.com/guide-to-tapas-table-parsing-a-technique-to-retrieve-information-from-tabular-data-using-nlp/) article.

# Practical implementation of TAPAS

Here’s a demonstration of TAPAS applied to a table having data of some international cricketers e.g. the team they belong to, career span, runs scored, number of innings played and so on. 

Clone the GitHub repository:

In [1]:
! git clone https://github.com/google-research/tapas.git

Cloning into 'tapas'...
remote: Enumerating objects: 582, done.[K
remote: Counting objects: 100% (140/140), done.[K
remote: Compressing objects: 100% (112/112), done.[K
remote: Total 582 (delta 42), reused 104 (delta 28), pack-reused 442[K
Receiving objects: 100% (582/582), 619.76 KiB | 7.47 MiB/s, done.
Resolving deltas: 100% (318/318), done.


Installation:

In [2]:
!pip install tapas-table-parsing

Collecting tapas-table-parsing
[?25l  Downloading https://files.pythonhosted.org/packages/b6/e6/c54cf34698048594962130555dddd6cd568fbfdb2e5af551f5ac9ad29e90/tapas_table_parsing-0.0.1.dev0-py3-none-any.whl (195kB)
[K     |████████████████████████████████| 204kB 4.4MB/s 
[?25hCollecting apache-beam[gcp]==2.20.0
[?25l  Downloading https://files.pythonhosted.org/packages/a7/0a/68469f3afd084ab085a2a3f000dbf010c7a94566242051e760c9702bb66c/apache_beam-2.20.0-cp37-cp37m-manylinux1_x86_64.whl (3.5MB)
[K     |████████████████████████████████| 3.5MB 35.5MB/s 
[?25hCollecting tensorflow~=2.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/95/73/855c34dc46c7a28c3964a20592eae750daa4fd1aa07d93d14549d1432f31/tensorflow-2.2.3-cp37-cp37m-manylinux2010_x86_64.whl (516.4MB)
[K     |████████████████████████████████| 516.4MB 26kB/s 
[?25hCollecting pandas~=1.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/af/f3/683bf2547a3eaeec15b39cef86f61e921b3b187f250fcd2b5c5fb438636

Restart the runtime to use the newly installed versions

Download the pre-trained checkpoint from Google Storage. For the sake of speed, a base sized model trained on SQA has been used. However, the best results in the paper were obtained with a larger model having 24 layers instead of 12.

In [None]:
!gsutil cp gs://tapas_models/2020_04_21/tapas_sqa_base.zip . && unzip tapas_sqa_base.zip

Copying gs://tapas_models/2020_04_21/tapas_sqa_base.zip...
| [1 files][  1.0 GiB/  1.0 GiB]   51.8 MiB/s                                   
Operation completed over 1 objects/1.0 GiB.                                      
Archive:  tapas_sqa_base.zip
   creating: tapas_sqa_base/
  inflating: tapas_sqa_base/model.ckpt.data-00000-of-00001  
  inflating: tapas_sqa_base/model.ckpt.index  
  inflating: tapas_sqa_base/README.txt  
  inflating: tapas_sqa_base/vocab.txt  
  inflating: tapas_sqa_base/bert_config.json  
  inflating: tapas_sqa_base/model.ckpt.meta  


Import the necessary modules

In [None]:
import tensorflow.compat.v1 as tf
import os 
import shutil
import csv
import pandas as pd
import IPython
tf.get_logger().setLevel('ERROR')
from tapas.utils import tf_example_utils
from tapas.protos import interaction_pb2
from tapas.utils import number_annotation_utils
from tapas.scripts import prediction_utils 

Load the latest checkpoint from the model

In [None]:
os.makedirs('results/sqa/tf_examples', exist_ok=True)
os.makedirs('results/sqa/model', exist_ok=True)
with open('results/sqa/model/checkpoint', 'w') as f:
  f.write('model_checkpoint_path: "model.ckpt-0"')
for suffix in ['.data-00000-of-00001', '.index', '.meta']:
  shutil.copyfile(f'tapas_sqa_base/model.ckpt{suffix}', 
  f'results/sqa/model/model.ckpt-0{suffix}') 

Load the tabular dataset

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/bhattbhavesh91/tapas-demo/master/data.csv')

Before passing the table as input, all the columns in the table are required to have string type values. Perform the datatype conversion.

In [None]:
df = df.astype(str)

On executing the above line of code, the tabular data will be displayed as follows:

In [None]:
df

Convert the data frame into a list of lists. The first element of the list of lists should be the column names of df.

In [None]:
list_of_list = [[]]
list_of_list[0] = list(df.columns)
list_of_list.extend(df.values.tolist()) 

Note: Since TAPAS is basically a BERT model, training models with greater than 512 sequence length will require a TPU. You can use the option max_seq_length to create shorter sequences. It will reduce the model accuracy but makes it possible to train the model on GPUs. 

In [None]:
max_seq_length = 512
vocab_file = "tapas_sqa_base/vocab.txt"
config = tf_example_utils.ClassifierConversionConfig(
    vocab_file=vocab_file,
    max_seq_length=max_seq_length,
    max_column_id=max_seq_length,
    max_row_id=max_seq_length,
    strip_column_names=False,
    add_aggregation_candidates=False,
)
converter = tf_example_utils.ToClassifierTensorflowExample(config)

def convert_interactions_to_examples(tables_and_queries):
  """Calls Tapas converter to convert interaction to example."""
  for idx, (table, queries) in enumerate(tables_and_queries):
    interaction = interaction_pb2.Interaction()
    for position, query in enumerate(queries):
      question = interaction.questions.add()
      question.original_text = query
      question.id = f"{idx}-0_{position}"
    for header in table[0]:
      interaction.table.columns.add().text = header
    for line in table[1:]:
      row = interaction.table.rows.add()
      for cell in line:
        row.cells.add().text = cell
    number_annotation_utils.add_numeric_values(interaction)
    for i in range(len(interaction.questions)):
      try:
        yield converter.convert(interaction, i)
      except ValueError as e:
        print(f"Can't convert interaction: {interaction.id} error: {e}")
        
def write_tf_example(filename, examples):
  with tf.io.TFRecordWriter(filename) as writer:
    for example in examples:
      writer.write(example.SerializeToString())

def predict(table_data, queries):
  table = table_data
  examples = convert_interactions_to_examples([(table, queries)])
  write_tf_example("results/sqa/tf_examples/test.tfrecord", examples)
  write_tf_example("results/sqa/tf_examples/random-split-1-dev.tfrecord", [])
  
  !python tapas/tapas/run_task_main.py \
    --task="SQA" \
    --output_dir="results" \
    --noloop_predict \
    --test_batch_size={len(queries)} \
    --tapas_verbosity="ERROR" \
    --compression_type= \
    --init_checkpoint="tapas_sqa_base/model.ckpt" \
    --bert_config_file="tapas_sqa_base/bert_config.json" \
    --mode="predict" 2> error


  results_path = "results/sqa/model/test_sequence.tsv"
  all_coordinates = []
  df = pd.DataFrame(table[1:], columns=table[0])
  display(IPython.display.HTML(df.to_html(index=False)))
  print()
  with open(results_path) as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t')
    for row in reader:
      coordinates = prediction_utils.parse_coordinates(row["answer_coordinates"])
      all_coordinates.append(coordinates)
      answers = ', '.join([table[row + 1][col] for row, col in coordinates])
      position = int(row['position'])
      print(">", queries[position])
      print(answers)
  return all_coordinates

In [20]:
!pwd

/content


In [17]:
%cd /content/

/content


In [19]:
table = list_of_list
queries = ["what were the players names?",
      "of these, which team did Sachin Tendulkar play for?",
      "what is his highest score?",
      "how many runs has Virat Kohli scored?"]
examples = convert_interactions_to_examples([(table, queries)])
write_tf_example("results/sqa/tf_examples/test.tfrecord", examples)
write_tf_example("results/sqa/tf_examples/random-split-1-dev.tfrecord", [])

!python tapas/tapas/run_task_main.py \
  --task="SQA" \
  --output_dir="results" \
  --noloop_predict \
  --test_batch_size={len(queries)} \
  --tapas_verbosity="ERROR" \
  --compression_type= \
  --init_checkpoint="tapas_sqa_base/model.ckpt" \
  --bert_config_file="tapas_sqa_base/bert_config.json" \
  --mode="predict"

Traceback (most recent call last):
  File "tapas/tapas/run_task_main.py", line 32, in <module>
    from tapas.retrieval import e2e_eval_utils
ModuleNotFoundError: No module named 'tapas.retrieval'


Make predictions

In [13]:
result = predict(list_of_list, ["what were the players names?",
      "of these, which team did Sachin Tendulkar play for?",
      "what is his highest score?",
      "how many runs has Virat Kohli scored?"]) 

Pos,Player,Team,Span,Innings,Runs,Highest Score,Average,Strike Rate
1,Sachin Tendulkar,India,1989-2012,452,18426,200,44.83,86.23
2,Kumar Sangakkara,Sri Lanka,2000-2015,380,14234,169,41.98,78.86
3,Ricky Ponting,Australia,1995-2012,365,13704,164,42.03,80.39
4,Sanath Jayasuriya,Sri Lanka,1989-2011,433,13430,189,32.36,91.2
5,Mahela Jayawardene,Sri Lanka,1998-2015,418,12650,144,33.37,78.96
6,Virat Kohli,India,2008-2020,236,11867,183,59.85,93.39
7,Inzamam-ul-Haq,Pakistan,1991-2007,350,11739,137,39.52,74.24
8,Jacques Kallis,South Africa,1996-2014,314,11579,139,44.36,72.89
9,Saurav Ganguly,India,1992-2007,300,11363,183,41.02,73.7
10,Rahul Dravid,India,1996-2011,318,10889,153,39.16,71.24





The arguments to the predict() function include the list of lists to be fed as the input and a list of questions to be answered.