<a href="https://colab.research.google.com/github/rii92/Datmin/blob/main/013_practical_example_tableqa_part_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Let's compile everything that we've learned so far

Let's try to learn other machine learning model pipeline called "Table question answering" which is a question answering system that answers questions about tables. The best way to understand it is to just refer to below code and run it:


In [1]:
%pip install transformers
%pip install pandas



In [2]:
from transformers import pipeline

data = [
    {
        "House": "House 1",
        "City": "Jakarta",
        "Price (in million IDR)": "5000",
        "Rooms": "2 bed, 2 bath",
        "Facilities": "Pool, Garage, Gym, Garden",
        "Furnished": "Yes",
    },
    {
        "House": "House 2",
        "City": "Surabaya",
        "Price (in million IDR)": "2000",
        "Rooms": "3 bed, 2 bath",
        "Facilities": "Garage, Library",
        "Furnished": "No",
    },
    {
        "House": "House 3",
        "City": "Malang",
        "Price (in million IDR)": "1500",
        "Rooms": "2 bed, 1 bath",
        "Facilities": "Pool, Gym, Sauna",
        "Furnished": "No",
    },
    {
        "House": "House 4",
        "City": "Jakarta",
        "Price (in million IDR)": "5300",
        "Rooms": "1 bed, 1 bath",
        "Facilities": "Gym, Rooftop",
        "Furnished": "Yes",
    },
    {
        "House": "House 5",
        "City": "Surabaya",
        "Price (in million IDR)": "2200",
        "Rooms": "3 bed, 2 bath",
        "Facilities": "Pool, Garage, Gym, Library",
        "Furnished": "No",
    },
    {
        "House": "House 6",
        "City": "Malang",
        "Price (in million IDR)": "1600",
        "Rooms": "2 bed, 2 bath",
        "Facilities": "Garage, Gym, Garden",
        "Furnished": "Yes",
    },
    {
        "House": "House 7",
        "City": "Jakarta",
        "Price (in million IDR)": "4900",
        "Rooms": "4 bed, 3 bath",
        "Facilities": "Pool, Garage, Sauna",
        "Furnished": "Yes",
    },
    {
        "House": "House 8",
        "City": "Surabaya",
        "Price (in million IDR)": "2100",
        "Rooms": "3 bed, 3 bath",
        "Facilities": "Gym, Garden",
        "Furnished": "No",
    },
    {
        "House": "House 9",
        "City": "Malang",
        "Price (in million IDR)": "1400",
        "Rooms": "4 bed, 2 bath",
        "Facilities": "Garage, Rooftop",
        "Furnished": "Yes",
    },
    {
        "House": "House 10",
        "City": "Jakarta",
        "Price (in million IDR)": "5100",
        "Rooms": "3 bed, 1 bath",
        "Facilities": "Pool, Garden, Sauna",
        "Furnished": "No",
    },
]

# Set up the pipeline:

table_qa = pipeline("table-question-answering", model="google/tapas-large-finetuned-wtq")


# Ask a question:

query = "Which house is the most expensive?"
table_qa({"table": data, "query": query})

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.66k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.35G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/490 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/262k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/154 [00:00<?, ?B/s]

{'answer': 'House 4',
 'coordinates': [(3, 0)],
 'cells': ['House 4'],
 'aggregator': 'NONE'}

As you can see, "TableQA" pipeline is basically when we want to ask a machine learning model a question about a set of data. Above table is about several dummy data on houses in several cities.

This pipeline will output different things based on given question. Try below questions:

- What is the average price of houses in Jakarta?
- I want to buy house 7 and house 10, what's the total price?
- How many houses in Surabaya that has gym as one of the facility?

In [3]:
query = "What is the average price of houses in Jakarta?"
table_qa({"table": data, "query": query})

{'answer': 'AVERAGE > 5000, 5300, 4900, 5100',
 'coordinates': [(0, 2), (3, 2), (6, 2), (9, 2)],
 'cells': ['5000', '5300', '4900', '5100'],
 'aggregator': 'AVERAGE'}

# Quick Note: Tuple

Tuple is a data structure that is similar to list, but it is immutable. It means that you cannot change the value of tuple once it is created (we cannot mutate any of its value). You'll see tuple on the output of coordinates from above code.

The syntax of tuple is similar to list, but instead of using square brackets, we use round brackets:

In [4]:
print((5, 0))
print((5, 0)[0])
(5, 0)[0] = 3 # Error

(5, 0)
5


TypeError: 'tuple' object does not support item assignment

# Different question can lead to different output

We're using `TAPAS` model by Google, and it's been fine-tuned on WTQ dataset. Unfortunately we can't dive deeper into learning what each of that mean because it's outside the context of our learning, but check below image:

![Tapas for QnA](https://storage.googleapis.com/rg-ai-bootcamp/python-prep/tapas-for-qa-min.png)

> This case is also called weak supervision, since the model itself must learn the appropriate aggregation operator (SUM/COUNT/AVERAGE/NONE) given only the answer to the question as supervision.

Above image and snippet can be found at https://huggingface.co/docs/transformers/v4.35.0/en/model_doc/tapas#transformers.TFTapasForSequenceClassification, feel free to read it if you're interested (Warning: it's quite technical). To summarize basically TAPAS fine-tuned with WTQ has 4 different output:

- `NONE`: No aggregation is needed, the answer is a single cell.
- `SUM`: We should sum the values given in the output.
- `COUNT`: We should count the number of values given in the output.
- `AVERAGE`: We should average the values given in the output.

# Challenge: Start with `count`

Let's start with creating the functionality to aggregate `count` first. One of the question that will be answered with `count` aggregation is: "How many houses in Surabaya that has gym as one of the facility?". The output can be seen below:

```python
{'answer': 'COUNT > House 5, House 6, House 8',
 'coordinates': [(4, 0), (5, 0), (7, 0)],
 'cells': ['House 5', 'House 6', 'House 8'],
 'aggregator': 'COUNT'}
```

We'll create two different function for now:

- `aggregate_data`: This function should input above dictionary, and will choose which aggregator to use based on the `aggregator` key. It will return the aggregated value.
- `count_aggregator`: This function should input above dictionary, and will count the number of cells in `cells` key. It will return the total count.

So you might have to realize that:

- For `none` aggregator, you don't need aggregator, just return the `answer` key as it is.
- For `count` aggregator, you need to call `count_aggregator` inside the `aggregate_data` function and return the result of `count_aggregator` function.

In [9]:
def count_aggregator(data):
    return len(data['cells'])


def aggregate_data(data):
  if data['aggregator'] == 'COUNT':
    return count_aggregator(data)
  else:
    return data['answer']


aggregate_data({'answer': 'COUNT > House 5, House 6, House 8',
 'coordinates': [( 4, 0), (5, 0), (7, 0)],
 'cells': ['House 5', 'House 6', 'House 8'],
 'aggregator': 'COUNT'})

3

When you are done with the above challenge, then:

1. Run the code block by pressing the play button.

In [10]:
!pip install rggrader

from rggrader import submit

# @title #### Student Identity
student_id = "REAFCDNE" # @param {type:"string"}
name = "Riofebri Prasetia" # @param {type:"string"}

# Submit Method
assignment_id = "013_practical_example_tableqa_part_1"
question_id = "01_count_aggregator"

data_count = {'answer': 'COUNT > House 1, House 2',
 'coordinates': [( 4, 0), (5, 0), (7, 0)],
 'cells': ['House 1', 'House 2'],
 'aggregator': 'COUNT'}

submit(student_id, name, assignment_id, str(count_aggregator(data_count)), question_id)

Collecting rggrader
  Downloading rggrader-0.1.6-py3-none-any.whl (2.5 kB)
Installing collected packages: rggrader
Successfully installed rggrader-0.1.6


'Assignment successfully submitted'

# Challenge: Create `sum` and `average` aggregator

For `sum` and `average` aggregator, you need to use the `cells` key again, convert the `string` to `int`, and then `sum` or `average` it. Feel free to use `for loop` or `map` function to do it.

In [13]:
def count_aggregator(data):
    return len(data['cells'])

def aggregate_data(data):
  mapper = {
      'COUNT': lambda: count_aggregator(data),
      'SUM': lambda: sum_aggregator(data),
      'AVERAGE': lambda: average_aggregator(data)
  }
  if data['aggregator'] in mapper:
    return mapper[data['aggregator']]()
  else:
    return data['answer']

  # match data['aggregator']:
  #   case 'COUNT':
  #     return count_aggregator(data)
  #   case 'SUM':
  #     return sum_aggregator(data)
  #   case 'AVERAGE':
  #     return average_aggregator(data)
  #   case _:
  #     return data['answer']


  # if data['aggregator'] == 'COUNT':
  #   return count_aggregator(data)
  # if data['aggregator'] == 'SUM':
  #   return sum_aggregator(data)
  # if data['aggregator'] == 'AVERAGE':
  #   return average_aggregator(data)
  # return data['answer']

def sum_aggregator(data):
    return sum(map(int, data['cells']))

def average_aggregator(data):
    return sum_aggregator(data) / count_aggregator(data)

print(aggregate_data({'answer': 'SUM > 4900, 5100',
 'coordinates': [(6, 2), (9, 2)],
 'cells': ['4900', '5100'],
 'aggregator': 'SUM'}))

print(aggregate_data({'answer': 'AVERAGE > 5000, 5300, 4900, 5100',
 'coordinates': [(0, 2), (3, 2), (6, 2), (9, 2)],
 'cells': ['5000', '5300', '4900', '5100'],
 'aggregator': 'AVERAGE'}))

10000
5075.0


When you are done with the above challenge, then:

1. Run the code block by pressing the play button.

In [14]:
# Submit Method
data_sum = {'answer': 'SUM > 4900, 5300',
 'coordinates': [(6, 2), (9, 2)],
 'cells': ['4900', '5300'],
 'aggregator': 'SUM'}

assignment_id = "013_practical_example_tableqa_part_1"
question_id = "02_sum_aggregator"
submit(student_id, name, assignment_id, str(sum_aggregator(data_sum)), question_id)

#####################################################################################
# Submit Method
data_avg = {'answer': 'AVERAGE > 5000, 2000, 1500, 5300',
 'coordinates': [(0, 2), (3, 2), (6, 2), (9, 2)],
 'cells': ['5000', '2000', '1500', '5300'],
 'aggregator': 'AVERAGE'}

assignment_id = "013_practical_example_tableqa_part_1"
question_id = "03_avg_aggregator"
submit(student_id, name, assignment_id, str(average_aggregator(data_avg)), question_id)

'Assignment successfully submitted'

# Challenge: Connecting with the pipeline

We've created the full functionality for what we need, now we just need to connect it with the pipeline. Now let's see if you can try to connect previous `aggregate_data` function with `table_qa` function!

In [15]:
# Let's try to connect `table_qa` with `aggregate_data`:

def answer_table_question(question):
    raw_aggregate = table_qa({"table": data, "query": question})
    return aggregate_data(raw_aggregate)


print(answer_table_question("What is the average price of the houses in Jakarta?")) # The answer should be 5075
print(answer_table_question("How many houses are there in Malang?")) # The answer should be 3
print(answer_table_question("How many houses in Jakarta has a pool as a facility?")) # The answer should be 3

5075.0
3
3


When you are done with the above challenge, then:

1. Run the code block by pressing the play button.

In [16]:
# Submit Method
assignment_id = "013_practical_example_tableqa_part_1"
question_id = "04_answer_table"
submit(student_id, name, assignment_id, str(answer_table_question("What is the average price of the houses in Surabaya?")), question_id)

'Assignment successfully submitted'