Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queries regarding ChartQA dataset #8

Open
shabbie opened this issue Dec 22, 2022 · 3 comments
Open

Queries regarding ChartQA dataset #8

shabbie opened this issue Dec 22, 2022 · 3 comments

Comments

@shabbie
Copy link

shabbie commented Dec 22, 2022

There are a few queries that we have, for which your help is needed.

  1. In section 4.1, you mentioned that gold data tables are not available therefore you set up the extraction mechanism to get the underlying tables. We have downloaded the dataset from (https://drive.google.com/file/d/17-aqtiq_KJ16PIGOp30W0y6OJNax6SVT/view) which has table annotations present. Are these extracted tables referred to as 'Gold Data Tables' in the paper?

  2. If there is a separate set of 'Gold Data Tables' not available from the link mentioned above, can you also share those for reproducibility purposes?

  3. And if the extracted tables are the same as the gold data tables, what are the results implying in Table 5 of the paper? How can TaPas predict answers if the tables itself is not provided?

@AhmedMasryKU
Copy link
Collaborator

Hi @shabbie
Gold Data Tables refer to the ground truth data tables which we crawled with the chart images from different sources. These gold tables are provided in the dataset (the "tables" folder) in this repo. In our experiments, we considered two scenarios:

  1. We used the gold tables as input to our model. However, the main issue of this setup is that it's not end-2-end. In general, chart images won't have their data tables with them. That's why we also considered the second setup
  2. We automatically extracted the data tables from the chart images using the ChartOCR model, and used these extracted data tables as inputs to our models.

Let me know if you have additional questions.

@shabbie
Copy link
Author

shabbie commented Dec 24, 2022

Thanks for the reply @AhmedMasryKU.

The Gold Data Tables that are present in the tables folder have many extraction issues like all/majority of the numerical values are zero, the column name is wrong/incomplete and the floating points are not correctly detected.

If these are also extracted tables (considering the noise present in the data), what are the gold or noise-free data tables?

@AhmedMasryKU
Copy link
Collaborator

Hi @shabbie,
Yes, the Pew chart images are not 100% clean due to issues crawling the data. However, the OWID, OECD, and Statista (the majority of the dataset) tables are very clean. For reproducibility, the "Gold Data Table" mentioned in the paper refers to the csv files in the tables folder in the dataset.

Let me know if you have any questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants