Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved CSV export - feedback welcome #900

Open
PhilippWendler opened this issue Feb 10, 2023 · 4 comments
Open

Improved CSV export - feedback welcome #900

PhilippWendler opened this issue Feb 10, 2023 · 4 comments

Comments

@PhilippWendler
Copy link
Member

The CSV tables exported by table-generator have a layout that is inspired by the HTML tables, but this sometimes makes them hard to use programmatically in other tools. We should improve this.

Open points:

  • Right now the header has 3 lines, this has been mentioned as unexpected by several people. We should reduce it to one line. Open questions:
    • What should the content of the header cells be? If we put only the column name (e.g., status), it is no longer unique. Some concatenation of run set name, timestamp, and column name? Having stuff like the timestamp there would be highly inconvenient for those where it is not needed. Maybe keep only the column name as long as it is unique?
    • Should it have a # in front of the line?
  • What should the separator be? Right now we call it CSV but it is tab separated. Tab has the advantage that it can not occur in our data except in extremely rare cases, where as comma appears regularly in some columns. This makes it easier to handle with tools like cut. Should we change the name and extension to TSV instead? Will people understand that abbreviation?
  • What should we do with the task-id columns at the left? Right now, we show only those columns where not all values are equal. We should at least change this to show all columns for which data exists. But should we always add all columns, e.g., have expectedVerdict even if it is always empty?

In general, there is a trade-off between having tables that always have exactly same format (all task-id columns, header content with full information) even if redundant / not applicable and tables that are tailored to the specific use case (keeping column names short and easy to handle when they are anyway unique, hiding expected verdict if empty, etc.). The latter can be much more convenient in many use cases, but are more difficult to use in use cases where data from lots of different scenarios are combined.

Maybe we also need to add some options to the table definitions to make it possible for users to choose among them (e.g., which columns should be shown for the task id).

Any feedback and ideas, whether about the general goal or concrete ideas, is highly welcome!
@s-winter ping

@PhilippWendler PhilippWendler added this to the Release 4.0 milestone Feb 10, 2023
@PhilippWendler PhilippWendler changed the title Improved CSV export Improved CSV export - feedback welcome Feb 10, 2023
@PhilippWendler PhilippWendler pinned this issue Feb 10, 2023
@Po-Chun-Chien
Copy link
Member

I would vote for using tab as separator, but changing the file extension to .tsv.
I was confused the first time when trying to parse the file.

@PhilippWendler
Copy link
Member Author

1. Header lines: It's a good idea to reduce the header to one line to make it more compact. Regarding the content of header cells, you could use concatenation of run set name and column name to make it unique, and keep the timestamp optional. For example, "runSetName_status" or "runSetName_columnName".

Yes, this is the idea mentioned in the original post, but it has the disadvantage that it would make the column names really long and complex to use. For example, they would need to include a timestamp, and thus after importing in some third-party software, you would have to use these long and unique column names instead of for example just cputime. So we are looking for arguments for and against each of these possible choices.

2. Separator: Since comma appears regularly in some columns, it might be better to switch to tab-separated values (TSV) instead of comma-separated values (CSV). However, it's important to inform users about this change and explain the TSV abbreviation.

Note that we already use tabs in our current "CSV" format. Notification of users is not difficult, we can likely add the new format as an option in addition to the existing format and then delete the previous format in a new major version.

3. Task-ID columns: Instead of showing only those columns where not all values are equal, it might be better to show all columns for which data exists. This way, users can easily identify the relevant columns.

Hm, I am not sure I follow this argument. How would it make it easy to identify the relevant columns, if all columns are shown?

4. Table options: It would be helpful to add some options to the table definitions so that users can customize the table layout based on their specific use case. For example, users could choose which columns should be shown for the task ID.

Note that we do have this feature already (cf. documentation).

@DrMichaelPetter
Copy link

How about keeping a raw/master TSV around and some postprocessing scripts based on CLI tools like sed/head/tail/grep and csvkit?

@PhilippWendler
Copy link
Member Author

The raw/master files that BenchExec uses are the result XML files. We cannot use CSV/TSV for this, because these files contain important meta information about the whole benchmark run, which we need to keep together with the measurement data. This is important for example for creating the HTML tables (which contain both) and also makes archiving results easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants