Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different number of queries in Academic, IMDB, etc. (compared to the original distribution) #1

Open
rizar opened this issue Nov 10, 2020 · 11 comments

Comments

@rizar
Copy link
Contributor

rizar commented Nov 10, 2020

Thank you for releasing this, great work!

I have noticed that the "classical datasets" in your release always have less examples than in the original release. For example:

Academic: 185 vs 189
Yelp: 122 vs 128
IMDB: 97 vs 131

Can you comment on what specific examples were left out? And maybe add a few words to README.md?

@ruiqi-zhong
Copy link
Collaborator

Which original release are you referring to?

We took the data from https://github.com/jkkummerfeld/text2sql-data . Note that this repo groups all the SQLs that share the same template, so there might be fewer SQLs than the original release.

(we then created separate data points for each SQL, but retrospectively it might be a better decision to create separate data points for each natural language question, rather than SQL queries.)

@rizar
Copy link
Contributor Author

rizar commented Dec 10, 2020

Thanks for your response!

The original release I'm referring to is indeed https://github.com/jkkummerfeld/text2sql-data. Having read your response, my understanding now is that you only took 1 query per template when you created the distilled test-suite. Does this imply that I can't use the test-suite databases to validate the correctness of the other queries? This would significantly reduce the dataset size. For example, there are 131 examples in the original IMDB dataset for 89 templates. Losing 42 out of 131 examples is not great. Do you plan to create a test-suite for the complete classical test sets? It would be a very valuable resource for the community.

I have also found other anomalies in your dataset release:

  1. In some IMDB queries the variables are not substituted. For example, I have found a query SELECT BIRTH_CITY FROM DIRECTOR WHERE NAME = "director_name0". Is that by design, or should such queries be discarded?

  2. When I looked at the texts field of the dataset entries, I see that some questions are associated with semantically different queries. For example, the question "How many reviews does " business_name0 " have ?" is in texts for both

SELECT REVIEW_COUNT FROM BUSINESS WHERE NAME = "Acacia Cafe" ;

and

SELECT COUNT( DISTINCT ( T1.TEXT ) ) FROM REVIEW AS T1 JOIN BUSINESS AS T2 ON T1.BUSINESS_ID = T2.BUSINESS_ID WHERE T2.NAME = "Acacia Cafe";

  1. Two IMDB queries are not covered:

SELECT MOVIEalias0.TITLE FROM CLASSIFICATION AS CLASSIFICATIONalias0 , GENRE AS GENREalias0 , MOVIE AS MOVIEalias0 WHERE GENREalias0.GID = CLASSIFICATIONalias0.GID AND MOVIEalias0.MID = CLASSIFICATIONalias0.MSID GROUP BY MOVIEalias0.TITLE ORDER BY COUNT( DISTINCT ( GENREalias0.GENRE ) ) DESC LIMIT 1 ;

... that corresponds to "Find the movie which is classified in the most number of genres", and

SELECT PRODUCERalias0.NAME FROM DIRECTED_BY AS DIRECTED_BYalias0 , DIRECTOR AS DIRECTORalias0 , MADE_BY AS MADE_BYalias0 , MOVIE AS MOVIEalias0 , PRODUCER AS PRODUCERalias0 WHERE DIRECTORalias0.DID = DIRECTED_BYalias0.DID AND MOVIEalias0.MID = DIRECTED_BYalias0.MSID AND MOVIEalias0.MID = MADE_BYalias0.MSID AND PRODUCERalias0.PID = MADE_BYalias0.PID GROUP BY PRODUCERalias0.NAME ORDER BY COUNT( DISTINCT ( DIRECTORalias0.NAME ) ) DESC LIMIT 1

... that corresponds to "Which producer has worked with the most number of directors ?"

@ruiqi-zhong
Copy link
Collaborator

Thanks for carefully examining the dataset! I think these recommendations are very helpful. Here are my responses to each of your point:

  1. Each template is associated with potentially multiple texts and SQLs. I mistakenly created one data point for each SQL template, which is retrospectively a bad decision. We will fix this. That said, all the texts are still kept in the .pkl file, so this should be a relatively easy fix for us.

another teammate rewrites the IMDB queries into SPIDER style; I will check with him.

The unsubstituted variable name in the SQL query is not intended. Our code made the assumption that variable examples for all SQLs were already provided and hence produced this artifact. Below is the original data dictionary:

`
{'difficulty': 1, 'query-split': '3',

'sentences': [{'question-split': '8', 'text': 'When was " writer_name0 " born ?', 'variables': {'writer_name0': 'Kevin Spacey'}}, {'question-split': '9', 'text': 'In what year was " writer_name0 " born ?', 'variables': {'writer_name0': 'Kevin Spacey'}}],

'sql': ['SELECT WRITERalias0.BIRTH_YEAR FROM WRITER AS WRITERalias0 WHERE WRITERalias0.NAME = "writer_name0" ;', 'SELECT ACTORalias0.BIRTH_YEAR FROM ACTOR AS ACTORalias0 WHERE ACTORalias0.NAME = "actor_name0" ;', 'SELECT DIRECTORalias0.BIRTH_YEAR FROM DIRECTOR AS DIRECTORalias0 WHERE DIRECTORalias0.NAME = "director_name0" ;', 'SELECT PRODUCERalias0.BIRTH_YEAR FROM PRODUCER AS PRODUCERalias0 WHERE PRODUCERalias0.NAME = "producer_name0" ;'],

'variables': [{'example': 'Kevin Spacey', 'location': 'both', 'name': 'writer_name0', 'type': 'writer_name'}]}
`

Here, the second SQL template has an "actor_name0" variable, but no example is provided.

  1. it seems that another teammate made a mistake when rewriting these two queries, and I threw it away thinking that the datapoint contains an error.

Again, thanks for all the comments. I will add a note in the README indicating that the current version has a couple of issues and is useful only for preliminary exploration. Additionally, in our next update, we will release the complete provenance of the data transformation process such that it becomes transparent and maintainable.

@rizar
Copy link
Contributor Author

rizar commented Dec 10, 2020

I really appreciate that you responded so quickly, @ruiqi-zhong. I am also happy hear that you want to release a better version of the dataset. My team and I are really looking forward to it!

Meanwhile, please find more comments below:

Regarding the missing questions.

That said, all the texts are still kept in the .pkl file, so this should be a relatively easy fix for us.

Hmm, I'm afraid the pkl file does not contain all the information that you'd need to fix everything. Consider the query template from the IMDB dataset. Your pkl file contains the following:

{'db_id': 'imdb',
 'query': 'SELECT RELEASE_YEAR FROM MOVIE WHERE TITLE = "The Imitation Game" ;',
 'texts': ['What year is the movie " movie_title0 " from ?',
  'What year was the movie " movie_title0 " produced'],
 'variables': [{'example': 'The Imitation Game',
   'location': 'both',
   'name': 'movie_title0',
   'type': 'movie_title',
   'ancestor_of_occuring_column': {('movie', 'TITLE')}}],
 'testsuite': {'database/imdb/imdbv9211stround0group128.sqlite',
  'database/imdb/imdbv9229thround0.sqlite',
  'database/imdb/imdbv9229thround1.sqlite',
  'database/imdb/imdbv9229thround2.sqlite',
  'database/imdb/imdbv9229thround3.sqlite',
  'database/imdb/imdbv9229thround4.sqlite'}}

Meanwhile, the original entry in the data release by Universtity of Michigan researchers is

{'difficulty': 1,
 'query-split': '5',
 'sentences': [{'question-split': '9',
   'text': 'What year is the movie " movie_title0 " from ?',
   'variables': {'movie_title0': 'Dead Poets Society'}},
  {'question-split': '6',
   'text': 'What year was the movie " movie_title0 " produced',
   'variables': {'movie_title0': 'The Imitation Game'}}],
 'sql': ['SELECT MOVIEalias0.RELEASE_YEAR FROM MOVIE AS MOVIEalias0 WHERE MOVIEalias0.TITLE = "movie_title0" ;'],
 'variables': [{'example': 'The Imitation Game',
   'location': 'both',
   'name': 'movie_title0',
   'type': 'movie_title'}]}

It appears that your pkl file does not include the variable assignments that would be required to reconstruct the second query ("What year was the movie The Imitation Game produced"). I also imagine that test-suite databases that you constructed probably do not contain the content that would be required to properly evaluate the corresponding query against model predictions. Is my understanding correct? It would imply that the databases would need to be regenerated to fix the issue once for all.

Regarding the queries that were not deanonymized

Thank you for your clarification, I think I understand now what happened. Note that the documentation for text2sql-data says that they "only use the first query, but retain the variants for completeness (e.g. using joins vs. conditions)." Queries other than the first one often do not even have the variable assignments provided, as you correctly noted in your analysis.

I think it would make sense for you to also use the first query only when you regenerate the data.

This also explains why you have different queries associated with different question texts (i.e. the review count example in my previous post).

Regarding the missing queries

Thanks for making clear what happened here.

@rizar
Copy link
Contributor Author

rizar commented Dec 11, 2020

I think I have found another issue. While I can see that a lot of effort was put into rewriting the original SQL queries from text2sql-data into Spider SQL, there are still some issues:

In [10]: from process_sql import Schema, get_sql, get_schema                                                                             
                                                                                                                                         
In [11]: get_sql(Schema(get_schema('/home/dzmitry/dist/duorat/data/database/imdb_test/imdb_test.sqlite')),                               
    ...:         'SELECT COUNT( DISTINCT ( T3.TITLE ) ) FROM DIRECTOR AS T1 JOIN DIRECTED_BY AS T2 ON T1.DID = T2.DID JOIN MOVIE AS T3 ON
    ...:  T3.MID = T2.MSID WHERE T1.NAME = "Quentin Tarantino" AND T3.RELEASE_YEAR < 2010')                                              

...

~/dist/test-suite-sql-eval/process_sql.py in parse_col(toks, start_idx, tables_with_alias, schema, default_tables)     
    186             return start_idx+1, schema.idMap[key]                                                              
    187                                                                                                                
--> 188     assert False, "Error col: {}".format(tok)                                                                  
    189                                                                                                                
    190                                                                                                                
                                                                                                                       
AssertionError: Error col: (                                                                                           

It appears that it should be SELECT COUNT(DISTINCT T3.TITLE)) instead of SELECT COUNT(DISTINCT(T3.TITLE)) for the Spider-style parser to function correctly.

@ruiqi-zhong
Copy link
Collaborator

thanks for all the feedbacks! I just pushed an update.

  1. missing queries from the original Jonathan's repo.

I created one data point for each natural language query. The data conversion process is now in classical_provenance.ipynb . Now for each datapoint in classical_test.pkl, we can trace back to the original datapoint in Jonathan's repo using the orig_id field

  1. incorrect value replacement

I fixed it and the details can be seen in classical_provenance.ipynb. Specifically, I used the variable values for the Questions-SQLs pairs to replace the values such that 1) every placeholder gets replaced by a value 2) both the natural language and SQLs are replaced with the same value for each place holder. I also fixed some queries that result in execution errors.

  1. semantically inequivalent query

In this version, I decided to back-off and directly use the SQLs from Jonathan's repo.

I changed some details on how to create neighbor queries. 1) Instead of dropping every sub span to create neighbor queries, I now only drop every sub span that corresponds to a derivation of an AST intermediate symbol (i.e. subtree). This is because SQLs from Jonathan's repo are usually very long, and dropping every possible subspan will create a quadratic number of neighbor queries, which is too many. 2) I forbid dropping any predicate of join statement (e.g. WHERE child.id == parent.id). This is because dropping those subspans will drastically increase computational time since it will then take a cartesian product of two tables.

I also added two features: evaluating on a subset of the classical datasets and caching results for SQLs that have been evaluated. This is because queries on ATIS sometimes can take a while to run.

Again, I really appreciate your time examining the queries closely. I am closing this issue. Please feel free to open a new issue if you have any comments on the updated version.

@rizar
Copy link
Contributor Author

rizar commented Mar 19, 2021

Hi Ruiqi, thanks a lot for your continued effort to make this test-suite work correctly for classical datasets. Unfortunately, I think I have found another issue, this time probably the last one.

In Jonathan's repo examples are grouped by SQL queries. Each query can have an arbitrary number of associated sentences. Every sentence has its own set of values that should be substituted for variables. E.g. in geography.json in the first query the city is Arizona, in the second it is Texas, etc.

Your code in classical_provenance.ipynb takes values for variables from another source:

instance_name2examples = {d['name']: d['example'] for d in instance_variables}

As a result, many queries in classical.pkl have wrong literals. For example, in classical.pkl you have a question
'What year is the movie The Imitation Game from ?', which instead should be 'What year is the movie Dead Poets Society from ?'.

The consequence of this mismatch is that the generated test suite can not be used to evaluate correctness of the original queries from the classical dataset. I noticed that when the evaluation returned some really surprising false positives.

I would be extremely grateful if you fix this issue and regenerate the test suite again. This resource makes a big difference in making meaningful evaluation reproducible, and it would be really great if you get it right.

@ruiqi-zhong
Copy link
Collaborator

(I'm re-opening the issue)

Thanks for your feedback! Would you mind clarifying which one of the following cases is happening? Suppose the dictionary associated with each text-sql pair is d, then

  1. There is a misalignment between the d['text'] and d['query']. If so would you mind tell me the index (in the .pkl file) of where this misalignment happens?
  2. The test suite cannot distinguish the gold SQL query (as is provided in the .pkl) and another query. If so, can you let me know which gold query has the problem, and what does the false-positive look like?
  3. The classical.pkl is missing certain combinations of natural language + literals from Jonathans' repo. This was in fact intentional: we only keep one way to insert literal for each SQL (template) and change the corresponding natural language text, such that we only need to create one test suite for each SQL template. However, I can also see the advantage of keeping the original variety of literals (e.g. especially for evaluating models where literals in the input are not anonymized/models that need to predict values, etc). if you think that we should add back those combinations, I will also be happy to do that soon in the future.

I ran the following code

import pickle as pkl
test = pkl.load(open('classical_test.pkl', 'rb'))
for idx, d in enumerate(test):
    t = d['text']
    if 'The Imitation Game' in t:
        print(t)
        print(d['query'])
        print(d)

and did not find "dead poet" in the output. Looks like it is problem 3 rather than 1?

This will help me pin down the problem with the current evaluation, and I will definitely fix it if there is an issue. Thanks!

@ruiqi-zhong ruiqi-zhong reopened this Mar 19, 2021
@rizar
Copy link
Contributor Author

rizar commented Mar 20, 2021

It is the case 3 that happens, namely:

The classical.pkl is missing certain combinations of natural language + literals from Jonathans' repo.

I would argue this is a significant change compared to the original data, because linking literals to the right column in the schema is in general a non-trivial issue. For example, in "How big is New Mexico?" New Mexico is clearly a state, because there is no city called New Mexico. Whereas in "How big is New York?" New York can refer to either a city or a state.

You can also check out the preprocessing code for this data that was released by Google: https://github.com/google-research/language/blob/89e5b53a0e7f9f3e2da25a5da71ce5bd466acabd/language/xsp/data_preprocessing/michigan_preprocessing.py#L30 . It is very well documented, and does variable substitution in what I think is the right way.

This was in fact intentional: we only keep one way to insert literal for each SQL (template) and change the corresponding natural language text, such that we only need to create one test suite for each SQL template.

I understand the computational effort considerations, but technically this simplification changes the original question texts in the datasets.

if you think that we should add back those combinations, I will also be happy to do that soon in the future.

It would really greatly help me in my work, thanks!

Here is the "deat poets" line in the original data: https://github.com/jkkummerfeld/text2sql-data/blob/master/data/imdb.json#L10

@ruiqi-zhong
Copy link
Collaborator

thanks for the clarification. I will probably make this update by the end of this month. Thanks!

@rizar
Copy link
Contributor Author

rizar commented Mar 22, 2021

Great, thank you so much! I am looking forward to the new release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants