Wikidata refactor #813

sileix · 2021-10-10T08:46:18Z

No description provided.

lgtm-com · 2021-10-10T08:53:39Z

This pull request introduces 1 alert when merging c6436bd into 5783646 - view on LGTM.com

new alerts:

1 for Missing space in string concatenation

gcampax

Ok, overall looks like it's a lot of good cleanups. There's quite a few new scripts here though, and I have some questions on those.

lib/pos-parser/index.ts

lib/i18n/english.ts

tool/autoqa/wikidata/preprocess-wikidata.ts

tool/autoqa/auto-annotate.ts

lib/dataset-tools/evaluation/sentence_evaluator.ts

lib/dataset-tools/augmentation/replace_parameters.ts

gcampax · 2021-10-11T20:49:04Z

tool/autoqa/wikidata/postprocess-data.ts

+                           bootlegTypes : fs.ReadStream,
+                           bootlegTypeCanonincals : fs.ReadStream,
+                           input : [fs.ReadStream, fs.ReadStream], 
+                           output : [fs.WriteStream, fs.WriteStream]) {


Why are you using arrays here?

I need update both the dataset file and the bootleg output file, so it reads both and outputs an updated version of both.

But like, you don't update the arrays themselves. You update the files. You could do with two input parameters (inputDataset and inputBootleg or similar) and two output parameters.

ugh... yeah, but I think this is a cleaner interface?

I mean not really, you need to remember which one is input[0] and which one is input[1]...

hmm, you are right. I switched to an object with examples and bootleg field for input and output.

tool/autoqa/wikidata/postprocess-data.ts

lib/templates/projections.genie

Similar to the bug discovered in the pos tagger, when the key is "constructor", an obejct created from literal syntax will return a function instead of undefined when 'constructor' key is missing

Data source: - Use CSQA preprocessed files as a source of data, only query wikidata service end point when information is missing in CSQA dumps - Use Bootleg data as a reference when determine the type of an entity Type system: In total, 3 difference option is provided (1) entity-plain: one entity type per property based on property name (2) entity-hierarchical: one entity type for each value, and the property type is the supertype of all types of its values; the property type has a prefix `p_` (3) string: everthing string except id When entity-heirarchical is selected, ThingTalk: - Add the option to include Entity value (QID) in ThingTalk Other tricks: - Remove properties that share the same label as the domain: In wikidata, each country has a property country, poiting to itself, and it's confusing - Remove trailing QIDs in bootleg type canonical: When there are multiple types share the same canonical, bootleg will append QID at the end of the canonical for all types except one. Sometimes, it also append QID to some type that does not share canonical with other types, such as "nation", and "designation", not sure why. In our case, the type information is to assist the parsing, the actual QID is not important, so we drop all the appended QIDs - if two types share the same canonical, they are considered as the same type in natural language.

Use npm alias to avoid the bug caused by the captilized file name

`soft_match_id` indicates that we will do string filter on id property; `entity_id` indicates that we will include entity id (value) in the thingtlak annotation. they should not be mixed into one option

(1) when bootleg produces the correct entity qid, do nothing; (2) when bootleg produces the wrong entity qid, but correct type, drop the example (3) when bootleg produces the wrong entity qid and wrong type, remove the qid in thingtalk (4) when bootleg produces nothing, remove the qid in thingtalk

We include the type of the property and its subtypes. E.g., property `p_sister_city` will have `city` as its subtype, thus, we will add `city` as a `base_projection`, and generate questions like: "which city is the sister city of x?"

Add templates to use "property", "reverse_property" in projection. Also introduce a new category: "reverse_passive_verb"

Now we have templates for `property_projection`

Avoid generate too many canonicals (may caused by type-based projection) which might lead to oom during synthesis

In wikidata, the actual type of an entity is not important, we can rely on the QID

This is a boolean option to include entity value in thingtalk annotation. `entity-id` doesn't like sound like a boolean option and we use the term `value` instead of `id` outside of wikidata for entities.

sileix force-pushed the wip/wikidata-refactor branch 3 times, most recently from ae90db0 to 021f2b5 Compare October 10, 2021 20:24

sileix marked this pull request as ready for review October 10, 2021 21:44

sileix requested a review from gcampax October 10, 2021 21:44

sileix force-pushed the wip/wikidata-refactor branch from 021f2b5 to b5a0256 Compare October 11, 2021 21:43

gcampax reviewed Oct 11, 2021

View reviewed changes

sileix force-pushed the wip/wikidata-refactor branch 3 times, most recently from a0e43cb to 56ebbfc Compare October 13, 2021 00:34

sileix and others added 14 commits October 12, 2021 17:41

Auto-annotate: avoid punctuation & pronouns in annotations

5e9929f

Cache wikidata end point queries

f38389c

Fix a bug in the pos tagger

174e11b

Bug fix in pos-parser: avoid inherit from Object.prototype

55452d8

Similar to the bug discovered in the pos tagger, when the key is "constructor", an obejct created from literal syntax will return a function instead of undefined when 'constructor' key is missing

Add a hack to include common_name as human type for CSQA

ca72d2b

Fix JSONStream dependency

346c22c

Use npm alias to avoid the bug caused by the captilized file name

Move duplicate loadClassDef to utils

acb3872

Add unit test for CSQA converter

b9e14a7

Cleanup unused auto annotate options

ce8948c

augment/typecheck/eval: add the option to include entity id in thingtalk

8fa5bdf

Separate option soft_match_id from entity_id

10158b9

`soft_match_id` indicates that we will do string filter on id property; `entity_id` indicates that we will include entity id (value) in the thingtlak annotation. they should not be mixed into one option

Synthesis: allow filtering on values of subtypes

8fdef28

augment: allow augmenting with all subtypes for a given entity

dc7438f

sileix force-pushed the wip/wikidata-refactor branch from 56ebbfc to d57841f Compare October 13, 2021 00:44

sileix added 4 commits October 12, 2021 18:16

Improve genie templates for projections

0678a17

Add templates to use "property", "reverse_property" in projection. Also introduce a new category: "reverse_passive_verb"

Do not exlcude `property type when adding projection annotations

8fbc9e1

Now we have templates for `property_projection`

sileix added 13 commits October 12, 2021 18:16

By default, limit the number of canonicals for each POS to be 10

df8c4ad

Avoid generate too many canonicals (may caused by type-based projection) which might lead to oom during synthesis

If paraphraser output is invalid Json, output plain file for debug

d0a0e6d

Add the option to ignore entity type when evaluate

06feb44

In wikidata, the actual type of an entity is not important, we can rely on the QID

Fix a typo in auto annotator

612a958

Update Wikidata starter code with a simplified instruction

0f81337

Update Wikidata starter test for a faster testing

a167a64

Add manual annotation for the city domain

5082b7b

Rename entity-id option to include-entity-value throughout

08ce20e

This is a boolean option to include entity value in thingtalk annotation. `entity-id` doesn't like sound like a boolean option and we use the term `value` instead of `id` outside of wikidata for entities.

Augment: cache entity descendants to improve efficiency

847a2c0

Add the option to include entity value in tt for dialogue-to-contextual

826b640

Fix wikidata starter test

f340812

Fix schemaorg starter test

7db8b1b

Fix lgtm

f84267b

sileix force-pushed the wip/wikidata-refactor branch from d57841f to f84267b Compare October 13, 2021 01:16

fix lint

a50c2c5

sileix merged commit 068465c into master Oct 13, 2021

sileix deleted the wip/wikidata-refactor branch October 13, 2021 05:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wikidata refactor #813

Wikidata refactor #813

sileix commented Oct 10, 2021

lgtm-com bot commented Oct 10, 2021

gcampax left a comment

gcampax Oct 11, 2021

sileix Oct 12, 2021

gcampax Oct 13, 2021

sileix Oct 13, 2021

gcampax Oct 13, 2021

sileix Oct 13, 2021

Wikidata refactor #813

Wikidata refactor #813

Conversation

sileix commented Oct 10, 2021

lgtm-com bot commented Oct 10, 2021

gcampax left a comment

Choose a reason for hiding this comment

gcampax Oct 11, 2021

Choose a reason for hiding this comment

sileix Oct 12, 2021

Choose a reason for hiding this comment

gcampax Oct 13, 2021

Choose a reason for hiding this comment

sileix Oct 13, 2021

Choose a reason for hiding this comment

gcampax Oct 13, 2021

Choose a reason for hiding this comment

sileix Oct 13, 2021

Choose a reason for hiding this comment