Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wikidata refactor #813

Merged
merged 32 commits into from
Oct 13, 2021
Merged

Wikidata refactor #813

merged 32 commits into from
Oct 13, 2021

Conversation

sileix
Copy link
Member

@sileix sileix commented Oct 10, 2021

No description provided.

@lgtm-com
Copy link

lgtm-com bot commented Oct 10, 2021

This pull request introduces 1 alert when merging c6436bd into 5783646 - view on LGTM.com

new alerts:

  • 1 for Missing space in string concatenation

@sileix sileix force-pushed the wip/wikidata-refactor branch 3 times, most recently from ae90db0 to 021f2b5 Compare October 10, 2021 20:24
@sileix sileix marked this pull request as ready for review October 10, 2021 21:44
@sileix sileix requested a review from gcampax October 10, 2021 21:44
Copy link
Contributor

@gcampax gcampax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, overall looks like it's a lot of good cleanups. There's quite a few new scripts here though, and I have some questions on those.

lib/pos-parser/index.ts Show resolved Hide resolved
lib/i18n/english.ts Show resolved Hide resolved
tool/autoqa/wikidata/preprocess-wikidata.ts Outdated Show resolved Hide resolved
tool/autoqa/auto-annotate.ts Outdated Show resolved Hide resolved
bootlegTypes : fs.ReadStream,
bootlegTypeCanonincals : fs.ReadStream,
input : [fs.ReadStream, fs.ReadStream],
output : [fs.WriteStream, fs.WriteStream]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you using arrays here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need update both the dataset file and the bootleg output file, so it reads both and outputs an updated version of both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But like, you don't update the arrays themselves. You update the files. You could do with two input parameters (inputDataset and inputBootleg or similar) and two output parameters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh... yeah, but I think this is a cleaner interface?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean not really, you need to remember which one is input[0] and which one is input[1]...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, you are right. I switched to an object with examples and bootleg field for input and output.

tool/autoqa/wikidata/postprocess-data.ts Show resolved Hide resolved
lib/templates/projections.genie Show resolved Hide resolved
lib/templates/projections.genie Outdated Show resolved Hide resolved
@sileix sileix force-pushed the wip/wikidata-refactor branch 3 times, most recently from a0e43cb to 56ebbfc Compare October 13, 2021 00:34
sileix and others added 14 commits October 12, 2021 17:41
Similar to the bug discovered in the pos tagger, when the key is
"constructor", an obejct created from literal syntax will return a
function instead of undefined when 'constructor' key is missing
Data source:
- Use CSQA preprocessed files as a source of data, only query wikidata
  service end point when information is missing in CSQA dumps
- Use Bootleg data as a reference when determine the type of an entity

Type system:
In total, 3 difference option is provided
(1) entity-plain: one entity type per property based on property name
(2) entity-hierarchical: one entity type for each value, and the
property type is the supertype of all types of its values; the property
type has a prefix `p_`
(3) string: everthing string except id
When entity-heirarchical is selected,

ThingTalk:
- Add the option to include Entity value (QID) in ThingTalk

Other tricks:
- Remove properties that share the same label as the domain: In
  wikidata, each country has a property country, poiting to itself, and
  it's confusing
- Remove trailing QIDs in bootleg type canonical: When there are
  multiple types share the same canonical, bootleg will
  append QID at the end of the canonical for all types except one.
  Sometimes, it also append QID to some type that does not share
  canonical with other types, such as "nation", and "designation", not
  sure why.
  In our case, the type information is to assist the parsing, the actual
  QID is not important, so we drop all the appended QIDs - if two types
  share the same canonical, they are considered as the same type in
  natural language.
Use npm alias to avoid the bug caused by the captilized file name
`soft_match_id` indicates that we will do string filter on id property;
`entity_id` indicates that we will include entity id (value) in the
thingtlak annotation.

they should not be mixed into one option
(1) when bootleg produces the correct entity qid, do nothing;
(2) when bootleg produces the wrong entity qid, but correct type, drop
the example
(3) when bootleg produces the wrong entity qid and wrong type, remove the
qid in thingtalk
(4) when bootleg produces nothing, remove the qid in thingtalk
We include the type of the property and its subtypes.
E.g., property `p_sister_city` will have `city` as its subtype, thus, we
will add `city` as a `base_projection`, and generate questions like:
"which city is the sister city of x?"
Add templates to use "property", "reverse_property" in projection.
Also introduce a new category: "reverse_passive_verb"
Now we have templates for `property_projection`
@sileix sileix merged commit 068465c into master Oct 13, 2021
@sileix sileix deleted the wip/wikidata-refactor branch October 13, 2021 05:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants