-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wikidata refactor #813
Wikidata refactor #813
Conversation
This pull request introduces 1 alert when merging c6436bd into 5783646 - view on LGTM.com new alerts:
|
ae90db0
to
021f2b5
Compare
021f2b5
to
b5a0256
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, overall looks like it's a lot of good cleanups. There's quite a few new scripts here though, and I have some questions on those.
bootlegTypes : fs.ReadStream, | ||
bootlegTypeCanonincals : fs.ReadStream, | ||
input : [fs.ReadStream, fs.ReadStream], | ||
output : [fs.WriteStream, fs.WriteStream]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you using arrays here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need update both the dataset file and the bootleg output file, so it reads both and outputs an updated version of both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But like, you don't update the arrays themselves. You update the files. You could do with two input
parameters (inputDataset and inputBootleg or similar) and two output
parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ugh... yeah, but I think this is a cleaner interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean not really, you need to remember which one is input[0] and which one is input[1]...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, you are right. I switched to an object with examples and bootleg field for input and output.
a0e43cb
to
56ebbfc
Compare
Similar to the bug discovered in the pos tagger, when the key is "constructor", an obejct created from literal syntax will return a function instead of undefined when 'constructor' key is missing
Data source: - Use CSQA preprocessed files as a source of data, only query wikidata service end point when information is missing in CSQA dumps - Use Bootleg data as a reference when determine the type of an entity Type system: In total, 3 difference option is provided (1) entity-plain: one entity type per property based on property name (2) entity-hierarchical: one entity type for each value, and the property type is the supertype of all types of its values; the property type has a prefix `p_` (3) string: everthing string except id When entity-heirarchical is selected, ThingTalk: - Add the option to include Entity value (QID) in ThingTalk Other tricks: - Remove properties that share the same label as the domain: In wikidata, each country has a property country, poiting to itself, and it's confusing - Remove trailing QIDs in bootleg type canonical: When there are multiple types share the same canonical, bootleg will append QID at the end of the canonical for all types except one. Sometimes, it also append QID to some type that does not share canonical with other types, such as "nation", and "designation", not sure why. In our case, the type information is to assist the parsing, the actual QID is not important, so we drop all the appended QIDs - if two types share the same canonical, they are considered as the same type in natural language.
Use npm alias to avoid the bug caused by the captilized file name
`soft_match_id` indicates that we will do string filter on id property; `entity_id` indicates that we will include entity id (value) in the thingtlak annotation. they should not be mixed into one option
56ebbfc
to
d57841f
Compare
(1) when bootleg produces the correct entity qid, do nothing; (2) when bootleg produces the wrong entity qid, but correct type, drop the example (3) when bootleg produces the wrong entity qid and wrong type, remove the qid in thingtalk (4) when bootleg produces nothing, remove the qid in thingtalk
We include the type of the property and its subtypes. E.g., property `p_sister_city` will have `city` as its subtype, thus, we will add `city` as a `base_projection`, and generate questions like: "which city is the sister city of x?"
Add templates to use "property", "reverse_property" in projection. Also introduce a new category: "reverse_passive_verb"
Now we have templates for `property_projection`
Avoid generate too many canonicals (may caused by type-based projection) which might lead to oom during synthesis
In wikidata, the actual type of an entity is not important, we can rely on the QID
This is a boolean option to include entity value in thingtalk annotation. `entity-id` doesn't like sound like a boolean option and we use the term `value` instead of `id` outside of wikidata for entities.
d57841f
to
f84267b
Compare
No description provided.