dictionary keys can be non-unique #959

kbenoit · 2017-09-12T12:09:01Z

Example:

my_dict <- dictionary(list(
    use = "firstuse",
    document = "document*",
    use      = c("use", "using")
))
# Dictionary object with 3 key entries.
# - use:
#     - firstuse
# - document:
#     - document*
#     - use:
#     - use, using

We have tested whether this creates a problem for the *l_lookup functions, and it does not:

toks <- tokens("firstuse documenting using word word2")
# tokens from 1 document.
# text1 :
# [1] "use"      "document" "use"    
# Document-feature matrix of: 1 document, 2 features (0% sparse).
# 1 x 2 sparse Matrix of class "dfm"
#        features
# docs    use document
#   text1   2        1

but it does create a problem for indexing.

my_dict[which(names(my_dict)=="use")]
# Dictionary object with 1 key entry.
# - use:
#   - use, using

my_dict[c(1:3)]
# Dictionary object with 2 key entries.
# - use:
#   - use, using
# - document:
#   - document*

The text was updated successfully, but these errors were encountered:

koheiw · 2017-09-12T16:51:30Z

Let's raise an error when duplicated keys are discovered on the same level. I can make a function to merge duplicated keys, but it can be fairly complex, because merging keys can cause new duplication in their sub-keys.

kbenoit · 2017-09-12T17:07:33Z

Actually, we might want to allow duplicated keys across different levels of hierarchy. Currently it seems to be working as expected:

dfm_lookup(dfm(txt), dic1)
# Document-feature matrix of: 2 documents, 3 features (16.7% sparse).
# 2 x 3 sparse Matrix of class "dfm"
#       features
# docs   A.X B.Y A.Y
#   doc1   3   2   2
#   doc2   0   2   2

dfm_lookup(dfm(txt), dic1, levels = 1)
Document-feature matrix of: 2 documents, 2 features (0% sparse).
# 2 x 2 sparse Matrix of class "dfm"
#       features
# docs   A B
#   doc1 5 2
#   doc2 2 2

dfm_lookup(dfm(txt), dic1, levels = 2)
# Document-feature matrix of: 2 documents, 2 features (25% sparse).
# 2 x 2 sparse Matrix of class "dfm"
#       features
# docs   X Y
#   doc1 3 4
#   doc2 0 4

A use-case? Could be:

level 1: Country
level 2: News category: sport, foreign policy, economy, etc.

(or could be reversed.) If you wanted an aggregate at level 2, to combine the values for sport across different countries, you want the values pooled for the duplicate key at level 2.

So maybe we should only issue a warning, and only if the flattened key is non-unique? For a level-1-only dictionary, we would simply combine the keys by union()?

koheiw · 2017-09-12T18:44:48Z

It is true that keys can be unique when dictionary is flattened, but I cannot think of reasons to do

- [US]:
   - [Economy]:
      - word1, word2, word3
- [US]:
   - [Law]:
      - word4, word5, word6

instead of

- [US]:
   - [Economy]:
      - word1, word2, word3
   - [Law]:
      - word4, word5, word6

Can you give me an example? That said, we can allow to use duplicated keys, because identical keys are merged in recompilation in tokens_lookup() or compression in dfm_lookup() anyway.

kbenoit · 2017-09-12T19:16:16Z

I fully agree with that. But I was thinking more of the example where the top level keys were different, and the level 2 keys were identical. But that's not really a problem either, since we have the levels option in the *_lookup() functions (thanks to you).

kbenoit · 2017-09-12T19:20:42Z

Note also that although identical keys are already merged, and therefore there may not be any problem identical keys, we still have a problem with the indexing (example above).

kbenoit added bug dictionary labels Sep 12, 2017

kbenoit assigned koheiw Sep 12, 2017

kbenoit added this to the v0.99.x refresh milestone Sep 18, 2017

koheiw mentioned this issue Sep 21, 2017

Issue 959 #984

Merged

koheiw closed this as completed Sep 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dictionary keys can be non-unique #959

dictionary keys can be non-unique #959

kbenoit commented Sep 12, 2017

koheiw commented Sep 12, 2017

kbenoit commented Sep 12, 2017 •

edited

Loading

koheiw commented Sep 12, 2017

kbenoit commented Sep 12, 2017 •

edited

Loading

kbenoit commented Sep 12, 2017

dictionary keys can be non-unique #959

dictionary keys can be non-unique #959

Comments

kbenoit commented Sep 12, 2017

koheiw commented Sep 12, 2017

kbenoit commented Sep 12, 2017 • edited Loading

koheiw commented Sep 12, 2017

kbenoit commented Sep 12, 2017 • edited Loading

kbenoit commented Sep 12, 2017

kbenoit commented Sep 12, 2017 •

edited

Loading

kbenoit commented Sep 12, 2017 •

edited

Loading