Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dictionary keys can be non-unique #959

Closed
kbenoit opened this issue Sep 12, 2017 · 5 comments
Closed

dictionary keys can be non-unique #959

kbenoit opened this issue Sep 12, 2017 · 5 comments
Assignees

Comments

@kbenoit
Copy link
Collaborator

kbenoit commented Sep 12, 2017

Example:

my_dict <- dictionary(list(
    use = "firstuse",
    document = "document*",
    use      = c("use", "using")
))
# Dictionary object with 3 key entries.
# - use:
#     - firstuse
# - document:
#     - document*
#     - use:
#     - use, using

We have tested whether this creates a problem for the *l_lookup functions, and it does not:

toks <- tokens("firstuse documenting using word word2")
# tokens from 1 document.
# text1 :
# [1] "use"      "document" "use"    
# Document-feature matrix of: 1 document, 2 features (0% sparse).
# 1 x 2 sparse Matrix of class "dfm"
#        features
# docs    use document
#   text1   2        1

but it does create a problem for indexing.

my_dict[which(names(my_dict)=="use")]
# Dictionary object with 1 key entry.
# - use:
#   - use, using

my_dict[c(1:3)]
# Dictionary object with 2 key entries.
# - use:
#   - use, using
# - document:
#   - document*
@koheiw
Copy link
Collaborator

koheiw commented Sep 12, 2017

Let's raise an error when duplicated keys are discovered on the same level. I can make a function to merge duplicated keys, but it can be fairly complex, because merging keys can cause new duplication in their sub-keys.

@kbenoit
Copy link
Collaborator Author

kbenoit commented Sep 12, 2017

Actually, we might want to allow duplicated keys across different levels of hierarchy. Currently it seems to be working as expected:

dfm_lookup(dfm(txt), dic1)
# Document-feature matrix of: 2 documents, 3 features (16.7% sparse).
# 2 x 3 sparse Matrix of class "dfm"
#       features
# docs   A.X B.Y A.Y
#   doc1   3   2   2
#   doc2   0   2   2

dfm_lookup(dfm(txt), dic1, levels = 1)
Document-feature matrix of: 2 documents, 2 features (0% sparse).
# 2 x 2 sparse Matrix of class "dfm"
#       features
# docs   A B
#   doc1 5 2
#   doc2 2 2

dfm_lookup(dfm(txt), dic1, levels = 2)
# Document-feature matrix of: 2 documents, 2 features (25% sparse).
# 2 x 2 sparse Matrix of class "dfm"
#       features
# docs   X Y
#   doc1 3 4
#   doc2 0 4

A use-case? Could be:

  • level 1: Country
  • level 2: News category: sport, foreign policy, economy, etc.

(or could be reversed.) If you wanted an aggregate at level 2, to combine the values for sport across different countries, you want the values pooled for the duplicate key at level 2.

So maybe we should only issue a warning, and only if the flattened key is non-unique? For a level-1-only dictionary, we would simply combine the keys by union()?

@koheiw
Copy link
Collaborator

koheiw commented Sep 12, 2017

It is true that keys can be unique when dictionary is flattened, but I cannot think of reasons to do

- [US]:
   - [Economy]:
      - word1, word2, word3
- [US]:
   - [Law]:
      - word4, word5, word6 

instead of

- [US]:
   - [Economy]:
      - word1, word2, word3
   - [Law]:
      - word4, word5, word6 

Can you give me an example? That said, we can allow to use duplicated keys, because identical keys are merged in recompilation in tokens_lookup() or compression in dfm_lookup() anyway.

@kbenoit
Copy link
Collaborator Author

kbenoit commented Sep 12, 2017

I fully agree with that. But I was thinking more of the example where the top level keys were different, and the level 2 keys were identical. But that's not really a problem either, since we have the levels option in the *_lookup() functions (thanks to you).

@kbenoit
Copy link
Collaborator Author

kbenoit commented Sep 12, 2017

Note also that although identical keys are already merged, and therefore there may not be any problem identical keys, we still have a problem with the indexing (example above).

@kbenoit kbenoit added this to the v0.99.x refresh milestone Sep 18, 2017
@koheiw koheiw mentioned this issue Sep 21, 2017
@koheiw koheiw closed this as completed Sep 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants