Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexeme.similarity and AnchorText #107

Closed
mapmeld opened this issue Jul 2, 2019 · 4 comments
Closed

Lexeme.similarity and AnchorText #107

mapmeld opened this issue Jul 2, 2019 · 4 comments

Comments

@mapmeld
Copy link

mapmeld commented Jul 2, 2019

Hi !
I'm using the AnchorText movie reviews example as a starting point for a blog post on explainable AI. I've run into two minor issues but I'd be interested in understanding them / maybe improving on them.

  1. When I am in synonym / use_proba=True mode, my CLI gets 1000s of lines of this warning - at least once for every movie review:
    <stdin>:1: UserWarning: [W008] Evaluating Lexeme.similarity based on empty vectors.

  2. When I made a super-easy classifier (sentences beginning with Apples, Oranges, or neither), the neither category is more of an absence-of-anchors, so predictions for it return an almost empty object. Could there be a better way to represent the 'null' category here?

{'names': [], 'precision': 1.0, 'coverage': 1, 'raw': {'feature': [], 'mean': [], 'precision': [], 'coverage': [], 'examples': [], 'all_precision': 1.0, 'num_preds': 101, 'names': [], 'positions': [], 'instance': 'This is a good book .', 'prediction': 2}}

@jklaise
Copy link
Member

jklaise commented Jul 3, 2019

Hi, thanks for your interest in the library, I look forward to reading the blog post!

  1. I've seen the same issue so I'm going to investigate, the warning suggests that we are looking at synonyms for empty vectors, this could be an alibi bug or intended behaviour according to spacy.

  2. This is interesting. First I just want to note that this in an example of an empty anchor not a "no anchor" - no anchor would be a partial anchor which did not satisfy the precision threshold no matter how many features were added in. I agree that it looks like the anchor algorithm is picking up on the model's decision to classify as "neither" if apples/oranges are absent, but it's not entirely clear. Can I ask a bit more about what the sentences in the "neither" category look like?

@jklaise
Copy link
Member

jklaise commented Jul 3, 2019

@mapmeld I believe the cause of 1. is using the small sm model - the spacy docs say that the sm models don't ship with word vectors, I would suggest trying the lg model instead and see if the warnings go away (the quality of synonyms should also improve!).

@jklaise
Copy link
Member

jklaise commented Jul 3, 2019

@mapmeld I was wrong, we use the md model by default which does have word vectors, the issue arises because there are a lot of lexemes for which the word vector is identically zero. I think we should prune these from the corpus before finding the synonyms. Alternatively we can bump up the default w_prob=-20 to something higher to exclude more words based on rarity.

Edit: I've submitted a PR to fix this #110

Edit2: This is now merged and fixed in v0.2.2

@jklaise
Copy link
Member

jklaise commented Aug 7, 2019

@mapmeld I will close this issue now as the warnings have been fixed and it's hard to debug the Anchor output without knowing the details of your model, feel free open a new issue if you have more details.

@jklaise jklaise closed this as completed Aug 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants