Lexeme.similarity and AnchorText #107

mapmeld · 2019-07-02T18:30:45Z

Hi !
I'm using the AnchorText movie reviews example as a starting point for a blog post on explainable AI. I've run into two minor issues but I'd be interested in understanding them / maybe improving on them.

When I am in synonym / use_proba=True mode, my CLI gets 1000s of lines of this warning - at least once for every movie review:
<stdin>:1: UserWarning: [W008] Evaluating Lexeme.similarity based on empty vectors.
When I made a super-easy classifier (sentences beginning with Apples, Oranges, or neither), the neither category is more of an absence-of-anchors, so predictions for it return an almost empty object. Could there be a better way to represent the 'null' category here?

{'names': [], 'precision': 1.0, 'coverage': 1, 'raw': {'feature': [], 'mean': [], 'precision': [], 'coverage': [], 'examples': [], 'all_precision': 1.0, 'num_preds': 101, 'names': [], 'positions': [], 'instance': 'This is a good book .', 'prediction': 2}}

The text was updated successfully, but these errors were encountered:

jklaise · 2019-07-03T08:36:59Z

Hi, thanks for your interest in the library, I look forward to reading the blog post!

I've seen the same issue so I'm going to investigate, the warning suggests that we are looking at synonyms for empty vectors, this could be an alibi bug or intended behaviour according to spacy.
This is interesting. First I just want to note that this in an example of an empty anchor not a "no anchor" - no anchor would be a partial anchor which did not satisfy the precision threshold no matter how many features were added in. I agree that it looks like the anchor algorithm is picking up on the model's decision to classify as "neither" if apples/oranges are absent, but it's not entirely clear. Can I ask a bit more about what the sentences in the "neither" category look like?

jklaise · 2019-07-03T13:43:21Z

@mapmeld I believe the cause of 1. is using the small sm model - the spacy docs say that the sm models don't ship with word vectors, I would suggest trying the lg model instead and see if the warnings go away (the quality of synonyms should also improve!).

jklaise · 2019-07-03T14:26:01Z

@mapmeld I was wrong, we use the md model by default which does have word vectors, the issue arises because there are a lot of lexemes for which the word vector is identically zero. I think we should prune these from the corpus before finding the synonyms. Alternatively we can bump up the default w_prob=-20 to something higher to exclude more words based on rarity.

Edit: I've submitted a PR to fix this #110

Edit2: This is now merged and fixed in v0.2.2

jklaise · 2019-08-07T10:55:44Z

@mapmeld I will close this issue now as the warnings have been fixed and it's hard to debug the Anchor output without knowing the details of your model, feel free open a new issue if you have more details.

jklaise closed this as completed Aug 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lexeme.similarity and AnchorText #107

Lexeme.similarity and AnchorText #107

mapmeld commented Jul 2, 2019 •

edited

jklaise commented Jul 3, 2019

jklaise commented Jul 3, 2019

jklaise commented Jul 3, 2019 •

edited

jklaise commented Aug 7, 2019

Lexeme.similarity and AnchorText #107

Lexeme.similarity and AnchorText #107

Comments

mapmeld commented Jul 2, 2019 • edited

jklaise commented Jul 3, 2019

jklaise commented Jul 3, 2019

jklaise commented Jul 3, 2019 • edited

jklaise commented Aug 7, 2019

mapmeld commented Jul 2, 2019 •

edited

jklaise commented Jul 3, 2019 •

edited