Add unsupervised keyphrase extraction #75

bartdegoede · 2018-09-26T11:59:37Z

Follow up on #73

It adds an extract_keyphrases method rather than a property, because there's three different algorithms that we can use. This way, it's up to the user what to pick. The sgrank algorithm provides some additional arguments (selecting which ngrams to consider, for example), so I've added a **kwargs for users to pass these arguments on.

anneschuth · 2018-09-26T12:02:49Z

Nice! Additionally I'd be in favour of adding a property that selects the best (up to you, or us).

anneschuth · 2018-09-26T12:03:57Z

textpipe/doc.py

@@ -314,3 +315,39 @@ def sentiment(self):
            return sentiment_it(self.clean)

        raise TextpipeMissingModelException(f'No sentiment model for {self.language}')
+
+    def extract_keyphrases(self, ranker='textrank', n_terms=10, **kwargs):


Can you add a LRU cache to this method?

anneschuth · 2018-09-26T12:07:26Z

Don't forget to bump the version to 0.6.0

bartdegoede · 2018-09-26T12:11:25Z

In terms of choosing the "best" ranker, the reason why textrank is the default is that it's easiest to understand (I'm having a hard time deciding what the "best" would be in which usecase without some extensive tests 😁). Any suggestions?

anneschuth · 2018-09-26T12:26:43Z

I think going with textrank is fine, if that's the default. No strong preferences.

anneschuth · 2018-09-27T07:02:08Z

Cool, one last thing: can you add a matching operation (in operation.py)?

bartdegoede · 2018-09-27T09:14:00Z

FYI, I rebased my changes n your latest ones in master :-)

anneschuth · 2018-09-27T09:15:19Z

VERSION

@@ -1 +1 @@
-0.5.2
+0.5.3


could be 0.6.0 bump

or should, I think

Change is that major? ;-)

no, minor :) major would increment the first digit.

minor is for added functionality
the last digit is for patches / bug fixes

msappelli · 2018-09-27T09:33:28Z

Does textrank only extract 1-grams or also actual phrases? I only see a multi word term in the sgrank example. If the function is called 'extract_keyphrases' I expect more/only multi-word terms. otherwise use something like extract_keyterms as a name to be clear about the functionality.

bartdegoede · 2018-09-27T09:39:08Z

@msappelli It doesn't. Textacy hardcodes the parameter for joining terms for textrank to False. The other two do, and I went for phrases because I felt that would communicate better to users what would be returned (1+ grams, depending on the ranker).

That's totally subjective though, so if you (the maintainers) feel keyterms is more appropriate, I'm more than happy to change that :-)

msappelli · 2018-09-27T10:38:20Z

@bartdegoede I don't think it's subjective. I see that in the IR community keyterm/ keyword / keyphrase is often used interchangeably. But in NLP/linguistics there is a difference, where 'keyterm' is typically 1-2 or 3 gram (so the interpretation of 'term' is not necessarily a 1-gram, but ), whereas a (key) phrase is a group of words (>1) that have a certain grammatical or semantic function.

Additionally keyterms would be consistent with how textacy calls it, and since we are using their function underlying I would keep it consistent.

bartdegoede · 2018-09-27T10:40:13Z

Awesome, makes total sense :-) Changing it now.

EDIT: Just noticed I put keyterms in the docstring too 🙈

anneschuth reviewed Sep 26, 2018

View reviewed changes

bartdegoede force-pushed the feature/keyphrases branch from c5c085b to cc1a175 Compare September 26, 2018 12:07

bartdegoede force-pushed the feature/keyphrases branch from 6f92b07 to 67816f4 Compare September 26, 2018 12:19

textpipe deleted a comment Sep 26, 2018

bartdegoede force-pushed the feature/keyphrases branch from dcfee29 to 2776a10 Compare September 26, 2018 14:58

bartdegoede added 5 commits September 27, 2018 11:03

Add keyphrase extraction

8b89e39

Add LRU cache to extract_keyphrases()

387a51b

Bump version

a41169d

Satisfy TravisCI

21bba0b

Add keyphrases property

91e504d

bartdegoede force-pushed the feature/keyphrases branch from 2776a10 to 91e504d Compare September 27, 2018 09:04

anneschuth reviewed Sep 27, 2018

View reviewed changes

anneschuth approved these changes Sep 27, 2018

View reviewed changes

Add keyphrases operation

1f1b225

bartdegoede force-pushed the feature/keyphrases branch from 809bf64 to 1f1b225 Compare September 27, 2018 09:17

babakx approved these changes Sep 27, 2018

View reviewed changes

Rename keyphrases to keyterms

8729087

msappelli approved these changes Sep 28, 2018

View reviewed changes

anneschuth merged commit 7325dae into textpipe:master Sep 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add unsupervised keyphrase extraction #75

Add unsupervised keyphrase extraction #75

bartdegoede commented Sep 26, 2018

anneschuth commented Sep 26, 2018

anneschuth Sep 26, 2018

anneschuth commented Sep 26, 2018

bartdegoede commented Sep 26, 2018

anneschuth commented Sep 26, 2018

anneschuth commented Sep 27, 2018

bartdegoede commented Sep 27, 2018 •

edited

anneschuth Sep 27, 2018

anneschuth Sep 27, 2018

bartdegoede Sep 27, 2018

anneschuth Sep 27, 2018

msappelli commented Sep 27, 2018

bartdegoede commented Sep 27, 2018

msappelli commented Sep 27, 2018

bartdegoede commented Sep 27, 2018 •

edited

		@@ -1 +1 @@
		0.5.2
		0.5.3

Add unsupervised keyphrase extraction #75

Add unsupervised keyphrase extraction #75

Conversation

bartdegoede commented Sep 26, 2018

anneschuth commented Sep 26, 2018

anneschuth Sep 26, 2018

Choose a reason for hiding this comment

anneschuth commented Sep 26, 2018

bartdegoede commented Sep 26, 2018

anneschuth commented Sep 26, 2018

anneschuth commented Sep 27, 2018

bartdegoede commented Sep 27, 2018 • edited

anneschuth Sep 27, 2018

Choose a reason for hiding this comment

anneschuth Sep 27, 2018

Choose a reason for hiding this comment

bartdegoede Sep 27, 2018

Choose a reason for hiding this comment

anneschuth Sep 27, 2018

Choose a reason for hiding this comment

msappelli commented Sep 27, 2018

bartdegoede commented Sep 27, 2018

msappelli commented Sep 27, 2018

bartdegoede commented Sep 27, 2018 • edited

bartdegoede commented Sep 27, 2018 •

edited

bartdegoede commented Sep 27, 2018 •

edited