Skip to content
This repository has been archived by the owner on Jun 9, 2021. It is now read-only.

Add unsupervised keyphrase extraction #75

Merged
merged 7 commits into from Sep 28, 2018

Conversation

bartdegoede
Copy link
Contributor

Follow up on #73

It adds an extract_keyphrases method rather than a property, because there's three different algorithms that we can use. This way, it's up to the user what to pick. The sgrank algorithm provides some additional arguments (selecting which ngrams to consider, for example), so I've added a **kwargs for users to pass these arguments on.

@anneschuth
Copy link
Member

Nice! Additionally I'd be in favour of adding a property that selects the best (up to you, or us).

textpipe/doc.py Outdated
@@ -314,3 +315,39 @@ def sentiment(self):
return sentiment_it(self.clean)

raise TextpipeMissingModelException(f'No sentiment model for {self.language}')

def extract_keyphrases(self, ranker='textrank', n_terms=10, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a LRU cache to this method?

@anneschuth
Copy link
Member

Don't forget to bump the version to 0.6.0

@bartdegoede
Copy link
Contributor Author

In terms of choosing the "best" ranker, the reason why textrank is the default is that it's easiest to understand (I'm having a hard time deciding what the "best" would be in which usecase without some extensive tests 😁). Any suggestions?

@anneschuth
Copy link
Member

I think going with textrank is fine, if that's the default. No strong preferences.

@anneschuth
Copy link
Member

Cool, one last thing: can you add a matching operation (in operation.py)?

@bartdegoede
Copy link
Contributor Author

bartdegoede commented Sep 27, 2018

FYI, I rebased my changes n your latest ones in master :-)

VERSION Outdated
@@ -1 +1 @@
0.5.2
0.5.3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be 0.6.0 bump

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or should, I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change is that major? ;-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, minor :) major would increment the first digit.

minor is for added functionality
the last digit is for patches / bug fixes

@msappelli
Copy link

Does textrank only extract 1-grams or also actual phrases? I only see a multi word term in the sgrank example. If the function is called 'extract_keyphrases' I expect more/only multi-word terms. otherwise use something like extract_keyterms as a name to be clear about the functionality.

@bartdegoede
Copy link
Contributor Author

@msappelli It doesn't. Textacy hardcodes the parameter for joining terms for textrank to False. The other two do, and I went for phrases because I felt that would communicate better to users what would be returned (1+ grams, depending on the ranker).

That's totally subjective though, so if you (the maintainers) feel keyterms is more appropriate, I'm more than happy to change that :-)

@msappelli
Copy link

@bartdegoede I don't think it's subjective. I see that in the IR community keyterm/ keyword / keyphrase is often used interchangeably. But in NLP/linguistics there is a difference, where 'keyterm' is typically 1-2 or 3 gram (so the interpretation of 'term' is not necessarily a 1-gram, but ), whereas a (key) phrase is a group of words (>1) that have a certain grammatical or semantic function.

Additionally keyterms would be consistent with how textacy calls it, and since we are using their function underlying I would keep it consistent.

@bartdegoede
Copy link
Contributor Author

bartdegoede commented Sep 27, 2018

Awesome, makes total sense :-) Changing it now.

EDIT: Just noticed I put keyterms in the docstring too 🙈

@anneschuth anneschuth merged commit 7325dae into textpipe:master Sep 28, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants