Word Attraction Model for Unsupervised Key Word Extraction

This is an unofficial implementation of keyword extration method proposed in this paper:

Wang, R., Liu, W., & McDonald, C. (2014, July). Corpus-independent generic keyphrase extraction using word embedding vectors. In Software Engineering Research Conference (Vol. 39).

Briefly, the method is a variantion of textrank but take semantic similarity of words into consideration. The pairwise semantic similarity is measured by cosine distance of word vectors of two words.

Prerequisite:

You should have a pretrained word2vec model before utilizing the method to extract keywords. There's a demo model, which I trained with very few data and of very low embedding dimension, avaible at /data/. Customized embedding model should work just fine if being trained via https://github.com/dav/word2vec and setting the -binary to 0. Or equivalently, use the repo link @word2vec.

Beside, NLTK is used for stemming words but not required. It should be a good practice to stem word in english, but it can be trivial if the target language doesn't change form of words. If NLTK is not installed, it will automatically skip stemming.

Run

from word2vecc_loader import Word2Vec
from word_attraction import WordAttraction

w2v = Word2Vec("data/text8-vector.bin")
wa = WordAttraction(w2v)
print(wa.extract_main(text, max_words=3, stem=False))

Try play with demo.py.

Further

English stopwords are hard coded in stopwords.py, if working on other language, it's better to modify this list accordingly. Not doing so won't lead to software problems but the keywords might be filled up with those very frequent but less informative words.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
word2vec @ 9b8b580		word2vec @ 9b8b580
.gitmodules		.gitmodules
README.md		README.md
demo.py		demo.py
fig.png		fig.png
stopwords.py		stopwords.py
word2vecc_loader.py		word2vecc_loader.py
word_attraction.py		word_attraction.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

word2vec @ 9b8b580

word2vec @ 9b8b580

.gitmodules

.gitmodules

README.md

README.md

demo.py

demo.py

fig.png

fig.png

stopwords.py

stopwords.py

word2vecc_loader.py

word2vecc_loader.py

word_attraction.py

word_attraction.py

Repository files navigation

Word Attraction Model for Unsupervised Key Word Extraction

Prerequisite:

Run

Further

About

Releases

Packages

Languages

shihui2010/word_attraction

Folders and files

Latest commit

History

Repository files navigation

Word Attraction Model for Unsupervised Key Word Extraction

Prerequisite:

Run

Further

About

Topics

Resources

Stars

Watchers

Forks

Languages