Spark Phrase Extraction : Automated phrase mining from huge text corpus using Apache Spark

This library is similar to phrase extraction implementation in Gensim(found here) but for huge text corpus at scale using apache Spark.

Target audience: Spark-Scala ML applications in the need of collocations phrase detection for their natural language processing (NLP) and information retrieval (IR) tasks.

spark-parse-extraction provides:

Basic building blocks to create ML applications utilizing GenSim API's to:
- To Train a distributed corpus vocabulary which automatically detects common phrases (multiword expressions) from a stream of sentences. Corpus learnt is based on frequently oocuring collocated phrases.
- To Save the trained model
- To Load the saved model and use it with its corpus knowledge to predict collocated n-gram phrases in input sentences.
- Scoring:
  - Supports default python-gensim scorers: Original Scoring and NPMI Scoring
  - Enabled config-based-approach to plugin and play custom scorers
  - Added Contingency-Based scorers for use with Phraser like ChiSq, Jaccard, LogLikelyHood etc

How to Run

Start Training

    // init config
    val common_words = mutable.HashSet[String]("of", "with", "without", "and", "or", "the", "a")
    val config: PhrasesConfig = new SimplePhrasesConfig()
                    .copy(minCount = 1, 
                          threshold = 1.0f, 
                          commonWords = Some(common_words))
    val configBc = spark.sparkContext.broadcast(config)
    
    // init scorer
    val scorer = BigramScorer.getScorer(BigramScorer.DEFAULT)

    // read sentences and start learning phrases
    val sentencesDf = spark.read...
    CorpusHolder.learnAndSave(spark, sentencesDf, configBc, scorer, outputPath)

Config params supported are documented here
CorpusHolder.learnAndSave does take in config and scorer, applies them on input sentences and saves the model at outputPath.

Start Predicting

Prediction involves 2 steps: loading model and use it to predict.

    // load model from output path in step3 above
    val phrases = Util.load[Phrases]("/tmp/gensim-model")
    val phraserBc = spark.sparkContext.broadcast(new Phraser(phrases))
    
    // read input sentences as dataframe
    val sentencesDf = spark.read....
    
    // predict
    val sentenceBigramsDf = sentencesDf
                  .map(sentence => phraserBc.value.apply(sentence.split(" ")))

    // write bigrams to file
    sentenceBigramsDf.write.json("/tmp/gensim-output/")

Working Examples:

Please refer to Predictor implementation and Trainer implementation where I've demoed a working example to train and predict phrases.

How to train for Trigrams?

Above step trains bigrams.
If we want trigrams, repeat step3 for sentenceBigramsDf instead of sentencesDfas shown below

  CorpusHolder.learnAndSave(spark, sentenceBigramsDf, configBc, scorer, outputPath)

And so on for each of the higher n-grams we want.

References

Heavily inspired from the good work that Python-Gensim has done here and here
Python-Gensim Github
Python-Gensim.ipynb on How to use it
Custom Scoring support via contingency-based scoring for collocations and statistical analysis of n-grams

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
all.gpg.enc		all.gpg.enc
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

.travis.yml

.travis.yml

README.md

README.md

all.gpg.enc

all.gpg.enc

pom.xml

pom.xml

Repository files navigation

Spark Phrase Extraction : Automated phrase mining from huge text corpus using Apache Spark

spark-parse-extraction provides:

How to Run

Start Training

Start Predicting

Working Examples:

How to train for Trigrams?

References

About

Releases

Packages

Languages

spoddutur/spark-phrase-extraction

Folders and files

Latest commit

History

Repository files navigation

Spark Phrase Extraction : Automated phrase mining from huge text corpus using Apache Spark

spark-parse-extraction provides:

How to Run

Start Training

Start Predicting

Working Examples:

How to train for Trigrams?

References

About

Resources

Stars

Watchers

Forks

Languages