This library is similar to phrase extraction implementation in Gensim(found here) but for huge text corpus at scale using apache Spark.
Target audience: Spark-Scala ML applications in the need of collocations phrase detection for their natural language processing (NLP) and information retrieval (IR) tasks.
- Basic building blocks to create ML applications utilizing GenSim API's to:
- To Train a distributed corpus vocabulary which automatically detects common phrases (multiword expressions) from a stream of sentences. Corpus learnt is based on frequently oocuring collocated phrases.
- To Save the trained model
- To Load the saved model and use it with its corpus knowledge to predict collocated n-gram phrases in input sentences.
- Scoring:
- Supports default python-gensim scorers: Original Scoring and NPMI Scoring
- Enabled config-based-approach to plugin and play custom scorers
- Added Contingency-Based scorers for use with Phraser like ChiSq, Jaccard, LogLikelyHood etc
// init config
val common_words = mutable.HashSet[String]("of", "with", "without", "and", "or", "the", "a")
val config: PhrasesConfig = new SimplePhrasesConfig()
.copy(minCount = 1,
threshold = 1.0f,
commonWords = Some(common_words))
val configBc = spark.sparkContext.broadcast(config)
// init scorer
val scorer = BigramScorer.getScorer(BigramScorer.DEFAULT)
// read sentences and start learning phrases
val sentencesDf = spark.read...
CorpusHolder.learnAndSave(spark, sentencesDf, configBc, scorer, outputPath)
- Config params supported are documented here
CorpusHolder.learnAndSave
does take in config and scorer, applies them on input sentences and saves the model at outputPath.
Prediction involves 2 steps: loading model and use it to predict.
// load model from output path in step3 above
val phrases = Util.load[Phrases]("/tmp/gensim-model")
val phraserBc = spark.sparkContext.broadcast(new Phraser(phrases))
// read input sentences as dataframe
val sentencesDf = spark.read....
// predict
val sentenceBigramsDf = sentencesDf
.map(sentence => phraserBc.value.apply(sentence.split(" ")))
// write bigrams to file
sentenceBigramsDf.write.json("/tmp/gensim-output/")
Please refer to Predictor implementation and Trainer implementation where I've demoed a working example to train and predict phrases.
- Above step trains bigrams.
- If we want trigrams, repeat step3 for
sentenceBigramsDf
instead ofsentencesDf
as shown below
CorpusHolder.learnAndSave(spark, sentenceBigramsDf, configBc, scorer, outputPath)
- And so on for each of the higher n-grams we want.
- Heavily inspired from the good work that Python-Gensim has done here and here
- Python-Gensim Github
- Python-Gensim.ipynb on How to use it
- Custom Scoring support via contingency-based scoring for collocations and statistical analysis of n-grams