Skip to content

spoddutur/spark-phrase-extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark Phrase Extraction : Automated phrase mining from huge text corpus using Apache Spark

This library is similar to phrase extraction implementation in Gensim(found here) but for huge text corpus at scale using apache Spark.

Build Status Gitter codecov Maven Central

Target audience: Spark-Scala ML applications in the need of collocations phrase detection for their natural language processing (NLP) and information retrieval (IR) tasks.

spark-parse-extraction provides:

  • Basic building blocks to create ML applications utilizing GenSim API's to:
    • To Train a distributed corpus vocabulary which automatically detects common phrases (multiword expressions) from a stream of sentences. Corpus learnt is based on frequently oocuring collocated phrases.
    • To Save the trained model
    • To Load the saved model and use it with its corpus knowledge to predict collocated n-gram phrases in input sentences.
    • Scoring:
      • Supports default python-gensim scorers: Original Scoring and NPMI Scoring
      • Enabled config-based-approach to plugin and play custom scorers
      • Added Contingency-Based scorers for use with Phraser like ChiSq, Jaccard, LogLikelyHood etc

How to Run

Start Training

    // init config
    val common_words = mutable.HashSet[String]("of", "with", "without", "and", "or", "the", "a")
    val config: PhrasesConfig = new SimplePhrasesConfig()
                    .copy(minCount = 1, 
                          threshold = 1.0f, 
                          commonWords = Some(common_words))
    val configBc = spark.sparkContext.broadcast(config)
    
    // init scorer
    val scorer = BigramScorer.getScorer(BigramScorer.DEFAULT)

    // read sentences and start learning phrases
    val sentencesDf = spark.read...
    CorpusHolder.learnAndSave(spark, sentencesDf, configBc, scorer, outputPath)
  • Config params supported are documented here
  • CorpusHolder.learnAndSave does take in config and scorer, applies them on input sentences and saves the model at outputPath.

Start Predicting

Prediction involves 2 steps: loading model and use it to predict.

    // load model from output path in step3 above
    val phrases = Util.load[Phrases]("/tmp/gensim-model")
    val phraserBc = spark.sparkContext.broadcast(new Phraser(phrases))
    
    // read input sentences as dataframe
    val sentencesDf = spark.read....
    
    // predict
    val sentenceBigramsDf = sentencesDf
                  .map(sentence => phraserBc.value.apply(sentence.split(" ")))

    // write bigrams to file
    sentenceBigramsDf.write.json("/tmp/gensim-output/")

Working Examples:

Please refer to Predictor implementation and Trainer implementation where I've demoed a working example to train and predict phrases.

How to train for Trigrams?

  • Above step trains bigrams.
  • If we want trigrams, repeat step3 for sentenceBigramsDf instead of sentencesDfas shown below
  CorpusHolder.learnAndSave(spark, sentenceBigramsDf, configBc, scorer, outputPath)
  • And so on for each of the higher n-grams we want.

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages