Spark Stemming

Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval. This package allows to use it as a part of Spark ML Pipeline API.

Linking

Link against this library using SBT:

libraryDependencies += "com.github.master" %% "spark-stemming" % "0.2.0"

Using Maven:

<dependency>
    <groupId>com.github.master</groupId>
    <artifactId>spark-stemming_2.10</artifactId>
    <version>0.2.0</version>
</dependency>

Or include it when starting the Spark shell:

$ bin/spark-shell --packages com.github.master:spark-stemming_2.10:0.2.0

Features

Currently implemented algorithms:

Arabic
English
English (Porter)
Romance stemmers:
- French
- Spanish
- Portuguese
- Italian
- Romanian
Germanic stemmers:
- German
- Dutch
Scandinavian stemmers:
- Swedish
- Norwegian (Bokmål)
- Danish
Russian
Finnish
Greek

More details are on the Snowball stemming algorithms page.

Usage in scala

Stemmer Transformer can be used directly or as a part of ML Pipeline. In particular, it is nicely combined with Tokenizer.

import org.apache.spark.mllib.feature.Stemmer

val data = sqlContext
  .createDataFrame(Seq(("мама", 1), ("мыла", 2), ("раму", 3)))
  .toDF("word", "id")

val stemmed = new Stemmer()
  .setInputCol("word")
  .setOutputCol("stemmed")
  .setLanguage("Russian")
  .transform(data)

stemmed.show

Usage in PySpark

Build a jar using SBT build tool.
Include it in the driver classpath for example using --driver-class-path argument for PySpark shell / spark-submit. Depending on the exact code you may have to pass it using --jars as well

def stemmer(sc,df):
    # Extract JVM instance from a Python SparkContext instance & Create Stemmer class
    trans=sc._jvm.org.apache.spark.mllib.feature.Stemmer()
    return trans.setInputCol("word")
      .setOutputCol("stemmed")
      .setSplitToken(true)
      .setMinLen(4)
      # Extract Java DataFrame from the df
      .transform(df._jdf)

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.idea		.idea
project		project
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark Stemming

Linking

Features

Usage in scala

Usage in PySpark

About

Releases

Packages

Languages

License

srinivasanHadoop/spark-stemming

Folders and files

Latest commit

History

Repository files navigation

Spark Stemming

Linking

Features

Usage in scala

Usage in PySpark

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages