Dirty Cat: Dealing with dirty categorical (strings).

DirtyCat(Scala) is a package that leverage Spark ML to perform large scale Machine Learning, and provides an alternative to encode string variables. This package is largely based on the python original code, https://github.com/dirty-cat

Documentation

https://github.com/dirty-cat
Patricio Cerda, Gaël Varoquaux, Balázs Kégl. Similarity encoding for learning with dirty categorical variables. Machine Learning journal, Springer. 2018.

Getting started: How to use it

The DirtyCat project is built for both Scala 2.11.x against Spark v2.3.0.

This package is provided as it is, hence, you will have to install it by yourself. Here are some indications to start using it.

Build it by yourself: Installation

This project can be built with SBT 1.1.x.

Change build.sbt to satisfy your scala/spark installations. Then, run on the command line

sbt clean

sbt compile

sbt package

This will generate a .jar file in: target/scala_VERSION/PACKAGE.jar, where PACKAGE = com.rakuten.dirty_cat_VERSION-0.1-SNAPSHOT.jar

If you are using Jupyter notebooks (scala), you can add this file to your toree-spark-options in your Jupyter kernel.

Find your available kernesls running:

jupyter kernelspec list

Go to your Scala kernel and add:

"env": {
    "DEFAULT_INTERPRETER": "Scala",
    "__TOREE_SPARK_OPTS__": "--conf spark.driver.memory=2g --conf spark.executor.cores=4 --conf spark.executor.memory=1g --jars PATH/target/scala_VERSION/PACKAGE.jar
    }

To submit your spark application, run

spark-submit --master local[3]  --jars target/scala-2.11/dirty_cat_2.11-1.0.jar YOUR_APPLICATION

Ceate local package

make publish

Usage with Spark ML

Declaration

import com.rakuten.dirty_cat.feature.SimilarityEncoder

val encoder = (new SimilarityEncoder()
  .setInputCol("devices")
  .setOutputCol("devicesEncoded")
  .setSimilarityType("nGram")
  .setVocabSize(1000))

Using it in a pipeline

import org.apache.spark.ml.Pipeline

val pipeline = (new Pipeline().setStages(Array(encoder, YOUR_ESTIMATOR)))
val pipelineModel = pipeline.fit(dataframe)

Serialization

pipelineModel.write.overwrite().save("pipeline.parquet")

History

Andrés Hoyos-Idrobo started this implementation of DirtyCat as a way to improve his Spark/Scala skills.

Contributions from:

Andrés Hoyos-Idrobo

Corporate (Code) Contributors:

Rakuten Institute of Technology

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
examples		examples
python		python
src		src
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
build.sbt		build.sbt
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dirty Cat: Dealing with dirty categorical (strings).

Documentation

Getting started: How to use it

Build it by yourself: Installation

Ceate local package

Usage with Spark ML

Declaration

Using it in a pipeline

Serialization

History

About

Releases

Packages

Contributors 2

Languages

License

rakutentech/spark-dirty-cat

Folders and files

Latest commit

History

Repository files navigation

Dirty Cat: Dealing with dirty categorical (strings).

Documentation

Getting started: How to use it

Build it by yourself: Installation

Ceate local package

Usage with Spark ML

Declaration

Using it in a pipeline

Serialization

History

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages