Skip to content

Spark DataFrame wrapper methods for Core Nlp SimpleApi Annotators.

Notifications You must be signed in to change notification settings

ziadmoubayed/spark-corenlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spark-corenlp

Spark DataFrame wrapper methods for CoreNlp SimpleApi Annotators. These methods were tested with spark 2.3.1 and standford version 3.9.1

To import the methods import static com.ziad.spark.nlp.functions.*.

  • tokenize: Splits the text into roughly “words”, using rules or methods suitable for the language being processed.
  • ssplit: Splits a sequence of tokens into sentences.
  • lemmas: Generates the word lemmas for all tokens in the corpus.
  • ner: Generates the named entity tags of the text.
  • sentiment: Measures the sentiment of an input sentence on a scale of 0 (strong negative) to 4 (strong positive).

Note: You need to add the core nlp models jar to your class path. Arabic Chinese English (KBP) English French German Spanish

Example of usage:

//Collection of Strings (text) to parse...
List<String> data = Arrays.asList("first text", 
"second text");
/*
1. create Dataset from String collection
2. call UserDefinedFunction, named "sentiment" to measure the sentiment of an input sentence
3. generate sentiment type using sentiment scale
4. print table with results
 */
 Dataset<Row> df = session.createDataset(data, Encoders.STRING()).toDF();
 df.select(col("value"), sentiment.apply(col("value")).as("sentiment"))
 .show();

Output:

+-----------+----------+
|      value| sentiment|
+-----------+----------+
| first text|         2|
|second text|         2|
+-----------+----------+

About

Spark DataFrame wrapper methods for Core Nlp SimpleApi Annotators.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages