# Spark NLP Test
This notebook explores the use of Spark NLP which contains built in, parallelized NLP methods accessible via Apache Spark. Spark NLP includes many basic functionalities such as tokenizers, lemmatizers, and stemmers. Spark NLP also has pre-built pipelines which include man yof these functionalities in a single pipeline function.

This notebook was created with a Spark cluster running on AWS EMR.

Inputs: Random set of twitter data (500K+ rows)
Outputs: None

## Initial Setup & Configuration

In [1]:
# Install some initial repositories
import os
os.system("sudo /usr/bin/pip-3.4 install findspark pandas")

0

In [2]:
# Use findspark package to connect Jupyter to Spark shell
import findspark
findspark.init('/usr/lib/spark')

In [3]:
# Load SparkSession object
import pyspark
from pyspark.sql import SparkSession

# Load other libraries
from datetime import datetime
import pyspark.sql.functions as F
from pyspark.sql.types import DateType
import pandas as pd

In [4]:
# Set jupyter display options
pd.set_option('display.max_colwidth', 1000)

### Spark NLP Setup & Configuration
First, we have to download Spark NLP. Currently, for the Python version of Spark NLP, there is a more involved process of installation. First, we have to use the --packages method of installation with Spark as this is the standard way of installing and including packages, including downloading all its dependencies.

In [5]:
# Install spark nlp using the --packages method
pyspark_submit_args = '--packages JohnSnowLabs:spark-nlp:1.5.3 pyspark-shell'
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

In [6]:
# Initiate a spark session to kick off the package download
spark = SparkSession\
    .builder\
    .getOrCreate()

Once we have all the dependencies installed automatically, we have to load and use the package itself using the --jar method. This is required as Spark NLP is still in initial development stages for Python and the .jar file is required and should be explicitly included so Python can access the Python wrappers included within the .jar file.

In [7]:
# Download the jar file to provide python the appropriate wrapper for operation
jar_source = u'http://repo1.maven.org/maven2/com/johnsnowlabs/nlp/spark-nlp_2.11/1.5.3/spark-nlp_2.11-1.5.3.jar'
jar_target = u'/home/hadoop/spark_nlp_test/data/spark-nlp_2.11-1.5.3.jar'
os.system('wget -O {} {}'.format(jar_target, jar_source))

0

In [8]:
# Load spark nlp a second time using the --jar method
pyspark_submit_args = ' --jars /home/hadoop/spark_nlp_test/data/spark-nlp_2.11-1.5.3.jar pyspark-shell'
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args
os.environ["PYTHONPATH"] = jar_target

In [9]:
# Add the jar to path so python can use the python wrapper
import sys
import glob
sys.path.extend(glob.glob(os.path.join(os.path.expanduser("~"), ".ivy2/jars/*.jar")))

In [10]:
# Initiate SparkSession as "spark"
spark = SparkSession\
    .builder\
    .getOrCreate()

In [11]:
# Load spark nlp processing libraries
from pyspark.ml import Pipeline, PipelineModel
from sparknlp.annotator import *
from sparknlp.base import DocumentAssembler, Finisher

## Load Data
Let's load 500K+ #metoo tweets (random data set for the purpose of testing).

In [12]:
# Read tweet data
tweets_df = spark.read.csv(
    "s3n://2017edmfasatb/spark_nlp_test/metootweets.csv", 
    header = True, 
    inferSchema = True
)

In [13]:
# Preview schema
tweets_df.printSchema()

root
 |-- id: string (nullable = true)
 |-- insertdate: string (nullable = true)
 |-- twitterhandle: string (nullable = true)
 |-- followers: string (nullable = true)
 |-- hashtagsearched: string (nullable = true)
 |-- tweetid: string (nullable = true)
 |-- dateoftweet: string (nullable = true)
 |-- text: string (nullable = true)
 |-- lastcontactdate: string (nullable = true)
 |-- lasttimelinepull: string (nullable = true)
 |-- lasttimetweetsanalyzed: string (nullable = true)
 |-- numberoftweetsanalysed: string (nullable = true)
 |-- numberoftweetsabouthash: string (nullable = true)
 |-- actualtwitterdate: string (nullable = true)



In [14]:
# Count number of rows
tweets_df.count()

552180

In [15]:
# Preview tweets
tweets_df.limit(5).toPandas()[['text']]

Unnamed: 0,text
0,RT @IxAmandaDelgado: @Navegaciones @FelipeCalderon @comsatori Cuando esta se?ora habla es como leer los twits de Ivanka Trump con el HT #Me?
1,"RT @alexwitze: .@NSF will require institutions that receive grant funds to tell them if PIs, co-PIs or anyone on the grant is found to have?"
2,Listening to the awesome feminist scholar Cynthia Enloe speaking about the relationship between the #metoo movement? https://t.co/aeoOhchgwA
3,???????????????????????????????????????????????????????????? https://t.co/gWAWGlKa36
4,"RT @AlbertoBernalLe: ?A ver, donde est?n todas las voceras colombianas del #MeToo? ?No van a decir nada ante esto? ?De verdad se van a qued?"


## Pre-Processing
Let's remove "RT" labels for retweets, @ mentions, hashtags, and URLs.

In [16]:
# RT label
tweets_df = tweets_df.withColumn('text_clean', F.regexp_replace(
    F.col('text'),
    'RT',
    ''
))

# @ mentions
tweets_df = tweets_df.withColumn('text_clean', F.regexp_replace(
    F.col('text_clean'),
    '@[A-Za-z0-9]+',
    ''
))

# Hashtags
tweets_df = tweets_df.withColumn('text_clean', F.regexp_replace(
    F.col('text_clean'),
    '#[A-Za-z0-9]+',
    ''
))

# URLs
tweets_df = tweets_df.withColumn('text_clean', F.regexp_replace(
    F.col('text_clean'),
    'https?[\S]+',
    ''
))

# Remove anything remaining that is not a word
tweets_df = tweets_df.withColumn('text_clean', F.regexp_replace(
    F.col('text_clean'),
    '[^A-Za-z0-9 ]+',
    ''
))

In [17]:
# Preview tweets
tweets_df.limit(5).toPandas()[['text', 'text_clean']]

Unnamed: 0,text,text_clean
0,RT @IxAmandaDelgado: @Navegaciones @FelipeCalderon @comsatori Cuando esta se?ora habla es como leer los twits de Ivanka Trump con el HT #Me?,Cuando esta seora habla es como leer los twits de Ivanka Trump con el HT
1,"RT @alexwitze: .@NSF will require institutions that receive grant funds to tell them if PIs, co-PIs or anyone on the grant is found to have?",will require institutions that receive grant funds to tell them if PIs coPIs or anyone on the grant is found to have
2,Listening to the awesome feminist scholar Cynthia Enloe speaking about the relationship between the #metoo movement? https://t.co/aeoOhchgwA,Listening to the awesome feminist scholar Cynthia Enloe speaking about the relationship between the movement
3,???????????????????????????????????????????????????????????? https://t.co/gWAWGlKa36,
4,"RT @AlbertoBernalLe: ?A ver, donde est?n todas las voceras colombianas del #MeToo? ?No van a decir nada ante esto? ?De verdad se van a qued?",A ver donde estn todas las voceras colombianas del No van a decir nada ante esto De verdad se van a qued


## Spark NLP
Let's try to run some basic NLP functions using Spark NLP.

### Document Assembler

In [18]:
# Initialize document assembler
document_assembler = DocumentAssembler() \
    .setInputCol("text_clean")
        
# Transform
tweets_df = document_assembler.transform(tweets_df)

In [19]:
# Preview dataset
tweets_df.limit(5).toPandas()[['text_clean', 'document']]

Unnamed: 0,text_clean,document
0,Cuando esta seora habla es como leer los twits de Ivanka Trump con el HT,"[(document, 0, 77, Cuando esta seora habla es como leer los twits de Ivanka Trump con el HT , {})]"
1,will require institutions that receive grant funds to tell them if PIs coPIs or anyone on the grant is found to have,"[(document, 0, 118, will require institutions that receive grant funds to tell them if PIs coPIs or anyone on the grant is found to have, {})]"
2,Listening to the awesome feminist scholar Cynthia Enloe speaking about the relationship between the movement,"[(document, 0, 109, Listening to the awesome feminist scholar Cynthia Enloe speaking about the relationship between the movement , {})]"
3,,"[(document, 0, 0, , {})]"
4,A ver donde estn todas las voceras colombianas del No van a decir nada ante esto De verdad se van a qued,"[(document, 0, 106, A ver donde estn todas las voceras colombianas del No van a decir nada ante esto De verdad se van a qued, {})]"


### Tokenizer

In [20]:
### Initialize tokenizer
tokenizer = Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

# Transform
tweets_df = tokenizer.transform(tweets_df)

In [21]:
# Preview dataset
tweets_df.limit(5).toPandas()[['document', 'token']]

Unnamed: 0,document,token
0,"[(document, 0, 77, Cuando esta seora habla es como leer los twits de Ivanka Trump con el HT , {})]","[(token, 5, 10, Cuando, {'sentence': '1'}), (token, 12, 15, esta, {'sentence': '1'}), (token, 17, 21, seora, {'sentence': '1'}), (token, 23, 27, habla, {'sentence': '1'}), (token, 29, 30, es, {'sentence': '1'}), (token, 32, 35, como, {'sentence': '1'}), (token, 37, 40, leer, {'sentence': '1'}), (token, 42, 44, los, {'sentence': '1'}), (token, 46, 50, twits, {'sentence': '1'}), (token, 52, 53, de, {'sentence': '1'}), (token, 55, 60, Ivanka, {'sentence': '1'}), (token, 62, 66, Trump, {'sentence': '1'}), (token, 68, 70, con, {'sentence': '1'}), (token, 72, 73, el, {'sentence': '1'}), (token, 75, 76, HT, {'sentence': '1'})]"
1,"[(document, 0, 118, will require institutions that receive grant funds to tell them if PIs coPIs or anyone on the grant is found to have, {})]","[(token, 3, 6, will, {'sentence': '1'}), (token, 8, 14, require, {'sentence': '1'}), (token, 16, 27, institutions, {'sentence': '1'}), (token, 29, 32, that, {'sentence': '1'}), (token, 34, 40, receive, {'sentence': '1'}), (token, 42, 46, grant, {'sentence': '1'}), (token, 48, 52, funds, {'sentence': '1'}), (token, 54, 55, to, {'sentence': '1'}), (token, 57, 60, tell, {'sentence': '1'}), (token, 62, 65, them, {'sentence': '1'}), (token, 67, 68, if, {'sentence': '1'}), (token, 70, 72, PIs, {'sentence': '1'}), (token, 74, 78, coPIs, {'sentence': '1'}), (token, 80, 81, or, {'sentence': '1'}), (token, 83, 88, anyone, {'sentence': '1'}), (token, 90, 91, on, {'sentence': '1'}), (token, 93, 95, the, {'sentence': '1'}), (token, 97, 101, grant, {'sentence': '1'}), (token, 103, 104, is, {'sentence': '1'}), (token, 106, 110, found, {'sentence': '1'}), (token, 112, 113, to, {'sentence': '1'}), (token, 115, 118, have, {'sentence': '1'})]"
2,"[(document, 0, 109, Listening to the awesome feminist scholar Cynthia Enloe speaking about the relationship between the movement , {})]","[(token, 0, 8, Listening, {'sentence': '1'}), (token, 10, 11, to, {'sentence': '1'}), (token, 13, 15, the, {'sentence': '1'}), (token, 17, 23, awesome, {'sentence': '1'}), (token, 25, 32, feminist, {'sentence': '1'}), (token, 34, 40, scholar, {'sentence': '1'}), (token, 42, 48, Cynthia, {'sentence': '1'}), (token, 50, 54, Enloe, {'sentence': '1'}), (token, 56, 63, speaking, {'sentence': '1'}), (token, 65, 69, about, {'sentence': '1'}), (token, 71, 73, the, {'sentence': '1'}), (token, 75, 86, relationship, {'sentence': '1'}), (token, 88, 94, between, {'sentence': '1'}), (token, 96, 98, the, {'sentence': '1'}), (token, 101, 108, movement, {'sentence': '1'})]"
3,"[(document, 0, 0, , {})]",[]
4,"[(document, 0, 106, A ver donde estn todas las voceras colombianas del No van a decir nada ante esto De verdad se van a qued, {})]","[(token, 2, 2, A, {'sentence': '1'}), (token, 4, 6, ver, {'sentence': '1'}), (token, 8, 12, donde, {'sentence': '1'}), (token, 14, 17, estn, {'sentence': '1'}), (token, 19, 23, todas, {'sentence': '1'}), (token, 25, 27, las, {'sentence': '1'}), (token, 29, 35, voceras, {'sentence': '1'}), (token, 37, 47, colombianas, {'sentence': '1'}), (token, 49, 51, del, {'sentence': '1'}), (token, 54, 55, No, {'sentence': '1'}), (token, 57, 59, van, {'sentence': '1'}), (token, 61, 61, a, {'sentence': '1'}), (token, 63, 67, decir, {'sentence': '1'}), (token, 69, 72, nada, {'sentence': '1'}), (token, 74, 77, ante, {'sentence': '1'}), (token, 79, 82, esto, {'sentence': '1'}), (token, 84, 85, De, {'sentence': '1'}), (token, 87, 92, verdad, {'sentence': '1'}), (token, 94, 95, se, {'sentence': '1'}), (token, 97, 99, van, {'sentence': '1'}), (token, 101, 101, a, {'sentence': '1'}), (token, 103, 106, qued, {'sentence': '1'})]"


### Normalizer

In [22]:
# Initialize normalizer
normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normal")
    
# Transform
tweets_df = normalizer.transform(tweets_df)

In [23]:
# Preview dataset
tweets_df.limit(5).toPandas()[['token', 'normal']]

Unnamed: 0,token,normal
0,"[(token, 5, 10, Cuando, {'sentence': '1'}), (token, 12, 15, esta, {'sentence': '1'}), (token, 17, 21, seora, {'sentence': '1'}), (token, 23, 27, habla, {'sentence': '1'}), (token, 29, 30, es, {'sentence': '1'}), (token, 32, 35, como, {'sentence': '1'}), (token, 37, 40, leer, {'sentence': '1'}), (token, 42, 44, los, {'sentence': '1'}), (token, 46, 50, twits, {'sentence': '1'}), (token, 52, 53, de, {'sentence': '1'}), (token, 55, 60, Ivanka, {'sentence': '1'}), (token, 62, 66, Trump, {'sentence': '1'}), (token, 68, 70, con, {'sentence': '1'}), (token, 72, 73, el, {'sentence': '1'}), (token, 75, 76, HT, {'sentence': '1'})]","[(token, 5, 10, cuando, {'sentence': '1'}), (token, 12, 15, esta, {'sentence': '1'}), (token, 17, 21, seora, {'sentence': '1'}), (token, 23, 27, habla, {'sentence': '1'}), (token, 29, 30, es, {'sentence': '1'}), (token, 32, 35, como, {'sentence': '1'}), (token, 37, 40, leer, {'sentence': '1'}), (token, 42, 44, los, {'sentence': '1'}), (token, 46, 50, twits, {'sentence': '1'}), (token, 52, 53, de, {'sentence': '1'}), (token, 55, 60, ivanka, {'sentence': '1'}), (token, 62, 66, trump, {'sentence': '1'}), (token, 68, 70, con, {'sentence': '1'}), (token, 72, 73, el, {'sentence': '1'}), (token, 75, 76, ht, {'sentence': '1'})]"
1,"[(token, 3, 6, will, {'sentence': '1'}), (token, 8, 14, require, {'sentence': '1'}), (token, 16, 27, institutions, {'sentence': '1'}), (token, 29, 32, that, {'sentence': '1'}), (token, 34, 40, receive, {'sentence': '1'}), (token, 42, 46, grant, {'sentence': '1'}), (token, 48, 52, funds, {'sentence': '1'}), (token, 54, 55, to, {'sentence': '1'}), (token, 57, 60, tell, {'sentence': '1'}), (token, 62, 65, them, {'sentence': '1'}), (token, 67, 68, if, {'sentence': '1'}), (token, 70, 72, PIs, {'sentence': '1'}), (token, 74, 78, coPIs, {'sentence': '1'}), (token, 80, 81, or, {'sentence': '1'}), (token, 83, 88, anyone, {'sentence': '1'}), (token, 90, 91, on, {'sentence': '1'}), (token, 93, 95, the, {'sentence': '1'}), (token, 97, 101, grant, {'sentence': '1'}), (token, 103, 104, is, {'sentence': '1'}), (token, 106, 110, found, {'sentence': '1'}), (token, 112, 113, to, {'sentence': '1'}), (token, 115, 118, have, {'sentence': '1'})]","[(token, 3, 6, will, {'sentence': '1'}), (token, 8, 14, require, {'sentence': '1'}), (token, 16, 27, institutions, {'sentence': '1'}), (token, 29, 32, that, {'sentence': '1'}), (token, 34, 40, receive, {'sentence': '1'}), (token, 42, 46, grant, {'sentence': '1'}), (token, 48, 52, funds, {'sentence': '1'}), (token, 54, 55, to, {'sentence': '1'}), (token, 57, 60, tell, {'sentence': '1'}), (token, 62, 65, them, {'sentence': '1'}), (token, 67, 68, if, {'sentence': '1'}), (token, 70, 72, pis, {'sentence': '1'}), (token, 74, 78, copis, {'sentence': '1'}), (token, 80, 81, or, {'sentence': '1'}), (token, 83, 88, anyone, {'sentence': '1'}), (token, 90, 91, on, {'sentence': '1'}), (token, 93, 95, the, {'sentence': '1'}), (token, 97, 101, grant, {'sentence': '1'}), (token, 103, 104, is, {'sentence': '1'}), (token, 106, 110, found, {'sentence': '1'}), (token, 112, 113, to, {'sentence': '1'}), (token, 115, 118, have, {'sentence': '1'})]"
2,"[(token, 0, 8, Listening, {'sentence': '1'}), (token, 10, 11, to, {'sentence': '1'}), (token, 13, 15, the, {'sentence': '1'}), (token, 17, 23, awesome, {'sentence': '1'}), (token, 25, 32, feminist, {'sentence': '1'}), (token, 34, 40, scholar, {'sentence': '1'}), (token, 42, 48, Cynthia, {'sentence': '1'}), (token, 50, 54, Enloe, {'sentence': '1'}), (token, 56, 63, speaking, {'sentence': '1'}), (token, 65, 69, about, {'sentence': '1'}), (token, 71, 73, the, {'sentence': '1'}), (token, 75, 86, relationship, {'sentence': '1'}), (token, 88, 94, between, {'sentence': '1'}), (token, 96, 98, the, {'sentence': '1'}), (token, 101, 108, movement, {'sentence': '1'})]","[(token, 0, 8, listening, {'sentence': '1'}), (token, 10, 11, to, {'sentence': '1'}), (token, 13, 15, the, {'sentence': '1'}), (token, 17, 23, awesome, {'sentence': '1'}), (token, 25, 32, feminist, {'sentence': '1'}), (token, 34, 40, scholar, {'sentence': '1'}), (token, 42, 48, cynthia, {'sentence': '1'}), (token, 50, 54, enloe, {'sentence': '1'}), (token, 56, 63, speaking, {'sentence': '1'}), (token, 65, 69, about, {'sentence': '1'}), (token, 71, 73, the, {'sentence': '1'}), (token, 75, 86, relationship, {'sentence': '1'}), (token, 88, 94, between, {'sentence': '1'}), (token, 96, 98, the, {'sentence': '1'}), (token, 101, 108, movement, {'sentence': '1'})]"
3,[],[]
4,"[(token, 2, 2, A, {'sentence': '1'}), (token, 4, 6, ver, {'sentence': '1'}), (token, 8, 12, donde, {'sentence': '1'}), (token, 14, 17, estn, {'sentence': '1'}), (token, 19, 23, todas, {'sentence': '1'}), (token, 25, 27, las, {'sentence': '1'}), (token, 29, 35, voceras, {'sentence': '1'}), (token, 37, 47, colombianas, {'sentence': '1'}), (token, 49, 51, del, {'sentence': '1'}), (token, 54, 55, No, {'sentence': '1'}), (token, 57, 59, van, {'sentence': '1'}), (token, 61, 61, a, {'sentence': '1'}), (token, 63, 67, decir, {'sentence': '1'}), (token, 69, 72, nada, {'sentence': '1'}), (token, 74, 77, ante, {'sentence': '1'}), (token, 79, 82, esto, {'sentence': '1'}), (token, 84, 85, De, {'sentence': '1'}), (token, 87, 92, verdad, {'sentence': '1'}), (token, 94, 95, se, {'sentence': '1'}), (token, 97, 99, van, {'sentence': '1'}), (token, 101, 101, a, {'sentence': '1'}), (token, 103, 106, qued, {'sentence': '1'})]","[(token, 2, 2, a, {'sentence': '1'}), (token, 4, 6, ver, {'sentence': '1'}), (token, 8, 12, donde, {'sentence': '1'}), (token, 14, 17, estn, {'sentence': '1'}), (token, 19, 23, todas, {'sentence': '1'}), (token, 25, 27, las, {'sentence': '1'}), (token, 29, 35, voceras, {'sentence': '1'}), (token, 37, 47, colombianas, {'sentence': '1'}), (token, 49, 51, del, {'sentence': '1'}), (token, 54, 55, no, {'sentence': '1'}), (token, 57, 59, van, {'sentence': '1'}), (token, 61, 61, a, {'sentence': '1'}), (token, 63, 67, decir, {'sentence': '1'}), (token, 69, 72, nada, {'sentence': '1'}), (token, 74, 77, ante, {'sentence': '1'}), (token, 79, 82, esto, {'sentence': '1'}), (token, 84, 85, de, {'sentence': '1'}), (token, 87, 92, verdad, {'sentence': '1'}), (token, 94, 95, se, {'sentence': '1'}), (token, 97, 99, van, {'sentence': '1'}), (token, 101, 101, a, {'sentence': '1'}), (token, 103, 106, qued, {'sentence': '1'})]"


### Lemmatizer

In [24]:
# Initialize pre-trained lemmatizer
from sparknlp.annotator import LemmatizerModel
lemmatizer = LemmatizerModel.pretrained() \
    .setInputCols(['normal']) \
    .setOutputCol('lemma')

# Transform
tweets_df = lemmatizer.transform(tweets_df)

In [25]:
# Preview dataset
tweets_df.limit(5).toPandas()[['normal', 'lemma']]

Unnamed: 0,normal,lemma
0,"[(token, 5, 10, cuando, {'sentence': '1'}), (token, 12, 15, esta, {'sentence': '1'}), (token, 17, 21, seora, {'sentence': '1'}), (token, 23, 27, habla, {'sentence': '1'}), (token, 29, 30, es, {'sentence': '1'}), (token, 32, 35, como, {'sentence': '1'}), (token, 37, 40, leer, {'sentence': '1'}), (token, 42, 44, los, {'sentence': '1'}), (token, 46, 50, twits, {'sentence': '1'}), (token, 52, 53, de, {'sentence': '1'}), (token, 55, 60, ivanka, {'sentence': '1'}), (token, 62, 66, trump, {'sentence': '1'}), (token, 68, 70, con, {'sentence': '1'}), (token, 72, 73, el, {'sentence': '1'}), (token, 75, 76, ht, {'sentence': '1'})]","[(token, 5, 10, cuando, {'sentence': '1'}), (token, 12, 15, esta, {'sentence': '1'}), (token, 17, 21, seora, {'sentence': '1'}), (token, 23, 27, habla, {'sentence': '1'}), (token, 29, 30, es, {'sentence': '1'}), (token, 32, 35, como, {'sentence': '1'}), (token, 37, 40, leer, {'sentence': '1'}), (token, 42, 44, los, {'sentence': '1'}), (token, 46, 50, twit, {'sentence': '1'}), (token, 52, 53, de, {'sentence': '1'}), (token, 55, 60, ivanka, {'sentence': '1'}), (token, 62, 66, trump, {'sentence': '1'}), (token, 68, 70, con, {'sentence': '1'}), (token, 72, 73, el, {'sentence': '1'}), (token, 75, 76, ht, {'sentence': '1'})]"
1,"[(token, 3, 6, will, {'sentence': '1'}), (token, 8, 14, require, {'sentence': '1'}), (token, 16, 27, institutions, {'sentence': '1'}), (token, 29, 32, that, {'sentence': '1'}), (token, 34, 40, receive, {'sentence': '1'}), (token, 42, 46, grant, {'sentence': '1'}), (token, 48, 52, funds, {'sentence': '1'}), (token, 54, 55, to, {'sentence': '1'}), (token, 57, 60, tell, {'sentence': '1'}), (token, 62, 65, them, {'sentence': '1'}), (token, 67, 68, if, {'sentence': '1'}), (token, 70, 72, pis, {'sentence': '1'}), (token, 74, 78, copis, {'sentence': '1'}), (token, 80, 81, or, {'sentence': '1'}), (token, 83, 88, anyone, {'sentence': '1'}), (token, 90, 91, on, {'sentence': '1'}), (token, 93, 95, the, {'sentence': '1'}), (token, 97, 101, grant, {'sentence': '1'}), (token, 103, 104, is, {'sentence': '1'}), (token, 106, 110, found, {'sentence': '1'}), (token, 112, 113, to, {'sentence': '1'}), (token, 115, 118, have, {'sentence': '1'})]","[(token, 3, 6, will, {'sentence': '1'}), (token, 8, 14, require, {'sentence': '1'}), (token, 16, 27, institution, {'sentence': '1'}), (token, 29, 32, that, {'sentence': '1'}), (token, 34, 40, receive, {'sentence': '1'}), (token, 42, 46, grant, {'sentence': '1'}), (token, 48, 52, fund, {'sentence': '1'}), (token, 54, 55, to, {'sentence': '1'}), (token, 57, 60, tell, {'sentence': '1'}), (token, 62, 65, they, {'sentence': '1'}), (token, 67, 68, if, {'sentence': '1'}), (token, 70, 72, pi, {'sentence': '1'}), (token, 74, 78, copis, {'sentence': '1'}), (token, 80, 81, or, {'sentence': '1'}), (token, 83, 88, anyone, {'sentence': '1'}), (token, 90, 91, on, {'sentence': '1'}), (token, 93, 95, the, {'sentence': '1'}), (token, 97, 101, grant, {'sentence': '1'}), (token, 103, 104, be, {'sentence': '1'}), (token, 106, 110, find, {'sentence': '1'}), (token, 112, 113, to, {'sentence': '1'}), (token, 115, 118, have, {'sentence': '1'})]"
2,"[(token, 0, 8, listening, {'sentence': '1'}), (token, 10, 11, to, {'sentence': '1'}), (token, 13, 15, the, {'sentence': '1'}), (token, 17, 23, awesome, {'sentence': '1'}), (token, 25, 32, feminist, {'sentence': '1'}), (token, 34, 40, scholar, {'sentence': '1'}), (token, 42, 48, cynthia, {'sentence': '1'}), (token, 50, 54, enloe, {'sentence': '1'}), (token, 56, 63, speaking, {'sentence': '1'}), (token, 65, 69, about, {'sentence': '1'}), (token, 71, 73, the, {'sentence': '1'}), (token, 75, 86, relationship, {'sentence': '1'}), (token, 88, 94, between, {'sentence': '1'}), (token, 96, 98, the, {'sentence': '1'}), (token, 101, 108, movement, {'sentence': '1'})]","[(token, 0, 8, listen, {'sentence': '1'}), (token, 10, 11, to, {'sentence': '1'}), (token, 13, 15, the, {'sentence': '1'}), (token, 17, 23, awesome, {'sentence': '1'}), (token, 25, 32, feminist, {'sentence': '1'}), (token, 34, 40, scholar, {'sentence': '1'}), (token, 42, 48, cynthia, {'sentence': '1'}), (token, 50, 54, enloe, {'sentence': '1'}), (token, 56, 63, speak, {'sentence': '1'}), (token, 65, 69, about, {'sentence': '1'}), (token, 71, 73, the, {'sentence': '1'}), (token, 75, 86, relationship, {'sentence': '1'}), (token, 88, 94, between, {'sentence': '1'}), (token, 96, 98, the, {'sentence': '1'}), (token, 101, 108, movement, {'sentence': '1'})]"
3,[],[]
4,"[(token, 2, 2, a, {'sentence': '1'}), (token, 4, 6, ver, {'sentence': '1'}), (token, 8, 12, donde, {'sentence': '1'}), (token, 14, 17, estn, {'sentence': '1'}), (token, 19, 23, todas, {'sentence': '1'}), (token, 25, 27, las, {'sentence': '1'}), (token, 29, 35, voceras, {'sentence': '1'}), (token, 37, 47, colombianas, {'sentence': '1'}), (token, 49, 51, del, {'sentence': '1'}), (token, 54, 55, no, {'sentence': '1'}), (token, 57, 59, van, {'sentence': '1'}), (token, 61, 61, a, {'sentence': '1'}), (token, 63, 67, decir, {'sentence': '1'}), (token, 69, 72, nada, {'sentence': '1'}), (token, 74, 77, ante, {'sentence': '1'}), (token, 79, 82, esto, {'sentence': '1'}), (token, 84, 85, de, {'sentence': '1'}), (token, 87, 92, verdad, {'sentence': '1'}), (token, 94, 95, se, {'sentence': '1'}), (token, 97, 99, van, {'sentence': '1'}), (token, 101, 101, a, {'sentence': '1'}), (token, 103, 106, qued, {'sentence': '1'})]","[(token, 2, 2, a, {'sentence': '1'}), (token, 4, 6, ver, {'sentence': '1'}), (token, 8, 12, donde, {'sentence': '1'}), (token, 14, 17, estn, {'sentence': '1'}), (token, 19, 23, todas, {'sentence': '1'}), (token, 25, 27, la, {'sentence': '1'}), (token, 29, 35, voceras, {'sentence': '1'}), (token, 37, 47, colombianas, {'sentence': '1'}), (token, 49, 51, del, {'sentence': '1'}), (token, 54, 55, no, {'sentence': '1'}), (token, 57, 59, van, {'sentence': '1'}), (token, 61, 61, a, {'sentence': '1'}), (token, 63, 67, decir, {'sentence': '1'}), (token, 69, 72, nada, {'sentence': '1'}), (token, 74, 77, ante, {'sentence': '1'}), (token, 79, 82, esto, {'sentence': '1'}), (token, 84, 85, de, {'sentence': '1'}), (token, 87, 92, verdad, {'sentence': '1'}), (token, 94, 95, se, {'sentence': '1'}), (token, 97, 99, van, {'sentence': '1'}), (token, 101, 101, a, {'sentence': '1'}), (token, 103, 106, qued, {'sentence': '1'})]"


### Finisher

In [26]:
# Initialize finisher
finisher = Finisher() \
    .setInputCols(["lemma"]) \
    .setAnnotationSplitSymbol(' ') 
    
# Transform
tweets_df = finisher.transform(tweets_df)

In [27]:
# Preview dataset
tweets_df.limit(5).toPandas()[['text', 'finished_lemma']]

Unnamed: 0,text,finished_lemma
0,RT @IxAmandaDelgado: @Navegaciones @FelipeCalderon @comsatori Cuando esta se?ora habla es como leer los twits de Ivanka Trump con el HT #Me?,cuando esta seora habla es como leer los twit de ivanka trump con el ht
1,"RT @alexwitze: .@NSF will require institutions that receive grant funds to tell them if PIs, co-PIs or anyone on the grant is found to have?",will require institution that receive grant fund to tell they if pi copis or anyone on the grant be find to have
2,Listening to the awesome feminist scholar Cynthia Enloe speaking about the relationship between the #metoo movement? https://t.co/aeoOhchgwA,listen to the awesome feminist scholar cynthia enloe speak about the relationship between the movement
3,???????????????????????????????????????????????????????????? https://t.co/gWAWGlKa36,
4,"RT @AlbertoBernalLe: ?A ver, donde est?n todas las voceras colombianas del #MeToo? ?No van a decir nada ante esto? ?De verdad se van a qued?",a ver donde estn todas la voceras colombianas del no van a decir nada ante esto de verdad se van a qued
