# Environment Setup

This notebook and Spark environment is setup via this docker image
https://hub.docker.com/r/johnsnowlabs/spark-nlp-workshop

I extended the docker image from https://hub.docker.com/r/johnsnowlabs/spark-nlp-workshop so that it includes pandas via the Dockerfile

The purpose of pandas is for asthethic purpose, after processing via Spark, I convert the dataframe to a local Pandas dataframe via toPandas() because it has better formatting than the spark dataframe.show() method

Steps setup this environment:
1. Have docker installed locally
2. git clone git@github.com:ronteo/big-data-project.git
3. docker build -t ron/mysparknlp .
3. docker run -it --rm -p 8888:8888 -p 4040:4040 -v < ABSOLUTEPATH > /big-data-project/notebook:/home/jovyan/ ron/mysparknlp
4. Goto http://localhost:8888

# Key Motivation Of Spark NLP

<img src="img/spark-nlp.png">

1. Runtime Performance 
    1. Although spaCy is fast, subpar reuntime performance when running on Spark because NLP pipeline is just one component of a bigger data processing piepeline.
    2. Splitting data processing framework (spark) from NLP framework means most of processing time will be spent on serialization and copying strings.
    
    3. DataFrames integration, no need to copy out of Tungesten optimized format.

2. Ecosystem
    1. Seamless integration with Spark ML
    
3. Enterprise Grade
    

Reference:

https://databricks.com/blog/2017/10/19/introducing-natural-language-processing-library-apache-spark.html



# Start spark

```python
import sparknlp

sparknlp.start()

#is a shortcut to :

spark = SparkSession.builder \
    .master('local[*]') \
    .appName('MY APP') \
    .config("spark.driver.memory", "6g") \
    .config("spark.executor.memory", "6g") \
    .config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.0.3") \
    .getOrCreate()
```

Reference:

https://nlp.johnsnowlabs.com/quickstart.html

In [1]:
# from pyspark.sql import SparkSession

# spark = SparkSession.builder \
#     .master('local[*]') \
#     .appName('MY APP') \
#     .config("spark.driver.memory", "8g") \
#     .config("spark.executor.memory", "8g") \
#     .config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.0.3") \
#     .getOrCreate()

In [1]:
import sparknlp
spark = sparknlp.start()
spark

# Load data / Data Pre-Processing

Read the data to DataFrame instead of RDD because Spark NLP integrades well with it and the [dataset](https://github.com/BobAdamsEE/SouthParkData/blob/master/All-seasons.csv) taken from https://github.com/BobAdamsEE/SouthParkData is already semi-structured (csv)


Reference:

https://github.com/databricks/spark-csv

# Clean / Input verification

**Pre-Processing**
1. Read in data as multiline, and handle commas within text as well as double quotes
2. Filter out duplicate headers
3. Cast episode and season to lowercase
4. Lowercase column names


**Issues**

|Symptom | Description| Remedy |
|--------|------------|--------|
| Rows with null | The input file contains a multiple line for a single entry. |.option("multiLine", True) 
|select("Season).distinct() produced invalied value such as "10,5,Geraldo,"A.." | Csv parsing issue due to spark-csv escaping with "\". This causes issues with entries containing quoted text such as "Okay, there's another one.  Aw, man! Look at that!  Can you believe this?! An SUV with a V8 engine, makes me sick!  ""Ticket for driving a gas-guzzler"" |  .option('escape', '"')
|select("Season).distinct() contains entry "Season"| The header was appended for each season to our All-Season dataset, used df.filter(col("Season") == "Season").count() yielded 17, the same count as number of seasons - 1 and df.filter(col("Season") == "Season").show() produced 17 of rows reading the header e.g."Season,Episode, Character, Line"| df.filter(col("Season") != "Season")
|orderBy for distinct episode and season was not producing expected ordering| although the inferSchema option was set to True, the dataType for the fields were of type String instead of int. Threfore order was lexicographical instead of int order | df.withColumn("Season", df.Season.cast('int')) |


**Any bad data issues?**

As I skimpped through the data, I did not find any bad data, it could be due to the fact that this data has been cleaned as stated on the Github Readme


In [17]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

df = spark.read.option("header", True) \
                .option("multiLine", True) \
                .option("inferSchema", True) \
                .option('escape', '"') \
                .csv("data/All-seasons.csv") \
                .filter(col("Season") != "Season")
 
df = df.withColumn("Season", df.Season.cast('int')) \
        .withColumn("Episode", df.Episode.cast('int'))
        
df = df.toDF(*[c.lower() for c in df.columns])


In [4]:
df.select("season").distinct().orderBy("season").show(10000)

+------+
|season|
+------+
|     1|
|     2|
|     3|
|     4|
|     5|
|     6|
|     7|
|     8|
|     9|
|    10|
|    11|
|    12|
|    13|
|    14|
|    15|
|    16|
|    17|
|    18|
+------+



In [5]:
df.select("episode").distinct().orderBy("episode").show(100000)

+-------+
|episode|
+-------+
|      1|
|      2|
|      3|
|      4|
|      5|
|      6|
|      7|
|      8|
|      9|
|     10|
|     11|
|     12|
|     13|
|     14|
|     15|
|     16|
|     17|
|     18|
+-------+



## Data Description and Schema

The csv data consist SouthPark scripts. * There were 70,896 lines/rows before removing headers and **70,879** after removing 17 headers, Each column represent:

1. Season
2. Episode
3. Character
4. Line (Can contain multiple sentences)


Each season consist of **~2,500 to ~6,500** lines (avg: **3,937** lines) for a total of 70,879 lines

The cast consist of **3,949** characters
Cartman tops the number of lines spoken (**9774**) followed by Stan (**7680**) and Kyle (**7099**)

\* see "Clean" section on explanation on why there are 17 instead of 1 header. 

In [6]:
print("Total rows: %d \n" % df.count())

df.printSchema()
df.toPandas()[:10]

Total rows: 70879 

root
 |-- season: integer (nullable = true)
 |-- episode: integer (nullable = true)
 |-- character: string (nullable = true)
 |-- line: string (nullable = true)



Unnamed: 0,season,episode,character,line
0,10,1,Stan,"You guys, you guys! Chef is going away. \n"
1,10,1,Kyle,Going away? For how long?\n
2,10,1,Stan,Forever.\n
3,10,1,Chef,I'm sorry boys.\n
4,10,1,Stan,"Chef said he's been bored, so he joining a gro..."
5,10,1,Chef,Wow!\n
6,10,1,Mrs. Garrison,Chef?? What kind of questions do you think adv...
7,10,1,Chef,What's the meaning of life? Why are we here?\n
8,10,1,Mrs. Garrison,I hope you're making the right choice.\n
9,10,1,Cartman,I'm gonna miss him. I'm gonna miss Chef and I...


In [7]:
print("Num unique characters % d" % df.select("character").distinct().count())
df.select("character").distinct().toPandas()[:10]

Num unique characters  3949


Unnamed: 0,character
0,Dougie
1,Lesbian 11
2,Pedo 5
3,Transient Man 6
4,Kip's Dad
5,Bookie
6,Miss Stevens
7,Elder 1
8,Boy in Green Shirt
9,Man on Porto Potty


In [8]:
df.groupBy("character").count().orderBy(desc("count")).toPandas()[:10]

Unnamed: 0,character,count
0,Cartman,9774
1,Stan,7680
2,Kyle,7099
3,Butters,2602
4,Randy,2467
5,Mr. Garrison,1002
6,Chef,917
7,Kenny,881
8,Sharon,862
9,Mr. Mackey,633


In [9]:
df.groupBy("season").count().orderBy("season").toPandas()

Unnamed: 0,season,count
0,1,4170
1,2,6416
2,3,5798
3,4,5680
4,5,4414
5,6,5131
6,7,4236
7,8,3601
8,9,3526
9,10,3471


In [10]:
df.groupBy("season").count().agg(avg(col("count"))).show()

+-----------------+
|       avg(count)|
+-----------------+
|3937.722222222222|
+-----------------+



# NLP Data Processing


**Terminology**

1. Annotation - Basic form of the result of a Spark NLP Operation
    1. begin
    2. end
    3. result
    4. mertadata
    5. embeddings
  
2. Annotators
    1. Annotator Approaces - Represent a Spark ML Estimator, produces annotator model or transformer
    2. Annotator Model - Spark model or transformers meaning they have transform(data) function.

3. Common Functions
    1. setInputCols(column_names) - column names of annotations required by this annotator
    2. setOutputCol(column_name) - Ouput column name
    

**Reference:**

https://nlp.johnsnowlabs.com/quickstart.html

## NLP Pipeline

**Steps:**
1. **DocumentAssembler** - Prep data for annotation
2. **SentenceDetector** - Split sentences in each line
3. **Tokenizer** - Tokenize sentences
4. **Normalizer** - Normalize tokens

Reference:

https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/english/vivekn-sentiment/VivekNarayanSentimentApproach.ipynb

### Step 1: DocumentAssembler

Transformer that will create annotation of type Document to be used by annotators down the road. We need this have our lines annotated

e.g.


**line:**

"You guys, you guys! Chef is going away."

**out:**
```
[
    [document, 0, 38, You guys, you guys! Chef is going away., [], [], []]
]
        
```

In [18]:
from sparknlp.base import DocumentAssembler, Finisher

document_assembler = DocumentAssembler() \
            .setInputCol("line")\
            .setOutputCol("document")


In [19]:
assembled = document_assembler.transform(df)
assembled.toPandas()[:10]


Unnamed: 0,season,episode,character,line,document
0,10,1,Stan,"You guys, you guys! Chef is going away. \n","[(document, 0, 38, You guys, you guys! Chef is going away., {}, [], [])]"
1,10,1,Kyle,Going away? For how long?\n,"[(document, 0, 24, Going away? For how long?, {}, [], [])]"
2,10,1,Stan,Forever.\n,"[(document, 0, 7, Forever., {}, [], [])]"
3,10,1,Chef,I'm sorry boys.\n,"[(document, 0, 14, I'm sorry boys., {}, [], [])]"
4,10,1,Stan,"Chef said he's been bored, so he joining a group called the Super Adventure Club. \n","[(document, 0, 80, Chef said he's been bored, so he joining a group called the Super Adventure Club., {}, [], [])]"
5,10,1,Chef,Wow!\n,"[(document, 0, 3, Wow!, {}, [], [])]"
6,10,1,Mrs. Garrison,Chef?? What kind of questions do you think adventuring around the world is gonna answer?!\n,"[(document, 0, 88, Chef?? What kind of questions do you think adventuring around the world is gonna answer?!, {}, [], [])]"
7,10,1,Chef,What's the meaning of life? Why are we here?\n,"[(document, 0, 43, What's the meaning of life? Why are we here?, {}, [], [])]"
8,10,1,Mrs. Garrison,I hope you're making the right choice.\n,"[(document, 0, 37, I hope you're making the right choice., {}, [], [])]"
9,10,1,Cartman,I'm gonna miss him. I'm gonna miss Chef and I...and I don't know how to tell him! \n,"[(document, 0, 80, I'm gonna miss him. I'm gonna miss Chef and I...and I don't know how to tell him!, {}, [], [])]"


In [20]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)

assembled.select("line","document").toPandas()[:3]

Unnamed: 0,line,document
0,"You guys, you guys! Chef is going away. \n","[(document, 0, 38, You guys, you guys! Chef is going away., {}, [], [])]"
1,Going away? For how long?\n,"[(document, 0, 24, Going away? For how long?, {}, [], [])]"
2,Forever.\n,"[(document, 0, 7, Forever., {}, [], [])]"


### Step 2: Split sentences within each line

e.g.

**line:**
"You guys, you guys! Chef is going away."

**out:**
```
[
    [document, 0, 18, You guys, you guys!, [sentence -> 0], [], []],
    [document, 20, 38, Chef is going away., [sentence -> 1], [], []]
]
```

In [22]:
from sparknlp.annotator import *
sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

In [23]:
sentence_data = sentence_detector.transform(assembled)
sentence_data.toPandas()[:5]

Unnamed: 0,season,episode,character,line,document,sentence
0,10,1,Stan,"You guys, you guys! Chef is going away. \n","[(document, 0, 38, You guys, you guys! Chef is going away., {}, [], [])]","[(document, 0, 18, You guys, you guys!, {'sentence': '0'}, [], []), (document, 20, 38, Chef is going away., {'sentence': '1'}, [], [])]"
1,10,1,Kyle,Going away? For how long?\n,"[(document, 0, 24, Going away? For how long?, {}, [], [])]","[(document, 0, 10, Going away?, {'sentence': '0'}, [], []), (document, 12, 24, For how long?, {'sentence': '1'}, [], [])]"
2,10,1,Stan,Forever.\n,"[(document, 0, 7, Forever., {}, [], [])]","[(document, 0, 7, Forever., {'sentence': '0'}, [], [])]"
3,10,1,Chef,I'm sorry boys.\n,"[(document, 0, 14, I'm sorry boys., {}, [], [])]","[(document, 0, 14, I'm sorry boys., {'sentence': '0'}, [], [])]"
4,10,1,Stan,"Chef said he's been bored, so he joining a group called the Super Adventure Club. \n","[(document, 0, 80, Chef said he's been bored, so he joining a group called the Super Adventure Club., {}, [], [])]","[(document, 0, 80, Chef said he's been bored, so he joining a group called the Super Adventure Club., {'sentence': '0'}, [], [])]"


In [24]:
sentence_data.select("line", "sentence").toPandas()[:3]

Unnamed: 0,line,sentence
0,"You guys, you guys! Chef is going away. \n","[(document, 0, 18, You guys, you guys!, {'sentence': '0'}, [], []), (document, 20, 38, Chef is going away., {'sentence': '1'}, [], [])]"
1,Going away? For how long?\n,"[(document, 0, 10, Going away?, {'sentence': '0'}, [], []), (document, 12, 24, For how long?, {'sentence': '1'}, [], [])]"
2,Forever.\n,"[(document, 0, 7, Forever., {'sentence': '0'}, [], [])]"


### Step 3:  Tokenize

**line**
You guys, you guys! Chef is going away.

**out**
```
[
	[token, 0, 2, You, [sentence -> 0], [], []],
	[token, 4, 7, guys, [sentence -> 0], [], []], 
	[token, 8, 8, ,, [sentence -> 0], [], []], 
	[token, 10, 12, you, [sentence -> 0], [], []],
	[token, 14, 17, guys, [sentence -> 0], [], []], 
	[token, 18, 18, !, [sentence -> 0], [], []],
	[token, 20, 23, Chef, [sentence -> 1], [], []], 
	[token, 25, 26, is, [sentence -> 1], [], []], 
	[token, 28, 32, going, [sentence -> 1], [], []], 
	[token, 34, 37, away, [sentence -> 1], [], []],
	[token, 38, 38, ., [sentence -> 1], [], []]
]
```

In [25]:
tokenizer = Tokenizer() \
            .setInputCols(["sentence"]) \
            .setOutputCol("token")

In [26]:
tokenized = tokenizer.transform(sentence_data)
tokenized.select("line","token").show(3, False)

+-----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|line                                     |token                                                                                                                                                                                                                                                                                                                                                                                                             

# Step 4: Create finisher to convert annotations to results

e.g.


**token**
```
[
    [token, 0, 2, You, [sentence -> 0], [], []],
    [token, 4, 7, guys, [sentence -> 0], [], []], 
    [token, 8, 8, ,, [sentence -> 0], [], []], 
    [token, 10, 12, you, [sentence -> 0], [], []],
    [token, 14, 17, guys, [sentence -> 0], [], []], 
    [token, 18, 18, !, [sentence -> 0], [], []],
    [token, 20, 23, Chef, [sentence -> 1], [], []], 
    [token, 25, 26, is, [sentence -> 1], [], []], 
    [token, 28, 32, going, [sentence -> 1], [], []], 
    [token, 34, 37, away, [sentence -> 1], [], []],
    [token, 38, 38, ., [sentence -> 1], [], []]
]
```

**out**

```
[You, guys, ,, you, guys, !, Chef, is, going, away, .]

```




In [27]:
finisher = Finisher() \
    .setInputCols(["token","sentence"]) \
    .setIncludeMetadata(False)

In [28]:

from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    finisher
])

output_data = pipeline.fit(df).transform(df)
output_data.limit(3).toPandas()


Unnamed: 0,season,episode,character,line,finished_token,finished_sentence
0,10,1,Stan,"You guys, you guys! Chef is going away. \n","[You, guys, ,, you, guys, !, Chef, is, going, away, .]","[You guys, you guys!, Chef is going away.]"
1,10,1,Kyle,Going away? For how long?\n,"[Going, away, ?, For, how, long, ?]","[Going away?, For how long?]"
2,10,1,Stan,Forever.\n,"[Forever, .]",[Forever.]


# Use Pipeline Output for Exploration/Analysis

## Use pretrained pipeline to obtain tokens, lemma, and pos
Earlier on, I build my own custom pipeline for the purpose of learning, howerver I am going to use a new pretrained pipeline this time

The "explain_document_ml" pre-defined pipeline is used

### Issues:
1. I was not able to get the code to automatically download the pretrained model using explainPipeline = PretrainedPipeline('explain_document_dl').model,
so I manually loaded the pre trained model as described here https://github.com/JohnSnowLabs/spark-nlp in the "offline" section
2. I was not able to explode the table so thhat each token in the array occupies a row due to hardware limitation of my MacBook. I was also not able to write out to parquet file so that it can be exported and processed on the cluster. Limiting the files written to parquet solved the issue to export to parquet. I exported the file "token_pos" and continued processing on the cluster



In [None]:
!wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_ml_en_2.0.2_2.4_1556661821108.zip
!unzip explain_document_ml_en_2.0.2_2.4_1556661821108.zip -d /home/jovyan/notebook/explain_document_ml_en_2.0.2_2.4_1556661821108

--2019-05-14 00:33:23--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_ml_en_2.0.2_2.4_1556661821108.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.65.51
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.65.51|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9903335 (9.4M) [application/zip]
Saving to: 'explain_document_ml_en_2.0.2_2.4_1556661821108.zip.1'


2019-05-14 00:33:31 (1.46 MB/s) - 'explain_document_ml_en_2.0.2_2.4_1556661821108.zip.1' saved [9903335/9903335]

Archive:  explain_document_ml_en_2.0.2_2.4_1556661821108.zip
replace /home/jovyan/notebook/explain_document_ml_en_2.0.2_2.4_1556661821108/stages/6_POS_e31ec9eb50dc/fields/POS Model/.part-00000.crc? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [3]:
#Note: Not sure why running the entire notebook from the top will cause an error 
from sparknlp.pretrained import PretrainedPipeline
from pyspark.ml import PipelineModel
from sparknlp.base import *

#df needs text column to used explain_document_dl pipeline
df = df.withColumnRenamed("line","text")

#explainPipeline = PretrainedPipeline('explain_document_dl').model
#Using the downloaded version of the piepline instead of calling the line
#above because the code above is unable to download the pre-trained model
explainPipeline = PipelineModel.load("explain_document_ml_en_2.0.2_2.4_1556661821108/")

finisher =  Finisher() \
    .setInputCols(["token","lemmas", "pos"]) \
    .setIncludeMetadata(False)
pipeline = Pipeline(stages=[
        explainPipeline,
        finisher
])

pipeline_output = pipeline.fit(df).transform(df)

pipeline_output.limit(5).toPandas()


Unnamed: 0,season,episode,character,text,finished_token,finished_lemmas,finished_pos
0,10,1,Stan,"You guys, you guys! Chef is going away. \n","[You, guys, ,, you, guys, !, Chef, is, going, ...","[You, guy, ,, you, guy, !, Chef, be, go, away, .]","[PRP, NNS, ,, PRP, NNS, ., NNP, VBZ, VBG, RB, .]"
1,10,1,Kyle,Going away? For how long?\n,"[Going, away, ?, For, how, long, ?]","[Going, away, ?, For, how, long, ?]","[VBG, RB, ., IN, WRB, JJ, .]"
2,10,1,Stan,Forever.\n,"[Forever, .]","[Forever, .]","[RB, .]"
3,10,1,Chef,I'm sorry boys.\n,"[I, 'm, sorry, boys, .]","[I, be, sorry, boy, .]","[PRP, VBP, JJ, NNS, .]"
4,10,1,Stan,"Chef said he's been bored, so he joining a gro...","[Chef, said, he, 's, been, bored, ,, so, he, j...","[Chef, say, he, have, be, bored, ,, so, he, jo...","[NNP, VBD, PRP, VBZ, VBN, VBN, ,, IN, PRP, VBG..."


In [5]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)


pipeline_result = pipeline_output.limit(300)[["character","text","finished_token","finished_pos"]]
pipeline_result.limit(10).toPandas()

Unnamed: 0,character,text,finished_token,finished_pos
0,Stan,"You guys, you guys! Chef is going away. \n","[You, guys, ,, you, guys, !, Chef, is, going, away, .]","[PRP, NNS, ,, PRP, NNS, ., NNP, VBZ, VBG, RB, .]"
1,Kyle,Going away? For how long?\n,"[Going, away, ?, For, how, long, ?]","[VBG, RB, ., IN, WRB, JJ, .]"
2,Stan,Forever.\n,"[Forever, .]","[RB, .]"
3,Chef,I'm sorry boys.\n,"[I, 'm, sorry, boys, .]","[PRP, VBP, JJ, NNS, .]"
4,Stan,"Chef said he's been bored, so he joining a group called the Super Adventure Club. \n","[Chef, said, he, 's, been, bored, ,, so, he, joining, a, group, called, the, Super, Adventure, Club, .]","[NNP, VBD, PRP, VBZ, VBN, VBN, ,, IN, PRP, VBG, DT, NN, VBD, DT, NNP, NNP, NNP, .]"
5,Chef,Wow!\n,"[Wow, !]","[UH, .]"
6,Mrs. Garrison,Chef?? What kind of questions do you think adventuring around the world is gonna answer?!\n,"[Chef, ?, ?, What, kind, of, questions, do, you, think, adventuring, around, the, world, is, gonna, answer, ?, !]","[NNP, ., ., WP, NN, IN, NNS, VBP, PRP, VBP, VBG, IN, DT, NN, VBZ, VBG, VBG, ., .]"
7,Chef,What's the meaning of life? Why are we here?\n,"[What, 's, the, meaning, of, life, ?, Why, are, we, here, ?]","[WP, VBZ, DT, NN, IN, NN, ., WRB, VBP, PRP, RB, .]"
8,Mrs. Garrison,I hope you're making the right choice.\n,"[I, hope, you, 're, making, the, right, choice, .]","[PRP, VBP, PRP, VBP, VBG, DT, JJ, NN, .]"
9,Cartman,I'm gonna miss him. I'm gonna miss Chef and I...and I don't know how to tell him! \n,"[I, 'm, gonna, miss, him, ., I, 'm, gonna, miss, Chef, and, I., ., ., and, I, do, n't, know, how, to, tell, him, !]","[PRP, VBP, VBG, VBG, PRP, ., PRP, VBP, VBG, VBG, NNP, CC, NN, ., ., CC, PRP, VBP, RB, VB, WRB, TO, VB, PRP, .]"


In [8]:
from pyspark.sql.types import  ArrayType, StringType
 
zip_lists = udf(lambda x, y: [list(z) for z in zip(x, y)], ArrayType(StringType()))


tokens_pos_df = pipeline_result.withColumn('tokens_pos', zip_lists(col('finished_token'), col('finished_pos')))
tokens_pos_df.select("tokens_pos").show()

+--------------------+
|          tokens_pos|
+--------------------+
|[[You, PRP], [guy...|
|[[Going, VBG], [a...|
|[[Forever, RB], [...|
|[[I, PRP], ['m, V...|
|[[Chef, NNP], [sa...|
| [[Wow, UH], [!, .]]|
|[[Chef, NNP], [?,...|
|[[What, WP], ['s,...|
|[[I, PRP], [hope,...|
|[[I, PRP], ['m, V...|
|[[Dude, NNP], [,,...|
|[[And, CC], [we, ...|
|[[Bye-bye, NN], [...|
|[[Good-bye, NN], ...|
|[[So, RB], [long,...|
|[[So, RB], [long,...|
|[[Good-bye, JJ], ...|
|[[Good-bye, JJ], ...|
|[[Good-bye, NN], ...|
|[[Draw, NNP], [tw...|
+--------------------+
only showing top 20 rows



POS Labels


7.	JJ	Adjective
8.	JJR	Adjective, comparative
9.	JJS	Adjective, superlative

Reference:
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html


In [9]:
!rm -r /home/jovyan/notebook/tokens_pos

rm: cannot remove '/home/jovyan/notebook/tokens_pos': No such file or directory


In [10]:
tokens_pos_df.write.parquet("./tokens_pos")

## Results of running on cluster due to limitation of running locally on my Mac

Base on the 300 sample rows ran on the cluster, the most used adjective is "right" with a count of 14

<img src="img/cluster-1.png">

<img src="img/cluster-2.png">

<img src="img/cluster-3.png">



# How did you verify that your output is correct?
I verified them by taking samples of the dataframe and verifying them. I also checked the counts of the dataframe

# Performance/scale characteristics
The processing runs in order of seconds both locally and on the cluster. The limitation on my Mac hardware is memory. The issue arises when exploding each word in a sentence in it's own row. Based on the sample of 300, we can see that the explode increases the row number by a factor of about 10.

# What would you have done differently if you did this again?
I spent alot of my time trying to figure out solving a problem by converting it the dataframe to RDD, but had issues with the Row object. Finally, I just solved the entire problem with Dataframe. I learned that sometimes it is easier to use other alternatives rather than continuing to slog on getting something to work.

# Conclusions
I am very pleased with this exercise. I've always wanted to learn spark-nlp and this project has given me the chance to do so. Using spark-nlp, I am able to do more than a "word count", I was able to do an "adjective count" The pipeline can be customized and also appended with a ML model, e.g. sentiment analysis 