# Environment Setup

This notebook and Spark environment is setup via this docker image
https://hub.docker.com/r/johnsnowlabs/spark-nlp-workshop

Steps setup this environment:
1. Have docket installed locally
2. Clone this Git repo
3. docker run -it --rm -p 8888:8888 -p 4040:4040 -v < ABSOLUTEPATH > /big-data-project/notebook:/home/jovyan/  johnsnowlabs/spark-nlp-workshop
4. Goto http://localhost:8888

# Key Motivation Of Spark NLP

<img src="img/spark-nlp.png">

1. Runtime Performance 
    1. Although spaCy is fast, subpar reuntime performance when running on Spark because NLP pipeline is just one component of a bigger data processing piepeline.
    2. Splitting data processing framework (spark) from NLP framework means most of processing time will be spent on serialization and copying strings.
    
    3. DataFrames integration, no need to copy out of Tungesten optimized format.

2. Ecosystem
    1. Seamless integration with Spark ML
    
3. Enterprise Grade
    

Reference:

https://databricks.com/blog/2017/10/19/introducing-natural-language-processing-library-apache-spark.html



# Start spark

```python
import sparknlp

sparknlp.start()

#is a shortcut to :

spark = SparkSession.builder \
    .master('local[4]') \
    .appName('OCR Eval') \
    .config("spark.driver.memory", "6g") \
    .config("spark.executor.memory", "6g") \
    .config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.0.3") \
    .getOrCreate()
```

Reference:

https://nlp.johnsnowlabs.com/quickstart.html

In [6]:
import sparknlp
spark = sparknlp.start()
spark

# Load data / Data Pre-Processing

Read the data to DataFrame instead of RDD because Spark NLP integrades well with it and the [dataset](https://github.com/BobAdamsEE/SouthParkData/blob/master/All-seasons.csv) taken from https://github.com/BobAdamsEE/SouthParkData is already semi-structured (csv)


Reference:

https://github.com/databricks/spark-csv

# Clean / Input verification

**Pre-Processing**
1. Read in data as multiline, and handle commas within text as well as double quotes
2. Filter out duplicate headers
3. Cast episode and season to lowercase
4. Lowercase column names


**Issues**

|Symptom | Description| Remedy |
|--------|------------|--------|
| Rows with null | The input file contains a multiple line for a single entry. |.option("multiLine", True) 
|select("Season).distinct() produced invalied value such as "10,5,Geraldo,"A.." | Csv parsing issue due to spark-csv escaping with "\". This causes issues with entries containing quoted text such as "Okay, there's another one.  Aw, man! Look at that!  Can you believe this?! An SUV with a V8 engine, makes me sick!  ""Ticket for driving a gas-guzzler"" |  .option('escape', '"')
|select("Season).distinct() contains entry "Season"| The header was appended for each season to our All-Season dataset, used df.filter(col("Season") == "Season").count() yielded 17, the same count as number of seasons - 1 and df.filter(col("Season") == "Season").show() produced 17 of rows reading the header e.g."Season,Episode, Character, Line"| df.filter(col("Season") != "Season")
|orderBy for distinct episode and season was not producing expected ordering| although the inferSchema option was set to True, the dataType for the fields were of type String instead of int. Threfore order was lexicographical instead of int order | df.withColumn("Season", df.Season.cast('int')) |


**Any bad data issues?**

As I skimpped through the data, I did not find any bad data, it could be due to the fact that this data has been cleaned as stated on the Github Readme


In [219]:
from pyspark.sql.types import *


df = spark.read.option("header", True) \
                .option("multiLine", True) \
                .option("inferSchema", True) \
                .option('escape', '"') \
                .csv("data/All-seasons.csv") \
                .filter(col("Season") != "Season")
 
df = df.withColumn("Season", df.Season.cast('int')) \
        .withColumn("Episode", df.Episode.cast('int'))
        
df = df.toDF(*[c.lower() for c in df.columns])


In [223]:
df.select("season").distinct().orderBy("season").show(100000)

+------+
|season|
+------+
|     1|
|     2|
|     3|
|     4|
|     5|
|     6|
|     7|
|     8|
|     9|
|    10|
|    11|
|    12|
|    13|
|    14|
|    15|
|    16|
|    17|
|    18|
+------+



In [222]:
df.select("episode").distinct().orderBy("episode").show(100000)

+-------+
|episode|
+-------+
|      1|
|      2|
|      3|
|      4|
|      5|
|      6|
|      7|
|      8|
|      9|
|     10|
|     11|
|     12|
|     13|
|     14|
|     15|
|     16|
|     17|
|     18|
+-------+



## Data Description and Schema

The csv data consist SouthPark scripts. * There were 70,896 lines/rows before removing headers and **70,879** after removing 17 headers, Each column represent:

1. Season
2. Episode
3. Character
4. Line (Can contain multiple sentences)


Each season consist of **~2,500 to ~6,500** lines (avg: **3,937** lines) for a total of 70,879 lines

The cast consist of **3,949** characters
Cartman tops the number of lines spoken (**9774**) followed by Stan (**7680**) and Kyle (**7099**)

\* see "Clean" section on explanation on why there are 17 instead of 1 header. 

In [165]:
print("Total rows: %d \n" % df.count())

df.printSchema()
df.show()

Total rows: 70879 

root
 |-- Season: integer (nullable = true)
 |-- Episode: integer (nullable = true)
 |-- Character: string (nullable = true)
 |-- Line: string (nullable = true)

+------+-------+---------------+--------------------+
|Season|Episode|      Character|                Line|
+------+-------+---------------+--------------------+
|    10|      1|           Stan|You guys, you guy...|
|    10|      1|           Kyle|Going away? For h...|
|    10|      1|           Stan|           Forever.
|
|    10|      1|           Chef|    I'm sorry boys.
|
|    10|      1|           Stan|Chef said he's be...|
|    10|      1|           Chef|               Wow!
|
|    10|      1|  Mrs. Garrison|Chef?? What kind ...|
|    10|      1|           Chef|What's the meanin...|
|    10|      1|  Mrs. Garrison|I hope you're mak...|
|    10|      1|        Cartman|I'm gonna miss hi...|
|    10|      1|           Stan|Dude, how are we ...|
|    10|      1|Mayor McDaniels|And we will all m...|
|    10|

In [225]:
print("Num unique characters % d" % df.select("character").distinct().count())
df.select("character").distinct().show(10000)

Num unique characters  3949
+--------------------+
|           character|
+--------------------+
|              Dougie|
|          Lesbian 11|
|              Pedo 5|
|     Transient Man 6|
|           Kip's Dad|
|              Bookie|
|        Miss Stevens|
|             Elder 1|
|  Boy in Green Shirt|
|  Man on Porto Potty|
|           Student 2|
|          Snoop Dogg|
|            Player 4|
|               Tyler|
|               Crips|
|           Scientist|
|               Crowd|
|       Boy Announcer|
|Secret Service Le...|
|           Old-Timer|
|                Chad|
|        Gerald, Kyle|
|          Bus Driver|
|           Comrade 1|
|          Director 1|
|         Priest Maxi|
|             Solders|
|                 Rod|
|            Patron 2|
|         Elderly man|
|       John Travolta|
|       Rob Schneider|
|              Hare 2|
|           Lesbian 7|
|              A Girl|
|    Butters, Cartman|
|              Russia|
|             Diner 4|
|            A caller|
|     

In [226]:
df.groupBy("character").count().orderBy(desc("count")).show()

+------------+-----+
|   character|count|
+------------+-----+
|     Cartman| 9774|
|        Stan| 7680|
|        Kyle| 7099|
|     Butters| 2602|
|       Randy| 2467|
|Mr. Garrison| 1002|
|        Chef|  917|
|       Kenny|  881|
|      Sharon|  862|
|  Mr. Mackey|  633|
|      Gerald|  626|
|       Jimmy|  597|
|       Wendy|  585|
|       Liane|  582|
|      Sheila|  566|
|       Jimbo|  556|
|   Announcer|  407|
|     Stephen|  357|
|       Craig|  326|
|       Clyde|  317|
+------------+-----+
only showing top 20 rows



In [227]:
df.groupBy("season").count().orderBy("season").show()

+------+-----+
|season|count|
+------+-----+
|     1| 4170|
|     2| 6416|
|     3| 5798|
|     4| 5680|
|     5| 4414|
|     6| 5131|
|     7| 4236|
|     8| 3601|
|     9| 3526|
|    10| 3471|
|    11| 3478|
|    12| 3307|
|    13| 3257|
|    14| 3346|
|    15| 3101|
|    16| 3120|
|    17| 2305|
|    18| 2522|
+------+-----+



In [228]:
df.groupBy("season").count().agg(avg(col("count"))).show()

+-----------------+
|       avg(count)|
+-----------------+
|3937.722222222222|
+-----------------+



# NLP Data Processing


**Terminology**

1. Annotation - Basic form of the result of a Spark NLP Operation
    1. begin
    2. end
    3. result
    4. mertadata
    5. embeddings
  
2. Annotators
    1. Annotator Approaces - Represent a Spark ML Estimator, produces annotator model or transformer
    2. Annotator Model - Spark model or transformers meaning they have transform(data) function.

3. Common Functions
    1. setInputCols(column_names) - column names of annotations required by this annotator
    2. setOutputCol(column_name) - Ouput column name
    

**Reference:**

https://nlp.johnsnowlabs.com/quickstart.html

## NLP Pipeline

**Steps:**
1. **DocumentAssembler** - Prep data for annotation
2. **SentenceDetector** - Split sentences in each line
3. **Tokenizer** - Tokenize sentences
4. **Normalizer** - Normalize tokens

Reference:

https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/english/vivekn-sentiment/VivekNarayanSentimentApproach.ipynb

### Step 1: DocumentAssembler

Transformer that will create annotation of type Document to be used by annotators down the road. We need this have our lines annotated

e.g.


**line:**

"You guys, you guys! Chef is going away."

**out:**
```
[
    [document, 0, 38, You guys, you guys! Chef is going away., [], [], []]
]
        
```

In [229]:
from sparknlp.base import DocumentAssembler

document_assembler = DocumentAssembler() \
            .setInputCol("line")\
            .setOutputCol("document")


In [230]:
assembled = document_assembler.transform(df)
assembled.show()


+------+-------+---------------+--------------------+--------------------+
|season|episode|      character|                line|            document|
+------+-------+---------------+--------------------+--------------------+
|    10|      1|           Stan|You guys, you guy...|[[document, 0, 38...|
|    10|      1|           Kyle|Going away? For h...|[[document, 0, 24...|
|    10|      1|           Stan|           Forever.
|[[document, 0, 7,...|
|    10|      1|           Chef|    I'm sorry boys.
|[[document, 0, 14...|
|    10|      1|           Stan|Chef said he's be...|[[document, 0, 80...|
|    10|      1|           Chef|               Wow!
|[[document, 0, 3,...|
|    10|      1|  Mrs. Garrison|Chef?? What kind ...|[[document, 0, 88...|
|    10|      1|           Chef|What's the meanin...|[[document, 0, 43...|
|    10|      1|  Mrs. Garrison|I hope you're mak...|[[document, 0, 37...|
|    10|      1|        Cartman|I'm gonna miss hi...|[[document, 0, 80...|
|    10|      1|         

In [231]:
assembled.select("line","document").show(3,False)

+-----------------------------------------+------------------------------------------------------------------------+
|line                                     |document                                                                |
+-----------------------------------------+------------------------------------------------------------------------+
|You guys, you guys! Chef is going away. 
|[[document, 0, 38, You guys, you guys! Chef is going away., [], [], []]]|
|Going away? For how long?
               |[[document, 0, 24, Going away? For how long?, [], [], []]]              |
|Forever.
                                |[[document, 0, 7, Forever., [], [], []]]                                |
+-----------------------------------------+------------------------------------------------------------------------+
only showing top 3 rows



### Step 2: Split sentences within each line

e.g.

**line:**
"You guys, you guys! Chef is going away."

**out:**
```
[
    [document, 0, 18, You guys, you guys!, [sentence -> 0], [], []],
    [document, 20, 38, Chef is going away., [sentence -> 1], [], []]
]
```

In [232]:
from sparknlp.annotator import *
sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

In [233]:
sentence_data = sentence_detector.transform(assembled)
sentence_data.show(5)

+------+-------+---------+--------------------+--------------------+--------------------+
|season|episode|character|                line|            document|            sentence|
+------+-------+---------+--------------------+--------------------+--------------------+
|    10|      1|     Stan|You guys, you guy...|[[document, 0, 38...|[[document, 0, 18...|
|    10|      1|     Kyle|Going away? For h...|[[document, 0, 24...|[[document, 0, 10...|
|    10|      1|     Stan|           Forever.
|[[document, 0, 7,...|[[document, 0, 7,...|
|    10|      1|     Chef|    I'm sorry boys.
|[[document, 0, 14...|[[document, 0, 14...|
|    10|      1|     Stan|Chef said he's be...|[[document, 0, 80...|[[document, 0, 80...|
+------+-------+---------+--------------------+--------------------+--------------------+
only showing top 5 rows



In [240]:
sentence_data.select("line", "sentence").show(3, False)

+-----------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------+
|line                                     |sentence                                                                                                                           |
+-----------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------+
|You guys, you guys! Chef is going away. 
|[[document, 0, 18, You guys, you guys!, [sentence -> 0], [], []], [document, 20, 38, Chef is going away., [sentence -> 1], [], []]]|
|Going away? For how long?
               |[[document, 0, 10, Going away?, [sentence -> 0], [], []], [document, 12, 24, For how long?, [sentence -> 1], [], []]]              |
|Forever.
                                |[[document, 0, 7, Forever., [sentence -> 0], [], []]]                        

### Step 3:  Tokenize

**line**
You guys, you guys! Chef is going away.

**out**
```
[
	[token, 0, 2, You, [sentence -> 0], [], []],
	[token, 4, 7, guys, [sentence -> 0], [], []], 
	[token, 8, 8, ,, [sentence -> 0], [], []], 
	[token, 10, 12, you, [sentence -> 0], [], []],
	[token, 14, 17, guys, [sentence -> 0], [], []], 
	[token, 18, 18, !, [sentence -> 0], [], []],
	[token, 20, 23, Chef, [sentence -> 1], [], []], 
	[token, 25, 26, is, [sentence -> 1], [], []], 
	[token, 28, 32, going, [sentence -> 1], [], []], 
	[token, 34, 37, away, [sentence -> 1], [], []],
	[token, 38, 38, ., [sentence -> 1], [], []]
]
```

In [241]:
tokenizer = Tokenizer() \
            .setInputCols(["sentence"]) \
            .setOutputCol("token")

In [245]:
tokenized = tokenizer.transform(sentence_data)
tokenized.select("line","token").show(3,False)

+-----------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|line                                     |token                                                                                                                                                                                                                                                                                                                                                                                                             

### Step 4: Normalize tokens

In [251]:
normalizer = Normalizer() \
            .setInputCols(["token"]) \
            .setOutputCol("normal")

### Step 5: Use spell checker to correct normalized tokens

In [260]:
! wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/spell/words.txt -P /tmp

--2019-05-05 22:56:42--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/spell/words.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.144.93
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.144.93|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4862966 (4.6M) [text/plain]
Saving to: '/tmp/words.txt'


2019-05-05 22:56:43 (3.24 MB/s) - '/tmp/words.txt' saved [4862966/4862966]



In [262]:
spell_checker = NorvigSweetingApproach() \
            .setInputCols(["normal"]) \
            .setOutputCol("spell") \
            .setDictionary("/tmp/words.txt")