# Spark NLP

In this notebook, we will walk through some of the basic functionality of Spark NLP, which can be used to perform more advanced text processing operations than is possible with `pyspark.ml` alone.

Note that this notebook is intended to be run on an AWS EMR cluster, using EMR Release 6.2 (With Spark 3.0 installed) (for advanced configuration options, [follow these instructions](https://nlp.johnsnowlabs.com/docs/en/install#emr-support) to bootstrap and configure the cluster via the AWS CLI).

-----

First, let's load our packages:

In [1]:
%%configure -f
{
    "conf": {
        "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.1",
        "spark.pyspark.python": "python3",
        "spark.pyspark.virtualenv.enabled": "true",
        "spark.pyspark.virtualenv.type":"native",
        "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"
    }
}

In [None]:
sc.install_pypi_package('spark-nlp', 'https://pypi.org/simple')

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
5,application_1683728831077_0007,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Collecting spark-nlp
  Using cached https://files.pythonhosted.org/packages/65/19/c439d42f7afd75d6c9c20207db8ee0c95d7c82177b759303c7601120e91a/spark_nlp-4.4.1-py2.py3-none-any.whl
Installing collected packages: spark-nlp
Successfully installed spark-nlp-4.4.1

In [3]:
from pyspark.ml import Pipeline, PipelineModel
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
import sparknlp
from sparknlp.pretrained import PretrainedPipeline
from sparknlp.annotator import *
from sparknlp.common import RegexRule
from sparknlp.base import *

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

And, then, let's create a sample DataFrame with some text entries in it that we can work with (drawn from the first several paragraphs of the [University of Chicago's Wikipedia page](https://en.wikipedia.org/wiki/University_of_Chicago)). This data is quite small (purposefully!) so that we can easily see all of the operations that are being performed. Of course, we're running this notebook on a Spark cluster, though, so we can perform these same operations on even the largest DataFrames using this same approach -- whether that is the Amazon Customer Reviews dataset that we've been working with or a text corpus as big as the Common Crawl.

In [4]:
sample = [
    [0, 'The University of Chicago was incorporated as a coeducational institution in 1890 by the American Baptist Education Society, using $400,000 donated to the ABES to match a $600,000 donation from Baptist oil magnate and philanthropist John D. Rockefeller, and including land donated by Marshall Field. While the Rockefeller donation provided money for academic operations and long-term endowment, it was stipulated that such money could not be used for buildings. The Hyde Park campus was financed by donations from wealthy Chicagoans like Silas B. Cobb who provided the funds for the campus first building, Cobb Lecture Hall, and matched Marshall Fields pledge of $100,000. Other early benefactors included businessmen Charles L. Hutchinson (trustee, treasurer and donor of Hutchinson Commons), Martin A. Ryerson (president of the board of trustees and donor of the Ryerson Physical Laboratory) Adolphus Clay Bartlett and Leon Mandel, who funded the construction of the gymnasium and assembly hall, and George C. Walker of the Walker Museum, a relative of Cobb who encouraged his inaugural donation for facilities.'],
    [1, 'The Hyde Park campus continued the legacy of the original university of the same name, which had closed in the 1880s after its campus was foreclosed on. What became known as the Old University of Chicago had been founded by a small group of Baptist educators in 1856 through a land endowment from Senator Stephen A. Douglas. After a fire, it closed in 1886. Alumni from the Old University of Chicago are recognized as alumni of the present University of Chicago. The university depiction on its coat of arms of a phoenix rising from the ashes is a reference to the fire, foreclosure, and demolition of the Old University of Chicago campus. As an homage to this pre-1890 legacy, a single stone from the rubble of the original Douglas Hall on 34th Place was brought to the current Hyde Park location and set into the wall of the Classics Building. These connections have led the dean of the college and University of Chicago and professor of history John Boyer to conclude that the University of Chicago has, a plausible genealogy as a pre–Civil War institution']
]

data = spark.createDataFrame(sample) \
            .toDF("id", "text")

data.show(truncate=100)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+----------------------------------------------------------------------------------------------------+
| id|                                                                                                text|
+---+----------------------------------------------------------------------------------------------------+
|  0|The University of Chicago was incorporated as a coeducational institution in 1890 by the American...|
|  1|The Hyde Park campus continued the legacy of the original university of the same name, which had ...|
+---+----------------------------------------------------------------------------------------------------+

You'll remember from `pyspark.ml` that pipelines can be useful approach for combining various estimators and transformers into a single workflow. Spark NLP extends this idea by introducing so-called "annotators" that can perform NLP-related estimation tasks (e.g. things that can be trained through `.fit()`) and transformation tasks (things that can transform one DataFrame into another DataFrame in some way). 

For instance, below, we transform raw text into a document, transform that document into tokens, and then identify the "part of speech" for each token based on a pre-trained POS-tagger. We can chain these transformers and estimators together into a single reproducible pipeline that can then be fit and used to transform data. Note as well that we're using the `Pipeline()` function from `pyspark.ml`, so it's also easy to use plug these annotators into our existing ML workflow.

In [5]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

pos = PerceptronModel.pretrained("pos_anc", 'en')\
        .setInputCols("document", "token")\
        .setOutputCol("pos")

my_pipeline = Pipeline(
      stages = [
          documentAssembler,
          tokenizer,
          pos
      ])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[OK!]

Once we transform our data, you can see that we have produced different columns for each of our different steps in the pipeline:

In [6]:
pipelineModel = my_pipeline.fit(data)

# transform data
result = pipelineModel.transform(data)

result.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+--------------------+--------------------+--------------------+--------------------+
| id|                text|            document|               token|                 pos|
+---+--------------------+--------------------+--------------------+--------------------+
|  0|The University of...|[[document, 0, 11...|[[token, 0, 2, Th...|[[pos, 0, 2, DT, ...|
|  1|The Hyde Park cam...|[[document, 0, 10...|[[token, 0, 2, Th...|[[pos, 0, 2, DT, ...|
+---+--------------------+--------------------+--------------------+--------------------+

If we take a closer look at the token-level data, we can see the parts of speech for each of the words in our DataFrame:

In [7]:
pos_df = result.select('id', F.explode(F.arrays_zip('token.result',
                                              'token.begin',
                                              'token.end', 
                                              'pos.result', 
                                          )).alias("cols")) \
               .select("id",
                       F.expr("cols['0']").alias("chunk"),
                       F.expr("cols['1']").alias("begin"),
                       F.expr("cols['2']").alias("end"),
                       F.expr("cols['3']").alias("pos"),
                      )
pos_df.show(truncate=False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-------------+-----+---+---+
|id |chunk        |begin|end|pos|
+---+-------------+-----+---+---+
|0  |The          |0    |2  |DT |
|0  |University   |4    |13 |NNP|
|0  |of           |15   |16 |IN |
|0  |Chicago      |18   |24 |NNP|
|0  |was          |26   |28 |VBD|
|0  |incorporated |30   |41 |VBN|
|0  |as           |43   |44 |IN |
|0  |a            |46   |46 |DT |
|0  |coeducational|48   |60 |JJ |
|0  |institution  |62   |72 |NN |
|0  |in           |74   |75 |IN |
|0  |1890         |77   |80 |CD |
|0  |by           |82   |83 |IN |
|0  |the          |85   |87 |DT |
|0  |American     |89   |96 |JJ |
|0  |Baptist      |98   |104|NNP|
|0  |Education    |106  |114|NNP|
|0  |Society      |116  |122|NNP|
|0  |,            |123  |123|,  |
|0  |using        |125  |129|VBG|
+---+-------------+-----+---+---+
only showing top 20 rows

Part-of-speech tagging is only [one of many available annotators](https://nlp.johnsnowlabs.com/docs/en/annotators) in the Spark NLP ecosystem, though, and you're encouraged to take a look through the documentation. Note, for instance, that there are many pre-trained annotators (using state-of-the-art training procedures) that can be used directly out-of-the-box and inserted into your pipelines.

Spark NLP also provides many predefined pipelines that will perform common series of transformations on your data according to pre-trained models (e.g. performing NER with various embedding models, for instance). Here, we'll load in a pre-trained pipeline, which produces NER labels (pre-trained through a series of neural networks) for each of our words to demonstrate how this can work on our mini dataset.

In [8]:
pipeline = PretrainedPipeline('explain_document_dl', lang='en')
result = pipeline.transform(data)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

explain_document_dl download started this may take some time.
Approx size to download 169.4 MB
[OK!]

And we can then take a look at the results; not bad for a single line of code!

In [12]:
ner_df = result.select('id', F.explode(F.arrays_zip('lemma.result',
                                 'stem.result', 
                                 'ner.result'
                                         )).alias("cols")) \
               .select("id",
                       F.expr("cols['0']").alias("lemma"),
                       F.expr("cols['1']").alias("stem"),
                       F.expr("cols['2']").alias("ner"),
                      )

ner_df.show(truncate=False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-------------+--------+-----+
|id |lemma        |stem    |ner  |
+---+-------------+--------+-----+
|0  |The          |the     |O    |
|0  |University   |univers |B-ORG|
|0  |of           |of      |I-ORG|
|0  |Chicago      |chicago |I-ORG|
|0  |be           |wa      |O    |
|0  |incorporate  |incorpor|O    |
|0  |as           |a       |O    |
|0  |a            |a       |O    |
|0  |coeducational|coeduc  |O    |
|0  |institution  |institut|O    |
|0  |in           |in      |O    |
|0  |1890         |1890    |O    |
|0  |by           |by      |O    |
|0  |the          |the     |O    |
|0  |American     |american|B-ORG|
|0  |Baptist      |baptist |I-ORG|
|0  |Education    |educ    |I-ORG|
|0  |Society      |societi |I-ORG|
|0  |,            |,       |O    |
|0  |use          |us      |O    |
+---+-------------+--------+-----+
only showing top 20 rows

In [13]:
ner_df.groupBy('ner') \
      .count() \
      .show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------+-----+
|   ner|count|
+------+-----+
|B-MISC|    3|
| I-ORG|   24|
| I-PER|   12|
| I-LOC|    7|
| B-PER|   16|
|     O|  305|
| B-ORG|   14|
| B-LOC|    8|
+------+-----+

---------------------


That's all we'll cover with regard to Spark NLP, but you're encouraged to play around with it further (perhaps [training your own NER model on GPUs](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/blogposts/3.NER_with_BERT.ipynb)!) and read the [excellent documentation](https://nlp.johnsnowlabs.com/docs/en/concepts) and [tutorials](https://nlp.johnsnowlabs.com/classify_documents) in more depth.

## Activity

Suppose you are engineering features that correspond to each row in the provided sample dataframe `data` (i.e. each line of text) to use as predictors in a machine learning model, based on the results of your Spark NLP annotations. Specifically, you want to:

1. Engineer a feature that computes the number of adjectives (JJ) in each row of the DataFrame.
2. Engineer another feature that computes the number of organizations mentioned in each row of the DataFrame (as counted by the B-ORG NER label -- i.e. the beginning of an organization name).
3. Merge these counts back into the original DataFrame `data` as columns in the DataFrame (i.e. an `adj_count` column and a `org_count` column) so that you can use them to train your ML model. Note that you can perform inner joins on DataFrames in Spark much like you can in Pandas via `df1.join(df2, on='shared_col_name')`