<a href="https://nlp.johnsnowlabs.com"><img src="https://nlp.johnsnowlabs.com/assets/images/logo.png"/></a>

# How to Use Pretrained Pipelines

## Introduction

In this notebook we will learn how to use one of Spark NLP's pretrained pipelines. 

In NLP, there is no "one size, fits all". In order to get the best performance you will need to gather data, build your own pipelines, and train your own models. So why would you want to use a pretrained pipelines. There are a number of reasons

1. You are learning about Spark NLP.
  - If you are learning a library, you want to be able to quickly run code to see what development might look like.
2. You want to establish a baseline.
  - In order to begin experimenting, you need to establish a baseline. You don't want to have to re-invent well understood techniques and best practices. You can use a pretrained pipeline to establish a good baseline, and then measure the improvements you get from tuning for your data and application.
3. You are exploring a data set.
  - Perhaps your application requires you to work with a large corpus with which you are unfamiliar. Apache Spark is a great tool for exploring large data sets. When you combine it with Spark NLP, you can easily gather some basic understanding of the vocabulary of your text. Additionally, you can get a first pass at seeing which steps you may need to customize. 

## This notebook

1. How to start with Spark NLP
2. The `explain_document_ml` pipeline
    - How to load pretrained pipelines
    - How to use the light pipeline
3. The `match_chunks` pipeline
    - Phrase chunking

## How to start with Spark NLP

When you use Spark NLP, you can quickly get started by importing `sparknlp` and calling `sparknlp.start()`. This will return a `SparkSession`.

In [1]:
import sparknlp

spark = sparknlp.start()

In [2]:
spark

Let's load our data into spark. For the purpose of exploring these pipelines we will use just small piece of text.

In [3]:
text = 'French author who helped pioner the science-fiction genre. ' \
        'Verne wrate about space, air, and underwater travel before ' \
        'navigable aircrast and practical submarines were invented ' \
        'and before any means of space travel had been devised.' 
text

'French author who helped pioner the science-fiction genre. Verne wrate about space, air, and underwater travel before navigable aircrast and practical submarines were invented and before any means of space travel had been devised.'

In [4]:
test_docs = [text]

In [5]:
test_df = spark.createDataFrame([(text,)], ['text'])

Now that we have Spark up and running and our data loaded, we can explore pipelines. We will start with looking at the `explain_document_ml` pipeline.

## The `explain_document_ml` pipeline

The `explain_document_ml` pipeline is a good choice if you want to try the classic text processing techniques. Let's download this pipeline.

### How to load pretrained pipelines

In [6]:
from sparknlp.pretrained import PretrainedPipeline

In [7]:
explain_document_ml = PretrainedPipeline(
    'explain_document_ml',  # the name of the pipeline
    lang='en' # the language of the pipeline
)

explain_document_ml download started this may take some time.
Approx size to download 9.4 MB
[OK!]


Now that we have our model, let's look at the stages of this pipeline

#### `explain_document_ml` stages

1. Document Assembler
  - This creates a document from the given text field(s). In an annotator framework, the document is where the text and annotations are stored.
2. Sentence Detector
  - This sentence detector uses an algorithm based on Kevin Dias' `pragmatic_segmenter` (https://github.com/diasks2/pragmatic_segmenter).
3. Tokenizer
  - The tokenizer in Spark NLP is more than just a simple regex based splitter. Check the documentation at https://nlp.johnsnowlabs.com/docs/en/annotators#tokenizer
4. Lemmatizer
  - A lemmatizer tags a token with its "lemma". The "lemma" is the entry you would find it under in a dictionary. For example, "cats" -> "cat", "geese" -> "goose"
5. Stemmer
  - A stemmer tags tokens with its stem. The stem is a word without any inflection, and possibly with no or few derivational affixes. In English, there are many words that are equivalent to their stem. For example, "cats" -> "cat", "geese" -> "gees", "manager" -> "manag"
6. Part-of-Speech Tagger
  - A POS tagger tags tokens with a lexical category. Most people are familiar with nouns, verbs, adjectives, adverbs, prepositions, and pronouns. The acronyms are almost always based the Penn Treebank tags (https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). Spark NLP uses a perceptron model for this.
7. Spell Checker (Norvig)
  - This tags words that may be misspelled with a potential correction. This algorithm is based on the algorithm described by Peter Norvig (http://norvig.com/spell-correct.html)
  
Let's try annotating with this pipeline.

In [8]:
annotated_df = explain_document_ml.transform(test_df)

It is super easy to run this model against the data frame. If you had a large corpus in Spark, you could process it with this pipeline, for storage, downstream consumption, or further exploration in Spark.

Now let's see what the data looks like.

In [9]:
annotated_df.printSchema()

root
 |-- text: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |    |    |-- sentence_embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- sentence: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metad

This seems pretty complicated. As you use more complex aspects of Spark NLP the purpose of the elements of the schema will become clear. If we had a large data data set, we could process everything and store it, but what if we only want to process a small amount of data - like in the current situation. How long does it take to annotate a single document.

In [10]:
%%timeit
annotations = annotated_df.first().asDict()

448 ms ± 37.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


2 seconds, on my machine, seems a little long. 

### How to use the light pipeline

Fortunately, Spark NLP has light pipelines that let you process small pieces of text without the overhead of Spark.

In [11]:
%%timeit
annotations = explain_document_ml.annotate(text)

32.7 ms ± 3.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


32.6 ms is more like it! Now that we can process our data more quickly, let's look at what these annotators are actually giving us.

In [12]:
annotations = explain_document_ml.annotate(text)
annotations.keys()

dict_keys(['stem', 'checked', 'lemma', 'document', 'pos', 'token', 'sentence'])

Let's look at the document first.

In [13]:
for i, document in enumerate(annotations['document']):
    print(i, document)

0 French author who helped pioner the science-fiction genre. Verne wrate about space, air, and underwater travel before navigable aircrast and practical submarines were invented and before any means of space travel had been devised.


That is the original text that we passed in.

Let's look at the sentences.

In [14]:
for i, sentence in enumerate(annotations['sentence']):
    print('{:2d} {}'.format(i, sentence))

 0 French author who helped pioner the science-fiction genre.
 1 Verne wrate about space, air, and underwater travel before navigable aircrast and practical submarines were invented and before any means of space travel had been devised.


It seems that the sentence detector has correctly split the sentences.

Now let's look at the tokens.

In [15]:
print('{:>2s} {:20s}'.format('i', 'token'))
print('=' * 25)
for i, token in enumerate(annotations['token']):
    corrected = token != annotations['checked'][i]
    print('{:2d} {:20s}'.format(i, token))

 i token               
 0 French              
 1 author              
 2 who                 
 3 helped              
 4 pioner              
 5 the                 
 6 science-fiction     
 7 genre               
 8 .                   
 9 Verne               
10 wrate               
11 about               
12 space               
13 ,                   
14 air                 
15 ,                   
16 and                 
17 underwater          
18 travel              
19 before              
20 navigable           
21 aircrast            
22 and                 
23 practical           
24 submarines          
25 were                
26 invented            
27 and                 
28 before              
29 any                 
30 means               
31 of                  
32 space               
33 travel              
34 had                 
35 been                
36 devised             
37 .                   


There are plainly some spelling errors, let's see if the spell checker caught them. Here we will print the tokens and the spell checker annotations side-by-side.

In [16]:
print('{:>2s} {:20s} {:20s}'.format('i', 'token', 'checked'))
print('=' * 45)
for i, token in enumerate(annotations['token']):
    checked = annotations['checked'][i]
    point = '<'*10 if token != checked else '' 
    print('{:2d} {:20s} {:20s} {}'.format(i, token, checked, point))

 i token                checked             
 0 French               French               
 1 author               author               
 2 who                  who                  
 3 helped               helped               
 4 pioner               pioneer              <<<<<<<<<<
 5 the                  the                  
 6 science-fiction      sciencefiction       <<<<<<<<<<
 7 genre                genre                
 8 .                    .                    
 9 Verne                Verne                
10 wrate                wrote                <<<<<<<<<<
11 about                about                
12 space                space                
13 ,                    ,                    
14 air                  air                  
15 ,                    ,                    
16 and                  and                  
17 underwater           underwater           
18 travel               travel               
19 before               before               
20 na

It has caught all the mistakes. Although "science-fiction" was _mis_-corrected to "sciencefiction".

Now, let's look at the lemmas.

In [17]:
print('{:>2s} {:20s} {:20s} {:20s}'.format('i', 'token', 'checked', 'lemma'))
print('=' * 65)
for i, token in enumerate(annotations['token']):
    checked = annotations['checked'][i]
    lemma = annotations['lemma'][i]
    point = '<'*10 if checked != lemma else '' 
    print('{:2d} {:20s} {:20s} {:20s} {}'.format(i, token, checked, lemma, point))

 i token                checked              lemma               
 0 French               French               French               
 1 author               author               author               
 2 who                  who                  who                  
 3 helped               helped               help                 <<<<<<<<<<
 4 pioner               pioneer              pioneer              
 5 the                  the                  the                  
 6 science-fiction      sciencefiction       sciencefiction       
 7 genre                genre                genre                
 8 .                    .                    .                    
 9 Verne                Verne                Verne                
10 wrate                wrote                write                <<<<<<<<<<
11 about                about                about                
12 space                space                space                
13 ,                    ,                  

Notice thta most of the changes to the lemmatization could be expressed as removing common suffixes, like in stemming. However, words like "had", and "wrote" are not stemmed correctly.

Let's look at the stems.

In [18]:
print('{:>2s} {:20s} {:20s} {:20s} {:20s}'.format('i', 'token', 'checked', 'lemma', 'stem'))
print('=' * 85)
for i, token in enumerate(annotations['token']):
    checked = annotations['checked'][i]
    lemma = annotations['lemma'][i]
    stem = annotations['stem'][i]
    print('{:2d} {:20s} {:20s} {:20s} {:20s}'.format(i, token, checked, lemma, stem))

 i token                checked              lemma                stem                
 0 French               French               French               french              
 1 author               author               author               author              
 2 who                  who                  who                  who                 
 3 helped               helped               help                 help                
 4 pioner               pioneer              pioneer              pioneer             
 5 the                  the                  the                  the                 
 6 science-fiction      sciencefiction       sciencefiction       sciencefict         
 7 genre                genre                genre                genr                
 8 .                    .                    .                    .                   
 9 Verne                Verne                Verne                vern                
10 wrate                wrote              

Notice that a lot of the stems are not real words, e.g. "devised" -> "devis". That is because a _stem_ is not technically a word. Also, notice that "wrote" has not been modified. That is because it has no affixes to remove, so stemming cannot alter it.

Now, let's look at the last stage in the pipeline, part-of-speech tagging.

In [19]:
print('{:>2s} {:20s} {:20s} {:20s} {:20s} {:20s}'.format('i', 'token', 'checked', 'lemma', 'stem', 'pos'))
print('=' * 105)
for i, token in enumerate(annotations['token']):
    checked = annotations['checked'][i]
    lemma = annotations['lemma'][i]
    stem = annotations['stem'][i]
    pos = annotations['pos'][i]
    print('{:2d} {:20s} {:20s} {:20s} {:20s} {:20s}'.format(i, token, checked, lemma, stem, pos))

 i token                checked              lemma                stem                 pos                 
 0 French               French               French               french               JJ                  
 1 author               author               author               author               NN                  
 2 who                  who                  who                  who                  WP                  
 3 helped               helped               help                 help                 VBD                 
 4 pioner               pioneer              pioneer              pioneer              NN                  
 5 the                  the                  the                  the                  DT                  
 6 science-fiction      sciencefiction       sciencefiction       sciencefict          NN                  
 7 genre                genre                genre                genr                 NN                  
 8 .                    .   

This is a lot of lexical information. In a real situation, as we look at the data we would find issues that we may want to address with custom processing or custom models. But we may also want to combine these tokens into phrases.

To do that, we need to look at the next pipeline, `match_chunks`.

## The `match_chunks` pipeline

The match_chunks pipeline is simpler than the `explain_document_ml` pipeline. It has 4 of the same stages, but it does not have the spell checker, lemmatizer, or stemmer. It does have a _chunker_. 

### Phrase Chunking

In NLP, a _chunker_ is an algorithm that takes tagged tokens and combines them into phrases. In constituency grammars, words, morphemes to be technical, are combined into phrases based on rules controlling how phrase structures can be built. A phrase structure, like a noun phrase, has a head of the same type. So a _noun_ phrase, has a _noun_ head.

In Spark NLP, the `chunker` let's you define regular expressions for chunking. Whereas, this will not let you do arbitrary parsing, Human language is much more complicated than that, it will let you quickly and easily create phrase structures you may be interested in.

This chunker uses the following regular expression.

> `<DT>?<JJ>*<NN>+`

This can be read as an optional determiner ("the", "an", ...), followed by 0 or more adjectives, and 1 or more nouns. In English, this will recognize a large portion of noun phrases.

Let's try it out on our text.

In [20]:
match_chunks = PretrainedPipeline('match_chunks', lang='en')

match_chunks download started this may take some time.
Approx size to download 4.3 MB
[OK!]


In [21]:
chunked_text = match_chunks.annotate(text)

First, let's remind ourselves what is in the text.

In [22]:
text

'French author who helped pioner the science-fiction genre. Verne wrate about space, air, and underwater travel before navigable aircrast and practical submarines were invented and before any means of space travel had been devised.'

Let's look at the tokens so we can see why certain things are being marked.

In [23]:
print('{:>2s} {:20s} {:20s}'.format('i', 'token', 'pos'))
print('=' * 45)
for i, token in enumerate(chunked_text['token']):
    pos = chunked_text['pos'][i]
    print('{:2d} {:20s} {:20s}'.format(i, token, pos))

 i token                pos                 
 0 French               JJ                  
 1 author               NN                  
 2 who                  WP                  
 3 helped               VBD                 
 4 pioner               NN                  
 5 the                  DT                  
 6 science-fiction      NN                  
 7 genre                NN                  
 8 .                    .                   
 9 Verne                NNP                 
10 wrate                VB                  
11 about                IN                  
12 space                NN                  
13 ,                    ,                   
14 air                  NN                  
15 ,                    ,                   
16 and                  CC                  
17 underwater           JJ                  
18 travel               NN                  
19 before               IN                  
20 navigable            JJ                  
21 aircras

It looks like the POS tagger has misidentified "pioner" as a noun. Let's look at the chunked noun phrases.

In [24]:
for i, chunk in enumerate(chunked_text['chunk']):
    print('{:2d} {}'.format(i, chunk))

 0 French author
 1 pioner
 2 the science-fiction genre
 3 space
 4 air
 5 underwater travel
 6 navigable aircrast
 7 space travel


This has caught most of the noun phrases. It has missed the plural nouns, "practical submarines", and the proper nouns, "Verne".

## Summary

In this tutorial you learned how to load a pretrained pipeline and run it on Spark data, as well as run them using light pipelines. You can check out more pipelines here (https://nlp.johnsnowlabs.com/docs/en/pipelines).

You also learned a little about lemmatizers, stemmers, and phrase chunkers. These techniques can be used with great success to reduce the size of your vocabulary. This especially helps in classification and regression tasks where you might deal with excessively high dimensions and sparsity.