<a href="https://colab.research.google.com/github/stevegabriel1/spark_nlp/blob/main/coding_session_spark_nlp_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This is following directly the ODSC East 2021 session 'Introduction to Spark NLP' from John Snow Labs' Veysel Kocaman.**

In [1]:
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
!bash colab_setup.sh

# Install sparknlp-display
! pip install spark-nlp-display

--2021-04-22 04:31:22--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1594 (1.6K) [text/plain]
Saving to: ‘colab_setup.sh’


2021-04-22 04:31:22 (23.9 MB/s) - ‘colab_setup.sh’ saved [1594/1594]

setup Colab for PySpark 3.0.2 and Spark NLP 3.0.2
[K     |████████████████████████████████| 204.8MB 74kB/s 
[K     |████████████████████████████████| 153kB 23.8MB/s 
[K     |████████████████████████████████| 204kB 24.5MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
Collecting spark-nlp-display
[?25l  Downloading https://files.pythonhosted.org/packages/0a/d7/bda1c504e36f7a544c40e0a3de108bfe7907e77ab7eb7f188dd3915bcad4/spark_nlp_display-1.6-py3-none-an

In [2]:
import sparknlp
from pyspark.ml import Pipeline
from sparknlp.base import *
import pyspark.sql.functions as F
from sparknlp.annotator import *

spark = sparknlp.start()

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

Spark NLP version 3.0.2
Apache Spark version: 3.0.2


## Using Pretrained Pipelines

for a more detailed notebook, see https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/1.SparkNLP_Basics.ipynb

In [3]:
from sparknlp.pretrained import PretrainedPipeline

In [4]:
pipeline_dl = PretrainedPipeline('explain_document_dl', lang='en')

explain_document_dl download started this may take some time.
Approx size to download 169.3 MB
[OK!]


**Stages**
- DocumentAssembler
- SentenceDetector
- Tokenizer
- NER (NER with GloVe 100D embeddings, CoNLL2003 dataset)
- Lemmatizer
- Stemmer
- Part of Speech
- SpellChecker (Norvig)


In [15]:
testDoc = '''
Blue and green should be seen.
It seems that Coca Cola Amitil will be bought out very soon.
Do you prefer sparkling or regular minerall water?
Their vehicle was quick off the mark compared to most other cars.
'''

result = pipeline_dl.annotate(testDoc)


In [16]:
result.keys()

dict_keys(['entities', 'stem', 'checked', 'lemma', 'document', 'pos', 'token', 'ner', 'embeddings', 'sentence'])

In [17]:
result['entities']

['Blue', 'Coca Cola Amitil']

In [18]:
import pandas as pd

df = pd.DataFrame({'token':result['token'], 'ner_label':result['ner'],
                      'spell_corrected':result['checked'], 'POS':result['pos'],
                      'lemmas':result['lemma'], 'stems':result['stem']})

df

Unnamed: 0,token,ner_label,spell_corrected,POS,lemmas,stems
0,Blue,B-ORG,Blue,NNP,Blue,blue
1,and,O,and,CC,and,and
2,green,O,green,NN,green,green
3,should,O,should,MD,should,should
4,be,O,be,VB,be,be
5,seen,O,seen,VBN,see,seen
6,.,O,.,.,.,.
7,It,O,It,PRP,It,it
8,seems,O,seems,VBZ,seem,seem
9,that,O,that,IN,that,that


In [19]:
detailed_result = pipeline_dl.fullAnnotate(testDoc)

detailed_result[0]['entities']

[Annotation(chunk, 1, 4, Blue, {'entity': 'ORG', 'sentence': '0', 'chunk': '0'}),
 Annotation(chunk, 46, 61, Coca Cola Amitil, {'entity': 'ORG', 'sentence': '1', 'chunk': '1'})]