# Apache Spark NLP

https://github.com/JohnSnowLabs/spark-nlp/releases/tag/5.3.3

New Home 
https://sparknlp.org/


## Install
https://sparknlp.org/docs/en/install#python

```bash
pip install spark-nlp==5.3.3
```

Before launch Jupyter 
```bash
export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/"
export SPARK_HOME="/home/tap/spark"
```

In [1]:
import findspark
import pyspark
import sparknlp
from pyspark.sql import SparkSession


In [None]:
findspark.find() 
findspark

In [None]:
spark = SparkSession.builder \
    .appName("Spark NLP")\
    .master("local[8]")\
    .config("spark.driver.memory","16G")\
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3")\
    .getOrCreate()

In [None]:
sparknlp.start(aarch64=True)

In [None]:
print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

In [None]:
%%bash 
java -version

## Name Entity Recognition
https://www.johnsnowlabs.com/visualizing-named-entities-with-spark-nlp/

### Pretrained Pipeline

In [None]:
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline('entity_recognizer_lg', lang = 'it')

In [None]:
text ="""
Nel corso di "Technologies for Advanced Programming" del Corso di Laurea in Informatica (L31)
all'Università di Catania si studiano un sacco di tecnologie.
Il professore Salvatore Nicotra usa Linux con CPU AMD,
ma grazie al supporto di John Snow Labs ed è un pò di pazienza
nel 2024 mostrerà questo esempio ai suoi studenti, il 6 Giugno 2023
durante la lezione al Dipartimento di Matematica e Informatica
"""

annotations =  pipeline.fullAnnotate(text)[0]

In [None]:
annotations

In [None]:
# First import NerVisualizer
from sparknlp_display import NerVisualizer
# Display the results
visualiser = NerVisualizer()
visualiser.display(annotations, label_col='entities', document_col='document', save_path=f"display_recognize_entities.html")


# Docker 

docker build sparknlp  --tag tap:sparknlp

docker run -v sparknlplibs:/ivy2/.ivy2 -v /home/tap/tap-workspace/tap2024/sparknlp/code/:/code  --network tap --rm -it tap:sparknlp /opt/spark/bin/spark-submit --conf spark.driver.extraJavaOptions="-Divy.cache.dir=/tmp -Divy.home=/ivy2" --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3  /code/standalone_ner.py

docker run -v /home/tap/tap-workspace/tap2024/sparknlp/code/:/code  --network tap --rm -it tap:sparknlp /opt/spark/bin/spark-submit --conf spark.driver.extraJavaOptions="-Divy.cache.dir=/tmp -Divy.home=/tmp" --conf spark.jsl.settings.pretrained.cache_folder="/tmp"   --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3  /code/standalone_ner.py