## Annotation libraries 

There are three objects to keep in mind in annotation libraries: Document, Annotation, Annotator. The idea behind annotation libraries is to augment the incoming data with the results of our NLP functions.

<br>***Document***
The document is the representation of the piece of text we wish to process. Naturally, the document must contain the text. Additionally, we often want to have an identifier associated with each document so that we can store our augmented data as structured data. This identifier will often be a title if the texts we are processing have titles.
<br>***Annotation***
The annotation is the representation of the output of our NLP functions. For the annotation we needs to have a type so that later processing knows how to interpret them. Annotations also need to store their location within the document. For example, let’s say “pacing” occurs 134 characters into the document. It will have 134 as the start, and 140 as the end. The lemma annotation for “pacing” will have the location. Some annotation libraries also have a concept of a document-level annotation which do not have a location. There will be additional fields depending on the type. Simple annotations like tokens generally don’t have extra fields. Stem annotations usually have the stem that was extracted for the range of the text.
<br>***Annotator***
The annotator is the object that contains the logic for using the NLP function. The annotator will often require configuration or external data sets. Additionally, there are model-based annotators. One of the benefits of an annotation library is that annotators can take advantage of the work done by previous annotators. This naturally creates a notion of a pipeline of annotators.

spacy<br>
gensim<br>
spark nlp

# spark nlp
Spark NLP has the same concepts as any other annotation library, but differs in how it stores annotations. Most annotation libraries store the annotations in the document object, but Spark NLP creates columns for the different types of annotations.

The annotators are implemented as Transformers, Estimators, and Models. Let’s take alook at some examples

### Stages
One of the design principles of Spark NLP is easy interoperability with the existing allgorithms in MLLib. Since there is no notion of documents or annotations in MLLib, there are transformers for turning text columns into documents and converting annotations into vanilla SparkSQL data types. The usual usage pattern is
<br>
<br>load data with SparkSQL
<br>create document column
<br>process with Spark NLP
<br>convert annotations of interest into SparkSQL data types
<br>run additional MLLib stages
<br><br>We have already looked at how to load data with SparkSQL, and how to use MLLib stages in the standard Spark library, so we will look at the middle three stages now. First, we will look at the DocumentAssembler (stage 2).

### Transformers 

In [5]:
from sparknlp import DocumentAssembler, Finisher

In [6]:
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)

In [7]:
from pyspark.sql.types import *

###### we explore mini_newsgroup data set 

In [11]:
! ls ../datasets/mini_newsgroup/mini_newsgroups/

alt.atheism		  rec.autos	      sci.space
comp.graphics		  rec.motorcycles     soc.religion.christian
comp.os.ms-windows.misc   rec.sport.baseball  talk.politics.guns
comp.sys.ibm.pc.hardware  rec.sport.hockey    talk.politics.mideast
comp.sys.mac.hardware	  sci.crypt	      talk.politics.misc
comp.windows.x		  sci.electronics     talk.religion.misc
misc.forsale		  sci.med


In [18]:
path='../datasets/mini_newsgroup/mini_newsgroups/alt.atheism/51121'

In [19]:
# RDD containing filepath-text pairs
texts = spark.sparkContext.wholeTextFiles(path)

schema = StructType(
[
    StructField('path', StringType()),
    StructField('text', StringType())
])

texts = spark.createDataFrame(texts, schema=schema)

In [20]:
texts.limit(5).toPandas()

Unnamed: 0,path,text
0,file:/home/jovyan/projects/spark-nlp/datasets/...,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...


## DOCUMENTASSEMBLER 

The DocumentAssembler takes 5 parameters.

1. inputCol the column containing the text of the document
2. outputCol the name of the column containing the newly constructed document
3. idCol the name of the column containing the identifier (optional)
4. metadataCol the name of a Map-type column that represents document metadata (optional)
5. trimAndClearNewLines -> Whether to remove new line characters and trim strings. (optional, default=True)

In [23]:
document_assembler = DocumentAssembler().setInputCol('text').setOutputCol('document').setIdCol('path')

TypeError: 'JavaPackage' object is not callable