**spaCy**<br>
spaCy is an open-source Python library that parses and understands large volume of text. Models are available that cater to specific languages.

Here we'll setup spaCy to work with python and explore some of it's features.

**Installation and Setup**<br>
It's a two-step process. Installing spaCy using conda or pip being the first step and downloading the specific model (based on language) being the second.

Note: I generally use pip(or pip3 for python 3.x versions).

1. Installation from command line or terminal<br>
   `pip3 install spacy`<br><br>
2. Downloading the model from command line or terminal<br>
    `python -m spacy download "model_name"`  (where model_name can be `en_core_web_sm, en_core_web_md, en` etc..)<br><br>
    For models and more details : https://spacy.io/usage/

## Working with spaCy and Python

Below are the typical set of instructions for importing and working with spaCy. Model loading can take a bit of time as spaCy has fairly large librart to load.

In [11]:
# Importing spaCy and loading the language library
import spacy
nlp = spacy.load('en_core_web_sm')

# Creating a document object
document = nlp(u'Salesforce announced its acquisition of Tableau at an enterprise value of approximately $16 billion.')

# Printing tokens/parts of the document created
for token in document:
    print(f"{token.text:{15}} {token.pos_:{8}} {token.dep_:{8}}")

Salesforce      PROPN    nsubj   
announced       VERB     ROOT    
its             DET      poss    
acquisition     NOUN     dobj    
of              ADP      prep    
Tableau         PROPN    pobj    
at              ADP      prep    
an              DET      det     
enterprise      NOUN     compound
value           NOUN     pobj    
of              ADP      prep    
approximately   ADV      advmod  
$               SYM      quantmod
16              NUM      compound
billion         NUM      pobj    
.               PUNCT    punct   


The above output doesn't seem very user-friendly, but we can see some interesting things.

a. Salesforce is recognized to be a Proper Noun<br>
b. 16 is recognized a Number

As we go further we'll see how combined tokens such as `$16 billion` can be recognized as **money**.

## Pipeline
When we run `nlp`, our text goes through a *pipeline* that bearks down the text and then performs a series of operations to tag, parse and describe the data.

Pipelinne Information: https://spacy.io/usage/spacy-101#pipelines

In [12]:
# checking the nlp pipeline
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x11368c9e8>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x11893eb88>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x11893eac8>)]

In [13]:
nlp.pipe_names

['tagger', 'parser', 'ner']

As we can see that the document is going through a tagger, a parser and a ner (Named Entity Recognition).

## Tokenization

The first step in processing text is to split up all the component parts (words & punctuation) into "tokens". These tokens are annotated inside the Document object to contain descriptive information.<br>
Let's understand this with the help of an example.

In [24]:
document2 = nlp(u"Harry isn't looking  for jobs anymore.")

for token in document2:
    print(f"{token.text:{10}} {token.pos_:{6}} {token.dep_:{6}}")

Harry      PROPN  nsubj 
is         VERB   aux   
n't        ADV    neg   
looking    VERB   ROOT  
           SPACE        
for        ADP    prep  
jobs       NOUN   pobj  
anymore    ADV    advmod
.          PUNCT  punct 


Notice that `isn't` has been split into two tokens. spaCy is able to recognize the root verb `is` and the negation in it. Also the extended whitespace and the period at the end of the sentence are assigned respective tokens.<br><br>

It can be noted that although document2 holds processed information, it also retains the original text.

In [25]:
document2

Harry isn't looking  for jobs anymore.

In [26]:
document2[0]

Harry

In [27]:
type(document2)

spacy.tokens.doc.Doc

___
## Part-of-Speech Tagging (POS)
The next step after splitting the text up into tokens is to assign parts of speech. In the above example, `Harry` was recognized to be a *proper noun*. Here some statistical modeling is required. For example, words that follow "the" are typically nouns.

For a full list of POS Tags: 
 https://spacy.io/api/annotation#pos-tagging

In [28]:
document2[0].pos_

'PROPN'

___
## Dependencies
There is also syntactic dependencies assigned to each token. `harry` is identified as an `nsubj` or the *nominal subject* of the sentence.

For a full list of Syntactic Dependencies: https://spacy.io/api/annotation#dependency-parsing

In [29]:
document2[0].dep_

'nsubj'

**To see the description of the tags we can use `spacy.explain(tag)`**

In [31]:
spacy.explain('PROPN')

'proper noun'

In [32]:
spacy.explain('nsubj')

'nominal subject'

## Additional token attributes

|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Harry`|
|`.lemma_`|The base form of the word|`harry`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

In [44]:
# Lemmas (the base form of the word):
print(f"{document2[3].text} ---> {document2[3].lemma_}")

looking ---> look


In [48]:
# Simple Parts-of-Speech & Detailed Tags:
print(f"{document2[3].pos_} ---> {document2[3].tag_} ---> {spacy.explain(document2[3].tag_)}")


VERB ---> VBG ---> verb, gerund or present participle


In [50]:
print(document2[0].text+' : '+document2[0].shape_)

Harry : Xxxxx


In [52]:
# Boolean Values:
print(document2[0].is_alpha)

True


In [53]:
print(document2[0].is_stop)

False


___
## Spans
A **span** is a slice of Document object in the form `Document[start:stop]`.

In [55]:
document3 = nlp(u"A paragraph is a group of sentences that fleshes out a single idea. \
In order for a paragraph to be effective, it must begin with a topic sentence, have sentences \
that support the main idea of that paragraph, and maintain a consistent flow.")

In [60]:
span_ex = document3[14:29]
print(span_ex)

In order for a paragraph to be effective, it must begin with a topic


In [61]:
type(span_ex)

spacy.tokens.span.Span

___
## Sentences
Certain tokens inside a Document object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through `Document.sents`.

In [62]:
document4 = nlp(u'This is the 1st sentence. \
This is 2nd sentence. \
This is the last sentence.')


In [63]:
for sent in document4.sents:
    print(sent)

This is the 1st sentence.
This is 2nd sentence.
This is the last sentence.


In [66]:
document4[6].is_sent_start

True