# Summary

#### SPACY FEATURES

- ***Tokenization :***Segmenting text into words, punctuations marks etc. * Part-of-speech (POS) Tagging Assigning word types to tokens, like verb or noun.
- Dependency Parsing Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
- ***Lemmatization :*** Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat".
- ***Sentence Boundary Detection (SBD):*** Finding and segmenting individual sentences.
- ***Named Entity Recognition (NER):*** Labelling named "real-world" objects, like persons, companies or locations.
- Similarity Comparing words, text spans and documents and how similar they are to each other.
- ***Text Classification:*** Assigning categories or labels to a whole document, or parts of a document.
- ***Rule-based Matching:*** Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.
- 1Training Updating and improving a statistical model's predictions.
- Serialization Saving objects to files or byte strings.

It's really FAST
Written in Cython, it was specifically designed to be as fast as possible

In [4]:
# start
import spacy
nlp = spacy.load('en')
doc = nlp('Hello     World!')
for token in doc:
    print('"' + token.text + '"')

"Hello"
"    "
"World"
"!"


Notice the index preserving tokenization in action. Rather than only keeping the words, spaCy keeps the spaces too. This is helpful for situations when you need to replace words in the original text or add some annotations. With NLTK tokenization, there’s no way to know exactly where a tokenized word is in the original raw text. spaCy preserves this “link” between the word and its place in the raw text. Here’s how to get the exact index of a word:

In [5]:
import spacy
nlp = spacy.load('en')
doc = nlp('Hello     World!')
for token in doc:
    print('"' + token.text + '"', token.idx)

"Hello" 0
"    " 6
"World" 10
"!" 15


The Token class exposes a lot of word-level attributes. Here are a few examples:

In [6]:
doc = nlp("Next week I'll   be in France.")
for token in doc:
    print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}".format(
        token.text,
        token.idx,
        token.lemma_,
        token.is_punct,
        token.is_space,
        token.shape_,
        token.pos_,
        token.tag_
    ))

Next	0	next	False	False	Xxxx	ADJ	JJ
week	5	week	False	False	xxxx	NOUN	NN
I	10	-PRON-	False	False	X	PRON	PRP
'll	11	will	False	False	'xx	VERB	MD
  	15	  	False	True	  	SPACE	SP
be	17	be	False	False	xx	VERB	VB
in	20	in	False	False	xx	ADP	IN
France	23	france	False	False	Xxxxx	PROPN	NNP
.	29	.	True	False	.	PUNCT	.


## The spaCy toolbox

Let’s now explore what are the models bundled up inside spaCy.

### Sentence detection
Here’s how to achieve one of the most common NLP tasks with spaCy:

In [7]:
doc = nlp("These are apples. These are oranges.")
 
for sent in doc.sents:
    print(sent)
 

These are apples.
These are oranges.


# Part Of Speech Tagging
We’ve already seen how this works but let’s have another look:

In [8]:

doc = nlp("Next week I'll be in India.")
print([(token.text, token.tag_) for token in doc])

[('Next', 'JJ'), ('week', 'NN'), ('I', 'PRP'), ("'ll", 'MD'), ('be', 'VB'), ('in', 'IN'), ('India', 'NNP'), ('.', '.')]


### Named Entity Recognition
Doing NER with spaCy is super easy and the pretrained model performs pretty well:

In [11]:
doc = nlp("Next week, I'll be in India.")
for ent in doc.ents:
    print(ent.text, ent.label_)

Next week DATE
India GPE


You can also view the IOB style tagging of the sentence like this:

In [13]:
from nltk.chunk import conlltags2tree
 
 
doc = nlp("Next week I'll be in Madrid.")
iob_tagged = [
    (
        token.text, 
        token.tag_, 
        "{0}-{1}".format(token.ent_iob_, token.ent_type_) if token.ent_iob_ != 'O' else token.ent_iob_
    ) for token in doc
]
 
print(iob_tagged)
 
# In case you like the nltk.Tree format
print(conlltags2tree(iob_tagged))

[('Next', 'JJ', 'O'), ('week', 'NN', 'O'), ('I', 'PRP', 'O'), ("'ll", 'MD', 'O'), ('be', 'VB', 'O'), ('in', 'IN', 'O'), ('Madrid', 'NNP', 'B-GPE'), ('.', '.', 'O')]
(S Next/JJ week/NN I/PRP 'll/MD be/VB in/IN (GPE Madrid/NNP) ./.)


## Enitity



In [14]:
doc = nlp("I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ")
for ent in doc.ents:
    print(ent.text, ent.label_)

2 CARDINAL
9 a.m. because the stock went up 30% in just 2 days according to the WSJ TIME


## Chunking
spaCy automatically detects noun-phrases as well:

In [17]:

doc = nlp("Wall Street Journal just published an interesting piece on crypto currencies")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.label_, chunk.root.text)

Wall Street Journal NP Journal
an interesting piece NP piece
crypto currencies NP currencies


### Dependency Parsing
This is what makes spaCy really stand out. Let’s see the dependency parser in action:

In [18]:

doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies')
 
for token in doc:
    print("{0}/{1} <--{2}-- {3}/{4}".format(
        token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))

Wall/NNP <--compound-- Journal/NNP
Street/NNP <--compound-- Journal/NNP
Journal/NNP <--nsubj-- published/VBD
just/RB <--advmod-- published/VBD
published/VBD <--ROOT-- published/VBD
an/DT <--det-- piece/NN
interesting/JJ <--amod-- piece/NN
piece/NN <--dobj-- published/VBD
on/IN <--prep-- piece/NN
crypto/JJ <--amod-- currencies/NNS
currencies/NNS <--pobj-- on/IN
