# spaCy

spaCy is a *free*, *open-source library* for Natural Language Processing (NLP) in Python.

spaCy, differently from [NLTK](https://github.com/nltk/nltk) is not a research software. It is designed to be "ready to use" and easy to understand. Indeed NLTK was created as a platform for teaching and research.

# Features
spaCy features can be related to:
- linguistic concepts
- more general machine learning functionality

## Tokenization
Segmenting text into words, punctuation marks, etc.

## Lemmatization
Assigning the base forms of words. For example the lemma of "playing" is "play", the lemma of "is" is "be", etc.

## Similarity
Comparing words, text spans and documents and how similar they are to each other.

## Part-of-speech (POS) tagging
Assigning word types to tokens, like verb or noun.

## Named entity recognition (NER)
Labelling named real words, like persons, companies and locations.


In [None]:
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_md # contains more information

# Linguistic annotations (example of use)
They are to give you insighs into *text's grammatical structure*.
It includes:
1. the *word types*,
2. *how they are related between each other*.

For example, there is a huge difference if a noun is the subject of a sentence or if it is the object of a sentence.

Once we have downloaded and installed a "trained pipeline", we can use it by using the function `spacy.load()`.

This will give a `Language` object, that is generally named `nlp` by convention.

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.pos_, token.dep_)

# Tokenization (example of use)

It segments it into words, punctuation, etc..

In [None]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text)

# Similarity (example of use)
Similarity can be computed between words and also over more complex data, like documents.

Similarity between words is determined by comparing word vectors that are named "word embeddings" that are *multi-dimensional meaning representation of a word".
Word embeddings can be generated by using an algorithm as word2vec

In [None]:
nlp = spacy.load("en_core_web_md")
tokens = nlp("dog cat banana afskfsd")

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)
    #print(token.vector)

In [None]:
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

# Similarity of two documents
print(doc1, "<->", doc2, doc1.similarity(doc2))

# Similarity of tokens and spans
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, "<->", burgers, french_fries.similarity(burgers))

# Named Entity Recognition (example of use)

In [None]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)