## Data preparation

In [1]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")

In [34]:
text = "Our NLP team just released the Electra model for Japanese"

### Word tokenization

 - Word tokenization is a method of dividing a text into many smaller meaningful segments, called tokens
 - This could play an important role in almost all natural language processing tasks, where we need to carefully capture both syntactic and semantic of each word

 <img src="https://user-images.githubusercontent.com/52401767/87949843-beb89e80-cad0-11ea-8a25-d6f74785542c.jpg" width="600">

In [35]:
doc = nlp(text)
for token in doc:
    print(token.text)

Our
NLP
team
just
released
the
Electra
model
for
Japanese


### Word tagging
- The process of classifying words into their parts of speech is known as  POS-tagging, or simply tagging.
- After tokenization, spaCy can parse and tag a given document

 <img src="https://user-images.githubusercontent.com/52401767/87952495-160c3e00-cad4-11ea-8d66-52a4a298ce4a.jpg" width="600">


In [44]:
for token in doc:
    print(token.text, '->', token.pos_)

Our -> DET
NLP -> PROPN
team -> NOUN
just -> ADV
released -> VERB
the -> DET
Electra -> PROPN
model -> NOUN
for -> ADP
Japanese -> PROPN


### Word lemmatization
- A lemma (in linguistics), is the root, canonical form of a word
- Word lemmatization groups together various forms of a word share the same lemma in order to consider them as a single word
 
 <img src="https://user-images.githubusercontent.com/52401767/87954535-b4999e80-cad6-11ea-8b2b-5450e94008b6.jpg" width="600">


In [45]:
for token in doc:
    print(token.text, '->', token.lemma_)

Our -> -PRON-
NLP -> NLP
team -> team
just -> just
released -> release
the -> the
Electra -> Electra
model -> model
for -> for
Japanese -> Japanese


### Dependency parsing
- Dependency parsing is the task of extracting a dependency parse of a sentence that represents its grammatical structure and defines the relationships between "head" words and words
- For example, dependency parsing can tell you what the subjects and objects of a verb are, as well as which words are describing the subject

In [39]:
displacy.render(doc, style="dep", jupyter=True, options={'distance': 70})

In [46]:
for token in doc:
    print(token.text, token.dep_, token.head.text,
            [child for child in token.children])

Our poss team []
NLP compound team []
team nsubj released [Our, NLP]
just advmod released []
released ROOT released [team, just, model, for]
the det model []
Electra compound model []
model dobj released [the, Electra]
for prep released [Japanese]
Japanese pobj for []


### Name entity recognition
- Named entity recognition (NER) is the task of tagging entities in text with their corresponding type
- Spacy's NER toolkit doesn’t always work perfectly and might need some tuning later, depending on your use case.

In [41]:
displacy.render(doc, style="ent", jupyter=True, options={'distance': 70})

In [40]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

NLP 4 7 ORG
Electra 31 38 PRODUCT
Japanese 49 57 NORP
