In [52]:
import spacy

>***NLP object powered by a trained model, contains pipelines to process english lan texts in better version***

In [53]:
nlp = spacy.load('en_core_web_sm')

In [54]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

***Parts of speech and lemmatized word (base/root word) detection***

In [55]:
doc = nlp("Tata Consultancy Services found the wayout and Wipro earns 2 $. TMSL students innovate the wayout but Ratan Tata with the help of Muskesh Ambani stole the plan.")
for token in doc:
    print(token, ": ", token.pos_, ", ", token.lemma_)

Tata :  PROPN ,  Tata
Consultancy :  PROPN ,  Consultancy
Services :  PROPN ,  Services
found :  VERB ,  find
the :  DET ,  the
wayout :  PROPN ,  wayout
and :  CCONJ ,  and
Wipro :  PROPN ,  Wipro
earns :  VERB ,  earn
2 :  NUM ,  2
$ :  SYM ,  $
. :  PUNCT ,  .
TMSL :  PROPN ,  TMSL
students :  NOUN ,  student
innovate :  VERB ,  innovate
the :  DET ,  the
wayout :  NOUN ,  wayout
but :  CCONJ ,  but
Ratan :  PROPN ,  Ratan
Tata :  PROPN ,  Tata
with :  ADP ,  with
the :  DET ,  the
help :  NOUN ,  help
of :  ADP ,  of
Muskesh :  PROPN ,  Muskesh
Ambani :  PROPN ,  Ambani
stole :  VERB ,  steal
the :  DET ,  the
plan :  NOUN ,  plan
. :  PUNCT ,  .


***`NER (Named Entity Recognition)`: the ner pipe helps to extract the named entity in the given peice of text and tries to relate with any predefined entities***

In [56]:
for ent in doc.ents:
    print(ent, ": ", ent.label_, ", ", spacy.explain(ent.label_))

Tata Consultancy Services :  ORG ,  Companies, agencies, institutions, etc.
Wipro :  ORG ,  Companies, agencies, institutions, etc.
2 $ :  MONEY ,  Monetary values, including unit
TMSL :  ORG ,  Companies, agencies, institutions, etc.
Ratan Tata :  ORG ,  Companies, agencies, institutions, etc.
Muskesh Ambani :  ORG ,  Companies, agencies, institutions, etc.


### ***Stemming v/s Lemitization***

In [57]:
words = ['eating', 'ate', 'ability', 'available', 'looked', 'stolen', 'eats', 'saw']

**Stemming**: `Applies some hard coded rules on each word (like, remove able, ing, ed etc.)`

***Have no knowledge of the language***

In [58]:
from nltk.stem import PorterStemmer

In [59]:
stemmer = PorterStemmer()

In [60]:
for word in words:
    print(word, ": ", stemmer.stem(word))

eating :  eat
ate :  ate
ability :  abil
available :  avail
looked :  look
stolen :  stolen
eats :  eat
saw :  saw


**Lemmatization**: `Appling linguistic knowledge it finely determines the root / base words (lemma)`

***Have knowledge of the language***

In [62]:
# customisation -> new rule addition
ar = nlp.get_pipe('attribute_ruler')
ar.add([[{"TEXT": "ate"}]], {"LEMMA": "eat"})

In [64]:
text = ' '.join(words)
word_doc = nlp(text)

In [65]:
for word in word_doc:
    print(word, ": ", word.lemma_)

eating :  eat
ate :  eat
ability :  ability
available :  available
looked :  look
stolen :  steal
eats :  eat
saw :  see
