

## NLP Techniques

- Rules and heurestics
- Machine Learning
- Deep Learning

## NLP Usecases

- Text Classification  
- Text Similarity  
- Information Extraction  
- Information Retrieval  
- Chatbots  
- Language Modeling  
- Text Summarization
- Topic Modeling
- Voice Assistants  

## NLP Pipeline  

- Data Acquisition
- Text Extraction and clean up
    - discarding irrelevant info
    - spelling correction
- Preprocessing
    - Sentence Tokenization
    - Word Tokenization
    - Stemming & Lemmatization
- Feature Engineeing (converting text to numbers)
    - TF-IDF
    - One Hot Encoder
    - BOW
    - Word Embedding
- Machine Learning Model
    - Hyperparameter tuning
    - Evaluation
- Model Deployment
    - Monitor
    - update  


## Spacy vs NLTK

- spacy is object oriented and NLTK is string processing library.
- NLTK has lot of customization and good for researchers whereas Spacy is good for app developers as it choose best algorithms.

In [1]:
import spacy

In [2]:
#sentence tokenization
nlp = spacy.load("en_core_web_sm")
doc = nlp("Dr. Strange loves pav bhaji in mumbai. Hulk loves chat of delhi")

for sentence  in doc.sents:
    print(sentence)

Dr. Strange loves pav bhaji in mumbai.
Hulk loves chat of delhi


In [3]:
import nltk

from nltk.tokenize import sent_tokenize
sent_tokenize("Dr. Strange loves pav bhaji in mumbai. Hulk loves chat of delhi")

['Dr.', 'Strange loves pav bhaji in mumbai.', 'Hulk loves chat of delhi']

## Tokenization in 

Tokenization is a process of splitting text in meaningful segments.  

In [4]:
nlp = spacy.blank("en") #blank english language component!

doc = nlp("Dr. Strange loves pav bhaji in mumbai as it costs only 2$ per plate")

for token in doc:
    print(token)

Dr.
Strange
loves
pav
bhaji
in
mumbai
as
it
costs
only
2
$
per
plate


In [5]:
doc[3]

pav

In [6]:
type(nlp)

spacy.lang.en.English

In [7]:
type(doc)


spacy.tokens.doc.Doc

In [8]:
doc[1:4]

Strange loves pav

In [9]:
doc = nlp("Tony gave two $ to Peter.")

token0 = doc[0]
token0

Tony

In [10]:
dir(token0)

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 'ancestors',
 'check_flag',
 'children',
 'cluster',
 'conjuncts',
 'dep',
 'dep_',
 'doc',
 'ent_id',
 'ent_id_',
 'ent_iob',
 'ent_iob_',
 'ent_kb_id',
 'ent_kb_id_',
 'ent_type',
 'ent_type_',
 'get_extension',
 'has_dep',
 'has_extension',
 'has_head',
 'has_morph',
 'has_vector',
 'head',
 'i',
 'idx',
 'iob_strings',
 'is_alpha',
 'is_ancestor',
 'is_ascii',
 'is_bracket',
 'is_currency',
 'is_digit',
 'is_left_punct',
 'is_lower',
 'is_oov',
 'is_punct',
 'is_quote',
 'is_right_punct',
 'is_sent_end',
 'is_sent_start',
 'is_space',
 'is_stop',
 'is_title',
 'is_upper',
 'lang

In [11]:
type(token0)

spacy.tokens.token.Token

In [12]:
token0.like_num

False

In [13]:
token2 = doc[2]
token2

two

In [15]:
token2.like_num

True

In [17]:
token3 = doc[3]
token3.is_currency

True

In [18]:
with open('students.txt') as f:
    text = f.readlines()

text

['Dayton high school, 8th grade students information\n',
 '\n',
 'Name\tbirth day   \temail\n',
 '-----\t------------\t------\n',
 'Virat   5 June, 1882    virat@kohli.com\n',
 'Maria\t12 April, 2001  maria@sharapova.com\n',
 'Serena  24 June, 1998   serena@williams.com \n',
 'Joe      1 May, 1997    joe@root.com\n',
 '\n']

In [19]:
text = ' '.join(text)
text



In [20]:
doc = nlp(text)

emails = []
for token in doc:
    if token.like_email:
        emails.append(token.text)
emails

['virat@kohli.com',
 'maria@sharapova.com',
 'serena@williams.com',
 'joe@root.com']

In [21]:
doc = nlp("gimme double cheese extra large pizza")

token = [token.text for token in doc]
token

['gimme', 'double', 'cheese', 'extra', 'large', 'pizza']

Adding special rule in tokenization

In [22]:
from spacy.symbols import ORTH
nlp.tokenizer.add_special_case("gimme",[
    {ORTH: "give"},
    {ORTH: "me"}
])

doc = nlp("gimme double cheese extra large pizza")

token = [token.text for token in doc]
token

ValueError: [E997] Tokenizer special cases are not allowed to modify the text. This would map 'gimme' to 'giveme' given token attributes '[{65: 'give'}, {65: 'me'}]'.

In [23]:
from spacy.symbols import ORTH
nlp.tokenizer.add_special_case("gimme",[
    {ORTH: "gim"},
    {ORTH: "me"}
])

doc = nlp("gimme double cheese extra large pizza")

token = [token.text for token in doc]
token

['gim', 'me', 'double', 'cheese', 'extra', 'large', 'pizza']

In [24]:
doc = nlp("Dr. Strange loves pav bhaji in mumbai. Hulk loves chat of delhi")

for sentence in doc.sents:
    print(sentence)

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: `nlp.add_pipe('sentencizer')`. Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting `doc[i].is_sent_start`.

In [25]:
nlp.pipeline

[]

In [26]:
nlp.add_pipe('sentencizer')

<spacy.pipeline.sentencizer.Sentencizer at 0x343e1a750>

In [27]:
nlp.pipe_names

['sentencizer']

In [28]:
doc = nlp("Dr. Strange loves pav bhaji in mumbai. Hulk loves chat of delhi")

for sentence in doc.sents:
    print(sentence)

Dr. Strange loves pav bhaji in mumbai.
Hulk loves chat of delhi


## Spacy Language Processing Pipeline

- Blank NLP Pipeline
- Download pre-trained pipeline
- Named Entity Recognition
- Train processing pipeline in french
- Add component to the blank pipeline 

In [30]:
nlp = spacy.blank("en")

doc = nlp("Dr. Strange loves pav bhaji in mumbai as it costs only 2$ per plate")

for token in doc:
    print(token)

Dr.
Strange
loves
pav
bhaji
in
mumbai
as
it
costs
only
2
$
per
plate


In [31]:
nlp.pipe_names

[]

In [32]:
nlp = spacy.load("en_core_web_sm")

In [33]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [34]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x344f43170>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x345331f70>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x3450065e0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x3440fd850>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x3440bf250>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x3450066c0>)]

In [35]:
doc = nlp("Captain America ate $100 of samosa. Then he said I can do this all day.")

for token in doc:
    print(token, " | ", token.pos_, " | ", token.lemma_)

Captain  |  PROPN  |  Captain
America  |  PROPN  |  America
ate  |  VERB  |  eat
$  |  SYM  |  $
100  |  NUM  |  100
of  |  ADP  |  of
samosa  |  NOUN  |  samosa
.  |  PUNCT  |  .
Then  |  ADV  |  then
he  |  PRON  |  he
said  |  VERB  |  say
I  |  PRON  |  I
can  |  AUX  |  can
do  |  VERB  |  do
this  |  PRON  |  this
all  |  DET  |  all
day  |  NOUN  |  day
.  |  PUNCT  |  .


In [36]:
doc = nlp("Tesla Inc acquires twitter for $45 billion.")
for ent in doc.ents:
    print(ent.text," | " ,ent.label_, " | ",spacy.explain(ent.label_))

Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
$45 billion  |  MONEY  |  Monetary values, including unit


In [37]:
# better display
from spacy import displacy

displacy.render(doc,style="ent")

'<div class="entities" style="line-height: 2.5; direction: ltr">\n<mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">\n    Tesla Inc\n    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">ORG</span>\n</mark>\n acquires twitter for \n<mark class="entity" style="background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">\n    $45 billion\n    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">MONEY</span>\n</mark>\n.</div>'

In [38]:
nlp=spacy.load('fr_core_news_sm')

In [39]:
doc = nlp("Tesla Inc acquiert Twitter pour 45 milliards de dollars.")
for ent in doc.ents:
    print(ent.text," | " ,ent.label_, " | ",spacy.explain(ent.label_))

Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
Twitter  |  MISC  |  Miscellaneous entities, e.g. events, nationalities, products or works of art


add component to blank pipeline

In [40]:
source_nlp = spacy.load("en_core_web_sm")

nlp = spacy.blank("en")

nlp.add_pipe("ner", source=source_nlp)
nlp.pipe_names

['ner']