# Spacy and NLTK Tutorial

## Sentence & Word Tokenization In Spacy
- spacy is for object-oriented
- Provides most efficient NLP algorithm for a given task. Hence if we care about the end result, go with spacy

In [1]:
import spacy

In [2]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Dr. Strange loves pav bhaji of mumbai. Hulk loves chat of delhi")

for sentence in doc.sents:
  print(sentence)

Dr. Strange loves pav bhaji of mumbai.
Hulk loves chat of delhi


In [3]:
for sentence in doc.sents:
  for word in sentence:
    print(word)

Dr.
Strange
loves
pav
bhaji
of
mumbai
.
Hulk
loves
chat
of
delhi


## NLTK
- string processing library
- Provides access to many algorithm. If we care about specific algorithm and customization, go with NLTK.

In [11]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [9]:
from nltk.tokenize import sent_tokenize

In [12]:
sent_tokenize("Dr. Strange loves pav bhaji of mumbai. Hulk loves chat of delhi")

['Dr.', 'Strange loves pav bhaji of mumbai.', 'Hulk loves chat of delhi']

Notice that the model did not work well since it split Dr. as sentence. However, it is still good in many cases since there are more options or choices.

In [13]:
from nltk.tokenize import word_tokenize
word_tokenize("Dr. Strange loves pav bhaji of mumbai. Hulk loves chat of delhi")

['Dr',
 '.',
 'Strange',
 'loves',
 'pav',
 'bhaji',
 'of',
 'mumbai',
 '.',
 'Hulk',
 'loves',
 'chat',
 'of',
 'delhi']

# Spacy
- Tokenization : A process of splitting text into meaningful segments.
- It's not just about splitting sentence by full stop or splitting word by space.
- [more practice](https://www.firstlanguage.ai/)

In [14]:
import spacy

In [18]:
nlp = spacy.blank("en") # create language object with tokenizer. 'en' stands for english.

doc = nlp("Dr. Strange loves pav bhaji of mumbai as it costs only 2$ per plate.") # provide text to the object and do word tokenization

for token in doc:
    print(token)

Dr.
Strange
loves
pav
bhaji
of
mumbai
as
it
costs
only
2
$
per
plate
.


In [17]:
doc2 = nlp('''"Let's go to N.Y.!"''')

for token in doc2:
    print(token)

"
Let
's
go
to
N.Y.
!
"


In [19]:
type(nlp)

In [20]:
type(doc)

spacy.tokens.doc.Doc

In [21]:
type(doc[1:5])

spacy.tokens.span.Span

In [22]:
type(doc[0])

spacy.tokens.token.Token

In [23]:
dir(doc[0]) # show all attributes in doc[0]

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 'ancestors',
 'check_flag',
 'children',
 'cluster',
 'conjuncts',
 'dep',
 'dep_',
 'doc',
 'ent_id',
 'ent_id_',
 'ent_iob',
 'ent_iob_',
 'ent_kb_id',
 'ent_kb_id_',
 'ent_type',
 'ent_type_',
 'get_extension',
 'has_dep',
 'has_extension',
 'has_head',
 'has_morph',
 'has_vector',
 'head',
 'i',
 'idx',
 'iob_strings',
 'is_alpha',
 'is_ancestor',
 'is_ascii',
 'is_bracket',
 'is_currency',
 'is_digit',
 'is_left_punct',
 'is_lower',
 'is_oov',
 'is_punct',
 'is_quote',
 'is_right_punct',
 'is_sent_end',
 'is_sent_start',
 'is_space',
 'is_stop',
 'is_title',
 'is_upper',
 'lang

In [27]:
for token in doc:
    print(token, "==>", "index: ", token.i, "is_alpha:", token.is_alpha,
          "is_punct:", token.is_punct,
          "like_num:", token.like_num,
          "is_currency:", token.is_currency,
         )

Dr. ==> index:  0 is_alpha: False is_punct: False like_num: False is_currency: False
Strange ==> index:  1 is_alpha: True is_punct: False like_num: False is_currency: False
loves ==> index:  2 is_alpha: True is_punct: False like_num: False is_currency: False
pav ==> index:  3 is_alpha: True is_punct: False like_num: False is_currency: False
bhaji ==> index:  4 is_alpha: True is_punct: False like_num: False is_currency: False
of ==> index:  5 is_alpha: True is_punct: False like_num: False is_currency: False
mumbai ==> index:  6 is_alpha: True is_punct: False like_num: False is_currency: False
as ==> index:  7 is_alpha: True is_punct: False like_num: False is_currency: False
it ==> index:  8 is_alpha: True is_punct: False like_num: False is_currency: False
costs ==> index:  9 is_alpha: True is_punct: False like_num: False is_currency: False
only ==> index:  10 is_alpha: True is_punct: False like_num: False is_currency: False
2 ==> index:  11 is_alpha: False is_punct: False like_num: True

## Use cases
- situation : having students data as .txt file and want to extract email of all students.

In [28]:
cd /content/drive/MyDrive/Study/NLP/codebasics

/content/drive/MyDrive/Study/NLP/codebasics


In [29]:
with open("students.txt") as f:
  text = f.readlines()
text

['Dayton high school, 8th grade students information\n',
 '\n',
 'Name\tbirth day   \temail\n',
 '-----\t------------\t------\n',
 'Virat   5 June, 1882    virat@kohli.com\n',
 'Maria\t12 April, 2001  maria@sharapova.com\n',
 'Serena  24 June, 1998   serena@williams.com \n',
 'Joe      1 May, 1997    joe@root.com\n',
 '\n',
 '\n',
 '\n']

In [30]:
text = ' '.join(text)
text



In [32]:
students_doc = nlp(text)
emails = []
for token in students_doc:
  if token.like_email:
    emails.append(token.text)

emails

['virat@kohli.com',
 'maria@sharapova.com',
 'serena@williams.com',
 'joe@root.com']

## Tokenization in Hindi

In [33]:
nlp = spacy.blank("hi")
doc = nlp("भैया जी! 5000 ₹ उधार थे वो वापस देदो")

for token in doc:
  print(token, token.is_currency)

भैया False
जी False
! False
5000 False
₹ True
उधार False
थे False
वो False
वापस False
देदो False


## Customizing tokenizer

In [34]:
nlp = spacy.blank("en")
doc = nlp("gimme double cheese extra large healthy pizza")
tokens = [token.text for token in doc]
tokens

['gimme', 'double', 'cheese', 'extra', 'large', 'healthy', 'pizza']

notice that we have a slang (gimme) in the sentence and we want to make it full word by customizing the model.

In [35]:
from spacy.symbols import ORTH

nlp.tokenizer.add_special_case("gimme",[
    {ORTH:"gim"},
    {ORTH:"me"}
])

doc = nlp("gimme double cheese extra large healthy pizza")
tokens = [token.text for token in doc]
tokens

['gim', 'me', 'double', 'cheese', 'extra', 'large', 'healthy', 'pizza']

## Sentence Tokenization or Segmentation

In [36]:
doc = nlp("Dr. Strange loves pav bhaji of mumbai. Hulk loves chat of delhi")

for sentence in doc.sents:
  print(sentence)

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: `nlp.add_pipe('sentencizer')`. Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting `doc[i].is_sent_start`.

We can see that sentence boundaries is unset since we create blank nlp (have only tokenizer). To solve this, we can add sentencizer pipe to nlp.

In [38]:
nlp.pipe_names

[]

In [39]:
nlp.add_pipe('sentencizer')
nlp.pipe_names

['sentencizer']

In [40]:
doc = nlp("Dr. Strange loves pav bhaji of mumbai. Hulk loves chat of delhi")

for sentence in doc.sents:
  print(sentence)

Dr. Strange loves pav bhaji of mumbai.
Hulk loves chat of delhi


From previous, we can see that we can modify our model pipeline with various components.

## Language processing pipeline

### blank NLP pipline

In [41]:
import spacy

nlp = spacy.blank("en")

doc = nlp("Captain america ate 100$ of samosa. Then he said I can do this all day.")

for token in doc:
    print(token)

Captain
america
ate
100
$
of
samosa
.
Then
he
said
I
can
do
this
all
day
.


In [43]:
nlp.pipe_names

[]

In the spacy tutorial, we tried 'spacy.load("en_core_web_sm")' which is the pre-trained model.

In [44]:
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [46]:
doc = nlp("Captain america ate 100$ of samosa. Then he said I can do this all day.")

for token in doc:
    print(token, ' | ', token.pos_, " | ", token.lemma_)  # pos = part of speech, lemma = original word

Captain  |  PROPN  |  Captain
america  |  PROPN  |  america
ate  |  VERB  |  eat
100  |  NUM  |  100
$  |  NUM  |  $
of  |  ADP  |  of
samosa  |  PROPN  |  samosa
.  |  PUNCT  |  .
Then  |  ADV  |  then
he  |  PRON  |  he
said  |  VERB  |  say
I  |  PRON  |  I
can  |  AUX  |  can
do  |  VERB  |  do
this  |  PRON  |  this
all  |  DET  |  all
day  |  NOUN  |  day
.  |  PUNCT  |  .


### Named Entity Recognition

In [47]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, ent.label_)

Tesla Inc ORG
$45 billion MONEY


In [49]:
from spacy import displacy

displacy.render(doc, style="ent")

### Trained processing pipeline in French

In [52]:
!python -m spacy download fr_core_news_sm

Collecting fr-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.8.0/fr_core_news_sm-3.8.0-py3-none-any.whl (16.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m37.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fr-core-news-sm
Successfully installed fr-core-news-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [53]:
nlp = spacy.load("fr_core_news_sm")

In [54]:
doc = nlp("Tesla Inc va racheter Twitter pour $45 milliards de dollars")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Tesla Inc  |  PER  |  Named person or family.
Twitter  |  MISC  |  Miscellaneous entities, e.g. events, nationalities, products or works of art


In [55]:
for token in doc:
    print(token, " | ", token.pos_, " | ", token.lemma_)

Tesla  |  PROPN  |  Tesla
Inc  |  PROPN  |  Inc
va  |  VERB  |  aller
racheter  |  VERB  |  racheter
Twitter  |  VERB  |  twitter
pour  |  ADP  |  pour
$  |  NOUN  |  dollar
45  |  NUM  |  45
milliards  |  NOUN  |  milliard
de  |  ADP  |  de
dollars  |  NOUN  |  dollar


### Adding a component to a blank pipeline

In [56]:
source_nlp = spacy.load("en_core_web_sm")

nlp = spacy.blank("en")
nlp.add_pipe("ner", source = source_nlp)  # create a pipeline with only 'ner'
nlp.pipe_names

['ner']

In [59]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")

for token in doc.ents:
    print(token.text, token.label_)

Tesla Inc ORG
$45 billion MONEY
