## Tokenization

Read this doc from Spacy
[Tokenization](https://spacy.io/usage/linguistic-features#tokenization)

In [1]:
import spacy

In [2]:
nlp = spacy.blank('en')

Tokenizer is really a special process which understands the language and does not merely split the words using some splitter (.split(" ")). We get meaningful terms.

In [10]:
input_text = "Dr. Strange visits Mumbai and he loves Pav Bhaji a lot as it costs just 100 Rs a plate."

doc = nlp(input_text)

# print tokens, tokenized by default
for token in doc:
    print(token)
    
# printing a normal python string word by word -- this will print each character!!! but above one prints tokens - smart!
# for word in input_text:
#     print(word)

Dr.
Strange
visits
Mumbai
and
he
loves
Pav
Bhaji
a
lot
as
it
costs
just
100
Rs
a
plate
.


In [13]:
print(doc[0])

print(type(doc[0]) is list)

Dr.
False


In [None]:
type(doc[0]) # its a token, not a list!!!

spacy.tokens.token.Token

In [16]:
span = doc[1:5]
print(span)
print(type(span))

Strange visits Mumbai and
<class 'spacy.tokens.span.Span'>


In [None]:
dir(doc[0])

In [19]:
doc[0]

Dr.

In [20]:
doc[0].is_alpha

False

In [21]:
doc[0].like_num

False

In [22]:
doc[0].text

'Dr.'

In [23]:
doc[15]

100

In [24]:
doc[15].like_num

True

In [27]:
doc[15].is_currency

False

### Reading from a text file and finding some info

In [29]:
with open('students.txt', "r") as f:
    text = f.readlines()
text

['Dayton high school, 8th grade students information\n',
 '\n',
 'Name\tbirth day   \temail\n',
 '-----\t------------\t------\n',
 'Virat   5 June, 1882    virat@kohli.com\n',
 'Maria\t12 April, 2001  maria@sharapova.com\n',
 'Serena  24 June, 1998   serena@williams.com \n',
 'Joe      1 May, 1997    joe@root.com']

In [30]:
text = ' '.join(text)
text



In [31]:
doc = nlp(text)

emails = []

for token in doc:
    if token.like_email:
        emails.append(token.text) # do not add token as its an object, flatten it to text

print('List of all emails from text file:', emails)

List of all emails from text file: ['virat@kohli.com', 'maria@sharapova.com', 'serena@williams.com', 'joe@root.com']


### Lets try HINDI --- Badhiya!

In [32]:
nlp_hindi = spacy.blank('hi')

In [34]:
my_text = 'नमस्ते, मैं एक उत्कृष्ट एआई इंजीनियर बनने जा रहा हूँ!'

doc = nlp_hindi(my_text)

for token in doc:
    print(token, token.like_num, token.is_alpha)

नमस्ते False False
, False False
मैं False False
एक True True
उत्कृष्ट False False
एआई False True
इंजीनियर False False
बनने False False
जा False False
रहा False False
हूँ False False
! False False


### Customize Tokenizer Rule

In [35]:
doc = nlp('gimme double cheese large healthy pizza')

tokens = [token.text for token in doc]

tokens

['gimme', 'double', 'cheese', 'large', 'healthy', 'pizza']

We want to customize the tokenizer process, e.g. for gimme we want two tokens give me

In [None]:
from spacy.symbols import ORTH

# here you can not do give and me - cant change these existing words! check more...
nlp.tokenizer.add_special_case("gimme", 
                               [
                                   {ORTH: "gim"},
                                   {ORTH: "me"}
                               ])

doc = nlp('gimme double cheese large healthy pizza')

tokens = [token.text for token in doc]

tokens

['gim', 'me', 'double', 'cheese', 'large', 'healthy', 'pizza']

What we used in this lecture is basic model `en` which only has `tokenizer` so features are limited. When we use a full fleged pipeline like `en_web_sm` etc we get `sentenzier` and other pipeline steps as well.