<a href="https://colab.research.google.com/github/sidharth178/Natural-Language-Processing-Tutorial/blob/master/1_Tokenization_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Tokenization by Python**

- **Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords.**
- **Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded.**

In [None]:
#Split by Whitespace
import re
text = 'I\'m with you for the entire life in U.K.!'
words = re.split(r'\W+', text)
print(words[:100])

['I', 'm', 'with', 'you', 'for', 'the', 'entire', 'life', 'in', 'U', 'K', '']


Here It didn't recognise "." at the last of the sentence.

**Remove punctuations and separate the word**

In [None]:
import string
import re
# split into words by white space
words = text.split()
# prepare regex for char filtering
re_punc = re.compile('[%s]' % re.escape(string.punctuation))
# remove punctuation from each word
stripped = [re_punc.sub('', w) for w in words]
print(stripped[:100])

['Im', 'with', 'you', 'for', 'the', 'entire', 'life', 'in', 'UK']


In [None]:
# string.printable inverse of string.punctuation
re_print = re.compile('[^%s]' % re.escape(string.printable))
result = [re_print.sub('', w) for w in words]
print(result)


["I'm", 'with', 'you', 'for', 'the', 'entire', 'life', 'in', 'U.K.!']


In [None]:
# Normalizing Case

# split into words by white space
words = text.split()
# convert to lower case
words = [word.lower() for word in words]
print(words[:100])

["i'm", 'with', 'you', 'for', 'the', 'entire', 'life', 'in', 'u.k.!']


# **Spacy**

Spacy is an open-source software python library used in advanced natural language processing and machine learning. It will be used to build information extraction, natural language understanding systems, and to pre-process text for deep learning.

In [None]:
#_____________________________ Working on Spacy _________________________

# Install by https://spacy.io/usage/facts-figures#benchmarks
# conda install -c conda-forge spacy
# or
# !pip install -U spacy

# Alternatively you can create a virtual environment:
# conda create -n spacyenv python=3 spacy=2

Collecting spacy
  Downloading spacy-3.0.5-cp38-cp38-win_amd64.whl (11.8 MB)
Collecting murmurhash<1.1.0,>=0.28.0
  Downloading murmurhash-1.0.5-cp38-cp38-win_amd64.whl (21 kB)
Collecting catalogue<2.1.0,>=2.0.1
  Downloading catalogue-2.0.1-py3-none-any.whl (9.6 kB)
Collecting preshed<3.1.0,>=3.0.2
  Downloading preshed-3.0.5-cp38-cp38-win_amd64.whl (112 kB)
Collecting spacy-legacy<3.1.0,>=3.0.0
  Downloading spacy_legacy-3.0.1-py2.py3-none-any.whl (7.0 kB)
Collecting srsly<3.0.0,>=2.4.0
  Downloading srsly-2.4.0-cp38-cp38-win_amd64.whl (451 kB)
Collecting pathy>=0.3.5
  Downloading pathy-0.4.0-py3-none-any.whl (36 kB)
Collecting pydantic<1.8.0,>=1.7.1
  Downloading pydantic-1.7.3-cp38-cp38-win_amd64.whl (1.8 MB)
Collecting blis<0.8.0,>=0.4.0
  Downloading blis-0.7.4-cp38-cp38-win_amd64.whl (6.5 MB)
Collecting thinc<8.1.0,>=8.0.2
  Downloading thinc-8.0.2-cp38-cp38-win_amd64.whl (1.0 MB)
Collecting typer<0.4.0,>=0.3.0
  Downloading typer-0.3.2-py3-none-any.whl (21 kB)
Collecting wasab

In [None]:
# !python -m spacy download en_core_web_sm
# Download en_core_web_sm from spacy library

Collecting en-core-web-sm==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7 MB)

2021-03-15 23:08:05.778132: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2021-03-15 23:08:05.778256: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.



Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.0.0
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
# Here en_core means core english,web_sm means small language,We import core english language here.
# This is spacy's internal english language

In [None]:
string = '"I\'m with you for the entire life in P.K.!"'
print(string)

"I'm with you for the entire life in P.K.!"


In [None]:
# Here we will break the string into token and print in text
doc = nlp(string)
for token in doc:
    print(token.text, end=' | ')

" | I | 'm | with | you | for | the | entire | life | in | P.K. | ! | " | 

In [None]:
doc

"I'm with you for the entire life in P.K.!"

In [None]:
# Here we break the string into unicode token
doc2 = nlp(u"We're here to help! Send snail-mail, email fahad@gmail.com or visit us at https://fahadhussaincs.blogspot.com/!")
for t in doc2:
    print(t)

We
're
here
to
help
!
Send
snail
-
mail
,
email
fahad@gmail.com
or
visit
us
at
https://fahadhussaincs.blogspot.com/
!


In [None]:
doc3 = nlp(u'A 5km NYC cab ride costs $10.30')
for t in doc3:
    print(t)

A
5
km
NYC
cab
ride
costs
$
10.30


In [None]:
doc4 = nlp(u"Let's visit St. Louis in the U.S. next year.")
for t in doc4:
    print(t)

Let
's
visit
St.
Louis
in
the
U.S.
next
year
.


**see the number of word used in the string**

In [None]:
len(doc4)

11

**see the number of vocabolary or character used in the string**

In [None]:
# How many vocabolary present in token
len(doc.vocab)

798

**Print the 3rd word from the string**

In [None]:
doc5 = nlp(u'It is better to give than to receive.')
# Retrieve the third token:
doc5[2]

better

**Print the words from 3rd word to 4th word**

In [None]:
# Retrieve three tokens from the middle:
doc5[2:5]

better to give

In [None]:
# Retrieve the last four tokens:
doc5[-4:]

than to receive.

In [None]:
doc6 = nlp(u'My dinner was horrible.')
doc7 = nlp(u'Your dinner was delicious.')

**We can't store any word from one line to the any word of other line/sentence**

In [None]:
# Try to change "My dinner was horrible" to "My dinner was delicious"
doc6[3] = doc7[3]

TypeError: 'spacy.tokens.doc.Doc' object does not support item assignment

### **explain which type of token is this**

In [None]:
# We can explain which type of token is this
doc8 = nlp(u'Apple to build a Hong Kong factory for $6 million')

for token in doc8:
    print(token.text, end=' | ')

print('\n----')
# ents shows all entities present in all tokens
for ent in doc8.ents:
    print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))

Apple | to | build | a | Hong | Kong | factory | for | $ | 6 | million | 
----
Apple - ORG - Companies, agencies, institutions, etc.
Hong Kong - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


Here it shows "Apple" may be a companies,agencies,institution.Similarly like Hong Kong may be a countries,cities.....etc

In [None]:
len(doc8.ents)

3

In [None]:
doc9 = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")
# noun_chunks finds all noun present in all tokens
for chunk in doc9.noun_chunks:
    print(chunk.text)

Autonomous cars
insurance liability
manufacturers


In [None]:
doc10 = nlp(u"Red cars do not carry higher insurance rates.")

for chunk in doc10.noun_chunks:
    print(chunk.text)

Red cars
higher insurance rates


In [None]:
doc11 = nlp(u"He was a one-eyed, one-horned, flying, purple people-eater.")

for chunk in doc11.noun_chunks:
    print(chunk.text)

He
purple people-eater


In [None]:
# Here we can see how "displacy render" shows which type of token are these
# and relationships between them 
from spacy import displacy

doc = nlp(u'Apple is going to build a U.K. factory for $6 million.')
displacy.render(doc, style='dep', jupyter=True, options={'distance': 110})

In [None]:
# Showing types of token and there relationships in a line
doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.')
displacy.render(doc, style='ent', jupyter=True)

In [None]:
doc = nlp(u'This is a sentence.')
displacy.serve(doc, style='dep')




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


# **Tokenization - KN**

In [None]:
!pip install nltk



In [None]:
# Tokenization of paragraphs/sentences
import nltk
# nltk.download("popular") # use this to download all popular libraries in nltk
nltk.download('all')

In [None]:
paragraph = """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours. 
               Yet we have not done this to any other nation. We have not conquered anyone. 
               We have not grabbed their land, their culture, 
               their history and tried to enforce our way of life on them. 
               Why? Because we respect the freedom of others.That is why my 
               first vision is that of freedom. I believe that India got its first vision of 
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India 
               stands up to the world, no one will respect us. Only strength respects strength. We must be 
               strong not only as a military power but also as an economic power. Both must go hand-in-hand. 
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of 
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. 
               I see four milestones in my career"""
               


In [None]:
# Tokenizing sentences
sentences = nltk.sent_tokenize(paragraph)

# Tokenizing words
words = nltk.word_tokenize(paragraph)


In [None]:
sentences

['I have three visions for India.',
 'In 3000 years of our history, people from all over \n               the world have come and invaded us, captured our lands, conquered our minds.',
 'From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,\n               the French, the Dutch, all of them came and looted us, took over what was ours.',
 'Yet we have not done this to any other nation.',
 'We have not conquered anyone.',
 'We have not grabbed their land, their culture, \n               their history and tried to enforce our way of life on them.',
 'Why?',
 'Because we respect the freedom of others.That is why my \n               first vision is that of freedom.',
 'I believe that India got its first vision of \n               this in 1857, when we started the War of Independence.',
 'It is this freedom that\n               we must protect and nurture and build on.',
 'If we are not free, no one will respect us.',
 'My second vision for India’s developme

In [None]:
words

['I',
 'have',
 'three',
 'visions',
 'for',
 'India',
 '.',
 'In',
 '3000',
 'years',
 'of',
 'our',
 'history',
 ',',
 'people',
 'from',
 'all',
 'over',
 'the',
 'world',
 'have',
 'come',
 'and',
 'invaded',
 'us',
 ',',
 'captured',
 'our',
 'lands',
 ',',
 'conquered',
 'our',
 'minds',
 '.',
 'From',
 'Alexander',
 'onwards',
 ',',
 'the',
 'Greeks',
 ',',
 'the',
 'Turks',
 ',',
 'the',
 'Moguls',
 ',',
 'the',
 'Portuguese',
 ',',
 'the',
 'British',
 ',',
 'the',
 'French',
 ',',
 'the',
 'Dutch',
 ',',
 'all',
 'of',
 'them',
 'came',
 'and',
 'looted',
 'us',
 ',',
 'took',
 'over',
 'what',
 'was',
 'ours',
 '.',
 'Yet',
 'we',
 'have',
 'not',
 'done',
 'this',
 'to',
 'any',
 'other',
 'nation',
 '.',
 'We',
 'have',
 'not',
 'conquered',
 'anyone',
 '.',
 'We',
 'have',
 'not',
 'grabbed',
 'their',
 'land',
 ',',
 'their',
 'culture',
 ',',
 'their',
 'history',
 'and',
 'tried',
 'to',
 'enforce',
 'our',
 'way',
 'of',
 'life',
 'on',
 'them',
 '.',
 'Why',
 '?',
 '