# Day 1:

1. NLTK
2. Sentence Tokenization (nltk.tokenize.sent_tokenize)
3. Word Tokenization (nltk.tokenize.word_tokenize)
4. Remove Punctuations and numbers ("str".isalpha())
5. Removing stopwords (from nltk.corpus import stopwords)
6. Stemming (from nltk.stem import PoterStemmer)
7. lemmatization (from nltk.stem import WordNetLemmatizer)
8. N-Grams
9. Part of Speech Tagging
10. Using Spacy

In [2]:
#!pip install nltk
#nltk.download('all')

In [3]:
#!pip install -U spacy
#python -m spacy download en_core_web_sm

In [4]:
import nltk

In [5]:
t_text = "North America is the third largest continent, and is also a portion of the third largest supercontinent if North and South America are combined into the Americas and Africa, Europe, and Asia are considered to be part of one supercontinent called Afro-Eurasia. With an estimated population of 580 million and an area of 24,709,000 km2 (9,540,000 mi2), the northernmost of the two continents of the Western Hemisphere is bounded by the Pacific Ocean on the west; the Atlantic Ocean on the east; the Caribbean Sea on the south; and the Arctic Ocean on the north. The northern half of North America is sparsely populated and covered mostly by Canada, except for the northeastern portion, which is occupied by Greenland, and the northwestern portion, which is occupied by Alaska, the largest state of the United States. The central and southern portions of the continent are occupied by the contiguous United States, Mexico, and numerous smaller states in Central America and in the Caribbean. The continent is delimited on the southeast by most geographers at the Darién watershed along the Colombia-Panama border, placing all of Panama within North America. Alternatively, a less common view would end North America at the man-made Panama Canal. Islands generally associated with North America include Greenland, the world's largest island, and archipelagos and islands in the Caribbean. The terminology of the Americas is complex, but 'Anglo-America' can describe Canada and the U.S., while 'Latin America' comprises Mexico and the countries of Central America and the Caribbean, as well as the entire continent of South America."

In [6]:
print(f"Length of the text {len(t_text)} characters")

Length of the text 1627 characters


## Sentence Tokenization

In [7]:
from nltk.tokenize import sent_tokenize

In [8]:
sentences = sent_tokenize(t_text)

In [9]:
for sent in sentences:
    print(sent)
    print("\n")

North America is the third largest continent, and is also a portion of the third largest supercontinent if North and South America are combined into the Americas and Africa, Europe, and Asia are considered to be part of one supercontinent called Afro-Eurasia.


With an estimated population of 580 million and an area of 24,709,000 km2 (9,540,000 mi2), the northernmost of the two continents of the Western Hemisphere is bounded by the Pacific Ocean on the west; the Atlantic Ocean on the east; the Caribbean Sea on the south; and the Arctic Ocean on the north.


The northern half of North America is sparsely populated and covered mostly by Canada, except for the northeastern portion, which is occupied by Greenland, and the northwestern portion, which is occupied by Alaska, the largest state of the United States.


The central and southern portions of the continent are occupied by the contiguous United States, Mexico, and numerous smaller states in Central America and in the Caribbean.


The

## Workd Tokenization

In [10]:
from nltk.tokenize import word_tokenize

In [11]:
words = word_tokenize(t_text)
words[0:10]

['North',
 'America',
 'is',
 'the',
 'third',
 'largest',
 'continent',
 ',',
 'and',
 'is']

In [12]:
len(words)

292

## Convert to lower

In [13]:
words = list(map(str.lower, words))

In [14]:
words[0:10]

['north',
 'america',
 'is',
 'the',
 'third',
 'largest',
 'continent',
 ',',
 'and',
 'is']

## Remove Duplicates

In [15]:
words = list(set(words))

In [16]:
len(words)

132

## Remove Punctuations and numbers

In [17]:
words_in_para = []
for word in words:
    if word.isalpha():
        words_in_para.append(word)

In [18]:
words_in_para[0:10]

['west',
 'million',
 'island',
 'part',
 'east',
 'an',
 'asia',
 'watershed',
 'while',
 'north']

In [19]:
len(words_in_para)

114

## Remove Stopwords

In [20]:
from nltk.corpus import stopwords
stopwords.words('english')[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [21]:
stopwords.words('spanish')[0:10]

['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'del', 'se']

In [22]:
new_words = []
for word in words_in_para:
    if word not in stopwords.words('english'):
        new_words.append(word)

In [23]:
len(new_words)

90

In [24]:
new_words[0:10]

['west',
 'million',
 'island',
 'part',
 'east',
 'asia',
 'watershed',
 'north',
 'bounded',
 'populated']

## Stemming and Lemmatization

- Stemming:
    - It takes the root of the words
    - It removes the last few words or suffix of a word where it misspelt or incorrect words.
    - Example: Plays, Playing, Player, Played will be known as "Play"
- Lemmatization
    - It takes the part of the speech
    - It converts the text to ameaningful base form by considering its context.
    - Example: Studying -> Study, Emptiness -> Empty

In [25]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()

In [26]:
ps.stem('cycling')

'cycl'

In [27]:
ps.stem('Emptiness')

'empti'

In [28]:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

In [29]:
wnl.lemmatize('playing', 'v')

'play'

In [30]:
wnl.lemmatize('cycling', 'v')

'cycle'

### Stemming

In [31]:
stemmed_words = []
for word in new_words:
    print(f"Original: Word {word}, Stem: {ps.stem(word)}")
    if ps.stem(word) not in stemmed_words:
        stemmed_words.append(ps.stem(word))

Original: Word west, Stem: west
Original: Word million, Stem: million
Original: Word island, Stem: island
Original: Word part, Stem: part
Original: Word east, Stem: east
Original: Word asia, Stem: asia
Original: Word watershed, Stem: watersh
Original: Word north, Stem: north
Original: Word bounded, Stem: bound
Original: Word populated, Stem: popul
Original: Word atlantic, Stem: atlant
Original: Word darién, Stem: darién
Original: Word one, Stem: one
Original: Word northwestern, Stem: northwestern
Original: Word southeast, Stem: southeast
Original: Word geographers, Stem: geograph
Original: Word portion, Stem: portion
Original: Word sparsely, Stem: spars
Original: Word countries, Stem: countri
Original: Word greenland, Stem: greenland
Original: Word describe, Stem: describ
Original: Word entire, Stem: entir
Original: Word portions, Stem: portion
Original: Word panama, Stem: panama
Original: Word third, Stem: third
Original: Word ocean, Stem: ocean
Original: Word complex, Stem: complex
O

In [32]:
len(stemmed_words)

84

In [33]:
stemmed_words[0:10]

['west',
 'million',
 'island',
 'part',
 'east',
 'asia',
 'watersh',
 'north',
 'bound',
 'popul']

## N-Grams

N-grams are combinations of adjacent words or letters of length n in the source text.
1. Group(contiguous sequesnce) of n words or characters
2. P(W | h) probability of word w given per history h
3. Probabilistic Model of words sequence.
4. Assign to the sequence of words.
- Unigram 1
- Bigram  2
- Trigram 3
- ________________________ .
- ________________________ .
- ________________________ .
- n-gram  n

### Applications of N-Grams
- **Spelling Error Detection / Spelling Error Correction**
    - Text Comparison
    - Information Retrieval
    - Automatic Text Categorization
    - AutoComplete

In [34]:
ng = nltk.ngrams("The quick brown fox jumps over the lazy dog".split(), 3)

In [35]:
for a in ng:
    print(a)

('The', 'quick', 'brown')
('quick', 'brown', 'fox')
('brown', 'fox', 'jumps')
('fox', 'jumps', 'over')
('jumps', 'over', 'the')
('over', 'the', 'lazy')
('the', 'lazy', 'dog')


### Parts of Speech Tagging

- CC coordinating conjunction
- CD cardinal digit
- DT determiner
- EX existential there (like: "there is" ... think of it like "there exists")
- FW foreign word
- IN preposition/subordinating conjunction
- JJ adjective 'big'
- JJR adjective, comparative 'bigger'
- JJS adjective, superlative 'biggest'
- LS list marker 1)
- MD modal could, will
- NN noun, singular 'desk'
- NNS noun plural 'desks'
- NNP proper noun, singular 'Harrison'
- NNPS proper noun, plural 'Americans'
- PDT predeterminer 'all the kids'
- POS possessive ending parent's
- PRP personal pronoun I, he, she
- PRP possessive pronoun my, his, hers
- RB adverb very, silently,
- RBR adverb, comparative better
- RBS adverb, superlative best
- RP particle give up
- TO to go 'to' the store.
- UH interjection errrrrrrrm
- VB verb, base form take
- VBD verb, past tense took
- VBG verb, gerund/present participle taking
- VBN verb, past participle taken
- VBP verb, sing. present, non-3d take
- VBZ verb, 3rd person sing. present takes
- WDT wh-determiner which
- WP wh-pronoun who, what
- WP possessive wh-pronoun whose
- WRB wh-abverb where, when

In [36]:
nltk.pos_tag(new_words)

[('west', 'JJS'),
 ('million', 'CD'),
 ('island', 'JJ'),
 ('part', 'NN'),
 ('east', 'JJ'),
 ('asia', 'NN'),
 ('watershed', 'VBD'),
 ('north', 'RB'),
 ('bounded', 'VBN'),
 ('populated', 'VBN'),
 ('atlantic', 'JJ'),
 ('darién', 'NN'),
 ('one', 'CD'),
 ('northwestern', 'JJ'),
 ('southeast', 'NN'),
 ('geographers', 'NNS'),
 ('portion', 'NN'),
 ('sparsely', 'RB'),
 ('countries', 'NNS'),
 ('greenland', 'VBP'),
 ('describe', 'NN'),
 ('entire', 'JJ'),
 ('portions', 'NNS'),
 ('panama', 'VBP'),
 ('third', 'JJ'),
 ('ocean', 'NN'),
 ('complex', 'JJ'),
 ('half', 'NN'),
 ('western', 'JJ'),
 ('southern', 'JJ'),
 ('considered', 'VBN'),
 ('also', 'RB'),
 ('islands', 'VBZ'),
 ('within', 'IN'),
 ('called', 'VBN'),
 ('estimated', 'JJ'),
 ('northern', 'RB'),
 ('along', 'IN'),
 ('america', 'RB'),
 ('largest', 'JJS'),
 ('area', 'NN'),
 ('mostly', 'RB'),
 ('less', 'RBR'),
 ('south', 'JJ'),
 ('states', 'NNS'),
 ('except', 'IN'),
 ('common', 'JJ'),
 ('canal', 'NN'),
 ('alternatively', 'RB'),
 ('central', 'JJ'),

In [37]:
nltk.pos_tag("Ram and Shyam went to the School".split())

[('Ram', 'NNP'),
 ('and', 'CC'),
 ('Shyam', 'NNP'),
 ('went', 'VBD'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('School', 'NNP')]

## Named Entity Resolution

In [38]:
print(nltk.ne_chunk(nltk.pos_tag(word_tokenize(t_text)), binary=True))

(S
  (NE North/NNP America/NNP)
  is/VBZ
  the/DT
  third/JJ
  largest/JJS
  continent/NN
  ,/,
  and/CC
  is/VBZ
  also/RB
  a/DT
  portion/NN
  of/IN
  the/DT
  third/JJ
  largest/JJS
  supercontinent/NN
  if/IN
  (NE North/NNP)
  and/CC
  (NE South/NNP America/NNP)
  are/VBP
  combined/VBN
  into/IN
  the/DT
  (NE Americas/NNPS)
  and/CC
  (NE Africa/NNP)
  ,/,
  (NE Europe/NNP)
  ,/,
  and/CC
  (NE Asia/NNP)
  are/VBP
  considered/VBN
  to/TO
  be/VB
  part/NN
  of/IN
  one/CD
  supercontinent/NN
  called/VBN
  Afro-Eurasia/NNP
  ./.
  With/IN
  an/DT
  estimated/JJ
  population/NN
  of/IN
  580/CD
  million/CD
  and/CC
  an/DT
  area/NN
  of/IN
  24,709,000/CD
  km2/NNS
  (/(
  9,540,000/CD
  mi2/NN
  )/)
  ,/,
  the/DT
  northernmost/NN
  of/IN
  the/DT
  two/CD
  continents/NNS
  of/IN
  the/DT
  (NE Western/NNP Hemisphere/NNP)
  is/VBZ
  bounded/VBN
  by/IN
  the/DT
  (NE Pacific/NNP Ocean/NNP)
  on/IN
  the/DT
  west/NN
  ;/:
  the/DT
  (NE Atlantic/NNP Ocean/NNP)
  on/IN
  th

### Chunking And Chinking
- Chunking: 
    - It is simpleset technique used for entity detection
    - Chinking is a way to remove chunk from chunk
    - It is the process of extracting meaningful short phrases from sentences by analyzing the parts of speech
    - Words or Patterns can also be defined. These should not be a part of chunk and such words are known as Chinks.
    - Chunk Pattern are made by normal regular expression which are designed and modified to match the part of speech tags
- Chunking

### Using Spacy

In [40]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [41]:
doc = nlp(t_text)

In [42]:
for token in doc:
    print(f"{token.text:{15}} {token.tag_:{10}} {token.pos_:{10}} {spacy.explain(token.tag_)}")

North           NNP        PROPN      noun, proper singular
America         NNP        PROPN      noun, proper singular
is              VBZ        AUX        verb, 3rd person singular present
the             DT         DET        determiner
third           RB         ADV        adverb
largest         JJS        ADJ        adjective, superlative
continent       NN         NOUN       noun, singular or mass
,               ,          PUNCT      punctuation mark, comma
and             CC         CCONJ      conjunction, coordinating
is              VBZ        AUX        verb, 3rd person singular present
also            RB         ADV        adverb
a               DT         DET        determiner
portion         NN         NOUN       noun, singular or mass
of              IN         ADP        conjunction, subordinating or preposition
the             DT         DET        determiner
third           RB         ADV        adverb
largest         JJS        ADJ        adjective, superlative
supe

In [43]:
new_words_dict ={}
for token in doc:
    if token.text.lower() not in stopwords.words('english'):
        if token.text.isalpha():
            new_words_dict[token.text.lower()] = token.pos_
            print(f"{token.text:{15}} {token.tag_:{10}} {token.pos_:{10}} {spacy.explain(token.tag_)}")

North           NNP        PROPN      noun, proper singular
America         NNP        PROPN      noun, proper singular
third           RB         ADV        adverb
largest         JJS        ADJ        adjective, superlative
continent       NN         NOUN       noun, singular or mass
also            RB         ADV        adverb
portion         NN         NOUN       noun, singular or mass
third           RB         ADV        adverb
largest         JJS        ADJ        adjective, superlative
supercontinent  NN         NOUN       noun, singular or mass
North           NNP        PROPN      noun, proper singular
South           NNP        PROPN      noun, proper singular
America         NNP        PROPN      noun, proper singular
combined        VBN        VERB       verb, past participle
Americas        NNPS       PROPN      noun, proper plural
Africa          NNP        PROPN      noun, proper singular
Europe          NNP        PROPN      noun, proper singular
Asia            NNP   

In [44]:
new_words_dict

{'north': 'PROPN',
 'america': 'PROPN',
 'third': 'ADV',
 'largest': 'ADJ',
 'continent': 'NOUN',
 'also': 'ADV',
 'portion': 'NOUN',
 'supercontinent': 'NOUN',
 'south': 'PROPN',
 'combined': 'VERB',
 'americas': 'PROPN',
 'africa': 'PROPN',
 'europe': 'PROPN',
 'asia': 'PROPN',
 'considered': 'VERB',
 'part': 'NOUN',
 'one': 'NUM',
 'called': 'VERB',
 'afro': 'PROPN',
 'eurasia': 'PROPN',
 'estimated': 'VERB',
 'population': 'NOUN',
 'million': 'NUM',
 'area': 'NOUN',
 'northernmost': 'NOUN',
 'two': 'NUM',
 'continents': 'NOUN',
 'western': 'PROPN',
 'hemisphere': 'PROPN',
 'bounded': 'VERB',
 'pacific': 'PROPN',
 'ocean': 'PROPN',
 'west': 'NOUN',
 'atlantic': 'PROPN',
 'east': 'NOUN',
 'caribbean': 'PROPN',
 'sea': 'PROPN',
 'arctic': 'PROPN',
 'northern': 'ADJ',
 'half': 'NOUN',
 'sparsely': 'ADV',
 'populated': 'VERB',
 'covered': 'VERB',
 'mostly': 'ADV',
 'canada': 'PROPN',
 'except': 'SCONJ',
 'northeastern': 'ADJ',
 'occupied': 'VERB',
 'greenland': 'PROPN',
 'northwestern':

In [45]:
from spacy import displacy

In [46]:
options = {'compact': 'True'}

In [47]:
if doc.ents:
    for ent in doc.ents:
        print(f"{ent.text:{15}} {ent.label_:{10}} {spacy.explain(ent.label_)}")

North America   LOC        Non-GPE locations, mountain ranges, bodies of water
third           ORDINAL    "first", "second", etc.
third           ORDINAL    "first", "second", etc.
North and South America GPE        Countries, cities, states
Americas        LOC        Non-GPE locations, mountain ranges, bodies of water
Africa          LOC        Non-GPE locations, mountain ranges, bodies of water
Europe          LOC        Non-GPE locations, mountain ranges, bodies of water
Asia            LOC        Non-GPE locations, mountain ranges, bodies of water
one             CARDINAL   Numerals that do not fall under another type
Afro-Eurasia    ORG        Companies, agencies, institutions, etc.
580 million     CARDINAL   Numerals that do not fall under another type
24,709,000      CARDINAL   Numerals that do not fall under another type
9,540,000       CARDINAL   Numerals that do not fall under another type
two             CARDINAL   Numerals that do not fall under another type
the Western Hem

In [48]:
displacy.render(doc, style='ent', jupyter=True)

In [49]:
doc.ents

(North America,
 third,
 third,
 North and South America,
 Americas,
 Africa,
 Europe,
 Asia,
 one,
 Afro-Eurasia,
 580 million,
 24,709,000,
 9,540,000,
 two,
 the Western Hemisphere,
 the Pacific Ocean,
 the Atlantic Ocean,
 the Caribbean Sea,
 the Arctic Ocean,
 half,
 North America,
 Canada,
 Greenland,
 Alaska,
 the United States,
 United States,
 Mexico,
 Central America,
 Caribbean,
 Darién,
 Colombia,
 Panama,
 North America,
 North America,
 Panama Canal,
 North America,
 Greenland,
 Caribbean,
 Americas,
 Anglo-America',
 Canada,
 U.S.,
 Latin America',
 Mexico,
 Central America,
 Caribbean,
 South America)