**Table of contents**<a id='toc0_'></a>    
- [Install and download package from the nltk](#toc1_1_)    
  - [Laod the packages](#toc1_2_)    
  - [Load some corpus and check these packages](#toc1_3_)    
    - [tokenize the paragraph](#toc1_3_1_)    
    - [word tokenize](#toc1_3_2_)    
    - [Remove the stop words, lower case](#toc1_3_3_)    
    - [Apply stemming and lemmatization](#toc1_3_4_)    
  - [Apply all of them together and prepare corpus](#toc1_4_)    
  - [Finally apply `BAG of words`](#toc1_5_)    
  - [Bi-gram, tri-gram](#toc1_6_)    
  - [TF - IDF](#toc1_7_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Install and download package from the nltk](#toc0_)

In [2]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/soumen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/soumen/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /home/soumen/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/soumen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## <a id='toc1_2_'></a>[Laod the packages](#toc0_)

In [39]:
# stemming 
from nltk.stem import PorterStemmer
# lemmatization
from nltk.stem import WordNetLemmatizer
# stop words
from nltk.corpus import stopwords
# word tokenize
from nltk import word_tokenize

## <a id='toc1_3_'></a>[Load some corpus and check these packages](#toc0_)

In [4]:
paragraph = """The World Economic Forum
said an investigation into
founder Klaus Schwab found
minor expense irregularities
but no material wrongdoing.
The Davos conference orga-
nizer also said it was shuffling
the leadership of its board in
the wake of the probe.
The board’s interim chair-
man, Peter Brabeck-Letmathe,
a former chief executive of
Nestlé, resigned after a meet-
ing of trustees earlier this
week to discuss the probe’s
conclusions, according to a
letter reviewed by The Wall
Street Journal. He raised con-
cerns about a “toxic” work en-
vironment in his resignation
letter, which hasn’t been pre-
viously reported.
At a meeting on Friday, the
Forum appointed two new in-
terim co-chairs to steer the
board of trustees: BlackRock
boss Larry Fink and Swiss bil-
lionaire André Hoffmann.
The Forum said Friday that
the board wants to move on
from a dispute with its
founder that has roiled the or-
ganization. The board said it
has concluded “there is no ev-
idence of material wrongdo-
ing” by Schwab or his wife,
Hilde Schwab."""

### <a id='toc1_3_1_'></a>[tokenize the paragraph](#toc0_)

* the whole paragraph tokenize to several sentences

In [20]:
sentences = nltk.sent_tokenize(paragraph)
sentences = [sentence.replace('-\n', '').replace('\n', ' ') for sentence in sentences]
sentences

['The World Economic Forum said an investigation into founder Klaus Schwab found minor expense irregularities but no material wrongdoing.',
 'The Davos conference organizer also said it was shuffling the leadership of its board in the wake of the probe.',
 'The board’s interim chairman, Peter Brabeck-Letmathe, a former chief executive of Nestlé, resigned after a meeting of trustees earlier this week to discuss the probe’s conclusions, according to a letter reviewed by The Wall Street Journal.',
 'He raised concerns about a “toxic” work environment in his resignation letter, which hasn’t been previously reported.',
 'At a meeting on Friday, the Forum appointed two new interim co-chairs to steer the board of trustees: BlackRock boss Larry Fink and Swiss billionaire André Hoffmann.',
 'The Forum said Friday that the board wants to move on from a dispute with its founder that has roiled the organization.',
 'The board said it has concluded “there is no evidence of material wrongdoing” by S

### <a id='toc1_3_2_'></a>[word tokenize](#toc0_)

* Sentences tokenized to words

In [21]:
tokenized_words = [nltk.word_tokenize(sentence) for sentence in sentences]
tokenized_words[0][:5]

['The', 'World', 'Economic', 'Forum', 'said']

### <a id='toc1_3_3_'></a>[Remove the stop words, lower case](#toc0_)

In [22]:
removed_sw = []
for tokenwor in tokenized_words:
    sen = []
    [sen.append(w.lower()) for w in tokenwor if w not in stopwords.words('english') and w not in ['.', ',', "\""]]
    removed_sw.append(sen)

In [23]:
for i in range(len(tokenized_words)):
    print(f"Number of words reduced from {len(tokenized_words[i])} to {len(removed_sw[i])} in sentence {i}")

Number of words reduced from 20 to 15 in sentence 0
Number of words reduced from 21 to 11 in sentence 1
Number of words reduced from 45 to 27 in sentence 2
Number of words reduced from 23 to 13 in sentence 3
Number of words reduced from 30 to 21 in sentence 4
Number of words reduced from 23 to 11 in sentence 5
Number of words reduced from 24 to 13 in sentence 6


In [24]:
print(tokenized_words[1],'\n',removed_sw[1])

['The', 'Davos', 'conference', 'organizer', 'also', 'said', 'it', 'was', 'shuffling', 'the', 'leadership', 'of', 'its', 'board', 'in', 'the', 'wake', 'of', 'the', 'probe', '.'] 
 ['the', 'davos', 'conference', 'organizer', 'also', 'said', 'shuffling', 'leadership', 'board', 'wake', 'probe']


### <a id='toc1_3_4_'></a>[Apply stemming and lemmatization](#toc0_)

In [25]:
stemmer=PorterStemmer()
lematizer = WordNetLemmatizer()

In [26]:
# simple example of stemming
stemmer.stem('historical')

'histor'

In [27]:
# simple example of lematizer
lematizer.lemmatize('historical')

'historical'

In [28]:
for tokenwor in removed_sw:
    print(tokenwor)

['the', 'world', 'economic', 'forum', 'said', 'investigation', 'founder', 'klaus', 'schwab', 'found', 'minor', 'expense', 'irregularities', 'material', 'wrongdoing']
['the', 'davos', 'conference', 'organizer', 'also', 'said', 'shuffling', 'leadership', 'board', 'wake', 'probe']
['the', 'board', '’', 'interim', 'chairman', 'peter', 'brabeck-letmathe', 'former', 'chief', 'executive', 'nestlé', 'resigned', 'meeting', 'trustees', 'earlier', 'week', 'discuss', 'probe', '’', 'conclusions', 'according', 'letter', 'reviewed', 'the', 'wall', 'street', 'journal']
['he', 'raised', 'concerns', '“', 'toxic', '”', 'work', 'environment', 'resignation', 'letter', '’', 'previously', 'reported']
['at', 'meeting', 'friday', 'forum', 'appointed', 'two', 'new', 'interim', 'co-chairs', 'steer', 'board', 'trustees', ':', 'blackrock', 'boss', 'larry', 'fink', 'swiss', 'billionaire', 'andré', 'hoffmann']
['the', 'forum', 'said', 'friday', 'board', 'wants', 'move', 'dispute', 'founder', 'roiled', 'organization'

In [29]:
stemmed = []
lemmatized = []
for tokenwor in removed_sw:
    senstem = []
    senlemma = []

    for w in tokenwor:
        print(f"{w}({stemmer.stem(w)}, {lematizer.lemmatize(w)})", end = ', ')
        senstem.append(stemmer.stem(w))
        senlemma.append(lematizer.lemmatize(w))
    stemmed.append(senstem)
    lemmatized.append(senlemma)
    print('\n')

the(the, the), world(world, world), economic(econom, economic), forum(forum, forum), said(said, said), investigation(investig, investigation), founder(founder, founder), klaus(klau, klaus), schwab(schwab, schwab), found(found, found), minor(minor, minor), expense(expens, expense), irregularities(irregular, irregularity), material(materi, material), wrongdoing(wrongdo, wrongdoing), 

the(the, the), davos(davo, davos), conference(confer, conference), organizer(organ, organizer), also(also, also), said(said, said), shuffling(shuffl, shuffling), leadership(leadership, leadership), board(board, board), wake(wake, wake), probe(probe, probe), 

the(the, the), board(board, board), ’(’, ’), interim(interim, interim), chairman(chairman, chairman), peter(peter, peter), brabeck-letmathe(brabeck-letmath, brabeck-letmathe), former(former, former), chief(chief, chief), executive(execut, executive), nestlé(nestlé, nestlé), resigned(resign, resigned), meeting(meet, meeting), trustees(truste, trustee), 

## <a id='toc1_4_'></a>[Apply all of them together and prepare corpus](#toc0_)

In [31]:
import re
corpus = []
for i in range(len(sentences)):
    # clean the text + lower case
    review = re.sub(r'[^a-zA-Z\s]','', sentences[i]).lower()
    # remove stop words + lemmatize
    review = [lematizer.lemmatize(word) for word in nltk.word_tokenize(review) if word not in stopwords.words
    ('english')]
    review = ' '.join(review)
    corpus.append(review)
corpus


['world economic forum said investigation founder klaus schwab found minor expense irregularity material wrongdoing',
 'davos conference organizer also said shuffling leadership board wake probe',
 'board interim chairman peter brabeckletmathe former chief executive nestl resigned meeting trustee earlier week discus probe conclusion according letter reviewed wall street journal',
 'raised concern toxic work environment resignation letter hasnt previously reported',
 'meeting friday forum appointed two new interim cochairs steer board trustee blackrock bos larry fink swiss billionaire andr hoffmann',
 'forum said friday board want move dispute founder roiled organization',
 'board said concluded evidence material wrongdoing schwab wife hilde schwab']

## <a id='toc1_5_'></a>[Finally apply `BAG of words`](#toc0_)

In [32]:
from  sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus)

The index of the word, i.e in the vector 74th dimension is representation of the word 'world', similarly 20th dimension is representation of the word 'econimic'

In [36]:
vocab = cv.vocabulary_
vocab

{'world': 74,
 'economic': 20,
 'forum': 27,
 'said': 59,
 'investigation': 35,
 'founder': 29,
 'klaus': 38,
 'schwab': 60,
 'found': 28,
 'minor': 44,
 'expense': 24,
 'irregularity': 36,
 'material': 42,
 'wrongdoing': 75,
 'davos': 16,
 'conference': 15,
 'organizer': 49,
 'also': 1,
 'shuffling': 61,
 'leadership': 40,
 'board': 6,
 'wake': 68,
 'probe': 52,
 'interim': 34,
 'chairman': 9,
 'peter': 50,
 'brabeckletmathe': 8,
 'former': 26,
 'chief': 10,
 'executive': 23,
 'nestl': 46,
 'resigned': 56,
 'meeting': 43,
 'trustee': 66,
 'earlier': 19,
 'week': 71,
 'discus': 17,
 'conclusion': 14,
 'according': 0,
 'letter': 41,
 'reviewed': 57,
 'wall': 69,
 'street': 63,
 'journal': 37,
 'raised': 53,
 'concern': 12,
 'toxic': 65,
 'work': 73,
 'environment': 21,
 'resignation': 55,
 'hasnt': 31,
 'previously': 51,
 'reported': 54,
 'friday': 30,
 'appointed': 3,
 'two': 67,
 'new': 47,
 'cochairs': 11,
 'steer': 62,
 'blackrock': 5,
 'bos': 7,
 'larry': 39,
 'fink': 25,
 'swiss':

Lets check a sentence and cross check the vectors

In [35]:
corpus[1]

'davos conference organizer also said shuffling leadership board wake probe'

In [49]:
# dimension of the words in corpus[1]

import numpy as np
x = np.array([vocab[w] for w in  nltk.word_tokenize(corpus[1])])
np.sort(x)

array([ 1,  6, 15, 16, 40, 49, 52, 59, 61, 68])

In [52]:
# the vector
X[1].toarray()[0]

array([0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0])

In [51]:
for i in range(len(X[1].toarray()[0])):
    if X[1].toarray()[0][i]!=0:
        print(i)

1
6
15
16
40
49
52
59
61
68


## <a id='toc1_6_'></a>[Bi-gram, tri-gram](#toc0_)

In [24]:
#https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
from  sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary=True, ngram_range=(2, 3))
X = cv.fit_transform(corpus)

In [26]:
vocab = cv.vocabulary_
vocab

{'world economic': 165,
 'economic forum': 47,
 'forum said': 63,
 'said investigation': 131,
 'investigation founder': 83,
 'founder klaus': 68,
 'klaus schwab': 87,
 'schwab found': 135,
 'found minor': 66,
 'minor expense': 103,
 'expense irregularity': 55,
 'irregularity material': 85,
 'material wrongdoing': 97,
 'world economic forum': 166,
 'economic forum said': 48,
 'forum said investigation': 65,
 'said investigation founder': 132,
 'investigation founder klaus': 84,
 'founder klaus schwab': 69,
 'klaus schwab found': 88,
 'schwab found minor': 136,
 'found minor expense': 67,
 'minor expense irregularity': 104,
 'expense irregularity material': 56,
 'irregularity material wrongdoing': 86,
 'davos conference': 39,
 'conference organizer': 37,
 'organizer also': 111,
 'also said': 2,
 'said shuffling': 133,
 'shuffling leadership': 139,
 'leadership board': 91,
 'board wake': 17,
 'wake probe': 154,
 'davos conference organizer': 40,
 'conference organizer also': 38,
 'organiz

In [27]:
corpus[1]

'davos conference organizer also said shuffling leadership board wake probe'

In [63]:
import numpy as np
from nltk.util import bigrams, trigrams
print("Bigrams:",[' '.join(w) for w in  list((bigrams(word_tokenize(corpus[1]))))])
print("Trigrams",[' '.join(w) for w in  list((trigrams(word_tokenize(corpus[1]))))])

Bigrams: ['davos conference', 'conference organizer', 'organizer also', 'also said', 'said shuffling', 'shuffling leadership', 'leadership board', 'board wake', 'wake probe']
Trigrams ['davos conference organizer', 'conference organizer also', 'organizer also said', 'also said shuffling', 'said shuffling leadership', 'shuffling leadership board', 'leadership board wake', 'board wake probe']


In [None]:
# their indexes
x1 = np.array([vocab[' '.join(w)] for w in  list((bigrams(word_tokenize(corpus[1]))))]) 
x2 = np.array([vocab[' '.join(w)] for w in  list((trigrams(word_tokenize(corpus[1]))))])
x = np.concatenate((x1, x2))
np.sort(x)

array([  2,   3,  17,  18,  37,  38,  39,  40,  91,  92, 111, 112, 133,
       134, 139, 140, 154])

In [58]:
X[1].toarray()

array([[0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [64]:
# check non zero dimensions
for i in range(len(X[1].toarray()[0])):
    if X[1].toarray()[0][i]!=0:
        print(i, end= ', ')

2, 3, 17, 18, 37, 38, 39, 40, 91, 92, 111, 112, 133, 134, 139, 140, 154, 

## <a id='toc1_7_'></a>[TF - IDF](#toc0_)

In [68]:
corpus

['world economic forum said investigation founder klaus schwab found minor expense irregularity material wrongdoing',
 'davos conference organizer also said shuffling leadership board wake probe',
 'board interim chairman peter brabeckletmathe former chief executive nestl resigned meeting trustee earlier week discus probe conclusion according letter reviewed wall street journal',
 'raised concern toxic work environment resignation letter hasnt previously reported',
 'meeting friday forum appointed two new interim cochairs steer board trustee blackrock bos larry fink swiss billionaire andr hoffmann',
 'forum said friday board want move dispute founder roiled organization',
 'board said concluded evidence material wrongdoing schwab wife hilde schwab']

In [70]:
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer()
X=cv.fit_transform(corpus)

In [71]:
corpus[0]

'world economic forum said investigation founder klaus schwab found minor expense irregularity material wrongdoing'

In [72]:
X[0].toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.29311674, 0.        , 0.        , 0.        , 0.29311674,
        0.        , 0.        , 0.20797509, 0.29311674, 0.24331207,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.29311674, 0.29311674, 0.        , 0.29311674, 0.        ,
        0.        , 0.        , 0.24331207, 0.        , 0.29311674,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.1805656 ,
        0.24331207, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.  