# Text Pre-processing Step II

In [16]:
%pip install nltk

Note: you may need to restart the kernel to use updated packages.


## Creating corpus

In [51]:
paragraph = '''Narendra Damodardas Modi[a] (born 17 September 1950)[b] is an Indian politician who has been serving as the prime minister of India since 2014. Modi was the chief minister of Gujarat from 2001 to 2014 and is the member of parliament (MP) for Varanasi. He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right-wing Hindu nationalist paramilitary volunteer organisation. He is the longest-serving prime minister outside the Indian National Congress.[4] Modi was born and raised in Vadnagar in northeastern Gujarat, where he completed his secondary education. He was introduced to the RSS at the age of eight. At the age of 18, he was married to Jashodaben Modi, whom he abandoned soon after, only publicly acknowledging her four decades later when legally required to do so. Modi became a full-time worker for the RSS in Gujarat in 1971. The RSS assigned him to the BJP in 1985 and he rose through the party hierarchy, becoming general secretary in 1998.[c] In 2001, Modi was appointed Chief Minister of Gujarat and elected to the legislative assembly soon after. His administration is considered complicit in the 2002 Gujarat riots,[d] and has been criticised for its management of the crisis. According to official records, a little over 1,000 people were killed, three-quarters of whom were Muslim; independent sources estimated 2,000 deaths, mostly Muslim.[13] A Special Investigation Team appointed by the Supreme Court of India in 2012 found no evidence to initiate prosecution proceedings against him.[e] While his policies as chief minister were credited for encouraging economic growth, his administration was criticised for failing to significantly improve health, poverty and education indices in the state.[f]'''

In [52]:
paragraph

'Narendra Damodardas Modi[a] (born 17 September 1950)[b] is an Indian politician who has been serving as the prime minister of India since 2014. Modi was the chief minister of Gujarat from 2001 to 2014 and is the member of parliament (MP) for Varanasi. He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right-wing Hindu nationalist paramilitary volunteer organisation. He is the longest-serving prime minister outside the Indian National Congress.[4] Modi was born and raised in Vadnagar in northeastern Gujarat, where he completed his secondary education. He was introduced to the RSS at the age of eight. At the age of 18, he was married to Jashodaben Modi, whom he abandoned soon after, only publicly acknowledging her four decades later when legally required to do so. Modi became a full-time worker for the RSS in Gujarat in 1971. The RSS assigned him to the BJP in 1985 and he rose through the party hierarchy, becoming general secretary in 1998

## Tokenization

Converts paragraph into sentences

In [14]:
nltk.download('punkt_tab')

sentences = nltk.sent_tokenize(paragraph) # tokenizing paragraph into sentences

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/saikiran/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [13]:
sentences

['Narendra Damodardas Modi[a] (born 17 September 1950)[b] is an Indian politician who has been serving as the prime minister of India since 2014.',
 'Modi was the chief minister of Gujarat from 2001 to 2014 and is the member of parliament (MP) for Varanasi.',
 'He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right-wing Hindu nationalist paramilitary volunteer organisation.',
 'He is the longest-serving prime minister outside the Indian National Congress.',
 '[4]\n\nModi was born and raised in Vadnagar in northeastern Gujarat, where he completed his secondary education.',
 'He was introduced to the RSS at the age of eight.',
 'At the age of 18, he was married to Jashodaben Modi, whom he abandoned soon after, only publicly acknowledging her four decades later when legally required to do so.',
 'Modi became a full-time worker for the RSS in Gujarat in 1971.',
 'The RSS assigned him to the BJP in 1985 and he rose through the party hierarch

In [53]:
print(sentences)

print(type(sentences))

['Narendra Damodardas Modi[a] (born 17 September 1950)[b] is an Indian politician who has been serving as the prime minister of India since 2014.', 'Modi was the chief minister of Gujarat from 2001 to 2014 and is the member of parliament (MP) for Varanasi.', 'He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right-wing Hindu nationalist paramilitary volunteer organisation.', 'He is the longest-serving prime minister outside the Indian National Congress.', '[4]\n\nModi was born and raised in Vadnagar in northeastern Gujarat, where he completed his secondary education.', 'He was introduced to the RSS at the age of eight.', 'At the age of 18, he was married to Jashodaben Modi, whom he abandoned soon after, only publicly acknowledging her four decades later when legally required to do so.', 'Modi became a full-time worker for the RSS in Gujarat in 1971.', 'The RSS assigned him to the BJP in 1985 and he rose through the party hierarchy, becom

## Initializing PorterStemmer

In [58]:
from nltk.stem import PorterStemmer # library used for stemming
from nltk.corpus import stopwords # library used for removing stopwords

stemmer = PorterStemmer() # creating object of PorterStemmer

### Examples

In [59]:
stemmer.stem('history')

'histori'

In [60]:
stemmer.stem('going')

'go'

## Lemmatization

In [21]:
from nltk.stem import WordNetLemmatizer

In [22]:
lemmatizer = WordNetLemmatizer()

### Examples

In [23]:
lemmatizer.lemmatize('history')

'history'

## Cleaning the corpus

Removing the special characters and lowering the sentence.

In [54]:
import re # importing regular expression library

corpus = []
for i in range(len(sentences)):

    review = re.sub('[^a-zA-Z]', ' ', sentences[i]) # removing special characters from the sentences and replacing them with space.
    review = review.lower() # converting all the characters of the sentence to lower case.
    
    corpus.append(review) # appending the cleaned sentence to the corpus list.

In [55]:
corpus

['narendra damodardas modi a   born    september       b  is an indian politician who has been serving as the prime minister of india since      ',
 'modi was the chief minister of gujarat from      to      and is the member of parliament  mp  for varanasi ',
 'he is a member of the bharatiya janata party  bjp  and of the rashtriya swayamsevak sangh  rss   a right wing hindu nationalist paramilitary volunteer organisation ',
 'he is the longest serving prime minister outside the indian national congress ',
 '     modi was born and raised in vadnagar in northeastern gujarat  where he completed his secondary education ',
 'he was introduced to the rss at the age of eight ',
 'at the age of     he was married to jashodaben modi  whom he abandoned soon after  only publicly acknowledging her four decades later when legally required to do so ',
 'modi became a full time worker for the rss in gujarat in      ',
 'the rss assigned him to the bjp in      and he rose through the party hierarchy 

## Stemming the words in the whole corpus

Downloading the stopwords package

In [56]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/saikiran/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [57]:
stemmer = PorterStemmer()

for document in corpus:
    words = nltk.word_tokenize(document) # tokenizing document/sentence into words
    
    for word in words: 
        
        if word not in set(stopwords.words('english')): # removing stopwords from the sentence.
            print(stemmer.stem(word))



narendra
damodarda
modi
born
septemb
b
indian
politician
serv
prime
minist
india
sinc
modi
chief
minist
gujarat
member
parliament
mp
varanasi
member
bharatiya
janata
parti
bjp
rashtriya
swayamsevak
sangh
rss
right
wing
hindu
nationalist
paramilitari
volunt
organis
longest
serv
prime
minist
outsid
indian
nation
congress
modi
born
rais
vadnagar
northeastern
gujarat
complet
secondari
educ
introduc
rss
age
eight
age
marri
jashodaben
modi
abandon
soon
publicli
acknowledg
four
decad
later
legal
requir
modi
becam
full
time
worker
rss
gujarat
rss
assign
bjp
rose
parti
hierarchi
becom
gener
secretari
c
modi
appoint
chief
minist
gujarat
elect
legisl
assembl
soon
administr
consid
complicit
gujarat
riot
criticis
manag
crisi
accord
offici
record
littl
peopl
kill
three
quarter
muslim
independ
sourc
estim
death
mostli
muslim
special
investig
team
appoint
suprem
court
india
found
evid
initi
prosecut
proceed
e
polici
chief
minist
credit
encourag
econom
growth
administr
criticis
fail
significantli
impro

## Printing all the above stopwords

In [37]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

## Lemmatizing the whole words in the corpus

In [38]:
for document in corpus:

    words = nltk.word_tokenize(document) # tokenizing document/sentence into words.

    for word in words:

        if word not in set(stopwords.words('english')):
            print(lemmatizer.lemmatize(word))

narendra
damodardas
modi
born
september
b
indian
politician
serving
prime
minister
india
since
modi
chief
minister
gujarat
member
parliament
mp
varanasi
member
bharatiya
janata
party
bjp
rashtriya
swayamsevak
sangh
rss
right
wing
hindu
nationalist
paramilitary
volunteer
organisation
longest
serving
prime
minister
outside
indian
national
congress
modi
born
raised
vadnagar
northeastern
gujarat
completed
secondary
education
introduced
rss
age
eight
age
married
jashodaben
modi
abandoned
soon
publicly
acknowledging
four
decade
later
legally
required
modi
became
full
time
worker
rss
gujarat
rss
assigned
bjp
rose
party
hierarchy
becoming
general
secretary
c
modi
appointed
chief
minister
gujarat
elected
legislative
assembly
soon
administration
considered
complicit
gujarat
riot
criticised
management
crisis
according
official
record
little
people
killed
three
quarter
muslim
independent
source
estimated
death
mostly
muslim
special
investigation
team
appointed
supreme
court
india
found
evidence
in

# Bag of Words

In [39]:
from sklearn.feature_extraction.text import CountVectorizer # library used for creating bag of words model

cv = CountVectorizer() # creating object of CountVectorizer

## Removing the stopwords, special characters and lemmatization

In [46]:
import re
corpus = []

for i in range(len(sentences)):

    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [lemmatizer.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)

    corpus.append(review)

## Training the bag of words model with our corpus

In [47]:
X = cv.fit_transform(corpus)# creating bag of words model

In [48]:
cv.vocabulary_ # printing the vocabulary of the bag of words model with index

# E.g., narendra is presented in the bag of words model at index 85.

{'narendra': 66,
 'damodardas': 22,
 'modi': 62,
 'born': 12,
 'september': 96,
 'indian': 46,
 'politician': 78,
 'serving': 97,
 'prime': 80,
 'minister': 61,
 'india': 45,
 'since': 99,
 'chief': 13,
 'gujarat': 38,
 'member': 60,
 'parliament': 74,
 'mp': 64,
 'varanasi': 110,
 'bharatiya': 10,
 'janata': 50,
 'party': 75,
 'bjp': 11,
 'rashtriya': 86,
 'swayamsevak': 105,
 'sangh': 93,
 'rss': 92,
 'right': 89,
 'wing': 112,
 'hindu': 41,
 'nationalist': 68,
 'paramilitary': 73,
 'volunteer': 111,
 'organisation': 71,
 'longest': 57,
 'outside': 72,
 'national': 67,
 'congress': 16,
 'raised': 85,
 'vadnagar': 109,
 'northeastern': 69,
 'completed': 14,
 'secondary': 94,
 'education': 26,
 'introduced': 48,
 'age': 4,
 'eight': 27,
 'married': 59,
 'jashodaben': 51,
 'abandoned': 0,
 'soon': 100,
 'publicly': 83,
 'acknowledging': 2,
 'four': 34,
 'decade': 24,
 'later': 53,
 'legally': 54,
 'required': 88,
 'became': 8,
 'full': 35,
 'time': 108,
 'worker': 113,
 'assigned': 7,
 

## Checking BOW for one sentence

In [49]:
corpus[0]

'narendra damodardas modi born september b indian politician serving prime minister india since'

In [50]:
X[0].toarray() # printing the bag of words model in the form of array

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0]])