# Text PreProcessing


Stop words in NLP are common words—such as articles, prepositions, and conjunctions—that carry minimal semantic content and are often removed to reduce noise and improve processing efficiency.

While there is no universal stop-word list, many tools and libraries provide default lists that can be customized for specific domains or tasks. Removing stop words can speed up text processing, reduce dimensionality, and enhance the focus on meaningful terms, but in some applications (e.g., sentiment analysis, named-entity recognition) stop-word removal may inadvertently discard useful information. In Python, libraries like NLTK and spaCy offer built-in stop-word lists and easy methods for filtering them, while practitioners should remain aware of context-specific and document-level stop words to achieve optimal results.

In [45]:
pargraph= """
I am indeed delighted to participate in the 21st Convocation of Sri Sathya Sai Institute of Higher Learning.
I take this opportunity to congratulate the young graduates for their achievement.
I greet the Vice Chancellor, Professors, teachers and staff for the excellent contribution in shaping young minds to contribute to the nation in multiple fields.
It is a great honour for me that the Chancellor, Swamiji, has given me this opportunity to share my thoughts at this Convocation.

Is value based education possible? Sri Sathya Sai Institute of Higher Learning has given an answer in the affirmative.
Our ultimate goal is: all human beings should be prosperous and should have all forms of security like food security, social security and future security of their children.
How to achieve them? How can a nation be secured from external and internal problems?
National security and economic prosperity are interconnected.

Sathyam, Dharma, Shanti and Prema are the eternal human values.
Efforts and endeavour are man's duty. Success or failure is God's domain.
I can see in this campus high calibre graduates bubbling with creativity.
There is a virtual presence of divine blessings all around.
I could sense intervention to alleviate the people's pain, difficulties and problems.
The integrated effect of this place is how a Guru can integrate both spiritual and material wealth.

When I was thinking what thoughts I could share with you, young graduates, a beautiful divine message was ringing in me:
"Where there is righteousness in the heart
 There is a beauty in the character.
 When there is beauty in the character,
 There is harmony in the home.
 When there is harmony in the home,
 There is order in the nation.
 When there is order in the nation,
 There is peace in the world."

Thinking is progress. Non-thinking is destruction to the individual, organization and the country.
Thinking leads to action. Knowledge without action is useless and irrelevant.
Knowledge with action brings prosperity.

I would like you, dear youth, to have a mind to explore every aspect of human life.
Look at the sky. We are not alone. The whole universe is friendly to us and conspires to give the best to those who dream.
Like Chandrasekhar Subramaniam discovered the black hole using Chandrasekhar's limit.
Like Sir C.V. Raman looked at the sea and questioned why it is blue, leading to the Raman Effect.
Like Albert Einstein, armed with the complexity of the universe, asked questions about its nature and arrived at E = mc².

To become a developed India, the essential needs are:
(a) India has to be economically and commercially powerful, aiming for 9% annual GDP growth and near-zero poverty.
(b) Near self-reliance in defence equipment with no umbilical attached to the outside world.
(c) India should have a right place in world forums.

Technology Vision 2020 is a pathway to realise this cherished mission.
We have identified five areas for integrated action:
  1. Agriculture and food processing
  2. Reliable and quality electric power for all parts of the country
  3. Education and Healthcare
  4. Information Communication Technology
  5. Strategic sectors (nuclear, space, defence, advanced sensors and materials)

These five areas are closely inter-related and will lead to national, food, and economic security.
A strong partnership among R&D, academia, industry, the community, and government will be essential to accomplish the vision.
"""

In [37]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

In [38]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [34]:
stopwords.words('english')


['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [22]:
# tokenize the paragraph-> sentence -> words
# check it falls in stopwords if not then apply stemming
# join the words converting words into sentence

In [23]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize,word_tokenize
stemmer=PorterStemmer()

sentences=sent_tokenize(pargraph)
print(sentences)



['\nI am indeed delighted to participate in the 21st Convocation of Sri Sathya Sai Institute of Higher Learning.', 'I take this opportunity to congratulate the young graduates for their achievement.', 'I greet the Vice Chancellor, Professors, teachers and staff for the excellent contribution in shaping young minds to contribute to the nation in multiple fields.', 'It is a great honour for me that the Chancellor, Swamiji, has given me this opportunity to share my thoughts at this Convocation.', 'Is value based education possible?', 'Sri Sathya Sai Institute of Higher Learning has given an answer in the affirmative.', 'Our ultimate goal is: all human beings should be prosperous and should have all forms of security like food security, social security and future security of their children.', 'How to achieve them?', 'How can a nation be secured from external and internal problems?', 'National security and economic prosperity are interconnected.', 'Sathyam, Dharma, Shanti and Prema are the 

In [25]:
for i in range(len(sentences)):
    # tokenize the sentence into words
    words=word_tokenize(sentences[i])
    words=[stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i]=' '.join(words)

In [26]:
sentences

['i inde delight particip 21st convoc sri sathya sai institut higher learn .',
 'i take opportun congratul young graduat achiev .',
 'i greet vice chancellor , professor , teacher staff excel contribut shape young mind contribut nation multipl field .',
 'it great honour chancellor , swamiji , given opportun share thought convoc .',
 'is valu base educ possibl ?',
 'sri sathya sai institut higher learn given answer affirm .',
 'our ultim goal : human be prosper form secur like food secur , social secur futur secur children .',
 'how achiev ?',
 'how nation secur extern intern problem ?',
 'nation secur econom prosper interconnect .',
 'sathyam , dharma , shanti prema etern human valu .',
 "effort endeavour man 's duti .",
 "success failur god 's domain .",
 'i see campu high calibr graduat bubbl creativ .',
 'there virtual presenc divin bless around .',
 "i could sens intervent allevi peopl 's pain , difficulti problem .",
 'the integr effect place guru integr spiritu materi wealth .',

In [40]:
sentences2=sent_tokenize(pargraph)
print(sentences2)

['\nI am indeed delighted to participate in the 21st Convocation of Sri Sathya Sai Institute of Higher Learning.', 'I take this opportunity to congratulate the young graduates for their achievement.', 'I greet the Vice Chancellor, Professors, teachers and staff for the excellent contribution in shaping young minds to contribute to the nation in multiple fields.', 'It is a great honour for me that the Chancellor, Swamiji, has given me this opportunity to share my thoughts at this Convocation.', 'Is value based education possible?', 'Sri Sathya Sai Institute of Higher Learning has given an answer in the affirmative.', 'Our ultimate goal is: all human beings should be prosperous and should have all forms of security like food security, social security and future security of their children.', 'How to achieve them?', 'How can a nation be secured from external and internal problems?', 'National security and economic prosperity are interconnected.', 'Sathyam, Dharma, Shanti and Prema are the 

In [41]:
from nltk.stem import SnowballStemmer
snowstem=SnowballStemmer('english')
for i in range(len(sentences2)):
    # tokenize the sentence into words
    words=word_tokenize(sentences2[i])
    words=[snowstem.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentences2[i]=' '.join(words)

In [42]:
sentences2

['i inde delight particip 21st convoc sri sathya sai institut higher learn .',
 'i take opportun congratul young graduat achiev .',
 'i greet vice chancellor , professor , teacher staff excel contribut shape young mind contribut nation multipl field .',
 'it great honour chancellor , swamiji , given opportun share thought convoc .',
 'is valu base educ possibl ?',
 'sri sathya sai institut higher learn given answer affirm .',
 'our ultim goal : human be prosper form secur like food secur , social secur futur secur children .',
 'how achiev ?',
 'how nation secur extern intern problem ?',
 'nation secur econom prosper interconnect .',
 'sathyam , dharma , shanti prema etern human valu .',
 "effort endeavour man 's duti .",
 "success failur god 's domain .",
 'i see campus high calibr graduat bubbl creativ .',
 'there virtual presenc divin bless around .',
 "i could sens intervent allevi peopl 's pain , difficulti problem .",
 'the integr effect place guru integr spiritu materi wealth .'

# lets do with lemmatization

In [43]:
from nltk.stem import WordNetLemmatizer

lemma=WordNetLemmatizer()


In [44]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [46]:
sentences3=sent_tokenize(pargraph)
print(sentences3)


['\nI am indeed delighted to participate in the 21st Convocation of Sri Sathya Sai Institute of Higher Learning.', 'I take this opportunity to congratulate the young graduates for their achievement.', 'I greet the Vice Chancellor, Professors, teachers and staff for the excellent contribution in shaping young minds to contribute to the nation in multiple fields.', 'It is a great honour for me that the Chancellor, Swamiji, has given me this opportunity to share my thoughts at this Convocation.', 'Is value based education possible?', 'Sri Sathya Sai Institute of Higher Learning has given an answer in the affirmative.', 'Our ultimate goal is: all human beings should be prosperous and should have all forms of security like food security, social security and future security of their children.', 'How to achieve them?', 'How can a nation be secured from external and internal problems?', 'National security and economic prosperity are interconnected.', 'Sathyam, Dharma, Shanti and Prema are the 

In [51]:
for i in range(len(sentences3)):
    # make the sentence to lower case because lemmatization is not making it lower case
    # sentences3[i]=sentences3[i].lower()
    # tokenize the sentence into words
    words=word_tokenize(sentences3[i])
    words=[lemma.lemmatize(word.lower(),pos='v') for word in words if word not in set(stopwords.words('english'))]
    sentences3[i]=' '.join(words)

In [52]:
sentences3

['i indeed delight participate 21st convocation sri sathya sai institute higher learn .',
 'i take opportunity congratulate young graduate achievement .',
 'i greet vice chancellor , professors , teacher staff excellent contribution shape young mind contribute nation multiple field .',
 'it great honour chancellor , swamiji , give opportunity share think convocation .',
 'be value base education possible ?',
 'sri sathya sai institute higher learn give answer affirmative .',
 'our ultimate goal : human prosperous form security like food security , social security future security child .',
 'how achieve ?',
 'how nation secure external internal problem ?',
 'national security economic prosperity interconnect .',
 'sathyam , dharma , shanti prema eternal human value .',
 "efforts endeavour man 's duty .",
 "success failure god 's domain .",
 'i see campus high calibre graduate bubble creativity .',
 'there virtual presence divine bless around .',
 "i could sense intervention alleviate pe