In [1]:
import spacy
# It works with a lot of languages. One can either go for efficiency or accuracy (bigger models)

In [2]:
nlp=spacy.load('en_core_web_sm')# Load the English language model for spacy

In [3]:
# Define a sample text (corpus) to work with
s='''
Data science is the study of data. Like biological sciences is a study of biology, physical sciences, it’s the study of physical reactions. Data is real, data has real properties, and we need to study them if we’re going to work on them. Data Science involves data and some signs.

It is a process, not an event. It is the process of using data to understand too many different things, to understand the world. Let Suppose when you have a model or proposed explanation of a problem, and you try to validate that proposed explanation or model with your data.

It is the skill of unfolding the insights and trends that are hiding (or abstract) behind data. It’s when you translate data into a story. So use storytelling to generate insight. And with these insights, you can make strategic choices for a company or an institution.

We can also define data science as a field that is about processes and systems to extract data of various forms and from various resources whether the data is unstructured or structured.
The definition and the name came up in the 1980s and 1990s when some professors, IT Professionals, scientists were looking into the statistics curriculum, and they thought it would be better to call it data science and then later on data analytics derived.
'''

#1.Printing all stop words

In [4]:
# Print the default set of stop words in the loaded spacy model
nlp.Defaults.stop_words
# Even if we remove stop words still the sentence will make sense

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

In [5]:
# Print the number of default stop words
len(nlp.Defaults.stop_words)

326

#2.Checking is the word is a stop_word or not


In [6]:
# Check if the word 'is' is a stop word
nlp.vocab['is'].is_stop

True

In [7]:
# Check if the word 'word' is a stop word
nlp.vocab['word'].is_stop

False

#Adding custom words into the list of D]StopWords

In [8]:
# Add the custom word 'word' to the set of stop words
nlp.Defaults.stop_words.add('word')

In [9]:
# Re-check if the word 'word' is now a stop word after adding it
nlp.vocab['word'].is_stop

False

In [10]:
# Print the updated number of stop words
len(nlp.Defaults.stop_words)

327

#4.Removing custom Words fromt he list of StopWords

In [11]:
# Check the status of 'word' as a stop word before removing it
nlp.vocab['word'].is_stop

False

In [12]:
# Remove the custom word 'word' from the set of stop words
nlp.Defaults.stop_words.remove('word')

In [13]:
# Check if 'word' is a stop word after removing it
nlp.vocab['word'].is_stop#word is not the stopWord any more

False

#5.Removing StopWords from Corpus

In [14]:
# Define the text (corpus) again for removing stop words
txt='''
Data science is the study of data. Like biological sciences is a study of biology, physical sciences, it’s the study of physical reactions. Data is real, data has real properties, and we need to study them if we’re going to work on them. Data Science involves data and some signs.

It is a process, not an event. It is the process of using data to understand too many different things, to understand the world. Let Suppose when you have a model or proposed explanation of a problem, and you try to validate that proposed explanation or model with your data.

It is the skill of unfolding the insights and trends that are hiding (or abstract) behind data. It’s when you translate data into a story. So use storytelling to generate insight. And with these insights, you can make strategic choices for a company or an institution.

We can also define data science as a field that is about processes and systems to extract data of various forms and from various resources whether the data is unstructured or structured.
The definition and the name came up in the 1980s and 1990s when some professors, IT Professionals, scientists were looking into the statistics curriculum, and they thought it would be better to call it data science and then later on data analytics derived.
'''
# Display the original text
txt

'\nData science is the study of data. Like biological sciences is a study of biology, physical sciences, it’s the study of physical reactions. Data is real, data has real properties, and we need to study them if we’re going to work on them. Data Science involves data and some signs.\n\nIt is a process, not an event. It is the process of using data to understand too many different things, to understand the world. Let Suppose when you have a model or proposed explanation of a problem, and you try to validate that proposed explanation or model with your data.\n\nIt is the skill of unfolding the insights and trends that are hiding (or abstract) behind data. It’s when you translate data into a story. So use storytelling to generate insight. And with these insights, you can make strategic choices for a company or an institution.\n\nWe can also define data science as a field that is about processes and systems to extract data of various forms and from various resources whether the data is uns

In [15]:
# Clean the text by removing newline characters, multiple spaces, and leading/trailing spaces
txt=txt.replace('\n','')
txt=txt.replace('  ','')
txt=txt.strip()
# Display the cleaned text
txt

'Data science is the study of data. Like biological sciences is a study of biology, physical sciences, it’s the study of physical reactions. Data is real, data has real properties, and we need to study them if we’re going to work on them. Data Science involves data and some signs.It is a process, not an event. It is the process of using data to understand too many different things, to understand the world. Let Suppose when you have a model or proposed explanation of a problem, and you try to validate that proposed explanation or model with your data.It is the skill of unfolding the insights and trends that are hiding (or abstract) behind data. It’s when you translate data into a story. So use storytelling to generate insight. And with these insights, you can make strategic choices for a company or an institution.We can also define data science as a field that is about processes and systems to extract data of various forms and from various resources whether the data is unstructured or s

In [16]:
# Process the cleaned text using the loaded spacy model
corp=nlp(txt)

In [17]:
# Display the processed spacy document
corp

Data science is the study of data. Like biological sciences is a study of biology, physical sciences, it’s the study of physical reactions. Data is real, data has real properties, and we need to study them if we’re going to work on them. Data Science involves data and some signs.It is a process, not an event. It is the process of using data to understand too many different things, to understand the world. Let Suppose when you have a model or proposed explanation of a problem, and you try to validate that proposed explanation or model with your data.It is the skill of unfolding the insights and trends that are hiding (or abstract) behind data. It’s when you translate data into a story. So use storytelling to generate insight. And with these insights, you can make strategic choices for a company or an institution.We can also define data science as a field that is about processes and systems to extract data of various forms and from various resources whether the data is unstructured or st

#5.1-Finding StopWords from the Corpus

In [18]:
# Find and list all stop words present in the corpus
stop_words = []
for token in corp:
  if token.is_stop:
    stop_words.append(token.text)

# Print the list of stop words found
print(stop_words)
# Print the count of stop words found
print(len(stop_words))

['is', 'the', 'of', 'is', 'a', 'of', 'it', '’s', 'the', 'of', 'is', 'has', 'and', 'we', 'to', 'them', 'if', 'we', '’re', 'to', 'on', 'them', 'and', 'some', 'It', 'is', 'a', 'not', 'an', 'It', 'is', 'the', 'of', 'using', 'to', 'too', 'many', 'to', 'the', 'when', 'you', 'have', 'a', 'or', 'of', 'a', 'and', 'you', 'to', 'that', 'or', 'with', 'your', 'It', 'is', 'the', 'of', 'the', 'and', 'that', 'are', 'or', 'behind', 'It', '’s', 'when', 'you', 'into', 'a', 'So', 'to', 'And', 'with', 'these', 'you', 'can', 'make', 'for', 'a', 'or', 'an', 'We', 'can', 'also', 'as', 'a', 'that', 'is', 'about', 'and', 'to', 'of', 'various', 'and', 'from', 'various', 'whether', 'the', 'is', 'or', 'The', 'and', 'the', 'name', 'up', 'in', 'the', 'and', 'when', 'some', 'IT', 'were', 'into', 'the', 'and', 'they', 'it', 'would', 'be', 'to', 'call', 'it', 'and', 'then', 'on']
125


In [19]:
# Print the number of unique stop words found in the corpus
len(set(stop_words))

55

In [20]:
# Display the list of stop words found in the corpus
stop_words

['is',
 'the',
 'of',
 'is',
 'a',
 'of',
 'it',
 '’s',
 'the',
 'of',
 'is',
 'has',
 'and',
 'we',
 'to',
 'them',
 'if',
 'we',
 '’re',
 'to',
 'on',
 'them',
 'and',
 'some',
 'It',
 'is',
 'a',
 'not',
 'an',
 'It',
 'is',
 'the',
 'of',
 'using',
 'to',
 'too',
 'many',
 'to',
 'the',
 'when',
 'you',
 'have',
 'a',
 'or',
 'of',
 'a',
 'and',
 'you',
 'to',
 'that',
 'or',
 'with',
 'your',
 'It',
 'is',
 'the',
 'of',
 'the',
 'and',
 'that',
 'are',
 'or',
 'behind',
 'It',
 '’s',
 'when',
 'you',
 'into',
 'a',
 'So',
 'to',
 'And',
 'with',
 'these',
 'you',
 'can',
 'make',
 'for',
 'a',
 'or',
 'an',
 'We',
 'can',
 'also',
 'as',
 'a',
 'that',
 'is',
 'about',
 'and',
 'to',
 'of',
 'various',
 'and',
 'from',
 'various',
 'whether',
 'the',
 'is',
 'or',
 'The',
 'and',
 'the',
 'name',
 'up',
 'in',
 'the',
 'and',
 'when',
 'some',
 'IT',
 'were',
 'into',
 'the',
 'and',
 'they',
 'it',
 'would',
 'be',
 'to',
 'call',
 'it',
 'and',
 'then',
 'on']

#5.2-Finding the words that doesn't belong to stopwords

In [21]:
# Iterate through the tokens in the corpus and print tokens that are not stop words
for token in corp:
  if not token.is_stop:
    print(token)

Data
science
study
data
.
Like
biological
sciences
study
biology
,
physical
sciences
,
study
physical
reactions
.
Data
real
,
data
real
properties
,
need
study
going
work
.
Data
Science
involves
data
signs
.
process
,
event
.
process
data
understand
different
things
,
understand
world
.
Let
Suppose
model
proposed
explanation
problem
,
try
validate
proposed
explanation
model
data
.
skill
unfolding
insights
trends
hiding
(
abstract
)
data
.
translate
data
story
.
use
storytelling
generate
insight
.
insights
,
strategic
choices
company
institution
.
define
data
science
field
processes
systems
extract
data
forms
resources
data
unstructured
structured
.
definition
came
1980s
1990s
professors
,
Professionals
,
scientists
looking
statistics
curriculum
,
thought
better
data
science
later
data
analytics
derived
.


In [22]:
# Join the non-stop words back into a string
" ".join([token.text for token in corp if not token.is_stop])

'Data science study data . Like biological sciences study biology , physical sciences , study physical reactions . Data real , data real properties , need study going work . Data Science involves data signs . process , event . process data understand different things , understand world . Let Suppose model proposed explanation problem , try validate proposed explanation model data . skill unfolding insights trends hiding ( abstract ) data . translate data story . use storytelling generate insight . insights , strategic choices company institution . define data science field processes systems extract data forms resources data unstructured structured . definition came 1980s 1990s professors , Professionals , scientists looking statistics curriculum , thought better data science later data analytics derived .'

stop_words = set(): Initializes an empty set to store stopwords.

for token in corp:: Loops through each token in corp.

if token.is_stop:: Checks if the token is a stopword.

stop_words.add(token.text): Adds stopword to the set.

print(stop_words): Prints the unique stopwords found.

print(len(stop_words)): Prints the number of unique stopwo

# Explanation of the code to find unique stopwords in the corpus
stop_words = set(): Initializes an empty set to store stopwords.

for token in corp:: Loops through each token in corp.

if token.is_stop:: Checks if the token is a stopword.

stop_words.add(token.text): Adds stopword to the set.

print(stop_words): Prints the unique stopwords found.

print(len(stop_words)): Prints the number of unique stopwo