## Stop Word Removing

Stop words are a set of commonly used words in a language. Examples of stop words in English are “a,” “the,” “is,” “are,” etc. Stop words are commonly used in Text Mining and Natural Language Processing (NLP) to eliminate words that are so widely used that they carry very little useful information.

In [1]:
import spacy

In [4]:
from spacy.lang.en.stop_words import STOP_WORDS

In [6]:
len(STOP_WORDS)

326

In [9]:
STOP_WORDS

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

In [7]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("We just opened our wings, the flying part is coming soon")

for token in doc:
    print(token)

We
just
opened
our
wings
,
the
flying
part
is
coming
soon


In [8]:
for token in doc:
    if token.is_stop:
        print(token)

We
just
our
the
part
is


In [16]:
def preprocess(text):
    doc =nlp(text)
    no_stop_words=[]
    
    for token in doc:
        if not token.is_stop:
            no_stop_words.append(token.text)
            
    return " ".join(no_stop_words)   

In [17]:
preprocess("Musk wants time to prepare for a trial over his")


'Musk wants time prepare trial'

In [19]:
preprocess("The other is not other but your divine brother")

'divine brother'

## Remove stop words from pandas dataframe text column

In [20]:
import pandas as pd

In [21]:
df = pd.read_csv("Spam.csv")

In [22]:
df

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [23]:
df.shape

(5572, 2)

In [24]:
df["Measage_new"] = df.Message.apply(preprocess)

In [25]:
df

Unnamed: 0,Category,Message,Measage_new
0,ham,"Go until jurong point, crazy.. Available only ...","jurong point , crazy .. Available bugis n grea..."
1,ham,Ok lar... Joking wif u oni...,Ok lar ... Joking wif u oni ...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry 2 wkly comp win FA Cup final tkts 2...
3,ham,U dun say so early hor... U c already then say...,U dun early hor ... U c ...
4,ham,"Nah I don't think he goes to usf, he lives aro...","Nah think goes usf , lives"
...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,2nd time tried 2 contact u. U won £ 750 Pound ...
5568,ham,Will ü b going to esplanade fr home?,ü b going esplanade fr home ?
5569,ham,"Pity, * was in mood for that. So...any other s...","Pity , * mood . ... suggestions ?"
5570,ham,The guy did some bitching but I acted like i'd...,guy bitching acted like interested buying week...


In [30]:
df.Message[0]

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [31]:
df.Measage_new[0]

'jurong point , crazy .. Available bugis n great world la e buffet ... Cine got amore wat ...'

## Exercise

In [32]:
text = '''
Thor: Love and Thunder is a 2022 American superhero film based on Marvel Comics featuring the character Thor, produced by Marvel Studios and 
distributed by Walt Disney Studios Motion Pictures. It is the sequel to Thor: Ragnarok (2017) and the 29th film in the Marvel Cinematic Universe (MCU).
The film is directed by Taika Waititi, who co-wrote the script with Jennifer Kaytin Robinson, and stars Chris Hemsworth as Thor alongside Christian Bale, Tessa Thompson,
Jaimie Alexander, Waititi, Russell Crowe, and Natalie Portman. In the film, Thor attempts to find inner peace, but must return to action and recruit Valkyrie (Thompson),
Korg (Waititi), and Jane Foster (Portman)—who is now the Mighty Thor—to stop Gorr the God Butcher (Bale) from eliminating all gods.
'''

doc = nlp(text)

In [33]:
stop_word_count = 0
total_word = 0
for token in doc:
    if token.is_stop:
        stop_word_count+=1
    total_word+=1

In [34]:
stop_word_count

40

In [35]:
total_word

160

In [36]:
stop_word_count / total_word * 100

25.0

Spacy default implementation considers "not" as a stop word. But in some scenarios removing 'not' will completely change the meaning of the statement/text. For Example, consider these two statements:

- this is a good movie       ----> Positive Statement
- this is not a good movie   ----> Negative Statement
So, after applying stopwords to those 2 texts, both will return "good movie" and does not respect the polarity/sentiments of text.

Now, your task is to remove this stop word "not" in spaCy and help in distinguishing the texts.

Hint: GOOGLE IT! Google is your friend.

In [42]:
def preprocess(text):
    doc = nlp(text)
    no_stop_words = [token.text for token in doc if not token.is_stop]
    return " ".join(no_stop_words)       


#Step1: remove the stopword 'not' in spacy
nlp.vocab['not'].is_stop = False


positive_text = preprocess('this is a good movie')
negative_text = preprocess('this is not a good movie')


#step3: finally print those 2 transformed texts
print(f"Text1: {positive_text}")
print(f"Text2: {negative_text}")

Text1: good movie
Text2: not good movie


In [49]:
text = ''' The India men's national cricket team, also known as Team India or the Men in Blue, represents India in men's international cricket.
It is governed by the Board of Control for Cricket in India (BCCI), and is a Full Member of the International Cricket Council (ICC) with Test,
One Day International (ODI) and Twenty20 International (T20I) status. Cricket was introduced to India by British sailors in the 18th century, and the 
first cricket club was established in 1792. India's national cricket team played its first Test match on 25 June 1932 at Lord's, becoming the sixth team to be
granted test cricket status.
'''


#step1: Create the object 'doc' for the given text using nlp()
doc = nlp(text)


#step2: remove all the stop words and punctuations and store all the remaining tokens in a new list
remaining_tokens = []
for token in doc:
  if token.is_stop or token.is_punct:    #check whether a given token is stop word or punctuations
    continue
  remaining_tokens.append(token.text)

In [50]:
remaining_tokens

[' ',
 'India',
 'men',
 'national',
 'cricket',
 'team',
 'known',
 'Team',
 'India',
 'Men',
 'Blue',
 'represents',
 'India',
 'men',
 'international',
 'cricket',
 '\n',
 'governed',
 'Board',
 'Control',
 'Cricket',
 'India',
 'BCCI',
 'Member',
 'International',
 'Cricket',
 'Council',
 'ICC',
 'Test',
 '\n',
 'Day',
 'International',
 'ODI',
 'Twenty20',
 'International',
 'T20I',
 'status',
 'Cricket',
 'introduced',
 'India',
 'British',
 'sailors',
 '18th',
 'century',
 '\n',
 'cricket',
 'club',
 'established',
 '1792',
 'India',
 'national',
 'cricket',
 'team',
 'played',
 'Test',
 'match',
 '25',
 'June',
 '1932',
 'Lord',
 'sixth',
 'team',
 '\n',
 'granted',
 'test',
 'cricket',
 'status',
 '\n']

In [51]:
#step3: create a new dictionary and get the frequency of words by iterating through the list which contains stored tokens  
frequency_tokens = {}
for token in remaining_tokens:
  if token != '\n' and token != ' ':      #As spacy considers new line and empty spaces as seperate token, it's better to ignore them
    if token not in frequency_tokens:     #if a particular token occurs for the first time, we initialise it to 1
      frequency_tokens[token] = 1
    else:
      frequency_tokens[token] += 1 

In [56]:
import numpy as np


In [59]:
np.argmax(frequency_tokens)

0

In [67]:
max(frequency_tokens.keys(), key=(lambda key: frequency_tokens[key]))

'India'

In [68]:
frequency_tokens.keys()

dict_keys(['India', 'men', 'national', 'cricket', 'team', 'known', 'Team', 'Men', 'Blue', 'represents', 'international', 'governed', 'Board', 'Control', 'Cricket', 'BCCI', 'Member', 'International', 'Council', 'ICC', 'Test', 'Day', 'ODI', 'Twenty20', 'T20I', 'status', 'introduced', 'British', 'sailors', '18th', 'century', 'club', 'established', '1792', 'played', 'match', '25', 'June', '1932', 'Lord', 'sixth', 'granted', 'test'])