### **STOP Words**
- of, not, by ... are called stop words
- In count vectorisation concept in many cases those words unnecessarily **enlarge the vocabulary size**
- Not only that it also **misleads the predictions**
    - In texts those words are used very frequently, as a result depending on the **similar count some outliers may also fall into the class**

In [11]:
import pandas as pd
import numpy as np
import spacy
from spacy.lang.en.stop_words import STOP_WORDS


In [12]:
# Total available stop words
len(STOP_WORDS)

326

In [14]:
nlp = spacy.load('en_core_web_sm')

***the number of stop words in the text.***

In [15]:
text = '''
Thor: Love and Thunder is a 2022 American superhero film based on Marvel Comics featuring the character Thor, produced by Marvel Studios and 
distributed by Walt Disney Studios Motion Pictures. It is the sequel to Thor: Ragnarok (2017) and the 29th film in the Marvel Cinematic Universe (MCU).
The film is directed by Taika Waititi, who co-wrote the script with Jennifer Kaytin Robinson, and stars Chris Hemsworth as Thor alongside Christian Bale, Tessa Thompson,
Jaimie Alexander, Waititi, Russell Crowe, and Natalie Portman. In the film, Thor attempts to find inner peace, but must return to action and recruit Valkyrie (Thompson),
Korg (Waititi), and Jane Foster (Portman)—who is now the Mighty Thor—to stop Gorr the God Butcher (Bale) from eliminating all gods.
'''

In [16]:
doc1 = nlp(text)

In [None]:
tokens = [token.text for token in doc1 if not token.is_stop]
tokens

['and',
 'is',
 'a',
 'on',
 'the',
 'by',
 'and',
 'by',
 'It',
 'is',
 'the',
 'to',
 'and',
 'the',
 'in',
 'the',
 'The',
 'is',
 'by',
 'who',
 'the',
 'with',
 'and',
 'as',
 'and',
 'In',
 'the',
 'to',
 'but',
 'must',
 'to',
 'and',
 'and',
 'is',
 'now',
 'the',
 'to',
 'the',
 'from',
 'all']

In [21]:
stop_tokens = [token.text for token in doc1 if token.is_stop]
stop_tokens

['and',
 'is',
 'a',
 'on',
 'the',
 'by',
 'and',
 'by',
 'It',
 'is',
 'the',
 'to',
 'and',
 'the',
 'in',
 'the',
 'The',
 'is',
 'by',
 'who',
 'the',
 'with',
 'and',
 'as',
 'and',
 'In',
 'the',
 'to',
 'but',
 'must',
 'to',
 'and',
 'and',
 'is',
 'now',
 'the',
 'to',
 'the',
 'from',
 'all']

In [22]:
print(f"Total stop words present: {len(stop_tokens)} in total {len(text)} words.")

Total stop words present: 40 in total 767 words.


******
-    Spacy default implementation considers "not" as a stop word. But in some scenarios removing 'not' will completely change the meaning of the statement/text. For Example, consider these two statements:

    - this is a good movie       ----> Positive Statement
    - this is not a good movie   ----> Negative Statement
-    So, after applying stopwords to those 2 texts, both will return "good movie" and does not respect the polarity/sentiments of text.


***removing the stop word "not" in spaCy***

In [24]:
text1 = 'this is a good movie'
text2 = 'this is not a good movie'

In [25]:
doc2 = nlp(text1)
doc3 = nlp(text2)

In [31]:
STOP_WORDS.remove("not")
len(STOP_WORDS)

325

In [33]:
# applying the condition in the nlp vocap also
nlp.vocab["not"].is_stop = False

In [34]:
tokens1 = [token.text for token in doc2 if not token.is_stop]
_text1 =" ".join(tokens1)
tokens2 = [token.text for token in doc3 if not token.is_stop]
_text2 = " ".join(tokens2)
_text1, _text2

('good movie', 'not good movie')

******
***the most frequently used token after removing all the stop word tokens and punctuations in it.***

In [35]:
text = ''' The India men's national cricket team, also known as Team India or the Men in Blue, represents India in men's international cricket.
It is governed by the Board of Control for Cricket in India (BCCI), and is a Full Member of the International Cricket Council (ICC) with Test,
One Day International (ODI) and Twenty20 International (T20I) status. Cricket was introduced to India by British sailors in the 18th century, and the 
first cricket club was established in 1792. India's national cricket team played its first Test match on 25 June 1932 at Lord's, becoming the sixth team to be
granted test cricket status.
'''

In [36]:
doc4 = nlp(text)

In [46]:
filtered_tokens = [token.text for token in doc4 if not token.is_stop and not token.is_punct]
filtered_text = " ".join(filtered_tokens)
filtered_text

'  India men national cricket team known Team India Men Blue represents India men international cricket \n governed Board Control Cricket India BCCI Member International Cricket Council ICC Test \n Day International ODI Twenty20 International T20I status Cricket introduced India British sailors 18th century \n cricket club established 1792 India national cricket team played Test match 25 June 1932 Lord sixth team \n granted test cricket status \n'

In [43]:
token_dict = {}
for token in filtered_tokens:
    if not token in token_dict.keys():
        count = 0
        for token_ in filtered_tokens:
            if token_ == token: 
                count = count + 1
        
        token_dict[token] = count

In [44]:
token_dict

{' ': 1,
 'India': 6,
 'men': 2,
 'national': 2,
 'cricket': 5,
 'team': 3,
 'known': 1,
 'Team': 1,
 'Men': 1,
 'Blue': 1,
 'represents': 1,
 'international': 1,
 '\n': 5,
 'governed': 1,
 'Board': 1,
 'Control': 1,
 'Cricket': 3,
 'BCCI': 1,
 'Member': 1,
 'International': 3,
 'Council': 1,
 'ICC': 1,
 'Test': 2,
 'Day': 1,
 'ODI': 1,
 'Twenty20': 1,
 'T20I': 1,
 'status': 2,
 'introduced': 1,
 'British': 1,
 'sailors': 1,
 '18th': 1,
 'century': 1,
 'club': 1,
 'established': 1,
 '1792': 1,
 'played': 1,
 'match': 1,
 '25': 1,
 'June': 1,
 '1932': 1,
 'Lord': 1,
 'sixth': 1,
 'granted': 1,
 'test': 1}