### Stop Words

- words filtered out before preprocessing
- most common words

Uses

- Improve performance in search engines eg sentiment analysis
- Eliminate noise and distraction in sentiment classification
- make ML faster due to less features
- make prediction more accurate due to noise reduction


In [1]:
import spacy
nlp = spacy.load('en_core_web_md')

In [2]:
from spacy.lang.en.stop_words import STOP_WORDS

In [3]:
len(STOP_WORDS)

326

In [6]:
print(STOP_WORDS)

{'either', 'something', 'should', 'always', 'hundred', 'serious', 'unless', 'your', 'get', 'both', 'whereby', 'whole', '‘ll', 'again', 'other', 'five', 'just', 'this', '’ve', 'becoming', 'otherwise', 'herself', 'more', 'less', 'whence', 'between', 'doing', 'own', 'often', 'whenever', 'four', 'throughout', 'several', 'ourselves', 'out', 'from', 'him', 'due', 'am', 'further', 'front', 'somewhere', 'myself', 'ever', 'in', 'his', 'who', 'for', 'ten', 'perhaps', 'eleven', 'third', 'make', 'beside', 'being', 'part', 'already', 'me', 'elsewhere', 'he', 'made', 'move', 'toward', 'or', 'will', 'below', 'themselves', 'sixty', 'when', 'whereafter', 'these', 'rather', 'off', 'except', 'that', 'go', 'though', 'if', 'top', 're', 'the', 'n’t', 'least', 'my', 'us', 'which', 'afterwards', 'almost', 'as', 'you', 'such', 'really', 'hereafter', 'of', 'everyone', 'they', 'its', 'towards', 'per', 'now', 'few', 'hereupon', 'did', 'mostly', 'six', 'where', 'since', 'still', 'does', 'whereupon', 'at', 'whether

In [44]:
# Checking if a word is a stop word

nlp.vocab['the'].is_stop

True

### Filtering non Stop words

In [13]:
doc_covid = nlp(open('covid19.txt').read())
doc_covid

Through the International Food Safety Authorities Network (INFOSAN),
national food safety authorities are seeking more information on the
potential for persistence of SARS-CoV-2, which causes COVID-19, on foods
traded internationally as well as the potential role of food in the transmission
of the virus. Currently, there are investigations conducted to evaluate the
viability and survival time of SARS-CoV-2. As a general rule, the consumption
of raw or undercooked animal products should be avoided. Raw meat, raw
milk or raw animal organs should be handled with care to avoid crosscontamination with uncooked foods.

In [23]:
#Filtering non stop words

non_stop = [token for token in doc_covid if token.is_stop == False]
print(non_stop)

[International, Food, Safety, Authorities, Network, (, INFOSAN, ), ,, 
, national, food, safety, authorities, seeking, information, 
, potential, persistence, SARS, -, CoV-2, ,, causes, COVID-19, ,, foods, 
, traded, internationally, potential, role, food, transmission, 
, virus, ., Currently, ,, investigations, conducted, evaluate, 
, viability, survival, time, SARS, -, CoV-2, ., general, rule, ,, consumption, 
, raw, undercooked, animal, products, avoided, ., Raw, meat, ,, raw, 
, milk, raw, animal, organs, handled, care, avoid, crosscontamination, uncooked, foods, .]


In [24]:
#Stop words
only_stop = [token for token in doc_covid if token.is_stop == True]
print(only_stop)

[Through, the, are, more, on, the, for, of, which, on, as, well, as, the, of, in, the, of, the, there, are, to, the, and, of, As, a, the, of, or, should, be, or, should, be, with, to, with]


### Adding Your stop words

In [77]:
STOP_WORDS.add("lol")

In [78]:
len(STOP_WORDS)

329

In [79]:
nlp.vocab['lol'].is_stop = True

In [80]:
#Checking addition in STOP_WORDS

'lol' in STOP_WORDS

True

In [None]:
len(STOP_WORDS)

### Removing added Stop Word

In [81]:
STOP_WORDS.remove('lol')

In [82]:
len(STOP_WORDS)

328

In [89]:
nlp.vocab['lol'].is_stop 

False

In [85]:
len(STOP_WORDS)

328

### To add several stopwords at once:


In [86]:
nlp.Defaults.stop_words |= {"my_new_stopword1","my_new_stopword2"}

In [87]:
len(STOP_WORDS)

328

### 2nd Method

In [71]:
customize_stop_words = ['uncooked', 'undercooked', 'raw']

for w in customize_stop_words:
    nlp.vocab[w].is_stop = True
    

tokens = [token.text for token in doc_covid if not token.is_stop]
print('Original Article: %s' % (doc_covid))
print()
print(tokens)

Original Article: Through the International Food Safety Authorities Network (INFOSAN),
national food safety authorities are seeking more information on the
potential for persistence of SARS-CoV-2, which causes COVID-19, on foods
traded internationally as well as the potential role of food in the transmission
of the virus. Currently, there are investigations conducted to evaluate the
viability and survival time of SARS-CoV-2. As a general rule, the consumption
of raw or undercooked animal products should be avoided. Raw meat, raw
milk or raw animal organs should be handled with care to avoid crosscontamination with uncooked foods.

['International', 'Food', 'Safety', 'Authorities', 'Network', '(', 'INFOSAN', ')', ',', '\n', 'national', 'food', 'safety', 'authorities', 'seeking', 'information', '\n', 'potential', 'persistence', 'SARS', '-', 'CoV-2', ',', 'causes', 'COVID-19', ',', 'foods', '\n', 'traded', 'internationally', 'potential', 'role', 'food', 'transmission', '\n', 'virus', '.',

In [88]:
len(STOP_WORDS)

328

In [91]:
STOP_WORDS.pop()

'either'

In [92]:
len(STOP_WORDS)

327

In [96]:
nlp.Defaults.stop_words |= {'either'}

In [97]:
len(STOP_WORDS)

328

In [99]:
STOP_WORDS.remove('my_new_stopword1')

In [100]:
len(STOP_WORDS)

327