Import Library

# SPACY

In [1]:
import spacy
print('spaCy Version: %s' % (spacy.__version__))
spacy_nlp = spacy.load('en_core_web_sm')

spaCy Version: 3.4.4


Check pre-defined stop words

In [2]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
print('Number of stop words: %d' % len(spacy_stopwords))
print('First ten stop words: %s' % list(spacy_stopwords)[:10])

Number of stop words: 326
First ten stop words: ['its', 'another', 'thereafter', 'be', 'anything', 'enough', 'if', 'throughout', 'part', 'not']


In [3]:
article = 'Original Article: In computing, stop words are words which are filtered out before or after processing of natural language data (text).[1] Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removin/g these stop words to support phrase search.'

Step 4:Remove stop words

In [4]:
doc = spacy_nlp(article)
tokens = [token.text for token in doc if not token.is_stop]
print('Original Article: %s' % (article))
print()
print(tokens)

Original Article: Original Article: In computing, stop words are words which are filtered out before or after processing of natural language data (text).[1] Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removin/g these stop words to support phrase search.

['Original', 'Article', ':', 'computing', ',', 'stop', 'words', 'words', 'filtered', 'processing', 'natural', 'language', 'data', '(', 'text).[1', ']', '"', 'stop', 'words', '"', 'usually', 'refers', 'common', 'words', 'language', ',', 'single', 'universal', 'list', 'stop', 'words', 'natural', 'language', 'processing', 'tools', ',', 'tools', 'use', 'list', '.', 'tools', 'specifically', 'avoid', 'removin', '/', 'g', 'stop', 'words', 'support', 'phrase', 'search', '.']


Step 5: Add customize stop words

In [5]:
customize_stop_words = [
    'computing', 'filtered','language'
]
for w in customize_stop_words:
    spacy_nlp.vocab[w].is_stop = True
doc = spacy_nlp(article)
tokens = [token.text for token in doc if not token.is_stop]
print('Original Article: %s' % (article))
print()
print(tokens)

Original Article: Original Article: In computing, stop words are words which are filtered out before or after processing of natural language data (text).[1] Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removin/g these stop words to support phrase search.

['Original', 'Article', ':', ',', 'stop', 'words', 'words', 'processing', 'natural', 'data', '(', 'text).[1', ']', '"', 'stop', 'words', '"', 'usually', 'refers', 'common', 'words', ',', 'single', 'universal', 'list', 'stop', 'words', 'natural', 'processing', 'tools', ',', 'tools', 'use', 'list', '.', 'tools', 'specifically', 'avoid', 'removin', '/', 'g', 'stop', 'words', 'support', 'phrase', 'search', '.']


After added “computing” and “filtered”, it will be removed as well.

# NLTK

Import Library

In [6]:
import nltk 
print('NLTK Version: %s' % (nltk.__version__))
nltk.download('stopwords')

NLTK Version: 3.8.1


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\RAHUL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Step 3: Check pre-defined stop words

In [7]:
nltk_stopwords = nltk.corpus.stopwords.words('english')
print('Number of stop words: %d' % len(nltk_stopwords))
print('First ten stop words: %s' % list(nltk_stopwords)[:10])

Number of stop words: 179
First ten stop words: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


Step 4: Remove stop words

In [8]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\RAHUL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [9]:
tokens = nltk.tokenize.word_tokenize(article)
tokens = [token for token in tokens if not token in nltk_stopwords]
print('Original Article: %s' % (article))
print()
print(tokens)

Original Article: Original Article: In computing, stop words are words which are filtered out before or after processing of natural language data (text).[1] Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removin/g these stop words to support phrase search.

['Original', 'Article', ':', 'In', 'computing', ',', 'stop', 'words', 'words', 'filtered', 'processing', 'natural', 'language', 'data', '(', 'text', ')', '.', '[', '1', ']', 'Though', '``', 'stop', 'words', "''", 'usually', 'refers', 'common', 'words', 'language', ',', 'single', 'universal', 'list', 'stop', 'words', 'used', 'natural', 'language', 'processing', 'tools', ',', 'indeed', 'tools', 'even', 'use', 'list', '.', 'Some', 'tools', 'specifically', 'avoid', 'removin/g', 'stop', 'words', 'support', 'phrase', 'search', '.']


General words such as “are”, “the” are removed as well. For example, “indeed” is removed in NLTK but not spaCy. On the other hand, “used” are removed in spaCy but not NLTK.

# jieba

For Chinese word, we use the similar ideas to filter out words if it is stop words.

In [10]:
pip install jieba

Note: you may need to restart the kernel to use updated packages.


In [11]:
import jieba
print('jieba Version: %s' % jieba.__version__)

jieba Version: 0.42.1


In [12]:
jieba_stop_words = [
    '的', '了', '和', '是', '就', '都', '而', '及', '與', 
    '著', '或', '一個', '沒有', '我們', '你們', '妳們', 
    '他們', '她們', '是否'
]

In [16]:
article2 = ' 在信息檢索中，為節省存儲空間和提高搜索效率，在處理自然語言數據（或文本）之前或之後會自動過濾掉某些字或詞，這些字或詞即被稱為Stop Words(停用詞)。不要把停用詞與安全口令混淆。 這些停用詞都是人工輸入、非自動化生成的，生成後的停用詞會形成一個停用詞表。但是，並沒有一個明確的停用詞表能夠適用於所有的工具。甚至有一些工具是明確地避免使用停用詞來支持短語搜索的。'

In [17]:
print('Original Article: %s' % (article2))
print()
words = jieba.cut(article2, cut_all=False)
words = [str(word) for word in words if not str(word) in jieba_stop_words]
print(words)

Building prefix dict from the default dictionary ...


Original Article:  在信息檢索中，為節省存儲空間和提高搜索效率，在處理自然語言數據（或文本）之前或之後會自動過濾掉某些字或詞，這些字或詞即被稱為Stop Words(停用詞)。不要把停用詞與安全口令混淆。 這些停用詞都是人工輸入、非自動化生成的，生成後的停用詞會形成一個停用詞表。但是，並沒有一個明確的停用詞表能夠適用於所有的工具。甚至有一些工具是明確地避免使用停用詞來支持短語搜索的。



Dumping model to file cache C:\Users\RAHUL\AppData\Local\Temp\jieba.cache
Loading model cost 0.599 seconds.
Prefix dict has been built successfully.


[' ', '在', '信息', '檢索', '中', '，', '為節', '省存', '儲空間', '提高', '搜索', '效率', '，', '在', '處理', '自然', '語言數', '據', '（', '文本', '）', '之前', '之後會', '自動', '過濾', '掉', '某些', '字', '詞', '，', '這些', '字', '詞', '即', '被', '稱', '為', 'Stop', ' ', 'Words', '(', '停用', '詞', ')', '。', '不要', '把', '停用', '詞', '安全', '口令', '混淆', '。', ' ', '這些', '停用', '詞', '人工', '輸入', '、', '非自動', '化生成', '，', '生成', '後', '停用', '詞會', '形成', '停用', '詞表', '。', '但是', '，', '並沒有', '明確', '停用', '詞表能夠', '適用', '於', '所有', '工具', '。', '甚至', '有', '一些', '工具', '明確', '地', '避免', '使用', '停用', '詞來', '支持', '短語', '搜索', '。']


Conclusion

The procedure of removing stop words is similar across libraries so the most importance is defining your own stop words. In initial phase, pre-defined stop words can be adopted but more and more words should be added into stop word list later on.

So besides, using spaCy or NLTK pre-defined stop words, we can use other words which are defined by other party such as Stanford NLP and Rank NL. You may check out the stop list from