## Stop Words
__What are Stop Words?__
Stop Words are words that are commonly occuring and may contain little informative value when it comes to NLP tasks such as topic modeling or document classification. 

__Why do we care?__
Because Stop Words can appear to contribute very little to the understanding of a document, their removal can simplify things for us, such as reducing sparsity of matrix representations of textual data.  However, what was once common practice has since become less standard, especially when one wants to use deep learning methods.  In some cases, we may want to forego Stop Word Removal altogether!

__Getting a list of Stop Words__ 
We can get a list of English Stop Words from libraries like SciKit Learn & SpaCy.  Please note that at some point in the future SciKit Learn may once again change how these methods are imported and employed, so check the documentation if errors or deprecation warnings are reported.   As of January 2021, the following should be sufficient for SciKit Learn.

__Be Aware:__ 
There is no universal list of Stop Words!  What SciKit Learn considers to be Stop Words won't match exactly with what another library, like SpaCy, considers to be Stop Words. 

### Getting Stop Words from SciKit Learn
Note that we don't always have to get the Stop Word lists from the CountVectorizer submodule.  It exists as a method attached to every text submodule used to build feature vectors from text documents.  So for example, we could've replaced CountVectorizer with TfidVectorizer instead and still retrieved the list of English Stop Words. 

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words = "english") 

Print the first 20 Stop Words in SciKit Learn's List of English Stop Words:

In [2]:
sorted(list(cv.get_stop_words()))[:20]

['a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amoungst']

### Getting Stop Words from SpaCy 
NOTE (SpaCy vs NLTK):  SpaCy is opinionated.  When you want to do something in SpaCy, it makes you do it "the SpaCy way", which tends to be very well optimized.  NLTK, in contrast, gives you a lot more variety in terms of getting the same thing done.  

SpaCy can be set to use different 'statisical models' that contain info regarding the language you want to use.  In this case, we're using an English model, that's of Type Core (general-purpose) that was trained on Web data, and is of a small size (occupies less memory on disk).  

In [3]:
import spacy 
nlp = spacy.load("en_core_web_sm") 

In [4]:
sorted(list(nlp.Defaults.stop_words))[6:26] # Note that we had to set indices from 6 to 26 to get a comparable list

['a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount']

### Comparing SciKit Learn Stop Words to SpaCy Stop Words

In [5]:
sklearn_stop_words = sorted(list(cv.get_stop_words()))
spacy_stop_words = sorted(list(nlp.Defaults.stop_words))

__Get the common stop words between SciKit Learn and SpaCy__

In [6]:
common_stop_words = [] 
for i in sklearn_stop_words:
    for j in spacy_stop_words:
        if i == j:
            common_stop_words.append(i)
print(common_stop_words[:20])

['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amount']


__What stop words are in sklearn but not spacy?__

In [7]:
sklearn_not_spacy = []
for i in sklearn_stop_words:
    if i not in spacy_stop_words:
        sklearn_not_spacy.append(i)
print(sklearn_not_spacy)

['amoungst', 'bill', 'cant', 'co', 'con', 'couldnt', 'cry', 'de', 'describe', 'detail', 'eg', 'etc', 'fill', 'find', 'fire', 'found', 'hasnt', 'ie', 'inc', 'interest', 'ltd', 'mill', 'sincere', 'system', 'thick', 'thin', 'un']


__What stop words are in spacy but not sklearn?__

In [8]:
spacy_not_sklearn = [] 
for i in spacy_stop_words:
    if i not in sklearn_stop_words:
        spacy_not_sklearn.append(i)
print(spacy_not_sklearn)

["'d", "'ll", "'m", "'re", "'s", "'ve", 'ca', 'did', 'does', 'doing', 'just', 'make', "n't", 'n‘t', 'n’t', 'quite', 'really', 'regarding', 'say', 'unless', 'used', 'using', 'various', '‘d', '‘ll', '‘m', '‘re', '‘s', '‘ve', '’d', '’ll', '’m', '’re', '’s', '’ve']


### Removing Stop Words in One Line with SpaCy

In [18]:
text = """
Dave watched as the forest burned up on the hill,
only a few miles from his house. The car had
been hastily packed and Marta was inside trying to round
up the last of the pets. "Where could she be?" he wondered
as he continued to wait for Marta to appear with the pets.
"""

doc = nlp(text)
# Tokenize and put into a list in one line
token_list = [token for token in doc]
print(token_list[:10])

[
, Dave, watched, as, the, forest, burned, up, on, the]


In [17]:
# Remove stop words in one line using hte .is_stop attribute
filtered_tokens = [token for token in doc if not token.is_stop]
print(filtered_tokens[:10])

[
, Dave, watched, forest, burned, hill, ,, 
, miles, house]
