# Stop Words

Stop words are common words that are often filtered out during the preprocessing of natural language text in various applications, including natural language processing (NLP) and information retrieval. These words are typically the most frequently occurring words in a language but often do not carry significant meaning on their own. Examples of stop words include "the," "and," "is," "in," and "to".

#### Role of STOP words

1. Reducing Dimensionality
2. Improving Computational Efficiency
3. Focus on Content-bearing Words
4. Enhancing Information Retrieval
5. Improved Analysis and Visualization

In [1]:
import spacy

from spacy.lang.en.stop_words import STOP_WORDS
len(STOP_WORDS)

326

In [2]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("We just opened our wings, the flying part is coming soon")

In [3]:
for token in doc:
    if token.is_stop:
        print(token)

We
just
our
the
part
is


In [4]:
def preprocess(text):
    doc = nlp(text)
    
    not_stop_words = [token.text for token in doc if not token.is_stop and not token.is_punct]
    return " ".join(not_stop_words)

In [5]:
preprocess(doc)

'opened wings flying coming soon'

In [6]:
preprocess("The other is not other but your divine brother")

'divine brother'

In [7]:
preprocess("Musk wants time to prepare for a trial over his new business project")

'Musk wants time prepare trial new business project'

### Remove stop words from pandas dataframe text column

In [8]:
import pandas as pd

df = pd.read_json("doj_press.json", lines=True)
df.shape

(13087, 6)

In [9]:
df.head()

Unnamed: 0,id,title,contents,date,topics,components
0,,Convicted Bomb Plotter Sentenced to 30 Years,"PORTLAND, Oregon. – Mohamed Osman Mohamud, 23,...",2014-10-01T00:00:00-04:00,[],[National Security Division (NSD)]
1,12-919,$1 Million in Restitution Payments Announced t...,WASHINGTON – North Carolina’s Waccamaw River...,2012-07-25T00:00:00-04:00,[],[Environment and Natural Resources Division]
2,11-1002,$1 Million Settlement Reached for Natural Reso...,BOSTON– A $1-million settlement has been...,2011-08-03T00:00:00-04:00,[],[Environment and Natural Resources Division]
3,10-015,10 Las Vegas Men Indicted \r\nfor Falsifying V...,WASHINGTON—A federal grand jury in Las Vegas...,2010-01-08T00:00:00-05:00,[],[Environment and Natural Resources Division]
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13087 entries, 0 to 13086
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          12810 non-null  object
 1   title       13087 non-null  object
 2   contents    13087 non-null  object
 3   date        13087 non-null  object
 4   topics      13087 non-null  object
 5   components  13087 non-null  object
dtypes: object(6)
memory usage: 613.6+ KB


In [11]:
df.describe()

Unnamed: 0,id,title,contents,date,topics,components
count,12810,13087,13087,13087,13087,13087
unique,12672,12887,13080,2400,253,810
top,13-526,Northern California Real Estate Investor Agree...,"WASHINGTON – ING Bank N.V., a financial inst...",2018-04-13T00:00:00-04:00,[],[Criminal Division]
freq,3,8,2,20,8399,2680


In [12]:
df = df[df.topics.str.len() != 0]
df.head()

Unnamed: 0,id,title,contents,date,topics,components
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]
7,14-1412,14 Indicted in Connection with New England Com...,A 131-count criminal indictment was unsealed t...,2014-12-17T00:00:00-05:00,[Consumer Protection],[Civil Division]
19,17-1419,2017 Southeast Regional Animal Cruelty Prosecu...,The United States Attorney’s Office for the Mi...,2017-12-14T00:00:00-05:00,[Environment],"[Environment and Natural Resources Division, U..."
22,15-1562,21st Century Oncology to Pay $19.75 Million to...,"21st Century Oncology LLC, has agreed to pay $...",2015-12-18T00:00:00-05:00,"[False Claims Act, Health Care Fraud]",[Civil Division]
23,17-1404,21st Century Oncology to Pay $26 Million to Se...,21st Century Oncology Inc. and certain of its ...,2017-12-12T00:00:00-05:00,"[Health Care Fraud, False Claims Act]","[Civil Division, USAO - Florida, Middle]"


In [13]:
df.shape

(4688, 6)

In [14]:
len(df['contents'].iloc[4])

5504

In [15]:
df["contents_new"] = df["contents"].apply(preprocess)
df.head()

Unnamed: 0,id,title,contents,date,topics,components,contents_new
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division],U.S. Department Justice U.S. Environmental Pro...
7,14-1412,14 Indicted in Connection with New England Com...,A 131-count criminal indictment was unsealed t...,2014-12-17T00:00:00-05:00,[Consumer Protection],[Civil Division],131 count criminal indictment unsealed today B...
19,17-1419,2017 Southeast Regional Animal Cruelty Prosecu...,The United States Attorney’s Office for the Mi...,2017-12-14T00:00:00-05:00,[Environment],"[Environment and Natural Resources Division, U...",United States Attorney Office Middle District ...
22,15-1562,21st Century Oncology to Pay $19.75 Million to...,"21st Century Oncology LLC, has agreed to pay $...",2015-12-18T00:00:00-05:00,"[False Claims Act, Health Care Fraud]",[Civil Division],21st Century Oncology LLC agreed pay $ 19.75 m...
23,17-1404,21st Century Oncology to Pay $26 Million to Se...,21st Century Oncology Inc. and certain of its ...,2017-12-12T00:00:00-05:00,"[Health Care Fraud, False Claims Act]","[Civil Division, USAO - Florida, Middle]",21st Century Oncology Inc. certain subsidiarie...


In [16]:
len(df['contents_new'].iloc[4])

4217

In [18]:
df.contents[4][:300]

'The U.S. Department of Justice, the U.S. Environmental Protection Agency (EPA), and the Rhode Island Department of Environmental Management (RIDEM) announced today that two subsidiaries of Stanley Black & Decker Inc.—Emhart Industries Inc. and Black & Decker Inc.—have agreed to clean up dioxin conta'

In [19]:
df.contents_new[4][:300]

'U.S. Department Justice U.S. Environmental Protection Agency EPA Rhode Island Department Environmental Management RIDEM announced today subsidiaries Stanley Black Decker Inc.—Emhart Industries Inc. Black Decker Inc.—have agreed clean dioxin contaminated sediment soil Centredale Manor Restoration Pro'

### Examples where removing stop words can create a problem

**(1) Sentiment detection: Not always but in some cases, based on your dataset it can change the sentiment of a sentence if you remove stop words**

In [20]:
preprocess("this is a good movie")

'good movie'

In [21]:
preprocess("this is not a good movie")

'good movie'

**(2) Language translation: Say you want to translate following sentence from english to telugu. Before actual translation if you remove stop words and then translate, it will produce horrible result**

In [22]:
preprocess("how are you doing dhaval?")

'dhaval'

**(3) Chat bot or any Q&A system**

In [23]:
preprocess("I don't find yoga mat on your website. Can you help?")

'find yoga mat website help'