#STOPWORDS

- Stop words are commonly used words that are often removed from texts during natural language processing (NLP) tasks such as information retrieval, text mining, and sentiment analysis.
- These words are considered to have little or no significance in determining the meaning or sentiment of a sentence, and their removal can help improve processing efficiency and accuracy.

- Examples of stop words include common words like "a," "an," "the," "in," "is," "and," "of," "on," and so on. These words are frequently used in the English language but do not carry much semantic value on their own.

While stop words are commonly removed in many NLP tasks, there are certain scenarios where **it is not advisable to remove them**. Here are a few cases where retaining stop words can be beneficial:

1. **Contextual Analysis:** In some cases, stop words can provide important contextual information. For example, in sentiment analysis, the sentiment of a sentence can heavily rely on the presence or absence of certain stop words. Removing stop words in such cases can lead to a loss of crucial context.

2. **Language Generation:** If you are generating natural language text, removing stop words can alter the grammatical structure and coherence of the generated sentences. Stop words often play a role in sentence formation and can be necessary for generating grammatically correct and coherent output.

3. **Named Entity Recognition:** Stop words may be part of named entities, such as person names or locations. Removing stop words in this case can result in the loss of important information and hinder the accuracy of named entity recognition systems.

4. **Information Retrieval:** In certain search queries or information retrieval tasks, stop words can be meaningful and play a role in determining the intent of the user. Removing them may lead to imprecise results or misinterpretation of the query.

5. **Linguistic Analysis:** In linguistic research or language studies, the analysis of stop words can provide insights into language patterns, usage frequencies, and syntactic structures. Retaining stop words in these cases allows for a more comprehensive analysis of the language.

In general, it's crucial to consider the specific task, dataset, and context when deciding whether to remove stop words. Understanding the impact of removing stop words on the particular analysis you are conducting is important to ensure accurate and meaningful results.

In [None]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
len(STOP_WORDS)

326

In [None]:
nlp=spacy.load("en_core_web_sm")
doc=nlp("We just opened our wings, the flying part is coming soon")

for token in doc:
  if token.is_stop:
    print(token)

We
just
our
the
part
is


In [None]:
#printing the non-stopwords using list comprehension

def preprocess(text):
  no_stop_words=[token.text for token in doc if not token.is_stop]
  return no_stop_words

In [None]:
preprocess("We just opened our wings, the flying part is coming soon")

['opened', 'wings', ',', 'flying', 'coming', 'soon']

- let's say we want to remove the punctuation as well.
-Let's modify the above list comprehension to remove punctuation as well.

In [None]:
def preprocess(text):
  no_stop_words=[token.text for token in doc if not token.is_stop and not token.is_punct ]
  return no_stop_words

In [None]:
preprocess("We just opened our wings, the flying part is coming soon")

['opened', 'wings', 'flying', 'coming', 'soon']

In [None]:
!pip install kaggle



In [None]:
!mkdir ~/.kaggle

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [None]:
!cp kaggle.json ~/.kaggle

In [None]:
! chmod 600 ~/.kaggle/kaggle.json


In [None]:
!kaggle datasets download -d jbencina/department-of-justice-20092018-press-releases

department-of-justice-20092018-press-releases.zip: Skipping, found more recently modified local copy (use --force to force download)


In [None]:
!ls /content/department-of-justice-20092018-press-releases.zip

/content/department-of-justice-20092018-press-releases.zip


In [None]:
!unzip /content/department-of-justice-20092018-press-releases.zip

Archive:  /content/department-of-justice-20092018-press-releases.zip
  inflating: combined.json           


In [None]:
import pandas as pd
df=pd.read_json("combined.json",lines=True) #lines True means one line per json object
df.shape

FileNotFoundError: ignored

In [None]:
df.head()

Unnamed: 0,id,title,contents,date,topics,components
0,,Convicted Bomb Plotter Sentenced to 30 Years,"PORTLAND, Oregon. – Mohamed Osman Mohamud, 23,...",2014-10-01T00:00:00-04:00,[],[National Security Division (NSD)]
1,12-919,$1 Million in Restitution Payments Announced t...,WASHINGTON – North Carolina’s Waccamaw River...,2012-07-25T00:00:00-04:00,[],[Environment and Natural Resources Division]
2,11-1002,$1 Million Settlement Reached for Natural Reso...,BOSTON– A $1-million settlement has been...,2011-08-03T00:00:00-04:00,[],[Environment and Natural Resources Division]
3,10-015,10 Las Vegas Men Indicted \r\nfor Falsifying V...,WASHINGTON—A federal grand jury in Las Vegas...,2010-01-08T00:00:00-05:00,[],[Environment and Natural Resources Division]
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]


- Since we can see that few of the rows of the topics row is empty so we are going to remove the title who don't have topics

In [None]:
type(df.topics[0])

list

- Since topics is an list data type we are goin to use filtering conditions to filter it out.

- Filter out those rows that do not have any topics associated with the case

In [None]:
df = df[df["topics"].str.len() != 0]
df.head()

Unnamed: 0,id,title,contents,date,topics,components
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division]
7,14-1412,14 Indicted in Connection with New England Com...,A 131-count criminal indictment was unsealed t...,2014-12-17T00:00:00-05:00,[Consumer Protection],[Civil Division]
19,17-1419,2017 Southeast Regional Animal Cruelty Prosecu...,The United States Attorney’s Office for the Mi...,2017-12-14T00:00:00-05:00,[Environment],"[Environment and Natural Resources Division, U..."
22,15-1562,21st Century Oncology to Pay $19.75 Million to...,"21st Century Oncology LLC, has agreed to pay $...",2015-12-18T00:00:00-05:00,"[False Claims Act, Health Care Fraud]",[Civil Division]
23,17-1404,21st Century Oncology to Pay $26 Million to Se...,21st Century Oncology Inc. and certain of its ...,2017-12-12T00:00:00-05:00,"[Health Care Fraud, False Claims Act]","[Civil Division, USAO - Florida, Middle]"


In [None]:
df.shape

(4688, 6)

In [None]:
df =df.head(100)
df.shape

(100, 6)

In [None]:
df["contents_new"] = df.contents.apply(preprocess)

In [None]:
df

Unnamed: 0,id,title,contents,date,topics,components,contents_new
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,[Environment],[Environment and Natural Resources Division],"[opened, wings, flying, coming, soon]"
7,14-1412,14 Indicted in Connection with New England Com...,A 131-count criminal indictment was unsealed t...,2014-12-17T00:00:00-05:00,[Consumer Protection],[Civil Division],"[opened, wings, flying, coming, soon]"
19,17-1419,2017 Southeast Regional Animal Cruelty Prosecu...,The United States Attorney’s Office for the Mi...,2017-12-14T00:00:00-05:00,[Environment],"[Environment and Natural Resources Division, U...","[opened, wings, flying, coming, soon]"
22,15-1562,21st Century Oncology to Pay $19.75 Million to...,"21st Century Oncology LLC, has agreed to pay $...",2015-12-18T00:00:00-05:00,"[False Claims Act, Health Care Fraud]",[Civil Division],"[opened, wings, flying, coming, soon]"
23,17-1404,21st Century Oncology to Pay $26 Million to Se...,21st Century Oncology Inc. and certain of its ...,2017-12-12T00:00:00-05:00,"[Health Care Fraud, False Claims Act]","[Civil Division, USAO - Florida, Middle]","[opened, wings, flying, coming, soon]"
...,...,...,...,...,...,...,...
316,15-1359,Alaska Plastic Surgeon Convicted of Wire Fraud...,Doctor Hid Millions in Secret Accounts in Pana...,2015-11-04T00:00:00-05:00,[Tax],[Tax Division],"[opened, wings, flying, coming, soon]"
318,16-396,Alaska Plastic Surgeon Sentenced to Prison for...,Defendant Concealed Bank Accounts in Panama an...,2016-04-04T00:00:00-04:00,[Tax],[Tax Division],"[opened, wings, flying, coming, soon]"
321,17-736,Alaskan Commercial Fishing Couple Charged with...,An Alaskan couple was charged in federal court...,2017-07-26T00:00:00-04:00,[Tax],"[Tax Division, USAO - Alaska]","[opened, wings, flying, coming, soon]"
322,18-717,Alaskan Husband And Wife Plead Guilty To Willf...,A husband and wife pleaded guilty yesterday to...,2018-06-01T00:00:00-04:00,[Tax],[Tax Division],"[opened, wings, flying, coming, soon]"


In [None]:
len(df.contents[4])

6286

In [None]:
len(df.contents_new[4])

5

In [None]:
df.contents[4][:300]

'The U.S. Department of Justice, the U.S. Environmental Protection Agency (EPA), and the Rhode Island Department of Environmental Management (RIDEM) announced today that two subsidiaries of Stanley Black & Decker Inc.—Emhart Industries Inc. and Black & Decker Inc.—have agreed to clean up dioxin conta'

In [None]:
df.contents_new[4][:300]

['opened', 'wings', 'flying', 'coming', 'soon']