# Recipe 2-3. Removing Stop Words
In this recipe, we are going to discuss how to remove stop words. Stop words
are very common words that carry no meaning or less meaning compared
to other keywords. If we remove the words that are less commonly used,
we can focus on the important keywords instead. Say, for example, in the
context of a search engine, if your search query is “How to develop chatbot
using python,” if the search engine tries to find web pages that contained the
terms “how,” “to,” “develop,” “chatbot,” “using,” “python,” the search engine
is going to find a lot more pages that contain the terms “how” and “to” than
pages that contain information about developing chatbot because the terms
“how” and “to” are so commonly used in the English language. So, if we
remove such terms, the search engine can actually focus on retrieving pages
that contain the keywords: “develop,” “chatbot,” “python” – which would
more closely bring up pages that are of real interest. Similarly we can remove
more common words and rare words as well.

# Problem

     You want to remove stop words.

# Solution

     The simplest way to do this by using the NLTK library, or you can build your own stop words file.

# How It Works
     Let’s follow the steps in this section to remove stop words from the text data.

# Step 1 Read/create the text data

In [1]:
text=['This is introduction to NLP','It is likely to be useful, to people ','Machine learning is the new electrcity',
       'There would be less hype around AI and more action going forward',
        'python is the best tool!','R is good langauage','I like this book','I want more books like this']
#convert list to data frame
import pandas as pd
df = pd.DataFrame({'tweet':text})
df

Unnamed: 0,tweet
0,This is introduction to NLP
1,"It is likely to be useful, to people"
2,Machine learning is the new electrcity
3,There would be less hype around AI and more ac...
4,python is the best tool!
5,R is good langauage
6,I like this book
7,I want more books like this


# Step 2 Execute below commands on the text data

In [6]:
df["tweet_lower"]=df["tweet"].apply(lambda x :" ".join( x.lower() for x in x.split()))

In [7]:
df["tweet_lower"]

0                          this is introduction to nlp
1                 it is likely to be useful, to people
2               machine learning is the new electrcity
3    there would be less hype around ai and more ac...
4                             python is the best tool!
5                                  r is good langauage
6                                     i like this book
7                          i want more books like this
Name: tweet_lower, dtype: object

In [8]:
df['remove_punc']=df["tweet_lower"].str.replace(r"[^\w\s]","")

  df['remove_punc']=df["tweet_lower"].str.replace(r"[^\w\s]","")


In [9]:
df

Unnamed: 0,tweet,tweet_lower,remove_punc
0,This is introduction to NLP,this is introduction to nlp,this is introduction to nlp
1,"It is likely to be useful, to people","it is likely to be useful, to people",it is likely to be useful to people
2,Machine learning is the new electrcity,machine learning is the new electrcity,machine learning is the new electrcity
3,There would be less hype around AI and more ac...,there would be less hype around ai and more ac...,there would be less hype around ai and more ac...
4,python is the best tool!,python is the best tool!,python is the best tool
5,R is good langauage,r is good langauage,r is good langauage
6,I like this book,i like this book,i like this book
7,I want more books like this,i want more books like this,i want more books like this


# Remove stopwords

In [10]:
import nltk
from nltk.corpus import stopwords
stop=stopwords.words("English")

In [11]:
stop

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [13]:
df['remove_punc'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

SyntaxError: invalid syntax (731382583.py, line 1)