<a href="https://colab.research.google.com/github/robinkm0610/NLP_dump/blob/main/NLP_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Tokenization


### Document.

We’ll be using a text narrated by Steve Jobs in the “Think Different” Apple commercial.

In [1]:
text = """Here’s to the crazy ones, the misfits, the rebels, the troublemakers,
the round pegs in the square holes. The ones who see things differently — they’re not fond of
rules. You can quote them, disagree with them, glorify
or vilify them, but the only thing you can’t do is ignore them because they
change things. They push the human race forward, and while some may see them
as the crazy ones, we see genius, because the ones who are crazy enough to think
that they can change the world, are the ones who do."""

###1. Simple tokenization with .split



In [2]:

# word tokenization
text.split()
# sentence tokenizer
# text.split('.') #splitting sentence by sentence

['Here’s',
 'to',
 'the',
 'crazy',
 'ones,',
 'the',
 'misfits,',
 'the',
 'rebels,',
 'the',
 'troublemakers,',
 'the',
 'round',
 'pegs',
 'in',
 'the',
 'square',
 'holes.',
 'The',
 'ones',
 'who',
 'see',
 'things',
 'differently',
 '—',
 'they’re',
 'not',
 'fond',
 'of',
 'rules.',
 'You',
 'can',
 'quote',
 'them,',
 'disagree',
 'with',
 'them,',
 'glorify',
 'or',
 'vilify',
 'them,',
 'but',
 'the',
 'only',
 'thing',
 'you',
 'can’t',
 'do',
 'is',
 'ignore',
 'them',
 'because',
 'they',
 'change',
 'things.',
 'They',
 'push',
 'the',
 'human',
 'race',
 'forward,',
 'and',
 'while',
 'some',
 'may',
 'see',
 'them',
 'as',
 'the',
 'crazy',
 'ones,',
 'we',
 'see',
 'genius,',
 'because',
 'the',
 'ones',
 'who',
 'are',
 'crazy',
 'enough',
 'to',
 'think',
 'that',
 'they',
 'can',
 'change',
 'the',
 'world,',
 'are',
 'the',
 'ones',
 'who',
 'do.']

## 2. Tokenization with **NLTK**

NLTK stands for Natural Language Toolkit. This is a suite of libraries and programs for statistical natural language processing for English written in Python.

NLTK contains a module called ***tokenize*** with a ***word_tokenize()*** method that will help us split a text into tokens. Once you installed NLTK, you can write the following code to tokenize text.

In [3]:
!pip install nltk



In [4]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

word_tokenize(text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['Here',
 '’',
 's',
 'to',
 'the',
 'crazy',
 'ones',
 ',',
 'the',
 'misfits',
 ',',
 'the',
 'rebels',
 ',',
 'the',
 'troublemakers',
 ',',
 'the',
 'round',
 'pegs',
 'in',
 'the',
 'square',
 'holes',
 '.',
 'The',
 'ones',
 'who',
 'see',
 'things',
 'differently',
 '—',
 'they',
 '’',
 're',
 'not',
 'fond',
 'of',
 'rules',
 '.',
 'You',
 'can',
 'quote',
 'them',
 ',',
 'disagree',
 'with',
 'them',
 ',',
 'glorify',
 'or',
 'vilify',
 'them',
 ',',
 'but',
 'the',
 'only',
 'thing',
 'you',
 'can',
 '’',
 't',
 'do',
 'is',
 'ignore',
 'them',
 'because',
 'they',
 'change',
 'things',
 '.',
 'They',
 'push',
 'the',
 'human',
 'race',
 'forward',
 ',',
 'and',
 'while',
 'some',
 'may',
 'see',
 'them',
 'as',
 'the',
 'crazy',
 'ones',
 ',',
 'we',
 'see',
 'genius',
 ',',
 'because',
 'the',
 'ones',
 'who',
 'are',
 'crazy',
 'enough',
 'to',
 'think',
 'that',
 'they',
 'can',
 'change',
 'the',
 'world',
 ',',
 'are',
 'the',
 'ones',
 'who',
 'do',
 '.']

In [5]:
from nltk.tokenize import sent_tokenize
sent_tokenize(text)

['Here’s to the crazy ones, the misfits, the rebels, the troublemakers, \nthe round pegs in the square holes.',
 'The ones who see things differently — they’re not fond of \nrules.',
 'You can quote them, disagree with them, glorify\nor vilify them, but the only thing you can’t do is ignore them because they\nchange things.',
 'They push the human race forward, and while some may see them\nas the crazy ones, we see genius, because the ones who are crazy enough to think\nthat they can change the world, are the ones who do.']

## 3. Tokenize text in different languages with **spaCy**

When you need to tokenize text written in a language other than English, you can use spaCy. This is a library for advanced natural language processing, written in Python and Cython, that supports tokenization for more than 65 languages.

Let’s tokenize the same Steve Jobs text but now translated in Spanish.

It considers punctuation as a token

In [6]:

text_spanish = """Por los locos. Los marginados. Los rebeldes. Los problematicos.
Los inadaptados. Los que ven las cosas de una manera distinta. A los que no les gustan
las reglas. Y a los que no respetan el “status quo”. Puedes citarlos, discrepar de ellos,
ensalzarlos o vilipendiarlos. Pero lo que no puedes hacer es ignorarlos… Porque ellos
cambian las cosas, empujan hacia adelante la raza humana y, aunque algunos puedan
considerarlos locos, nosotros vemos en ellos a genios. Porque las personas que están
lo bastante locas como para creer que pueden cambiar el mundo, son las que lo logran."""

In [7]:
from spacy.lang.es import Spanish
spac = Spanish()

doc = spac(text_spanish)
tokens = [token.text for token in doc]
print(tokens)

['Por', 'los', 'locos', '.', 'Los', 'marginados', '.', 'Los', 'rebeldes', '.', 'Los', 'problematicos', '.', '\n', 'Los', 'inadaptados', '.', 'Los', 'que', 'ven', 'las', 'cosas', 'de', 'una', 'manera', 'distinta', '.', 'A', 'los', 'que', 'no', 'les', 'gustan', '\n', 'las', 'reglas', '.', 'Y', 'a', 'los', 'que', 'no', 'respetan', 'el', '“', 'status', 'quo', '”', '.', 'Puedes', 'citarlos', ',', 'discrepar', 'de', 'ellos', ',', '\n', 'ensalzarlos', 'o', 'vilipendiarlos', '.', 'Pero', 'lo', 'que', 'no', 'puedes', 'hacer', 'es', 'ignorarlos', '…', 'Porque', 'ellos', '\n', 'cambian', 'las', 'cosas', ',', 'empujan', 'hacia', 'adelante', 'la', 'raza', 'humana', 'y', ',', 'aunque', 'algunos', 'puedan', '\n', 'considerarlos', 'locos', ',', 'nosotros', 'vemos', 'en', 'ellos', 'a', 'genios', '.', 'Porque', 'las', 'personas', 'que', 'están', '\n', 'lo', 'bastante', 'locas', 'como', 'para', 'creer', 'que', 'pueden', 'cambiar', 'el', 'mundo', ',', 'son', 'las', 'que', 'lo', 'logran', '.']


## 4. Tokenization with **Gensim**

Gensim(Generate Similar) is a library for unsupervised topic modeling and natural language processing and also contains a tokenizer. Once you install Gensim, tokenizing text will be as simple as writing the following code.

Gensim is quite strict with punctuation. It splits whenever a punctuation is encountered.

In [10]:
from gensim.utils import tokenize
#word tokenization
print(list(tokenize(text)))

['Here', 's', 'to', 'the', 'crazy', 'ones', 'the', 'misfits', 'the', 'rebels', 'the', 'troublemakers', 'the', 'round', 'pegs', 'in', 'the', 'square', 'holes', 'The', 'ones', 'who', 'see', 'things', 'differently', 'they', 're', 'not', 'fond', 'of', 'rules', 'You', 'can', 'quote', 'them', 'disagree', 'with', 'them', 'glorify', 'or', 'vilify', 'them', 'but', 'the', 'only', 'thing', 'you', 'can', 't', 'do', 'is', 'ignore', 'them', 'because', 'they', 'change', 'things', 'They', 'push', 'the', 'human', 'race', 'forward', 'and', 'while', 'some', 'may', 'see', 'them', 'as', 'the', 'crazy', 'ones', 'we', 'see', 'genius', 'because', 'the', 'ones', 'who', 'are', 'crazy', 'enough', 'to', 'think', 'that', 'they', 'can', 'change', 'the', 'world', 'are', 'the', 'ones', 'who', 'do']


#**Stop Words**

Stop words are a set of commonly used words in a language. Examples of stop words in English are “a”, “the”, “is”, “are” and etc. Stop words are commonly used in Text Mining and NLP to eliminate words that are so commonly used that they carry very little useful information.


Why do we remove stop words?

It helps to remove the low-level information from our text in order to give more focus to the important information. Those words do not really contribute significant information to our model.


Do we always remove stop words

Not always. It highly depends on the use case. For example tasks like text classification do not generally need stop words as the other words present in the dataset are more important and give the general idea of the text. So, we generally remove stop words in such tasks.

However, in task like sentiment analysis, you might want to maintain these stop words.

For example, if we are training a model that can perform the sentiment analysis task, we might not remove the stop words.

Movie review: “The movie was not good at all.”

Text after removal of stop words: “movie good”

We can clearly see that the review for the movie was negative. However, after the removal of stop words, the review became positive, which is not the reality. Thus, the removal of stop words can be problematic here.

##Removing Stop words with Natural Language Toolkit (NLTK)

In [1]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [20]:
sw_nltk = stopwords.words('english')
print(sw_nltk)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]


In [21]:
print(len(sw_nltk))

179


In [4]:
text2 = "Here's to the crazy ones, the misfits, the rebels, the troublemakers, the round pegs in the square holes."

In [7]:
words = [word for word in text2.split() if word.lower() not in sw ]
new_text2 = " ".join(words)
print(new_text2)
print("Old length: ", len(text2))
print("New length: ", len(new_text2))

Here's crazy ones, misfits, rebels, troublemakers, round pegs square holes.
Old length:  105
New length:  75


##Removing Stop words with spaCy


In [9]:
import spacy
#loading the english language small model of spacy
en = spacy.load('en_core_web_sm')
sw_spacy = en.Defaults.stop_words
print(sw_spacy)

{'thereupon', 'toward', 'amount', 'take', 'their', 'really', 'beside', 'on', "'m", 'us', 'somewhere', 'may', 'forty', 'under', 'along', 'by', 'towards', 'be', 'hereupon', 'ca', 'during', 'herself', 'upon', 'nobody', 'himself', 'without', 'get', 'whereafter', 'into', 'however', 'itself', 'i', 'it', 'thus', 'now', 'beyond', 'out', 'whom', 'we', 'down', 'alone', 'show', 'from', 'most', 'every', 'of', 'once', 'side', 'he', 'do', 'have', 'either', 'quite', 'as', 'front', 'cannot', 'yourselves', 'somehow', 'thence', "'d", 'amongst', '‘ve', 'off', 'became', 'anything', 'seem', 'hundred', 'nine', 'you', 'hereby', 'much', 'our', 'fifty', 'say', 'using', 'other', 'further', 'more', 'unless', 'all', 'hereafter', 'becomes', 'yours', 'never', 'fifteen', 'someone', 'whoever', 'at', 'can', 'formerly', 'made', 'third', 'several', 'being', 'already', 'are', 'meanwhile', 'throughout', 'each', 'what', 'no', 'regarding', 'least', 'a', 'then', 'own', 'whither', 'across', 'around', 'via', 'always', 'only', 

{'thereupon', 'toward', 'amount', 'take', 'their', 'really', 'beside', 'on', "'m", 'us', 'somewhere', 'may', 'forty', 'under', 'along', 'by', 'towards', 'be', 'hereupon', 'ca', 'during', 'herself', 'upon', 'nobody', 'himself', 'without', 'get', 'whereafter', 'into', 'however', 'itself', 'i', 'it', 'thus', 'now', 'beyond', 'out', 'whom', 'we', 'down', 'alone', 'show', 'from', 'most', 'every', 'of', 'once', 'side', 'he', 'do', 'have', 'either', 'quite', 'as', 'front', 'cannot', 'yourselves', 'somehow', 'thence', "'d", 'amongst', '‘ve', 'off', 'became', 'anything', 'seem', 'hundred', 'nine', 'you', 'hereby', 'much', 'our', 'fifty', 'say', 'using', 'other', 'further', 'more', 'unless', 'all', 'hereafter', 'becomes', 'yours', 'never', 'fifteen', 'someone', 'whoever', 'at', 'can', 'formerly', 'made', 'third', 'several', 'being', 'already', 'are', 'meanwhile', 'throughout', 'each', 'what', 'no', 'regarding', 'least', 'a', 'then', 'own', 'whither', 'across', 'around', 'via', 'always', 'only', 'any', 'sixty', 'whereby', 'even', 'twenty', 'who', 'become', 'else', 'she', 'must', 'top', 'where', 'used', 'since', 'among', 'themselves', 'seemed', 'wherein', 'because', 'eleven', 'just', 'others', 'various', 'doing', 'seems', 'or', 'done', 'six', 'ours', 'therein', 'whereupon', "'ll", 'thereafter', 'therefore', 'twelve', 'otherwise', 'has', 'anyway', 'two', 'part', 'nothing', 'besides', 'within', 'ten', 'whence', 'before', 'behind', 'hence', 'again', 'between', "'re", '’re', 'perhaps', 'some', 'so', '‘m', 'until', 'would', 'yourself', 'also', 'anyhow', 'they', 'very', 'should', 'through', 'although', 'empty', 'last', 'one', 're', 'neither', 'many', 'moreover', 'serious', 'were', 'his', 'seeming', 'mostly', 'up', '’d', '‘re', 'nevertheless', 'anyone', "'ve", 'three', 'still', 'wherever', 'had', 'whereas', 'ever', 'below', '‘s', 'but', 'him', '‘ll', 'after', 'an', 'sometime', 'in', 'does', '’s', '’ve', 'why', 'namely', 'whether', 'well', 'is', 'please', 'whose', 'those', 'her', 'these', 'for', 'nowhere', 'anywhere', 'my', 'over', 'though', 'will', 'few', 'been', 'mine', 'call', 'see', 'name', '’m', 'four', 'five', 'afterwards', 'everywhere', 'to', 'about', 'none', 'this', "'s", 'not', 'against', 'enough', 'here', 'myself', 'your', 'latterly', 'them', 'elsewhere', 'per', '‘d', 'and', 'thru', 'give', 'almost', 'together', 'was', 'with', 'same', 'if', 'go', 'eight', 'that', 'both', 'indeed', 'noone', 'yet', 'herein', 'when', 'ourselves', '’ll', 'whenever', 'how', 'its', 'next', 'whatever', 'former', 'above', "n't", 'becoming', 'there', 'thereby', 'did', 'the', 'less', 'another', 'might', 'n‘t', 'me', 'first', 'often', 'keep', 'back', 'nor', 'too', 'am', 'except', 'bottom', 'whole', 'n’t', 'everyone', 'everything', 'could', 'while', 'full', 'rather', 'latter', 'make', 'than', 'onto', 'something', 'sometimes', 'move', 'hers', 'beforehand', 'due', 'such', 'put', 'which'}


In [10]:
print(len(sw_spacy))

326


In [12]:
words = [word for word in text2.split() if word.lower() not in sw_spacy]
new_text2 = " ".join(words)
print(new_text2)
print(f"Old length {len(text2)}")
print(f"New length {len(new_text2)}")

Here's crazy ones, misfits, rebels, troublemakers, round pegs square holes.
Old length 105
New length 75


##Removing Stop words with Ginsim


In [None]:
import gensim
from gensim.parsing.preprocessing import remove_stopwords, STOPWORDS
print(STOPWORDS)

In [14]:
import gensim
from gensim.parsing.preprocessing import remove_stopwords, STOPWORDS
print(STOPWORDS)

frozenset({'thereupon', 'toward', 'amount', 'take', 'their', 'really', 'beside', 'on', 'us', 'etc', 'somewhere', 'may', 'forty', 'under', 'along', 'by', 'towards', 'be', 'hereupon', 'during', 'herself', 'upon', 'nobody', 'himself', 'without', 'get', 'whereafter', 'into', 'however', 'itself', 'i', 'it', 'thus', 'now', 'beyond', 'out', 'whom', 'sincere', 'we', 'down', 'alone', 'show', 'from', 'most', 'every', 'of', 'once', 'side', 'he', 'do', 'have', 'either', 'quite', 'as', 'mill', 'front', 'cannot', 'yourselves', 'somehow', 'thence', 'thick', 'amongst', 'off', 'became', 'anything', 'seem', 'bill', 'hundred', 'nine', 'you', 'kg', 'hereby', 'much', 'our', 'fifty', 'say', 'using', 'other', 'further', 'more', 'unless', 'all', 'hereafter', 'becomes', 'yours', 'never', 'fifteen', 'found', 'someone', 'whoever', 'at', 'can', 'formerly', 'made', 'third', 'several', 'being', 'already', 'are', 'meanwhile', 'throughout', 'each', 'what', 'no', 'regarding', 'least', 'a', 'then', 'detail', 'own', 'wh

frozenset({'thereupon', 'toward', 'amount', 'take', 'their', 'really', 'beside', 'on', 'us', 'etc', 'somewhere', 'may', 'forty', 'under', 'along', 'by', 'towards', 'be', 'hereupon', 'during', 'herself', 'upon', 'nobody', 'himself', 'without', 'get', 'whereafter', 'into', 'however', 'itself', 'i', 'it', 'thus', 'now', 'beyond', 'out', 'whom', 'sincere', 'we', 'down', 'alone', 'show', 'from', 'most', 'every', 'of', 'once', 'side', 'he', 'do', 'have', 'either', 'quite', 'as', 'mill', 'front', 'cannot', 'yourselves', 'somehow', 'thence', 'thick', 'amongst', 'off', 'became', 'anything', 'seem', 'bill', 'hundred', 'nine', 'you', 'kg', 'hereby', 'much', 'our', 'fifty', 'say', 'using', 'other', 'further', 'more', 'unless', 'all', 'hereafter', 'becomes', 'yours', 'never', 'fifteen', 'found', 'someone', 'whoever', 'at', 'can', 'formerly', 'made', 'third', 'several', 'being', 'already', 'are', 'meanwhile', 'throughout', 'each', 'what', 'no', 'regarding', 'least', 'a', 'then', 'detail', 'own', 'whither', 'across', 'around', 'via', 'always', 'only', 'sixty', 'any', 'whereby', 'even', 'twenty', 'ie', 'who', 'become', 'else', 'system', 'she', 'must', 'top', 'amoungst', 'cry', 'where', 'used', 'since', 'among', 'themselves', 'seemed', 'wherein', 'computer', 'because', 'various', 'just', 'others', 'eleven', 'doing', 'seems', 'or', 'done', 'six', 'ours', 'therein', 'whereupon', 'thereafter', 'therefore', 'twelve', 'otherwise', 'has', 'anyway', 'don', 'two', 'part', 'nothing', 'besides', 'within', 'ten', 'whence', 'before', 'behind', 'hence', 'again', 'between', 'perhaps', 'some', 'so', 'until', 'would', 'yourself', 'km', 'anyhow', 'also', 'they', 'very', 'should', 'through', 'although', 'eg', 'empty', 'thin', 'find', 'hasnt', 'last', 'one', 're', 'cant', 'describe', 'neither', 'couldnt', 'many', 'moreover', 'serious', 'were', 'his', 'seeming', 'mostly', 'up', 'nevertheless', 'anyone', 'ltd', 'three', 'con', 'still', 'wherever', 'had', 'whereas', 'ever', 'below', 'but', 'him', 'after', 'an', 'sometime', 'in', 'does', 'un', 'why', 'namely', 'whether', 'well', 'whose', 'please', 'is', 'those', 'her', 'these', 'for', 'nowhere', 'inc', 'anywhere', 'my', 'over', 'will', 'though', 'fire', 'few', 'been', 'mine', 'call', 'see', 'name', 'four', 'five', 'afterwards', 'everywhere', 'to', 'co', 'about', 'none', 'this', 'not', 'against', 'enough', 'here', 'myself', 'your', 'latterly', 'them', 'elsewhere', 'per', 'and', 'thru', 'give', 'almost', 'together', 'was', 'with', 'same', 'if', 'go', 'eight', 'that', 'both', 'indeed', 'noone', 'yet', 'herein', 'de', 'when', 'ourselves', 'whenever', 'didn', 'how', 'fill', 'its', 'next', 'whatever', 'former', 'above', 'becoming', 'there', 'thereby', 'did', 'the', 'less', 'another', 'might', 'interest', 'me', 'first', 'often', 'keep', 'back', 'nor', 'too', 'doesn', 'except', 'am', 'bottom', 'whole', 'everything', 'everyone', 'could', 'while', 'full', 'rather', 'latter', 'make', 'than', 'onto', 'something', 'sometimes', 'move', 'hers', 'due', 'beforehand', 'such', 'put', 'which'})


In [15]:
print(len(STOPWORDS))

337


In [16]:
new_text = remove_stopwords(text2)
print(new_text)
print(f"Old length {len(text2)}")
print(f"New Length {len(new_text)}")

Here's crazy ones, misfits, rebels, troublemakers, round pegs square holes.
Old length 105
New Length 75


##Removing Stop words with Scikit-Learn

In [17]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
print(ENGLISH_STOP_WORDS)

frozenset({'thereupon', 'toward', 'amount', 'take', 'their', 'beside', 'on', 'us', 'etc', 'somewhere', 'may', 'forty', 'under', 'along', 'by', 'towards', 'be', 'hereupon', 'during', 'herself', 'upon', 'nobody', 'himself', 'without', 'get', 'whereafter', 'into', 'however', 'itself', 'i', 'it', 'thus', 'now', 'beyond', 'out', 'whom', 'sincere', 'we', 'down', 'alone', 'show', 'from', 'most', 'every', 'of', 'once', 'side', 'he', 'do', 'have', 'either', 'as', 'mill', 'front', 'cannot', 'yourselves', 'somehow', 'thence', 'thick', 'amongst', 'off', 'became', 'anything', 'seem', 'bill', 'hundred', 'nine', 'you', 'hereby', 'much', 'our', 'fifty', 'other', 'further', 'more', 'all', 'hereafter', 'becomes', 'yours', 'never', 'fifteen', 'found', 'someone', 'whoever', 'at', 'can', 'formerly', 'made', 'third', 'several', 'being', 'already', 'are', 'meanwhile', 'throughout', 'each', 'what', 'no', 'least', 'a', 'then', 'detail', 'own', 'whither', 'across', 'around', 'via', 'always', 'only', 'any', 'six

frozenset({'thereupon', 'toward', 'amount', 'take', 'their', 'beside', 'on', 'us', 'etc', 'somewhere', 'may', 'forty', 'under', 'along', 'by', 'towards', 'be', 'hereupon', 'during', 'herself', 'upon', 'nobody', 'himself', 'without', 'get', 'whereafter', 'into', 'however', 'itself', 'i', 'it', 'thus', 'now', 'beyond', 'out', 'whom', 'sincere', 'we', 'down', 'alone', 'show', 'from', 'most', 'every', 'of', 'once', 'side', 'he', 'do', 'have', 'either', 'as', 'mill', 'front', 'cannot', 'yourselves', 'somehow', 'thence', 'thick', 'amongst', 'off', 'became', 'anything', 'seem', 'bill', 'hundred', 'nine', 'you', 'hereby', 'much', 'our', 'fifty', 'other', 'further', 'more', 'all', 'hereafter', 'becomes', 'yours', 'never', 'fifteen', 'found', 'someone', 'whoever', 'at', 'can', 'formerly', 'made', 'third', 'several', 'being', 'already', 'are', 'meanwhile', 'throughout', 'each', 'what', 'no', 'least', 'a', 'then', 'detail', 'own', 'whither', 'across', 'around', 'via', 'always', 'only', 'any', 'sixty', 'whereby', 'even', 'twenty', 'ie', 'who', 'become', 'else', 'system', 'she', 'must', 'top', 'amoungst', 'cry', 'where', 'since', 'among', 'themselves', 'seemed', 'wherein', 'because', 'eleven', 'others', 'seems', 'or', 'done', 'six', 'ours', 'therein', 'whereupon', 'thereafter', 'therefore', 'twelve', 'otherwise', 'has', 'anyway', 'two', 'part', 'nothing', 'besides', 'within', 'ten', 'whence', 'before', 'behind', 'hence', 'again', 'between', 'perhaps', 'some', 'so', 'until', 'would', 'yourself', 'also', 'anyhow', 'they', 'very', 'should', 'through', 'although', 'eg', 'empty', 'thin', 'find', 'hasnt', 'last', 'cant', 'one', 're', 'describe', 'neither', 'couldnt', 'many', 'moreover', 'serious', 'were', 'his', 'seeming', 'mostly', 'up', 'nevertheless', 'anyone', 'ltd', 'three', 'con', 'still', 'wherever', 'had', 'whereas', 'ever', 'below', 'but', 'him', 'after', 'an', 'sometime', 'in', 'un', 'why', 'namely', 'whether', 'well', 'is', 'please', 'whose', 'those', 'her', 'these', 'for', 'nowhere', 'inc', 'anywhere', 'my', 'over', 'though', 'will', 'fire', 'few', 'been', 'mine', 'call', 'see', 'name', 'four', 'five', 'afterwards', 'everywhere', 'to', 'co', 'about', 'none', 'this', 'not', 'against', 'enough', 'here', 'myself', 'your', 'latterly', 'them', 'elsewhere', 'per', 'and', 'thru', 'give', 'almost', 'together', 'was', 'with', 'same', 'if', 'go', 'eight', 'that', 'both', 'indeed', 'noone', 'yet', 'herein', 'de', 'when', 'ourselves', 'whenever', 'how', 'fill', 'its', 'next', 'whatever', 'former', 'above', 'becoming', 'there', 'thereby', 'the', 'less', 'another', 'might', 'interest', 'me', 'first', 'often', 'keep', 'back', 'nor', 'too', 'am', 'except', 'bottom', 'whole', 'everyone', 'everything', 'could', 'while', 'full', 'rather', 'latter', 'than', 'onto', 'something', 'sometimes', 'move', 'hers', 'beforehand', 'due', 'such', 'put', 'which'})


In [18]:
print(len(ENGLISH_STOP_WORDS))

318


In [19]:
words = [word for word in text2.split() if word.lower() not in ENGLISH_STOP_WORDS]
new_text2 = " ".join(words)
print(new_text2)
print(f"Old length {len(text2)}")
print(f"New length {len(new_text2)}")

Here's crazy ones, misfits, rebels, troublemakers, round pegs square holes.
Old length 105
New length 75


## Adding custom Stop Words

You can also add custom stop words to the list of stop words available in these libraries to serve our purpose.

Here is the code to add some custom stop words to NLTK’s stop words list:

In [None]:
sw_nltk.extend(['first', 'second', 'third', 'me'])
print(len(sw_nltk))

In [22]:
sw_nltk.extend(['first', 'second','third','me'])
print(len(sw_nltk))

183


##Removing Stop Words

You can also remove stop words from the list available in these libraries.

Here is the code using the NLTK library:

In [None]:

sw_nltk.remove('not')
print(len(sw_nltk))

In [23]:
sw_nltk.remove('not')
print(len(sw_nltk))

182


##Create Custom Stop Words

In [24]:
text2 = "Here's to the crazy ones, the misfits, the rebels, the troublemakers, the round pegs in the square holes."

In [27]:
#create your custom stop words list

my_stop_words = ['to','the','in']
words = [word for word in text2.split() if word.lower() not in my_stop_words]
new_text = " ".join(words)
print(new_text)
print(f"Old length {len(text2)}")
print(f"New length {len(new_text)}")

Here's crazy ones, misfits, rebels, troublemakers, round pegs square holes.
Old length 105
New length 75
