**Basic steps of text preprocessing, which are needed for transferring text from human language to machine-readable format for further processing**



**Convert text to lowercase**

In [21]:
input_text = "The 5 biggest countries by population in 2017 are China, India, United States, Indonesia, and Brazil."
input_text = input_text.lower()
print(input_text)

the 5 biggest countries by population in 2017 are china, india, united states, indonesia, and brazil.


**Remove numbers**

Remove numbers if they are not relevant to your analyses. Usually, regular expressions are used to remove numbers.

In [22]:
import re
input_text = 'Box A contains 3 red and 5 white balls, while Box B contains 4 red and 2 blue balls.'
result = re.sub(r'\d+', '', input_text)
print(result)

Box A contains  red and  white balls, while Box B contains  red and  blue balls.


**Remove punctuation**

The following code removes this set of symbols [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]:

In [23]:
import string
input_text = "This &is [an] example? {of} string. with.? punctuation!!!!"
result = re.sub(r'[^\w\s]', '', input_text)
print(result)

This is an example of string with punctuation


**Remove whitespaces**

To remove leading and ending spaces, you can use the strip() function:



In [24]:
input_text = " \t a string example\t "
input_text = input_text.strip()
input_text

'a string example'

**Remove stop words**

“Stop words” are the most common words in a language like “the”, “a”, “on”, “is”, “all”. These words do not carry important meaning and are usually removed from texts. It is possible to remove stop words using Natural Language Toolkit (NLTK), a suite of libraries and programs for symbolic and statistical natural language processing.

In [25]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    |   Package bcp47 is already up-to-dat

True

In [26]:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text = "NLTK is a leading platform for building Python programs to work with human language data."

stop_words = set(stopwords.words('english'))

tokenizer = word_tokenize(text)

remove_stopwords = [token for token in tokenizer if token not in stop_words]

filtered_sentence = []

for token in tokenizer:
    if token not in stop_words:
        remove_stopwords.append(token)

print(tokenizer)
print(remove_stopwords)


['NLTK', 'is', 'a', 'leading', 'platform', 'for', 'building', 'Python', 'programs', 'to', 'work', 'with', 'human', 'language', 'data', '.']
['NLTK', 'leading', 'platform', 'building', 'Python', 'programs', 'work', 'human', 'language', 'data', '.', 'NLTK', 'leading', 'platform', 'building', 'Python', 'programs', 'work', 'human', 'language', 'data', '.']


**Stemming**

Stemming is a process of reducing words to their word stem, base or root form (for example, books — book, looked — look). The main two algorithms are Porter stemming algorithm (removes common morphological and inflexional endings from words ) and Lancaster stemming algorithm (a more aggressive stemming algorithm).

In [27]:

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer= PorterStemmer()
input_text="There are several types of stemming algorithms."
input_text=word_tokenize(input_text)
for word in input_text:
    print(stemmer.stem(word))

there
are
sever
type
of
stem
algorithm
.


**Lemmatization**

The aim of lemmatization, like stemming, is to reduce inflectional forms to a common base form. As opposed to stemming, lemmatization does not simply chop off inflections. Instead it uses lexical knowledge bases to get the correct base forms of words.

In [28]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
lemmatizer=WordNetLemmatizer()
input_text="been had done languages cities mice"
input_text=word_tokenize(input_text)
for word in input_text:
    print(lemmatizer.lemmatize(word))

been
had
done
language
city
mouse


**Part of speech tagging (POS)**

Part-of-speech tagging aims to assign parts of speech to each word of a given text (such as nouns, verbs, adjectives, and others) based on its definition and its context.

In [29]:
input_text="Parts of speech examples: an article, to write, interesting, easily, and, of"
from textblob import TextBlob
result = TextBlob(input_text)
print(result.tags)

[('Parts', 'NNS'), ('of', 'IN'), ('speech', 'NN'), ('examples', 'NNS'), ('an', 'DT'), ('article', 'NN'), ('to', 'TO'), ('write', 'VB'), ('interesting', 'VBG'), ('easily', 'RB'), ('and', 'CC'), ('of', 'IN')]


**Chunking (shallow parsing)**

Chunking is a natural language process that identifies constituent parts of sentences (nouns, verbs, adjectives, etc.) and links them to higher order units that have discrete grammatical meanings (noun groups or phrases, verb groups, etc.)

In [30]:

input_text="A black television and a white stove were bought for the new apartment of John."
from textblob import TextBlob
result = TextBlob(input_text)
print(result.tags)

[('A', 'DT'), ('black', 'JJ'), ('television', 'NN'), ('and', 'CC'), ('a', 'DT'), ('white', 'JJ'), ('stove', 'NN'), ('were', 'VBD'), ('bought', 'VBN'), ('for', 'IN'), ('the', 'DT'), ('new', 'JJ'), ('apartment', 'NN'), ('of', 'IN'), ('John', 'NNP')]


**Named entity recognition**

Named-entity recognition (NER) aims to find named entities in text and classify them into pre-defined categories (names of persons, locations, organizations, times, etc.).

In [31]:
from nltk import word_tokenize, pos_tag, ne_chunk
input_text = "Bill works for Apple so he went to Boston for a conference!."
print(ne_chunk(pos_tag(word_tokenize(input_text))))

(S
  (PERSON Bill/NNP)
  works/VBZ
  for/IN
  Apple/NNP
  so/IN
  he/PRP
  went/VBD
  to/TO
  (GPE Boston/NNP)
  for/IN
  a/DT
  conference/NN
  !/.
  ./.)
