# Text Cleaning and Pre-processing


In Natural Language Processing (NLP), most of the text and documents contain many words that are redundant for text classification, such as stopwords, miss-spellings, slangs, and etc. These are some techniques and methods for text cleaning and pre-processing to eliminate noise and unnecessary features that can negatively affect the overall performance. 


## Tokenization 

Tokenization is the process of breaking down a stream of text into words, phrases, symbols, or any other meaningful elements called tokens.



In [None]:
import nltk
nltk.download('punkt')

from nltk.tokenize import word_tokenize

text = "Computers don't speak English. So, we've to learn C, C++, Java, Python and the like! Yay!"

In [76]:
words = nltk.word_tokenize(text) 
print(words)

['Computers', 'do', "n't", 'speak', 'English', '.', 'So', ',', 'we', "'ve", 'to', 'learn', 'C', ',', 'C++', ',', 'Java', ',', 'Python', 'and', 'the', 'like', '!', 'Yay', '!']




## Noise cleaning

Remove white spaces, special characters, and punctuations.


In [29]:
import string

In [77]:
punctuations = list(string.punctuation)
print(punctuations)

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']


In [78]:
words = [word for word in words if word not in punctuations]
print(len(words), "words without stopwords and punctuations:", words)
print(words)

18 words without stopwords and punctuations: ['Computers', 'do', "n't", 'speak', 'English', 'So', 'we', "'ve", 'to', 'learn', 'C', 'C++', 'Java', 'Python', 'and', 'the', 'like', 'Yay']
['Computers', 'do', "n't", 'speak', 'English', 'So', 'we', "'ve", 'to', 'learn', 'C', 'C++', 'Java', 'Python', 'and', 'the', 'like', 'Yay']



## Spell checking

An optional part of the pre-processing step is correcting the misspelled words. 

*italicized text*

In [48]:
!pip install autocorrect
from autocorrect import Speller
check = Speller(lang='en')





In [79]:
print(check('caaaar'))
print(check('mussage'))
print(check('survice'))
print(check('hte'))

aaaaaa
message
service
the


## Contractions mapping
*italicized text*
Standardize text data!



In [None]:
!pip install contractions
import contractions

In [80]:
corpus = ["The brown fox wasn't that quick and he couldn't win the race",
          "Hey that's a great deal! I just bought a phone for $199",
          "@@You'll (learn) a **lot** in the book. Python is an amazing language!@@"]

expand_contraction = [contractions.fix(c) for c in corpus]
print(expand_contraction)


['The brown fox was not that quick and he could not win the race', 'Hey that is a great deal! I just bought a phone for $199', '@@you will (learn) a **lot** in the book. Python is an amazing language!@@']


## Stemming/lemmatization

Reduce different forms of a used word.

**Stemming:** less complicated   



In [None]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()


In [81]:
# choose some words to be stemmed
words = ["program", "programs", "programmer", "programming", "programmers"]
 
for w in words:
    print(w, " : ", ps.stem(w))

program  :  program
programs  :  program
programmer  :  programm
programming  :  program
programmers  :  programm


**Lemmatization**: true word



In [None]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
  
# Create WordNetLemmatizer object
wnl = WordNetLemmatizer()


In [82]:
# single word lemmatization examples
list1 = ['kites', 'babies', 'dogs', 'flying', 'smiling', 
         'driving', 'died', 'tried', 'feet']
for words in list1:
    print(words + " ---> " + wnl.lemmatize(words))

kites ---> kite
babies ---> baby
dogs ---> dog
flying ---> flying
smiling ---> smiling
driving ---> driving
died ---> died
tried ---> tried
feet ---> foot


## Stop words identification 
In natural language processing, useless words (data), are referred to as stop words (such as “the”, “a”, “an”, “in”). 

[link text](https://)

In [None]:
import nltk
nltk.download('stopwords')

In [83]:
stop_word_list = nltk.corpus.stopwords.words('english')
print(stop_word_list)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [84]:
def remove_stop_words(text):
    filtered_words = [t for t in text if t not in stop_word_list]
    return filtered_words


def tokenize_text(text):
    return nltk.word_tokenize(text)

corpus = ['The brown fox was not that quick and he could not win the race', 
'Hey that is a great deal! I just bought a phone for 199',
 'you will (learn) a lot  in the book. Python is an amazing language']


tokenize_text  = [tokenize_text(c) for c in corpus]
filtered_text  = [remove_stop_words(t) for t in tokenize_text]

print(filtered_text)

[['The', 'brown', 'fox', 'quick', 'could', 'win', 'race'], ['Hey', 'great', 'deal', '!', 'I', 'bought', 'phone', '199'], ['(', 'learn', ')', 'lot', 'book', '.', 'Python', 'amazing', 'language']]


## Capitalization

Sentences can contain a mixture of uppercase and lower case letters. To reduce the problem space, the most common approach is to reduce everything to lower case. 

In [85]:
text = "The United States of America (USA) or America, is a federal republic composed of 50 states"
print(text)
print(text.lower())

The United States of America (USA) or America, is a federal republic composed of 50 states
the united states of america (usa) or america, is a federal republic composed of 50 states
