<a href="https://colab.research.google.com/github/van1991/100-days-of-Coding-Learning/blob/master/Text_Pre_processing_Function.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Feature Engineering on Text Data
#### There can be multiple ways of cleaning and pre-processing textual data. In the following points, we highlight some of the most important ones which are used heavily in Natural Language Processing (NLP) pipelines.

Reference: https://towardsdatascience.com/understanding-feature-engineering-part-3-traditional-methods-for-text-data-f6f7d70acd41


### **Removing tags:** Our text often contains unnecessary content like HTML tags, which do not add much value when analyzing text. The BeautifulSoup library does an excellent job in providing necessary functions for this.

In [80]:
def remove_html_tags (text):
  from bs4 import BeautifulSoup
  soup = BeautifulSoup(text, "html.parser")
  html_stripped_text = soup.get_text()
  print("1. html tags removed")
  return html_stripped_text

remove_html_tags(document)

1. html tags removed


"Héllo! Héllo! can you hear me! I just heard about Python!\r\n \n              It's an amazing language which can be used for Scripting, Web development,\r\n\r\n\n              Information Retrieval, Natural Language Processing, Machine Learning & Artificial Intelligence!\n\n              What are you waiting for? Go and get started. He's learning, she's learning, they've already\n\n\n              got a headstart!\n"

### **Removing accented characters**: In any text corpus, especially if you are dealing with the English language, often you might be dealing with accented characters\letters. Hence we need to make sure that these characters are converted and standardized into ASCII characters. A simple example would be converting é to e.

In [101]:
def remove_accented_chars(text):
  import unicodedata
  text = unicodedata.normalize ('NFKD', text).encode('ascii','ignore').decode('utf-8','ignore') # https://docs.python.org/3/howto/unicode.html
  print("2. accented chars removed")
  return text

remove_accented_chars(document)


2. accented chars removed


"<p>Hello! Hello! can you hear me! I just heard about <b>Python</b>!<br/>\r\n \n              It's an amazing language which can be used for Scripting, Web development,\r\n\r\n\n              Information Retrieval, Natural Language Processing, Machine Learning & Artificial Intelligence!\n\n              What are you waiting for? Go and get started.<br/> He's learning, she's learning, they've already\n\n\n              got a headstart!</p>\n           "

### **Expanding contractions**: In the English language, contractions are basically shortened versions of words or syllables. These shortened versions of existing words or phrases are created by removing specific letters and sounds. Examples would be, do not to don’t and I would to I’d. Converting each contraction to its expanded, original form often helps with text standardization.

In [None]:
from contractions import CONTRACTION_MAP # i think this contractions is a seperate python script on github

def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    print("3. expanding contractions")
    return expanded_text



### **Removing special characters**: Special characters and symbols which are usually non alphanumeric characters often add to the extra noise in unstructured text. More than often, simple regular expressions (regexes) can be used to achieve this.

In [95]:
def remove_special_chars (text):
  import re
  text = re.sub('[^a-zA-z0-9\s]','', text) # identify all non-alphanumeric characters and replace with '' i.e. remove them.
  print("5. special charaters removed")
  return text

remove_special_chars(document)

5. special charaters removed


'pHllo Hllo can you hear me I just heard about bPythonbbr\r\n \n              Its an amazing language which can be used for Scripting Web development\r\n\r\n\n              Information Retrieval Natural Language Processing Machine Learning  Artificial Intelligence\n\n              What are you waiting for Go and get startedbr Hes learning shes learning theyve already\n\n\n              got a headstartp\n           '

### **Stemming and lemmatization**: Word stems are usually the base form of possible words that can be created by attaching affixes like prefixes and suffixes to the stem to create new words. This is known as inflection. The reverse process of obtaining the base form of a word is known as stemming. A simple example are the words WATCHES, WATCHING, and WATCHED. They have the word root stem WATCH as the base form. Lemmatization is very similar to stemming, where we remove word affixes to get to the base form of a word. However the base form in this case is known as the root word but not the root stem. The difference being that the root word is always a lexicographically correct word (present in the dictionary) but the root stem may not be so.

#### There are multiple way to do stemming but it is not a good way to clean the data. Better approach is Lemmatization. But challenge with Lemmatization is that it is time consuming.

In [97]:
def lemmatize(text):
  import spacy
  sp_nlp = spacy.load('en_core_web_sm')
  #print(text)
  text = sp_nlp(text) # creating spacy document. SpaCy automatically creates tokens.
  #wo = [word.text for word in text]
  #print(len(wo))
  #print(wo)
  lemma_text = " ".join(word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text)
  print("6. Lemmatization is done")
  return lemma_text

lemmatize(document)

6. Lemmatization is done


'< p > héllo ! Héllo ! can you hear me ! I just hear about < b > python</b>!<br/ > \r\n \n               It be an amazing language which can be use for Scripting , web development , \r\n\r\n\n               Information Retrieval , Natural Language Processing , Machine Learning & Artificial Intelligence ! \n\n               what be you wait for ? go and get started.<br/ > He be learn , she be learn , they have already \n\n\n               get a headstart!</p > \n           '

### **Removing stopwords**: Words which have little or no significance especially when constructing meaningful features from text are known as stopwords or stop words. These are usually words that end up having the maximum frequency if you do a simple term or word frequency in a corpus. Words like a, an, the, and so on are considered to be stopwords. There is no universal stopword list but we use a standard English language stopwords list from nltk. You can also add your own domain specific stopwords as needed.

In [99]:
def remove_stopwords (text, is_lower_case = False):
  import nltk
  nltk.download('stopwords')
  from nltk.tokenize.toktok import ToktokTokenizer
  stopword_list = nltk.corpus.stopwords.words('english')
  stopword_list.remove('no') # one can remove certain stopwords as per use-case
  stopword_list.append('karthik') # one can add certain stopwords as per use-case
  
  tokenizer = ToktokTokenizer()
  words = tokenizer.tokenize(text)
  print('# of words before removing stopwords: ',len(words))
  #if isinstance(words, list): print('it is a list') # to check if an object is a list
  words = [w.strip() for w in words]  # removing additional whitespaces at the start or end of a word
  if is_lower_case:
    filtered_words = [w for w in words if w not in stopword_list]
  else: 
      filtered_words = [w for w in words if w.lower() not in stopword_list]
  filtered_text = ' '.join(filtered_words)
  print('# of words after removing stopwords: ',len(filtered_words))
  return filtered_text

remove_stopwords(document)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
# of words before removing stopwords:  74
# of words after removing stopwords:  49


"<p>Héllo ! Héllo ! hear ! heard <b>Python</b> ! <br/> ' amazing language used Scripting , Web development , Information Retrieval , Natural Language Processing , Machine Learning &amp; Artificial Intelligence ! waiting ? Go get started.<br/> ' learning , ' learning , ' already got headstart ! </p>"

### **Combine all the functions together to create single text pre-processing function**

In [118]:
def text_pre_processing (corpus,html_tags_removal = True, accented_chars_removal = True, 
                         expanding_contractions = False, text_lower_case = True,
                         special_chars_removal = True, lemmatization = True,
                         stopwords_removal = True ):
  processed_corpus = []
  # process each document in the corpus
  for doc in corpus:
    if html_tags_removal:
      doc = remove_html_tags(doc)
    if accented_chars_removal:
      doc  = remove_accented_chars(doc)
    if expanding_contractions:
      doc = expand_contractions(doc)
    # converting the text into lowercase
    if text_lower_case:
      doc = doc.lower()
    # remove extra newline \r,\n,\r\n & whitespaces
    import re
    doc = re.sub('[\r|\n|\r\n| ]+', ' ',doc)
    print("4. converted to lowercase and extra newlines & whitespaces removed")
    if special_chars_removal:
      doc = remove_special_chars(doc)
    if lemmatization:
      doc = lemmatize(doc)
    if stopwords_removal:
      doc = remove_stopwords(doc)


    processed_corpus.append(doc)
  return processed_corpus

doc2 = "Karthik is a good boy"
doc3 = "Peter's brother is also a good boy"
text_pre_processing([doc2, doc3, document])


1. html tags removed
2. accented chars removed
4. converted to lowercase and extra newlines & whitespaces removed
5. special charaters removed
6. Lemmatization is done
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
# of words before removing stopwords:  5
# of words after removing stopwords:  2
1. html tags removed
2. accented chars removed
4. converted to lowercase and extra newlines & whitespaces removed
5. special charaters removed
6. Lemmatization is done
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
# of words before removing stopwords:  7
# of words after removing stopwords:  5
1. html tags removed
2. accented chars removed
4. converted to lowercase and extra newlines & whitespaces removed
5. special charaters removed
6. Lemmatization is done
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords 

['good boy',
 'peters brother also good boy',
 'hello hello hear hear python amazing language use script web development information retrieval natural language processing machine learn artificial intelligence wait go get start learn learn already get headstart']

# Can create word cloud out of the processed text. Try out tomorrow.
Try using Counter from Collections for word frequeny counter


In [1]:
document = """<p>Héllo! Héllo! can you hear me! I just heard about <b>Python</b>!<br/>\r\n 
              It's an amazing language which can be used for Scripting, Web development,\r\n\r\n
              Information Retrieval, Natural Language Processing, Machine Learning & Artificial Intelligence!\n
              What are you waiting for? Go and get started.<br/> He's learning, she's learning, they've already\n\n
              got a headstart!</p>
           """
document

"<p>Héllo! Héllo! can you hear me! I just heard about <b>Python</b>!<br/>\r\n \n              It's an amazing language which can be used for Scripting, Web development,\r\n\r\n\n              Information Retrieval, Natural Language Processing, Machine Learning & Artificial Intelligence!\n\n              What are you waiting for? Go and get started.<br/> He's learning, she's learning, they've already\n\n\n              got a headstart!</p>\n           "