<a href="https://colab.research.google.com/github/sahug/ds-nlp/blob/main/NLP%20-%20Session%2018%20-%20Building%20Text%20Cleanup%20and%20PreProcessing%20Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**NLP - Session 18 - Building Text Cleanup and PreProcessing Pipeline**




**Removing HTML Tags**

In [2]:
from bs4 import BeautifulSoup

def remove_html_tags(text):
  return BeautifulSoup(text, "html.parser").get_text()

remove_html_tags(
    "<html> \
      <h1>Article Heading</h1> \
      <p>First sentence of some important article. And another one. And then the last one</p></html>"
)


' Article Heading First sentence of some important article. And another one. And then the last one'

**Removing Accented Characters**

In [3]:
import unicodedata

def remove_accented_characters(text):
  return unicodedata.normalize("NFKD", text).encode("ascii", "ignore").decode("utf-8", "ignore")
   
remove_accented_characters("Sómě Áccěntěd těxt. Some words such as résumé, café, prótest, divorcé, coördinate, exposé, latté.")

'Some Accented text. Some words such as resume, cafe, protest, divorce, coordinate, expose, latte.'

**Expanding Contractions**

Contractions are shortened versions of words or syllables. They are created by removing specific, one or more letters from words. 

In [None]:
!pip install contractions

In [None]:
from contractions import contractions_dict 
import re

def expand_contractions(text, map=contractions_dict):
  pattern = re.compile("({})".format("|".join(map.keys())), flags=re.IGNORECASE|re.DOTALL)

  def get_match(contraction):
    match = contraction.group(0)
    first_char = match[0]
    expanded = map.get(match) 
    expanded = first_char + expanded[1:]
    return expanded

  new_text = pattern.sub(get_match, text)
  new_text = re.sub("'", new_text)
  return new_text

expand_contractions("Y’all i’d contractions you’re expanded don’t think.")

**Removing Special Characters**

In [23]:
import re

def remove_special_characters(text):
  pat = "[^a-zA-Z0-9.,!?/:;\"\'\s]"
  return re.sub(pat, "", text)

remove_special_characters("007 Not sure@ if this % was #fun! 558923 What do# you think** of it.? $500USD!") 

'007 Not sure if this  was fun! 558923 What do you think of it.? 500USD!'

**Removing Numbers**

In [24]:
# imports
import re

# function to remove numbers
def remove_numbers(text):
    # define the pattern to keep
    pattern = r'[^a-zA-z.,!?/:;\"\'\s]' 
    return re.sub(pattern, '', text)
 
# call function
remove_numbers("007 Not sure@ if this % was #fun! 558923 What do# you think** of it.? $500USD!")

' Not sure if this  was fun!  What do you think of it.? USD!'

**Removing Punctuation**

In [25]:
import string

def remove_punctuation(text):
  text = "".join([c for c in text if c not in string.punctuation])
  return text

remove_punctuation('Article: @First sentence of some, {important} article having lot of ~ punctuations. And another one;!')

'Article First sentence of some important article having lot of  punctuations And another one'

**Stemming**

In [27]:
import nltk

def get_stem(text):
  stemmer = nltk.porter.PorterStemmer()
  text = " ".join([stemmer.stem(word) for word in text.split()])
  return text

get_stem("we are eating and swimming ; we have been eating and swimming ; he eats and swims ; he ate and swam ")

'we are eat and swim ; we have been eat and swim ; he eat and swim ; he ate and swam'

**Lemmatization**

In [28]:
import spacy

nlp = spacy.load("en", parse=True, tag=True, entity=True)

def get_lem(text):
  text = nlp(text)
  text = " ".join([word.lemma_ if word.lemma_ != "-PRON-" else word.text for word in text])
  return text

get_lem("we are eating and swimming ; we have been eating and swimming ; he eats and swims ; he ate and swam ")

'we be eat and swim ; we have be eat and swim ; he eat and swim ; he eat and swam'

**Removing Stopwords**

In [33]:
from nltk import tokenize
import nltk
from nltk.tokenize import ToktokTokenizer

nltk.download("stopwords")

tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words("english")

def remove_stopwords(text):
  tokens = tokenizer.tokenize(text)
  tokens = [token.strip() for token in tokens]
  t = [token for token in tokens if token.lower() not in stopword_list]
  text = " ".join(t)
  return text

remove_stopwords("i am myself you the stopwords list and this article is not should removed")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


'stopwords list article removed'

**Removing extra Whitespaces and Tabs**

In [35]:
import re

def remove_extra_whitespace_tabs(text):
  pattern = "^\s*|\s\s*"
  return re.sub(pattern, " ", text).strip()

remove_extra_whitespace_tabs('  This web line  has \t some extra  \t   tabs and whitespaces  ')

'This web line has some extra tabs and whitespaces'

**Lowercase**

In [36]:
def to_lowercase(text):
    return text.lower()

to_lowercase('ConVert THIS string to LOWER cASe.')

'convert this string to lower case.'