# Text preprocessing pipeline

<a target="_blank" href="https://colab.research.google.com/github/PacktPublishing/Mastering-NLP-from-Foundations-to-LLMs/blob/liors_branch/Chapter4_notebooks/Ch4_Preprocessing_Pipeline.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

**The purpose of this notebook:**  
As demonstrated Text preprocessing is one of the most fundamental practices of NLP.  
In this notebook we walk you through a variety of preprocessing functions and show how they come together to a solid pipeline.   

**Requirements:**  
* When running in Colab, use this runtime notebook setting: `Python 3, CPU`

>*```Disclaimer: The content and ideas presented in this notebook are solely those of the authors and do not represent the views or intellectual property of the authors' employers.```*

Install:

In [1]:
# REMARK:
# If the below code error's out due to a Python package discrepency, it may be because new versions are causing it.
# In which case, set "default_installations" to False to revert to the original image:
default_installations = True

if default_installations:
  !pip install num2words autocorrect
else:
  import requests
  text_file_path = "preprocessing_pipeline.txt"
  url = "https://raw.githubusercontent.com/python-devops-sre/nlp/master/requirements/" + text_file_path
  res = requests.get(url)
  with open(text_file_path, "w") as f:
    f.write(res.text)

  !pip install -r preprocessing_pipeline.txt



In [2]:
# Imports:
import re
from num2words import num2words
import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from autocorrect import Speller

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\laven\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\laven\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\laven\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
# Preprocessing functions:
def decode(text):
    """
    The function takes in a string of text as input
    and extracts the subject line and body text from the text
    using regular expressions. It then formats the extracted
    text into a single string and returns it as output.

    Input: str
    Output: str
    """
    text = re.sub("\\n|\\r|\\t|-", " ", text)
    subject_line_search = re.search(r"<SUBJECT LINE>(.*?)<END>", text, flags=re.S)
    body_text_search = re.search(r"<BODY TEXT>(.*?)<END>", text, flags=re.S)

    formated_output = ""
    if subject_line_search:
      formated_output = formated_output + subject_line_search.groups()[0] + ". "
    if body_text_search:
      formated_output = formated_output + body_text_search.groups()[0] + "."
    return formated_output


def digits_to_words(match):
    """
    Convert string digits to the English words. The function distinguishes between
    cardinal and ordinal.
    E.g. "2" becomes "two", while "2nd" becomes "second"

    Input: str
    Output: str
    """
    suffixes = ['st', 'nd', 'rd', 'th']
    # Making sure it's lower cased so not to rely on previous possible actions:
    string = match[0].lower()
    if string[-2:] in suffixes:
      type='ordinal'
      string = string[:-2]
    else:
      type='cardinal'

    return num2words(string, to=type)


def spelling_correction(text):
    """
    Replace misspelled words with the correct spelling.

    Input: str
    Output: str
    """
    corrector = Speller()
    spells = [corrector(word) for word in text.split()]
    return " ".join(spells)


def remove_stop_words(text):
    """
    Remove stopwords.

    Input: str
    Output: str
    """
    stopwords_set = set(stopwords.words('english'))
    return " ".join([word for word in text.split() if word not in stopwords_set])


def stemming(text):
    """
    Perform stemming of each word individually.

    Input: str
    Output: str
    """
    stemmer = PorterStemmer()
    return " ".join([stemmer.stem(word) for word in text.split()])


def lemmatizing(text):
    """
    Perform lemmatization for each word individually.

    Input: str
    Output: str
    """
    lemmatizer = WordNetLemmatizer()
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

In [4]:
# Preprocessing pipeline:
def preprocessing(input_text, printing=False):
    """
    This function represents a complete pipeline for text preprocessing.

    Code design note: The fact that we update variable "output" instead of
    creating new variables with new names as we go, allows us to change the
    order of the actions or add/remove actions easily.

    Input: str
    Output: str
    """
    output = input_text
    # Decode/remove encoding:
    output = decode(output)
    print("\nDecode/remove encoding:\n        ", output)

    # Lower casing:
    output = output.lower()
    print("\nLower casing:\n        ", output)

    # Convert digits to words:
    # The following regex syntax looks for matching of consequtive digits tentatively followed by an ordinal suffix:
    output = re.sub(r'\d+(st)?(nd)?(rd)?(th)?', digits_to_words, output, flags=re.IGNORECASE)
    print("\nDigits to words\n        ", output)

    # Remove punctuations and other special characters:
    output = re.sub('[^ A-Za-z0-9]+', '', output)
    print("\nRemove punctuations and other special characters\n        ", output)

    # Spelling corrections:
    output = spelling_correction(output)
    print("\nSpelling corrections:\n        ", output)


    # Remove stop words:
    output = remove_stop_words(output)
    print("\nRemove stop words:\n        ", output)

    # Stemming:
    output = stemming(output)
    print("\nStemming:\n        ", output)

    # Lemmatizing:
    output = lemmatizing(output)
    print("\nLemmatizing:\n        ", output)

    return output

In [5]:
# Applying preprocessing:
raw_text_input = """
"<SUBJECT LINE> Employees details<END><BODY TEXT>Attached are 2 files,\n1st one is pairoll, 2nd is healtcare!<END>"
"""
print(f"This is the input raw text:\n{raw_text_input}")

print(f"\n----------------------------\nThis is the preprocessed text:\n        {preprocessing(raw_text_input, printing=True)}")

This is the input raw text:

"<SUBJECT LINE> Employees details<END><BODY TEXT>Attached are 2 files,
1st one is pairoll, 2nd is healtcare!<END>"


Decode/remove encoding:
          Employees details. Attached are 2 files, 1st one is pairoll, 2nd is healtcare!.

Lower casing:
          employees details. attached are 2 files, 1st one is pairoll, 2nd is healtcare!.

Digits to words
          employees details. attached are two files, first one is pairoll, second is healtcare!.

Remove punctuations and other special characters
          employees details attached are two files first one is pairoll second is healtcare

Spelling corrections:
         employees details attached are two files first one is payroll second is healthcare

Remove stop words:
         employees details attached two files first one payroll second healthcare

Stemming:
         employe detail attach two file first one payrol second healthcar

Lemmatizing:
         employe detail attach two file first one payrol second