<a href="https://colab.research.google.com/github/ua-datalab/NLP-Speech/blob/main/Text_pre_processing_for_NLP/text_pre_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><h1> Text Pre Processing for NLP </h1> </center>



![](https://images.datacamp.com/image/upload/v1669223212/Text_Mining_6eeff5cb7c.png)

## Housekeeping
1. Check that the recording is on
2. Check audio and screenshare
3. Share link to notebook in chat
4. Light mode and readable font size

## What is Text Pre-processing?

Natural Language Processing (NLP) has seen tremendous growth and development, becoming an integral part of various applications, from chatbots to sentiment analysis. One of the foundational steps in NLP is text preprocessing, which involves cleaning and preparing raw text data for further analysis or model training. Proper text preprocessing can significantly impact the performance and accuracy of NLP models. This article will delve into the essential steps involved in text preprocessing for NLP tasks.



## Why Text Preprocessing is Important?

Raw text data is often noisy and unstructured, containing various inconsistencies such as typos, slang, abbreviations, and irrelevant information. Preprocessing helps in:

- ### Improving Data Quality:
Removing noise and irrelevant information ensures that the data fed into the model is clean and consistent.

- ### Enhancing Model Performance:
Well-preprocessed text can lead to better feature extraction, improving the performance of NLP models.

- ### Reducing Complexity:
Simplifying the text data can reduce the computational complexity and make the models more efficient.



## 4 main threads of today's text pre-processing class
How to pre process:
- ### Plain Text
- ###  Web page
- ### Pdf files



# Plain text: Standard Text Preprocessing Techniques in NLP:

### 1. Basic Text Cleaning
#### Convert the text to :
- lowercase,
- remove punctuation,
- numbers,
- special characters, and
- HTML tags.

In [61]:
corpus = [
    "I can't wait for the new season of my favorite show!",
    "The COVID-19 pandemic has affected millions of people worldwide.",
    "U.S. stocks fell on Friday after news of rising inflation.",
    "<html><body>Welcome to the website!</body></html>",
    "Python is a great programming language!!! ??"
]


In [62]:
import re
import string
from bs4 import BeautifulSoup

def clean_text(text):
    text = text.lower()  # Lowercase
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    text = re.sub(r'\W', ' ', text)  # Remove special characters
    text = BeautifulSoup(text, "html.parser").get_text()  # Remove HTML tags
    return text

cleaned_corpus = [clean_text(doc) for doc in corpus]
print(cleaned_corpus)


['i cant wait for the new season of my favorite show', 'the covid pandemic has affected millions of people worldwide', 'us stocks fell on friday after news of rising inflation', 'htmlbodywelcome to the websitebodyhtml', 'python is a great programming language ']





### 2. Regular Expressions


- A powerful tool in text preprocessing for Natural Language Processing (NLP). They allow for efficient and flexible pattern matching and text manipulation.
- Already covered in [lecture 2](https://github.com/ua-datalab/NLP-Speech/blob/main/Introduction_to_Regular_Expressions/Introduction_to_Regular_Expressions.ipynb)


### 3. Tokenization


- The process of breaking down text into smaller units, such as words or sentences. This is a crucial step in NLP as it transforms raw text into a structured format that can be further analyzed. Here’s a comprehensive guide on various tokenization techniques:
- Sample code below. More details in [lecture 1](https://github.com/ua-datalab/NLP-Speech/blob/main/Natural_Language_Processing_Text_Mining_and_Sentiment_Analysis/Natural_Language_Processing_Text_Mining_and_Sentiment_Analysis.ipynb)



In [63]:
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

tokenized_corpus = [word_tokenize(doc) for doc in cleaned_corpus]
print(tokenized_corpus)


[['i', 'cant', 'wait', 'for', 'the', 'new', 'season', 'of', 'my', 'favorite', 'show'], ['the', 'covid', 'pandemic', 'has', 'affected', 'millions', 'of', 'people', 'worldwide'], ['us', 'stocks', 'fell', 'on', 'friday', 'after', 'news', 'of', 'rising', 'inflation'], ['htmlbodywelcome', 'to', 'the', 'websitebodyhtml'], ['python', 'is', 'a', 'great', 'programming', 'language']]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Stop Words Removal

Removing common stop words from the tokens.






In [64]:
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
filtered_corpus = [[word for word in doc if word not in stop_words] for doc in tokenized_corpus]
print(filtered_corpus)


[['cant', 'wait', 'new', 'season', 'favorite', 'show'], ['covid', 'pandemic', 'affected', 'millions', 'people', 'worldwide'], ['us', 'stocks', 'fell', 'friday', 'news', 'rising', 'inflation'], ['htmlbodywelcome', 'websitebodyhtml'], ['python', 'great', 'programming', 'language']]


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### 4. Lemmatization and Stemming
- Techniques used in NLP to reduce words to their base or root forms. This process is important for tasks like text normalization, information retrieval, and text mining.


In [65]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('wordnet')

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

stemmed_corpus = [[stemmer.stem(word) for word in doc] for doc in filtered_corpus]
lemmatized_corpus = [[lemmatizer.lemmatize(word) for word in doc] for doc in filtered_corpus]
print(stemmed_corpus)
print(lemmatized_corpus)


[['cant', 'wait', 'new', 'season', 'favorit', 'show'], ['covid', 'pandem', 'affect', 'million', 'peopl', 'worldwid'], ['us', 'stock', 'fell', 'friday', 'news', 'rise', 'inflat'], ['htmlbodywelcom', 'websitebodyhtml'], ['python', 'great', 'program', 'languag']]
[['cant', 'wait', 'new', 'season', 'favorite', 'show'], ['covid', 'pandemic', 'affected', 'million', 'people', 'worldwide'], ['u', 'stock', 'fell', 'friday', 'news', 'rising', 'inflation'], ['htmlbodywelcome', 'websitebodyhtml'], ['python', 'great', 'programming', 'language']]


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### 5. Handling Contractions
Contractions are words or combinations of words that are shortened by dropping letters and replacing them by an apostrophe.

Expanding contractions in the text.

For example: ` I’ll be there within 5 min. Are u not gng there? Am I mssng out on smthng? I’d like to see u near d park.`

In [66]:
!pip install contractions

[33mDEPRECATION: textract 1.6.5 has a non-standard dependency specifier extract-msg<=0.29.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of textract or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

In [67]:
# import library
import contractions
# contracted text
text = '''I'll be there within 5 min. Shouldn't you be there too?
		I'd love to see u there my dear. It's awesome to meet new friends.
		We've been waiting for this day for so long.'''

# creating an empty list
expanded_words = []
for word in text.split():
  # using contractions.fix to expand the shortened words
  expanded_words.append(contractions.fix(word))

expanded_text = ' '.join(expanded_words)
print('Original text: ' + text)
print('\n\nExpanded_text: ' + expanded_text)


Original text: I'll be there within 5 min. Shouldn't you be there too? 
		I'd love to see u there my dear. It's awesome to meet new friends.
		We've been waiting for this day for so long.


Expanded_text: I will be there within 5 min. Should not you be there too? I would love to see you there my dear. It is awesome to meet new friends. We have been waiting for this day for so long.



# 6. Parts of Speech (POS)

Parts of Speech (POS) tagging is a fundamental task in NLP that involves **labeling each word in a sentence with its corresponding part of speech, such as noun, verb, adjective**, etc. This information is crucial for many NLP applications, including parsing, information retrieval, and text analysis.



In [68]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [69]:
# Importing the NLTK library
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
nltk.download('punkt')

# Sample text
text = "NLTK is a powerful library for natural language processing."
words = word_tokenize(text)


# Performing PoS tagging
pos_tags = pos_tag(words)

# Displaying the PoS tagged result in separate lines
print("Original Text:")
print(text)

print("\nPoS Tagging Result:")
for word, pos_tag in pos_tags:
	print(f"{word}: {pos_tag}")


Original Text:
NLTK is a powerful library for natural language processing.

PoS Tagging Result:
NLTK: NNP
is: VBZ
a: DT
powerful: JJ
library: NN
for: IN
natural: JJ
language: NN
processing: NN
.: .


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!



#### 4 main threads of today's text pre-processing class
How to pre process:
- ### Plain Text
- #  Web pages
- ### Pdf files


Extracting text from html pages using beautiful soup

In [70]:
# prompt: Extracting text from html pages using beautiful soup

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# Extract the text content of the page
text = soup.get_text()

print(text)



The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



scrape and extract text from [this](https://en.wikisource.org/wiki/Moral_letters_to_Lucilius) wikipedia page

# note: the below code will print a huge html output- be ready to scroll

## Also remember to: right click clear output- before moving to next code block

In [71]:
import requests
from bs4 import BeautifulSoup

# import page containing links to all of Seneca's letters
# get web address
src = "https://en.wikisource.org/wiki/Moral_letters_to_Lucilius"

html_doc = requests.get(src).text  # pull html as text
print(html_doc)


<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8">
<title>Moral letters to Lucilius - Wikisource, the free online library</title>
<script>(function(){var className="client-js";var cookie=document.cookie.match(/(?:^|; )enwikisourcemwclientpreferences=([^;]+)/);if(cookie){cookie[1].split('%2C').forEach(function(pref){className=className.replace(new RegExp('(^| )'+pref.replace(/-clientpref-\w+$|[^\w-]+/g,'')+'-clientpref-\\w+( |$)'),'$1'+pref+'$2');});}document.documentElement.className=className;}());RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"e086ff8b-1838-454b-86f5-5e5538d68192","wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Moral_letters_to_Lucilius","wgTitle":"Moral lett

#### Now let's extract plain text from that page

## note: the below code will print a huge output- scroll up if you want to see the top section of wikipedia page

## Also remember to: right click clear output- before moving to next code block

In [72]:
soup = BeautifulSoup(html_doc, 'html.parser')

# Extract the text content of the page
text = soup.get_text()

print(text)






Moral letters to Lucilius - Wikisource, the free online library
































Download

Moral letters to Lucilius

From Wikisource



Jump to navigation
Jump to search
←Moral letters to Lucilius (Epistulae morales ad Lucilium) (1917/1920/1925)by Seneca, translated by Richard Mott Gummere→sister projects: Wikipedia article, Commons category, Wikidata item.
A Loeb Classical Library edition; volume 1 published 1917; volume 2 published 1920; volume 3 published 1925

482782Moral letters to Lucilius (Epistulae morales ad Lucilium)Richard Mott GummereSeneca


 SENECA
 AD LUCILIUM
 EPISTULAE MORALES

WITH AN ENGLISH TRANSLATION BY
RICHARD M. GUMMERE, PH.D.
OF HAVERFORD COLLEGE
IN THREE VOLUMES




LONDON : WILLIAM HEINEMANN
NEW YORK : G. P. PUTNAM'S SONS


 




CONTENTS




Volume 1


Introduction




Letter  1

On saving time


Letter  2

On discursiveness in reading


Letter  3

On true and false friendship


Letter  4

On the terrors of death


Letter  5

On the philo


#### 4 main threads of today's text pre-processing class
How to pre process:
- ### Plain Text
- ###  Web pages
- # Pdf files


In [73]:
!pip install PyPDF2
!pip install nltk

[33mDEPRECATION: textract 1.6.5 has a non-standard dependency specifier extract-msg<=0.29.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of textract or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[33mDEPRECATION: textract 1.6.5 has a non-standard dependency specifier extract-msg<=0.29.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of textract or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

# Goal: Extract names of individuals from Municipal Corporation of Greater Mumbai from Page 2 of [this](http://www.udri.org/pdf/02%20working%20paper%201.pdf) pdf file

i.e extract first names from this paragraph
```
We wish to especially thank MCGM officers, Mr. Jagdish Talreja, Mr. Dinesh Naik, Mr. Hiren
Daftardar, Ms. Anita Naik for their continual support since the beginning of the project and their
help towards familiarization and data collection. They have been instrumental in helping to
contact various MCGM departments as well as in helping to establish contact with personnel from
other government departments and organizations. Many thanks for the MCGM team, for
deploying personnel, particularly Mr. Prasad Gharat, on extensive field visits that have helped in
understanding actual ground conditions.
```

### i.e the expected answer is:

['Mr.Jagdish Talreja', 'Mr.Dinesh Naik', 'Mr.Hiren Daftardar', 'Ms.Anita Naik', 'Mr.Prasad Gharat']


# note to instructor: Code below will ask you to restart runtime. You dont have to /you can move on

In [74]:
!pip install pip==24.0
!pip install textract --no-cache-dir

[33mDEPRECATION: textract 1.6.5 has a non-standard dependency specifier extract-msg<=0.29.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of textract or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[33mDEPRECATION: textract 1.6.5 has a non-standard dependency specifier extract-msg<=0.29.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of textract or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

In [75]:
import PyPDF2, urllib.request , nltk , textract
from io import BytesIO
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [76]:

#Reading the PDF
wFile = urllib.request.urlopen('http://www.udri.org/pdf/02%20working%20paper%201.pdf')
pdfreader = PyPDF2.PdfReader(BytesIO(wFile.read()))

In [77]:
#extracting page 2 of the docuemnt
pageObj = pdfreader.pages[2]
page2 = pageObj.extract_text()
#Cleaning the text
punctuations = ['(',')',';',':','[',']',',','...','.']
tokens = word_tokenize(page2)
stop_words = stopwords.words('english')
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]

name_list = list()
check =  ['Mr.', 'Mrs.', 'Ms.']
for idx, token in enumerate(tokens):
    if token.startswith(tuple(check)) and idx < (len(tokens)-1):
        name = token + tokens[idx+1] + ' ' +  tokens[idx+2]
        name_list.append(name)

print(name_list)

['Mr.Jagdish Talreja', 'Mr.Dinesh Naik', 'Mr.Hiren Daftardar', 'Ms.Anita Naik', 'Mr.Prasad Gharat']


##### References

1. https://www.geeksforgeeks.org/text-preprocessing-for-nlp-tasks/
2. https://www.xbyte.io/how-to-do-web-scraping-and-pre-processing-for-nlp-using-python/