In [1]:
import nltk
from nltk.stem import PorterStemmer,LancasterStemmer,WordNetLemmatizer
nltk.download("punkt")
nltk.download("wordnet")
nltk.download("omw-1.4")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\omars\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\omars\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\omars\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True


## Introduction to Regular Expressions for NLP

### What are Regular Expressions (Regex)?

 Regular expressions (regex or regexp) are powerful tools for pattern matching and text manipulation. They are a sequence of characters that define a search pattern. Regex allows you to search, match, and manipulate strings based on specific patterns or rules.

### Why Use Regular Expressions in NLP?

In Natural Language Processing, regular expressions are useful for:

- Text cleaning and preprocessing.
- Extracting specific information from text.
- Validating and parsing text data.
- Pattern matching for entity recognition, tokenization, and more.

### Basic Syntax

Regex patterns are constructed using a combination of regular characters and metacharacters.

- Regular characters (e.g., letters, numbers) are matched literally.
- Metacharacters (e.g., . ^ $ * + ? { } [ ] | ( ) \) have special meanings.

### Python `re` Module

 Python's `re` module provides functions and classes for working with regular expressions.




### Let's start by importing the `re` module:

## Importing re module

In [2]:
import re

In [3]:
""" regulare expression are compiled into pattern objects, which have methods for various operations
such as searching for pattern matches or performing string substitutions. """

p = re.compile('ab*')

""" re.compile also accepts optional flags to control the behavior of the pattern.
For example, to make the matching case-insensitive, we can pass in the flag re.IGNORECASE or re.IGNORECASE:"""

p = re.compile('ab*', re.I)


| Flag               | Description                                   |
|--------------------|-----------------------------------------------|
| re.IGNORECASE or re.I | Ignores a case.                             |
| re.DOTALL or re.S  | Allows the . metch any character to match a newline.|
| re.MULTILINE or re.M | Allows the ^ and $ metacharacters to match each line.|
| re.VERBOSE or re.X  | Allows whitespaces and comments in pattern compilation.|
| re.ASCII or re.A     | Makes \w, \W, \b, \B, \s, \S match only ASCII characters.|
| re.LOCALE or re.L     | Makes \w, \W, \b, \B, \s, \S match according to the current locale.|

In [4]:
""" Example of using verbose flag to make the pattern more readable """
pattern = re.compile(r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X)
pattern = re.compile(r"\d+\.\d*")

in regular expressions we use \ to escape special characters
for example \d is a special character that matches any digit
so if we want to match a \d we need to escape it with another \
so we need to write \\\d


In [5]:
pattern= re.compile(r"\\d")
print(pattern.search("abc123\def"))

<re.Match object; span=(6, 8), match='\\d'>


How to match a backslash character? 
In regular expressions all backslashes and metacharacters must be escaped with a backslash so the regular expression that matches a \ is “\\\\”

To pass that in a regular python string object each backslash needs to be escaped also by backslashes so we need to pass “\\\\\\\\” to match only one backslash.

To solve this problem, we can use python raw string notation for regular expressions where the backslashes are not handled in special way. Python raw string are prefixed with r → r”\\\\” matches just single backslash

In [6]:
pattern= re.compile("\\\\")
print(pattern.search("abc123\def"))

pattern= re.compile(r"\\")
print(pattern.search("abc123\def"))


<re.Match object; span=(6, 7), match='\\'>
<re.Match object; span=(6, 7), match='\\'>


## Preforming matches

the pattern object has several function to preform match, the table cover most important ones

| Method               | Purpose                                   |
|--------------------|-----------------------------------------------|
| match() | Determine if the RE matches at the beginning of the string.|
| search()  | Scan through a string, looking for any location where this RE matches.|
| findall() | Find all substrings where the RE matches, and returns them as a list.|
| finditer() | Find all substrings where the RE matches, and returns them as an iterator.|
| sub() | Find all substrings where the RE matches, and replace them with a different string.|

Example 1: matching for a Pattern

In [7]:
# match for pattern
text = "abbaaabbbbaaaaa"
# use the battern object to match the text
match=p.match(text)
if match:
    # match object has many attributes and methods
    print("Match:", match)
    print("Match found:", match.group())
    print("Starting index:", match.start())
    print("Ending index:", match.end())
    print("Number of characters matched:", match.end() - match.start())
    print("Matched text:", match.string)
    
else:
    print("No match found.")

Match: <re.Match object; span=(0, 3), match='abb'>
Match found: abb
Starting index: 0
Ending index: 3
Number of characters matched: 3
Matched text: abbaaabbbbaaaaa


Example 2: Searching for a Pattern

In [8]:
# Search for a pattern in text
text = "The price of the product is $2 $25.99."
pattern = r'\$\d+(\.\d{2})?'
# Note: You can use the top-level functions provided by re without creating  pattern objects
match = re.search(pattern, text) # match is none if no match is found
if match:
    # match object has many attributes and methods
    print("Match found:", match.group())
    print("Starting index:", match.start())
    print("Ending index:", match.end())
    print("Number of characters matched:", match.end() - match.start())
    print("Matched text:", match.string)
    
else:
    print("No match found.")

Match found: $2
Starting index: 28
Ending index: 30
Number of characters matched: 2
Matched text: The price of the product is $2 $25.99.


Example 3: Extracting Email Addresses with findall()

In [9]:
# Extract email addresses from text
text = "Contact us at john@example.com or mary@example.org."
pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b'
matches = re.findall(pattern, text)
print("Email addresses found:", matches)

Email addresses found: ['john@example.com', 'mary@example.org']


Example 4: Extracting Email Addresses with finditer()

In [10]:
# Extract email addresses from text
text = "Contact us at john@example.com or mary@example.org."
pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b'
matches = re.finditer(pattern, text)
for match in matches:
    print("Match found:", match.group())
    print("Starting index:", match.start())
    print("Ending index:", match.end())
    print("Number of characters matched:", match.end() - match.start())
    print("Matched text:", match.string)
    print("-----------------------------")

Match found: john@example.com
Starting index: 14
Ending index: 30
Number of characters matched: 16
Matched text: Contact us at john@example.com or mary@example.org.
-----------------------------
Match found: mary@example.org
Starting index: 34
Ending index: 50
Number of characters matched: 16
Matched text: Contact us at john@example.com or mary@example.org.
-----------------------------


Example 5: Text Cleaning


In [11]:
# Clean text by removing non-alphanumeric characters
text = "This is an example sentence with $pecial character$."
cleaned_text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
print("Cleaned text:", cleaned_text)


Cleaned text: This is an example sentence with pecial character


| Stemming      | Lemmatization |
|---------------|--------------|
| A process that stems or removes the last few characters from a word, often resulting in incorrect meanings and spelling.  | Considers the context and converts a word to its meaningful base form, known as the lemma.
| For example, stemming 'Caring' would return 'Car'.| For example, lemmatizing 'Caring' would return 'Care'. |
| Stemming is faster. | Lemmatization is slower. |
| looking at the form of the word. | looking at the meaning of the word. |


## Stemming


NLTK includes several off-the-shelf stemmers, and if you ever need a stemmer you should use one of these in preference to crafting your own using regular expressions, since these handle a wide range of irregular cases.

Two popular stemmers in NLTK:
- Porter stemmer
- Lancaster stemmer


|                    | PorterStemmer                         | LancasterStemmer                   |
|--------------------|--------------------------------------|------------------------------------|
| Use Cases          | Useful when you want to preserve word meanings to some extent. | Suitable when you want a more aggressive stemming to reduce words to their base form.|


In [12]:

# Initialize Python porter stemmer
ps = PorterStemmer()
# Example inflections to reduce
example_words = ["program","programming","programer","programs","programmed", "happy", "happily","Happier"]
# Perform stemming
print("{0:20}{1:20}".format("--Word--","--Stem--"))
for word in example_words:
   print ("{0:20}{1:20}".format(word, ps.stem(word)))

"""
--Word--            --Stem--            
program             program             
programming         program             
programer           program             
programs            program             
programmed          program

"""

--Word--            --Stem--            
program             program             
programming         program             
programer           program             
programs            program             
programmed          program             
happy               happi               
happily             happili             
Happier             happier             


'\n--Word--            --Stem--            \nprogram             program             \nprogramming         program             \nprogramer           program             \nprograms            program             \nprogrammed          program\n\n'

In [13]:

# Initialize Python porter stemmer
ps = LancasterStemmer()
# Example inflections to reduce
example_words = ["program","programming","programer","programs","programmed", "happy", "happily","Happier"]
# Perform stemming
print("{0:20}{1:20}".format("--Word--","--Stem--"))
for word in example_words:
   print ("{0:20}{1:20}".format(word, ps.stem(word)))

"""
--Word--            --Stem--            
program             program             
programming         program             
programer           program             
programs            program             
programmed          program
"""

--Word--            --Stem--            
program             program             
programming         program             
programer           program             
programs            program             
programmed          program             
happy               happy               
happily             happy               
Happier             happy               


'\n--Word--            --Stem--            \nprogram             program             \nprogramming         program             \nprogramer           program             \nprograms            program             \nprogrammed          program\n'

## Lemmatization

In [13]:
# Initialize wordnet lemmatizer
wnl = WordNetLemmatizer()
# Example inflections to reduce
example_words = ["program","programming","programer","programs","programmed", "happy", "happily", "Happier"]
# Perform lemmatization
print("{0:20}{1:20}".format("--Word--","--Lemma--"))
for word in example_words:
   print ("{0:20}{1:20}".format(word, wnl.lemmatize(word, pos="v")))
"""
--Word--            --Lemma--           
program             program             
programming         program             
programer           programer           
programs            program             
programmed          program
"""   

--Word--            --Lemma--           
program             program             
programming         program             
programer           programer           
programs            program             
programmed          program             
happy               happy               
happily             happily             
Happier             Happier             


'\n--Word--            --Lemma--           \nprogram             program             \nprogramming         program             \nprogramer           programer           \nprograms            program             \nprogrammed          program\n'

## Tokenization

Tokenization is the task of cutting a string into identifiable linguistic units that constitute a piece of language data.

In [14]:
#The simplest approach is to split text on white space
text = "This is an example sentence.     We will split this into words."

words=re.split(r' ', text)
print("split on :", words)

# What about tabs and newlines?  What about double, triple white spaces?

words=re.split(r'[ \t\n]+', text)
print("split on space/tabs/newlines :", words)
#Or
words=re.split(r'\s+', text)
print("split on spaces :", words)


split on : ['This', 'is', 'an', 'example', 'sentence.', '', '', '', '', 'We', 'will', 'split', 'this', 'into', 'words.']
split on space/tabs/newlines : ['This', 'is', 'an', 'example', 'sentence.', 'We', 'will', 'split', 'this', 'into', 'words.']
split on spaces : ['This', 'is', 'an', 'example', 'sentence.', 'We', 'will', 'split', 'this', 'into', 'words.']


In [15]:
# tokenize text into words
words = nltk.word_tokenize(text)
print("Tokenized text:", words)


Tokenized text: ['This', 'is', 'an', 'example', 'sentence', '.', 'We', 'will', 'split', 'this', 'into', 'words', '.']


### BPE tokenization (Byte Pair Encoding)

In [None]:
from tokenizers import ByteLevelBPETokenizer

def byte_pair_tokenizer(corpus, vocab_size=1000, min_frequency=2):
    # Initialize a tokenizer
    tokenizer = ByteLevelBPETokenizer()

    # Train the tokenizer on the corpus
    tokenizer.train_from_iterator(corpus, vocab_size=vocab_size, min_frequency=min_frequency)

    # Return the tokenizer
    return tokenizer


In [None]:
import re 
from collections import defaultdict 

def get_vocab(data): 
	""" 
	Given a list of strings, returns a dictionary of words mapping to their frequency 
	count in the data. 

	Parameters:
	data: list of strings

	Returns:
	dictionary of words mapping to their frequency count in the data

	Example:
	If the input 'data' is ['low', 'lower'], the function returns:
	{'l o w </w>': 1, 'l o w e r </w>': 1}
	"""
	vocab = defaultdict(int) 
	for line in data: 
		for word in line.split(): 
			vocab[' '.join(list(word)) + ' </w>'] += 1
	return vocab 


def get_stats(vocab): 
	"""
	Given a vocabulary (a dictionary mapping words to frequency counts), this function returns a dictionary of tuples representing the frequency count of pairs of characters in the vocabulary.

	Parameters:
		vocab (dict[str, int]): A dictionary where keys are words, and values are their frequency counts.

	Returns:
		dict[tuple[str, str], int]: A dictionary where keys are tuples of two characters, and values are the frequency count of those character pairs.

	Example:
		If the input 'vocab' is {'l o w </w>': 5, 'l o w e r </w>': 2}, the function returns:
		{('l', 'o'): 7, ('o', 'w'): 7, ('w', '</w>'): 5, ('w', 'e'): 2, ('e', 'r'): 2, ('r', '</w>'): 2}
	"""
	pairs = defaultdict(int) 
	for word, freq in vocab.items(): 
		symbols = word.split() 
		for i in range(len(symbols)-1): 
			pairs[symbols[i],symbols[i+1]] += freq 
	return pairs 

def merge_vocab(pair, v_in): 
	""" 
	Given a pair of characters and a vocabulary, returns a new vocabulary with the 
	pair of characters merged together wherever they appear. 

	Parameters:
		pair: tuple of two characters
		v_in: dictionary of words mapping to their frequency count in the data
	
	Returns:
		V_out: dictionary of words mapping to their frequency count in the data with the pair merged

	Example:
		If the input 'pair' is ('e', 'r') and the input 'v_in' is {'l o w </w>': 5, 'l o w e r </w>': 2},
		the function returns:
		{'l o w </w>': 5, 'l o w er </w>': 2}


	"""
	v_out = {} 
	bigram = re.escape(' '.join(pair)) 
	#  Negative Lookbehind (?<!\S) => it matches a position where the character before it is a whitespace character or the beginning of the string
	#  Negative Lookahead (?!\S)  => it matches a position where the character after it is a whitespace character or the end of the string
	# \S => non white space [^\s]
	p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)') 
	for word in v_in: 
		w_out = p.sub(''.join(pair), word) 
		v_out[w_out] = v_in[word] 
	return v_out 


def byte_pair_encoding(data, n): 
	""" 
	Given a list of strings and an integer n, returns a list of n merged pairs 
	of characters found in the vocabulary of the input data. 
	"""
	vocab = get_vocab(data) 
	for i in range(n): 
		pairs = get_stats(vocab) 
		best = max(pairs, key=pairs.get) 
		vocab = merge_vocab(best, vocab) 
	return vocab 

# Example usage: 
corpus = '''Tokenization is the process of breaking down 
a sequence of text into smaller units called tokens, 
which can be words, phrases, or even individual characters. 
Tokenization is often the first step in natural languages processing tasks 
such as text classification, named entity recognition, and sentiment analysis. 
The resulting tokens are typically used as input to further processing steps, 
such as vectorization, where the tokens are converted 
into numerical representations for machine learning models to use.'''
data = corpus.split('.') 

n = 230
bpe_pairs = byte_pair_encoding(data, n) 
bpe_pairs


{'Tokenization</w>': 2,
 'is</w>': 2,
 'the</w>': 3,
 'process</w>': 1,
 'of</w>': 2,
 'breaking</w>': 1,
 'down</w>': 1,
 'a</w>': 1,
 'sequence</w>': 1,
 'text</w>': 2,
 'into</w>': 2,
 'smaller</w>': 1,
 'units</w>': 1,
 'called</w>': 1,
 'tokens,</w>': 1,
 'which</w>': 1,
 'can</w>': 1,
 'be</w>': 1,
 'words,</w>': 1,
 'phrases,</w>': 1,
 'or</w>': 1,
 'even</w>': 1,
 'individual</w>': 1,
 'characters</w>': 1,
 'often</w>': 1,
 'first</w>': 1,
 'step</w>': 1,
 'in</w>': 1,
 'natural</w>': 1,
 'languages</w>': 1,
 'processing</w>': 2,
 'tasks</w>': 1,
 'such</w>': 2,
 'as</w>': 3,
 'classification,</w>': 1,
 'named</w>': 1,
 'entity</w>': 1,
 'recognition,</w>': 1,
 'and</w>': 1,
 'sentiment</w>': 1,
 'analysis</w>': 1,
 'The</w>': 1,
 'resulting</w>': 1,
 'tokens</w>': 2,
 'are</w>': 2,
 'typically</w>': 1,
 'used</w>': 1,
 'input</w>': 1,
 'to</w>': 2,
 'further</w>': 1,
 'steps,</w>': 1,
 'vectorization,</w>': 1,
 'where</w>': 1,
 'converted</w>': 1,
 'numerical</w>': 1,
 'repres

## Resources
1. https://www.geeksforgeeks.org/byte-pair-encoding-bpe-in-nlp/
2. https://docs.python.org/3/library/re.html