# Regular Expressions and RE Module 

## **Submitted by : Pinaki Shaw**


Regular Expressions are character sequences used for string matching or pattern matching within strings in several natural language processing and information retrieval applications.
Also known as REs or regexes,regular expressions are embedded in Python and made available through Python's **'re'** module.

**Motivation**: For any information retrieval system, preprocessing is a crucial task which has implications on the performance of the system too. Some common pre-processing tasks are punctuation removal, tokenization, stemming, lemmatization etc. All of these can be performed with the help of the 're' module. Even though other libraries exist which make such tasks a lot easier, the advantage that the 're' module offers is the flexibility to design our own rules using regexes. For example, instead of using the predefined Porter stemmer available in Python, we can build our own stemmer after having studied the data relevant to our IR task. This notebook will demonstrate through simple examples the functionality provided by the re module and how we can customize them for our IR specific task. 

**Getting ready**

Python has re included in the standard library. So we just need to import it and then we are good to go! So let us import this module.

In [1]:
import re

For checking out the cool features of the re module, we will use the following text:

In [2]:
text= "One of the classic IR textbooks says that \"Information retrieval (IR) is finding material \
(usually documents) of an unstructured nature (usually text) that satisfies an information need from \
within large collections (usually stored on computers).\" \
Gerard Salton \"the father of Information Retrieval\" said that \
\"Information retrieval is a field concerned with the structure, analysis, organization, \
storage, searching, and retrieval of information.\""

print(text)

One of the classic IR textbooks says that "Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)." Gerard Salton "the father of Information Retrieval" said that "Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information."


**(I) re.sub(pattern,repl,string)**

**Task 1: Punctuation Removal**
Firstly, let us clean the punctuation from the given text. Anything apart from numbers, alphabets and whitespace is usually considered as punctuation. 
The *re.sub()* method checks for the *pattern* in the *string* and replaces every occurence of the pattern with the *repl* string.
For our task, we can replace all punctuation with empty strings as demonstrated below:

In [3]:
text=re.sub(r'[^\w\s\d]','',text)
print(text)

One of the classic IR textbooks says that Information retrieval IR is finding material usually documents of an unstructured nature usually text that satisfies an information need from within large collections usually stored on computers Gerard Salton the father of Information Retrieval said that Information retrieval is a field concerned with the structure analysis organization storage searching and retrieval of information


**Task 2: Stemming**

Having cleaned our text, we can define our own rules for stemming using the re module. For demonstration purposes, let us consider the following two stemming rules:
- **plural to singular as in**: *textbooks ->textbook*
- **future tense to present tense as in**: *finding -> find*

We can again use *re.sub()* for this task. In addition to this function, we can also capture groups of characters by using parentheses () and use indexes \1, \2 etc to then backrefer such groups as shown below:

In [4]:
text_stemmed1=re.sub(r"(\w*)ing\b",r'\1',text)  #Rule (*)ing -> (*)

text_stemmed2=re.sub(r"(\w*)([^e|^i])s\b",r'\1\2',text_stemmed1) #Rule (*)s -> (*), except for cases like "flies, satifies"

text=text_stemmed2

print(text)

One of the classic IR textbook say that Information retrieval IR is find material usually document of an unstructured nature usually text that satisfies an information need from within large collection usually stored on computer Gerard Salton the father of Information Retrieval said that Information retrieval is a field concerned with the structure analysis organization storage search and retrieval of information


**Task 3: Lemmatization**

Let us convert the verbs "is","are" to their base form, i.e., "be"

In [5]:
text=re.sub(r"\b(is|are)\b",r'be',text)
print(text)

One of the classic IR textbook say that Information retrieval IR be find material usually document of an unstructured nature usually text that satisfies an information need from within large collection usually stored on computer Gerard Salton the father of Information Retrieval said that Information retrieval be a field concerned with the structure analysis organization storage search and retrieval of information


**(II) re.findall(pattern,string)**

**Task 4: Count of query terms in a document**

Traditionally, the more frequent the query terms are in a document, the more relevant the document is considered for that query. For such bag of words based models, we need to find out the number of times the query term appears in the document.
The *re.findall()* function returns a list containg all matches of the *pattern* in the given *string*.

In [6]:
text=text.lower()  #convert entire text to lowercase
query_terms=["the","information"]  #let the query be "the information"
for term in query_terms:
    print(re.findall(term,text))

['the', 'the', 'the', 'the']
['information', 'information', 'information', 'information', 'information']


 We can fetch the frequency by simply calculating the length of the list returned as demonstrated below:

In [7]:
for term in query_terms:
    print(term," : ",len(re.findall(term,text)))

the  :  4
information  :  5


**(III) re.split(pattern,string)**

**Task 5: Tokenization**
Breaking up a huge text into individual tokens is another important task in IR. The *re.split()* function provides an easy way to split a *string* at the specified character or *pattern*.It returns a list where the string has been split at each match of the pattern specified in the function. For our case, we will split the text on whitespace.

In [8]:
words_list=re.split("\s+",text)
print(words_list)

['one', 'of', 'the', 'classic', 'ir', 'textbook', 'say', 'that', 'information', 'retrieval', 'ir', 'be', 'find', 'material', 'usually', 'document', 'of', 'an', 'unstructured', 'nature', 'usually', 'text', 'that', 'satisfies', 'an', 'information', 'need', 'from', 'within', 'large', 'collection', 'usually', 'stored', 'on', 'computer', 'gerard', 'salton', 'the', 'father', 'of', 'information', 'retrieval', 'said', 'that', 'information', 'retrieval', 'be', 'a', 'field', 'concerned', 'with', 'the', 'structure', 'analysis', 'organization', 'storage', 'search', 'and', 'retrieval', 'of', 'information']


*re.split()* has an advantage over Python's popular *String.split()* when the delimiter size is not a constant. For example, if we want to split our text string on occurence of any number of the alphabet s, we can easily do the following using *re.split()*

In [9]:
words_on_s=re.split("s+",text) #split on multiple occurences of the character 's'
print(words_on_s)

['one of the cla', 'ic ir textbook ', 'ay that information retrieval ir be find material u', 'ually document of an un', 'tructured nature u', 'ually text that ', 'ati', 'fie', ' an information need from within large collection u', 'ually ', 'tored on computer gerard ', 'alton the father of information retrieval ', 'aid that information retrieval be a field concerned with the ', 'tructure analy', 'i', ' organization ', 'torage ', 'earch and retrieval of information']


**Task 6: Bag of Words model**

Let us now see how we can build a bag of words model for our text using the re module.

In [10]:
unique_words_list=set(words_list) #build a list of all unique tokens in the text

vocab={}
for word in unique_words_list:
    vocab[word]=len(re.findall(word,text)) #building dictionary of unique tokens with their counts

for key,value in vocab.items():
    print(key,":",value)  


usually : 3
within : 1
structure : 2
information : 5
retrieval : 4
one : 1
say : 1
collection : 1
said : 1
with : 2
computer : 1
stored : 1
document : 1
salton : 1
on : 11
textbook : 1
of : 4
text : 2
that : 3
search : 1
material : 1
classic : 1
ir : 2
large : 1
and : 1
satisfies : 1
find : 1
need : 1
be : 2
organization : 1
an : 5
the : 4
from : 1
field : 1
concerned : 1
gerard : 1
storage : 1
nature : 1
father : 1
unstructured : 1
analysis : 1
a : 36


**Wrapping it up**

Python's re module provides other useful functions such as re.subn(), re.search(), match object functionalities etc. In this notebook, I have highlighted the functions which are most relevant to IR specific tasks.
As shown in the examples above, with the correct combination of regular expressions and re functions, we can design strong and flexible customized functions for preprocessing text for IR applications.

**References:**
1. https://docs.python.org/3/library/re.html
2. https://regexone.com/lesson/capturing_groups
3. https://www.regular-expressions.info/python.html
4. https://www.guru99.com/python-regular-expressions-complete-tutorial.html