<table align="left" width=100%>
    <tr>
        <td width="10%">
            <img src="../images/RA_Logo.png">
        </td>
        <td>
            <div align="center">
                <font color="#21618C" size=8px>
                  <b> Regular Expression </b>
                </font>
            </div>
        </td>
    </tr>
</table>

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/vidyadharbendre/learn_nlp_using_examples/blob/main/notebooks/00_regular_expressions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/vidyadharbendre/learn_nlp_using_examples/blob/main/notebooks/regular_expressions.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>

# Regular Expressions

Regular expressions is a concept used to search for patterns in string text.

This is a univerisal concept for any programming language or text editing program. 

We're going to learn the concepts while we learn the syntax for python.

The goal of regular expressions is to be able to search for a specific type of text inside of a string.  If we have a form on our webpage where we ask for email addresses, can we check whether the inputted string actually follows the form of an email?  some letters or numbers or special characters, then an @ sign then some more letters numbers or special characters then a . then a few more letters

# Text Preprocessing with Regular Expressions (Regex)

Regular expressions (regex) are sequences of characters that define search patterns, primarily used for string matching and manipulation. In text preprocessing, regex can be used to clean and normalize text data by removing unwanted characters, extracting specific patterns, and more.

## Table of Contents
1. [Introduction to Regular Expressions](#introduction-to-regular-expressions)
2. [Installation](#installation)
3. [Basic Regex Operations](#basic-regex-operations)
    - [Pattern Matching](#pattern-matching)
    - [Replacing Patterns](#replacing-patterns)
    - [Splitting Strings](#splitting-strings)
4. [Text Preprocessing Examples](#text-preprocessing-examples)
    - [Removing Punctuation](#removing-punctuation)
    - [Removing Digits](#removing-digits)
    - [Extracting Email Addresses](#extracting-email-addresses)
    - [Normalizing Whitespace](#normalizing-whitespace)
5. [Conclusion](#conclusion)

## Introduction to Regular Expressions

Regular expressions are used to identify patterns in text. They are supported by most programming languages, including Python, and are very powerful for text processing tasks.

## Installation

In Python, the `re` module provides support for working with regular expressions. No additional installation is required as it is part of the standard library.

## Basic Regex Operations

### Pattern Matching

To search for patterns in text, use the `re.search` function.

In [118]:
import re

text = "Hello, my name is Radhika."
pattern = r"Radhika"
match = re.search(pattern, text)

if match:
    print("Pattern found:", match.group())
else:
    print("Pattern not found")

Pattern found: Radhika


### Replacing Patterns
To replace patterns in text, use the re.sub function.

In [120]:
text = "Hello, my name is Radhika."
pattern = r"Radhika"
replacement = "Pandit"
new_text = re.sub(pattern, replacement, text)
print(new_text)


Hello, my name is Pandit.


### Splitting Strings
To split strings based on patterns, use the re.split function.

In [121]:
text = "Hello, my name is Radhika."
pattern = r"\s+"  # Split by whitespace
tokens = re.split(pattern, text)
print(tokens)

['Hello,', 'my', 'name', 'is', 'Radhika.']


## Text Preprocessing Examples
### Removing Punctuation
Use regex to remove punctuation from text.

In [122]:
import re

text = "Hello, my name is Radhika!"
pattern = r"[^\w\s]"
clean_text = re.sub(pattern, "", text)
print(clean_text)

Hello my name is Radhika


## Removing Digits
Use regex to remove digits from text.

In [123]:
text = "I have 2 apples and 3 bananas."
pattern = r"\d"
clean_text = re.sub(pattern, "", text)
print(clean_text)

I have  apples and  bananas.


## Text Preprocessing Examples
### Removing Punctuation

In [125]:
import re

text = "Hello, my name is Radhika!"
pattern = r"[^\w\s]"
clean_text = re.sub(pattern, "", text)
print(clean_text)

Hello my name is Radhika


## Removing Digits
Use regex to remove digits from text.

In [126]:
text = "I have 2 apples and 3 bananas."
pattern = r"\d"
clean_text = re.sub(pattern, "", text)
print(clean_text)

I have  apples and  bananas.


## Extracting Email Addresses
Use regex to extract email addresses from text.

In [127]:
text = "Please contact us at support@example.com for further information."
pattern = r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"
emails = re.findall(pattern, text)
print(emails)

['support@example.com']


## Normalizing Whitespace
Use regex to normalize whitespace by replacing multiple spaces with a single space.

In [128]:
text = "This  is    an example  text."
pattern = r"\s+"
normalized_text = re.sub(pattern, " ", text).strip()
print(normalized_text)

This is an example text.


## Conclusion
Regular expressions are a versatile tool for text preprocessing, allowing for complex pattern matching and manipulation. With regex, you can clean and prepare your text data efficiently, making it suitable for various NLP tasks.

In [112]:
import re

text_to_search = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890
123abc

Hello HelloHello

MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )

utexas.edu

321-555-4321
123.555.1234

daniel-mitchell@utexas.edu

Mr. Vidyadhar
Mr Bendre
Ms Akshara
Mrs. Poornima
Mr. Atharva
'''

## Searching literals

In [114]:
pattern = re.compile(r'abc')
pattern

re.compile(r'abc', re.UNICODE)

In [115]:
matches = pattern.finditer(text_to_search)
matches

<callable_iterator at 0x107f3b4c0>

In [116]:
for mat in matches:
    print(mat)

<re.Match object; span=(1, 4), match='abc'>
<re.Match object; span=(69, 72), match='abc'>


In [117]:
print(text_to_search[69:72])

abc


In [111]:
pattern = re.compile(r'cba')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat)

## Searching special characters

In [67]:
pattern = re.compile(r'.')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat)

<re.Match object; span=(1, 2), match='a'>
<re.Match object; span=(2, 3), match='b'>
<re.Match object; span=(3, 4), match='c'>
<re.Match object; span=(4, 5), match='d'>
<re.Match object; span=(5, 6), match='e'>
<re.Match object; span=(6, 7), match='f'>
<re.Match object; span=(7, 8), match='g'>
<re.Match object; span=(8, 9), match='h'>
<re.Match object; span=(9, 10), match='i'>
<re.Match object; span=(10, 11), match='j'>
<re.Match object; span=(11, 12), match='k'>
<re.Match object; span=(12, 13), match='l'>
<re.Match object; span=(13, 14), match='m'>
<re.Match object; span=(14, 15), match='n'>
<re.Match object; span=(15, 16), match='o'>
<re.Match object; span=(16, 17), match='p'>
<re.Match object; span=(17, 18), match='q'>
<re.Match object; span=(18, 19), match='u'>
<re.Match object; span=(19, 20), match='r'>
<re.Match object; span=(20, 21), match='t'>
<re.Match object; span=(21, 22), match='u'>
<re.Match object; span=(22, 23), match='v'>
<re.Match object; span=(23, 24), match='w'>
<re.M

In [68]:
pattern = re.compile(r'\.')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat)

<re.Match object; span=(129, 130), match='.'>
<re.Match object; span=(164, 165), match='.'>
<re.Match object; span=(186, 187), match='.'>
<re.Match object; span=(190, 191), match='.'>
<re.Match object; span=(219, 220), match='.'>
<re.Match object; span=(227, 228), match='.'>
<re.Match object; span=(263, 264), match='.'>
<re.Match object; span=(276, 277), match='.'>


In [69]:
pattern = re.compile(r'\d')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat)

<re.Match object; span=(55, 56), match='1'>
<re.Match object; span=(56, 57), match='2'>
<re.Match object; span=(57, 58), match='3'>
<re.Match object; span=(58, 59), match='4'>
<re.Match object; span=(59, 60), match='5'>
<re.Match object; span=(60, 61), match='6'>
<re.Match object; span=(61, 62), match='7'>
<re.Match object; span=(62, 63), match='8'>
<re.Match object; span=(63, 64), match='9'>
<re.Match object; span=(64, 65), match='0'>
<re.Match object; span=(66, 67), match='1'>
<re.Match object; span=(67, 68), match='2'>
<re.Match object; span=(68, 69), match='3'>
<re.Match object; span=(170, 171), match='3'>
<re.Match object; span=(171, 172), match='2'>
<re.Match object; span=(172, 173), match='1'>
<re.Match object; span=(174, 175), match='5'>
<re.Match object; span=(175, 176), match='5'>
<re.Match object; span=(176, 177), match='5'>
<re.Match object; span=(178, 179), match='4'>
<re.Match object; span=(179, 180), match='3'>
<re.Match object; span=(180, 181), match='2'>
<re.Match obje

In [70]:
pattern = re.compile(r'\D')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat)

<re.Match object; span=(0, 1), match='\n'>
<re.Match object; span=(1, 2), match='a'>
<re.Match object; span=(2, 3), match='b'>
<re.Match object; span=(3, 4), match='c'>
<re.Match object; span=(4, 5), match='d'>
<re.Match object; span=(5, 6), match='e'>
<re.Match object; span=(6, 7), match='f'>
<re.Match object; span=(7, 8), match='g'>
<re.Match object; span=(8, 9), match='h'>
<re.Match object; span=(9, 10), match='i'>
<re.Match object; span=(10, 11), match='j'>
<re.Match object; span=(11, 12), match='k'>
<re.Match object; span=(12, 13), match='l'>
<re.Match object; span=(13, 14), match='m'>
<re.Match object; span=(14, 15), match='n'>
<re.Match object; span=(15, 16), match='o'>
<re.Match object; span=(16, 17), match='p'>
<re.Match object; span=(17, 18), match='q'>
<re.Match object; span=(18, 19), match='u'>
<re.Match object; span=(19, 20), match='r'>
<re.Match object; span=(20, 21), match='t'>
<re.Match object; span=(21, 22), match='u'>
<re.Match object; span=(22, 23), match='v'>
<re.Ma

In [71]:
pattern = re.compile(r'\d\w')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat)

<re.Match object; span=(55, 57), match='12'>
<re.Match object; span=(57, 59), match='34'>
<re.Match object; span=(59, 61), match='56'>
<re.Match object; span=(61, 63), match='78'>
<re.Match object; span=(63, 65), match='90'>
<re.Match object; span=(66, 68), match='12'>
<re.Match object; span=(68, 70), match='3a'>
<re.Match object; span=(170, 172), match='32'>
<re.Match object; span=(174, 176), match='55'>
<re.Match object; span=(178, 180), match='43'>
<re.Match object; span=(180, 182), match='21'>
<re.Match object; span=(183, 185), match='12'>
<re.Match object; span=(187, 189), match='55'>
<re.Match object; span=(191, 193), match='12'>
<re.Match object; span=(193, 195), match='34'>


In [72]:
pattern = re.compile(r'\d\s')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat)

<re.Match object; span=(64, 66), match='0\n'>
<re.Match object; span=(181, 183), match='1\n'>
<re.Match object; span=(194, 196), match='4\n'>


In [73]:
## Word boundary

In [74]:
# Hello HelloHello
pattern = re.compile(r'Hello')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat)

<re.Match object; span=(74, 79), match='Hello'>
<re.Match object; span=(80, 85), match='Hello'>
<re.Match object; span=(85, 90), match='Hello'>


In [75]:
pattern = re.compile(r'Hello\b')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat)

<re.Match object; span=(74, 79), match='Hello'>
<re.Match object; span=(85, 90), match='Hello'>


In [76]:
pattern = re.compile(r'\bHello\b')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat)

<re.Match object; span=(74, 79), match='Hello'>


In [77]:
pattern = re.compile(r'\BHello\b')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat)

<re.Match object; span=(85, 90), match='Hello'>


In [78]:
pattern = re.compile(r'\b\d')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat)

<re.Match object; span=(55, 56), match='1'>
<re.Match object; span=(66, 67), match='1'>
<re.Match object; span=(170, 171), match='3'>
<re.Match object; span=(174, 175), match='5'>
<re.Match object; span=(178, 179), match='4'>
<re.Match object; span=(183, 184), match='1'>
<re.Match object; span=(187, 188), match='5'>
<re.Match object; span=(191, 192), match='1'>


In [79]:
pattern = re.compile(r'^\s')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat)

<re.Match object; span=(0, 1), match='\n'>


In [80]:
## Character sets

In [81]:
pattern = re.compile(r'[123]\w')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat)

<re.Match object; span=(55, 57), match='12'>
<re.Match object; span=(57, 59), match='34'>
<re.Match object; span=(66, 68), match='12'>
<re.Match object; span=(68, 70), match='3a'>
<re.Match object; span=(170, 172), match='32'>
<re.Match object; span=(179, 181), match='32'>
<re.Match object; span=(183, 185), match='12'>
<re.Match object; span=(191, 193), match='12'>
<re.Match object; span=(193, 195), match='34'>


In [82]:
pattern = re.compile(r'[a-z][a-z]')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat)

<re.Match object; span=(1, 3), match='ab'>
<re.Match object; span=(3, 5), match='cd'>
<re.Match object; span=(5, 7), match='ef'>
<re.Match object; span=(7, 9), match='gh'>
<re.Match object; span=(9, 11), match='ij'>
<re.Match object; span=(11, 13), match='kl'>
<re.Match object; span=(13, 15), match='mn'>
<re.Match object; span=(15, 17), match='op'>
<re.Match object; span=(17, 19), match='qu'>
<re.Match object; span=(19, 21), match='rt'>
<re.Match object; span=(21, 23), match='uv'>
<re.Match object; span=(23, 25), match='wx'>
<re.Match object; span=(25, 27), match='yz'>
<re.Match object; span=(69, 71), match='ab'>
<re.Match object; span=(75, 77), match='el'>
<re.Match object; span=(77, 79), match='lo'>
<re.Match object; span=(81, 83), match='el'>
<re.Match object; span=(83, 85), match='lo'>
<re.Match object; span=(86, 88), match='el'>
<re.Match object; span=(88, 90), match='lo'>
<re.Match object; span=(93, 95), match='et'>
<re.Match object; span=(97, 99), match='ha'>
<re.Match object; s

In [83]:
pattern = re.compile(r'[a-zA-Z0-9][a-zA-z-]')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat)

<re.Match object; span=(1, 3), match='ab'>
<re.Match object; span=(3, 5), match='cd'>
<re.Match object; span=(5, 7), match='ef'>
<re.Match object; span=(7, 9), match='gh'>
<re.Match object; span=(9, 11), match='ij'>
<re.Match object; span=(11, 13), match='kl'>
<re.Match object; span=(13, 15), match='mn'>
<re.Match object; span=(15, 17), match='op'>
<re.Match object; span=(17, 19), match='qu'>
<re.Match object; span=(19, 21), match='rt'>
<re.Match object; span=(21, 23), match='uv'>
<re.Match object; span=(23, 25), match='wx'>
<re.Match object; span=(25, 27), match='yz'>
<re.Match object; span=(28, 30), match='AB'>
<re.Match object; span=(30, 32), match='CD'>
<re.Match object; span=(32, 34), match='EF'>
<re.Match object; span=(34, 36), match='GH'>
<re.Match object; span=(36, 38), match='IJ'>
<re.Match object; span=(38, 40), match='KL'>
<re.Match object; span=(40, 42), match='MN'>
<re.Match object; span=(42, 44), match='OP'>
<re.Match object; span=(44, 46), match='QR'>
<re.Match object; s

In [84]:
pattern = re.compile(r'[a-zA-Z][^a-zA-z]')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat)

<re.Match object; span=(26, 28), match='z\n'>
<re.Match object; span=(53, 55), match='Z\n'>
<re.Match object; span=(71, 73), match='c\n'>
<re.Match object; span=(78, 80), match='o '>
<re.Match object; span=(89, 91), match='o\n'>
<re.Match object; span=(105, 107), match='s '>
<re.Match object; span=(111, 113), match='d '>
<re.Match object; span=(114, 116), match='o '>
<re.Match object; span=(117, 119), match='e '>
<re.Match object; span=(125, 127), match='d)'>
<re.Match object; span=(163, 165), match='s.'>
<re.Match object; span=(167, 169), match='u\n'>
<re.Match object; span=(202, 204), match='l-'>
<re.Match object; span=(211, 213), match='l@'>
<re.Match object; span=(218, 220), match='s.'>
<re.Match object; span=(222, 224), match='u\n'>
<re.Match object; span=(226, 228), match='r.'>
<re.Match object; span=(237, 239), match='r\n'>
<re.Match object; span=(240, 242), match='r '>
<re.Match object; span=(247, 249), match='e\n'>
<re.Match object; span=(250, 252), match='s '>
<re.Match objec

In [85]:
## Character groups

In [86]:
pattern = re.compile(r'(abc|edu|texas)\b')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat)

<re.Match object; span=(69, 72), match='abc'>
<re.Match object; span=(159, 164), match='texas'>
<re.Match object; span=(165, 168), match='edu'>
<re.Match object; span=(214, 219), match='texas'>
<re.Match object; span=(220, 223), match='edu'>


In [87]:
pattern = re.compile(r'([A-Z]|llo)[a-zA-z]')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat)

<re.Match object; span=(28, 30), match='AB'>
<re.Match object; span=(30, 32), match='CD'>
<re.Match object; span=(32, 34), match='EF'>
<re.Match object; span=(34, 36), match='GH'>
<re.Match object; span=(36, 38), match='IJ'>
<re.Match object; span=(38, 40), match='KL'>
<re.Match object; span=(40, 42), match='MN'>
<re.Match object; span=(42, 44), match='OP'>
<re.Match object; span=(44, 46), match='QR'>
<re.Match object; span=(46, 48), match='ST'>
<re.Match object; span=(48, 50), match='UV'>
<re.Match object; span=(50, 52), match='WX'>
<re.Match object; span=(52, 54), match='YZ'>
<re.Match object; span=(74, 76), match='He'>
<re.Match object; span=(80, 82), match='He'>
<re.Match object; span=(82, 86), match='lloH'>
<re.Match object; span=(92, 94), match='Me'>
<re.Match object; span=(96, 98), match='Ch'>
<re.Match object; span=(108, 110), match='Ne'>
<re.Match object; span=(225, 227), match='Mr'>
<re.Match object; span=(229, 231), match='Vi'>
<re.Match object; span=(239, 241), match='Mr'>


In [88]:
## Quantifiers

In [89]:
pattern = re.compile(r'Mr\.?\s[A-Z]')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat)

<re.Match object; span=(225, 230), match='Mr. V'>
<re.Match object; span=(239, 243), match='Mr B'>
<re.Match object; span=(274, 279), match='Mr. A'>


In [90]:
pattern = re.compile(r'Mr\.?\s[A-Z][a-z]*')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat)

<re.Match object; span=(225, 238), match='Mr. Vidyadhar'>
<re.Match object; span=(239, 248), match='Mr Bendre'>
<re.Match object; span=(274, 285), match='Mr. Atharva'>


In [91]:
pattern = re.compile(r'M(s|rs)\.?\s[A-Z][a-z]*')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat)

<re.Match object; span=(249, 259), match='Ms Akshara'>
<re.Match object; span=(260, 273), match='Mrs. Poornima'>


In [92]:
pattern = re.compile(r'\d{3}[.-]\d{3}[.-]\d{4}')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat)

<re.Match object; span=(170, 182), match='321-555-4321'>
<re.Match object; span=(183, 195), match='123.555.1234'>


In [93]:
pattern = re.compile(r'[a-zA-Z0-9_]+\.[a-z]{3}')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat)

<re.Match object; span=(158, 168), match='utexas.edu'>
<re.Match object; span=(213, 223), match='utexas.edu'>


In [94]:
pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat)

<re.Match object; span=(197, 223), match='daniel-mitchell@utexas.edu'>


In [95]:
## Accessing information in the Match object

In [96]:

pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]{2,4}')
matches = pattern.finditer(text_to_search)
for mat in matches:
    print(mat.span(0))
    print(mat.group(0))
    print(text_to_search[mat.span(0)[0]:mat.span(0)[1]])
    
    


(197, 223)
daniel-mitchell@utexas.edu
daniel-mitchell@utexas.edu


In [97]:
urls = r'''
https://www.google.com
http://yahoo.com
https://www.whitehouse.gov
https://craigslist.org
'''

In [98]:
pattern = re.compile(r'https?://(www\.)?\w+\.\w+')
matches = pattern.finditer(urls)
for mat in matches:
    print(mat)

<re.Match object; span=(1, 23), match='https://www.google.com'>
<re.Match object; span=(24, 40), match='http://yahoo.com'>
<re.Match object; span=(41, 67), match='https://www.whitehouse.gov'>
<re.Match object; span=(68, 90), match='https://craigslist.org'>


In [99]:
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
matches = pattern.finditer(urls)
for mat in matches:
    print(mat.group(2)+mat.group(3))

google.com
yahoo.com
whitehouse.gov
craigslist.org


In [100]:
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
matches = pattern.finditer(urls)
for mat in matches:
    print(mat.group(0))
    print(urls[mat.span(2)[0]:mat.span(2)[1]]+urls[mat.span(3)[0]:mat.span(3)[1]])

https://www.google.com
google.com
http://yahoo.com
yahoo.com
https://www.whitehouse.gov
whitehouse.gov
https://craigslist.org
craigslist.org
