<span style="font-size:30px; font-family:Courier;"> Project: Text patterns matching with Regular Expression </span>


<font size="5">**RegEx Quick Reference**</font>

<table>
  <tr>
    <th style="font-size:20px">Pattern</th>
    <th style="font-size:20px">Description</th>
  </tr>
  <tr>
    <td style="font-size:18px"><b>\d</b></td>
    <td style="font-size:18px">Any numeric digit from 0 to 9</td>
  </tr>
  <tr>
    <td style="font-size:18px"><b>\D</b></td>
    <td style="font-size:18px">Any character that is not a digit</td>
  </tr>
    <tr>
    <td style="font-size:18px"><b>\w</b></td>
    <td style="font-size:18px">Any letter, numeric digit, or the underscore character (Think of this as matching “word” characters )</td>
  </tr>
  <tr>
    <td style="font-size:18px"><b>\W</b></td>
    <td style="font-size:18px">Any character that is not a letter, numeric digit, orthe underscore character</td>
  </tr>
   <tr>
    <td style="font-size:18px"><b>\s</b></td>
    <td style="font-size:18px">Any space, tab, or newline character (Think of this as matching “space” characters)</td>
  </tr>
     <tr>
    <td style="font-size:18px"><b>\S</b></td>
    <td style="font-size:18px">Any character that is not a space, tab, or newline character</td>
  </tr>
</table>

<span style="font-size:30px; font-family:Courier;"> 9. The Syntax of Regular Expressions </span>

<span style="font-size:30px; font-family:Courier;"> 9.1 Finding Text Patterns with Regular Expressions </span>

In [13]:
import re
pattern = re.compile(r'\d{3}-\d{3}-\d{3}')
match = pattern.search('My number is 415-555-424')
match.group()

'415-555-424'

<span style="font-size:30px; font-family:Courier;"> 9.2.1. Grouping with Parentheses </span>

In [21]:
import re
pattern = re.compile (r'(\d{5})\s(\d\d)\s(\d{3})')
match = pattern.search ("My phone number is 98493 92 897")
print('First part of the phone number %s' %(match.group(1)))
print('Second part of the phone number %s' % (match.group(2)))
print(match.group(3))

First part of the phone number 98493
Second part of the phone number 92
897


In [95]:
import re
pattern = re.compile (r'(\d{5})\s\d\d\s\d{3}')
match = pattern.findall ("My phone number is 98493 92 897")
print(match)

['98493']


<span style="font-size:30px; font-family:Courier;"> 9.2.2. Using Escape Characters </span>

In [1]:
import re
pattern = re.compile (r'(\(\d{5}\))\s(\(\d\d\))\s(\(\d{3}\))')
match = pattern.search ("My phone number is (98493) (92) (897)")
print('First part of the phone number %s' %(match.group(1)))
print('Second part of the phone number %s' % (match.group(2)))
print(match.group(3))

First part of the phone number (98493)
Second part of the phone number (92)
(897)


<span style="font-size:30px; font-family:Courier;"> 9.2.3. Matching Characters from Alternate Groups </span>

In [15]:
import re
pattern = re.compile (r'(Bejing|New York|Paris|Moscow|London)')
match = pattern.search('I am going to Paris tomorrow. Tom is arriving from London')
print(match.group(1))

Paris


<span style="font-size:30px; font-family:Courier;"> 9.2.3. Returning All Matches </span>

In [11]:
import re
pattern = re.compile (r'(Bejing|New York|Paris|Moscow|London)')
match = pattern.findall('Xin is flyig from Bejing. I am going to Paris tomorrow. Tom is arriving from London. Pavel is from Moscow')
for city in match:
    print (city)

Bejing
Paris
London
Moscow


<span style="font-size:30px; font-family:Courier;"> 10. Qualifier Syntax: What Characters to Match </span>

<span style="font-size:30px; font-family:Courier;"> 10.1. Using Character Classes and Negative Character Classes </span>

In [17]:
import re
vowel_pattern = re.compile(r'[aeiouAEIOU]')
match = vowel_pattern.findall('Humpty Dumpty sat on a wall. Humpty Dumpty had a great fall. All the king\'s horses and all the king\'s men')
# match = vowel_pattern.findall('XXbgt')
if (len(match) == 0):
    print ("No vowels found")
else:
    print (print(match))

['u', 'u', 'a', 'o', 'a', 'a', 'u', 'u', 'a', 'a', 'e', 'a', 'a', 'A', 'e', 'i', 'o', 'e', 'a', 'a', 'e', 'i', 'e']
None


In [29]:
vowel_pattern = re.compile(r'[^aeiouAEIOU ]') # consonant + space pattern
# match = vowel_pattern.findall('Humpty Dumpty sat on a wall. Humpty Dumpty had a great fall. All the king\'s horses and all the king\'s men')
match = vowel_pattern.findall('I LovE YOu')
if (len(match) == 0):
    print ("No consonant found")
else:
    print (print(match))

['L', 'v', 'Y']
None


<span style="font-size:30px; font-family:Courier;"> 10.2. Using Shorthand Character Classes </span>

In [72]:
pattern = re.compile(r'\w+\s+(\d+)')
match = pattern.findall('Miller  56, Arthur 66, Lewis 88, John#33')
print("Age values are: %s" % match)

Age values are: ['56', '66', '88']


In [57]:
pattern = re.compile (r'First Name: (\w+)')
match = pattern.findall ('First Name: Sabyasachi Age: 48')
print(match)

['Sabyasachi']


In [None]:
pattern = re.compile (r'Last Name: (\w+\s\D+)')
match = pattern.findall('Last Name: Sinéad O’Connor')
print(match)
pattern = re.compile (r'Last Name: (\S+\s\D+)')
match = pattern.findall('Last Name: Jean-Paul Sartre')
print(match)

['Sinéad O’Connor']
['Jean-Paul Sartre']


<span style="font-size:30px; font-family:Courier;"> 10.3. Matching Everything with the Dot Character </span>

In [55]:
at_re = re.compile(r'.at')
match = at_re.findall('cat on the hat sat on the mat')
print(match)


['cat', 'hat', 'sat', 'mat']


<span style="font-size:30px; font-family:Courier;"> 10.4. Matching an Optional Pattern </span>

In [None]:
import re
pattern = re.compile(r'(\+\d{2}\s)?\d{5}\s\d{5}')
match = pattern.search("My phone number +91 98493 92897") # will match
match = pattern.search("My phone number 98493 92897") # will match this also
print(match.group())

+91 98493 92897


<span style="font-size:30px; font-family:Courier;"> 10.5. Matching Zero or More Qualifiers </span>

In [None]:
import re
pattern = re.compile(r'Its about you(, you and you)*')
match = pattern.search('Its about you')
print(match.group())                  
match = pattern.search('Its about you, you and you, you and you')
print(match.group())
                    


Its about you
Its about you, you and you, you and you
[', you and you']


<span style="font-size:30px; font-family:Courier;"> 10.5. Matching One or More Qualifiers </span>

In [31]:
import re
pattern = re.compile (r'Eggs( and spam)+')
match = pattern.search('Eggs and spam and spam')
print(match.group())


Eggs and spam and spam


<span style="font-size:30px; font-family:Courier;"> 10.6. Matching a Specific Number of Qualifiers </span>

In [None]:
import re
pattern = re.compile(r'(Ha\s?){1,3}')
match = pattern.search('Ha Ha Ha')
match.group() # will return three Ha
pattern.findall('Ha Ha Ha') # will return only one Ha 

'Ha Ha Ha'

<span style="font-size:30px; font-family:Courier;"> 10.7. Non-capturing Groups </span>

<span style="font-size:30px; font-family:Courier;"> Non-capturing groups tells the regex engine: "I need these parentheses to group my logic together, but do # not capture the text inside them as a separate result." </span>

In [12]:
import re
pattern = re.compile(r'(?:Ha\s?){1,3}')
match = pattern.search('Ha Ha Ha')
pattern.findall('Ha Ha Ha') # will return only one Ha 

['Ha Ha Ha']

<span style="font-size:30px; font-family:Courier;">11. Greedy and Non-greedy Matching </span>

<span style="font-size:30px; font-family:Courier;"> 11.1. Lazy pattern </span>

In [4]:
import re
pattern = re.compile(r'(Ha\s?){1,3}?')
match = pattern.search('Ha Ha Ha')
match.group() # will return one Ha
pattern.findall('Ha Ha Ha') # will return only three Ha

['Ha ', 'Ha ', 'Ha']

<span style="font-size:30px; font-family:Courier;"> 11.2. Matching Everything </span>

In [18]:
import re
pattern = re.compile(r'First Name: (.*) Age: (.*)')
match = pattern.search('First Name: Sabyasachi Age: 48')
print(match.group(1))
print(match.group(2))

Sabyasachi
48


<span style="font-size:30px; font-family:Courier;">  11.3. Matching any character including Newline Characters </span>

In [35]:
import re
pattern = re.compile(r'.*', re.DOTALL)
match = pattern.search('This is an incredible success\n The team work very gard\n Client is happy.')
print(match.group())

This is an incredible success
 The team work very gard
 Client is happy.


In [None]:
import re
text = '''<!DOCTYPE html>
<html>
<head>
<title>Title of the document</title>
</head>
<p>
This tutorial is provided by CodeSpeedy.
Hope you like this.
</p>
</html>'''
pattern = "<p>.*</p>"
match = re.findall(pattern,text)
print(match) # will return empty list since it cannot match newline character

[]


In [None]:
import re
text = '''<!DOCTYPE html>
<html>
<head>
<title>Title of the document</title>
</head>
<p>
This tutorial is provided by CodeSpeedy.
Hope you like this.
</p>
</html>'''
pattern = "<p>.*</p>"
match = re.findall(pattern,text, re.DOTALL) # will match due to re.DOTALL
print(match)

['<p>\nThis tutorial is provided by CodeSpeedy.\nHope you like this.\n</p>']


<span style="font-size:30px; font-family:Courier;">  12. Matching at the Start and End of a String </span>

In [10]:
import re
pattern = re.compile(r'^Hello .*')
if (pattern.search('students Hello') != None) :
    match = pattern.search('students Hello')
    print(match.group())
else:
    print ('Not found!')


Not found!


In [11]:
import re
pattern = re.compile(r'.*Bye$')
if (pattern.search('students Bye') != None) :
    match = pattern.search('students Bye')
    print(match.group())
else:
    print ('Not found!')

students Bye


In [13]:
import re
alpha_num = re.compile(r'^(\D+)(\d+)$')
if (alpha_num.search('Abcd4667') == None): 
    print ('Not Alpha Numeric')
else:
    match = alpha_num.search('Abcd4667')
    print(match.group(1))
    print(match.group(2))
  


Abcd
4667


<span style="font-size:30px; font-family:Courier;"> 13. Word Boundary</span>

In [None]:
import re
wb_pattern = re.compile(r'\bcat\b')
wb_pattern.findall('catastrophe') # does not match - cat is bounded on right side 
wb_pattern = re.compile(r'\bcat')
wb_pattern.findall('catastrophe') # match - cat is unbounded on left side 
wb_pattern = re.compile(r'cat\b')
wb_pattern.findall('pussycat') # match - cat is unbounded on right side

['cat']

<span style="font-size:30px; font-family:Courier;"> 13. Non-Word Boundary</span>

In [10]:
import re
wnb_pattern = re.compile(r'\Bcat\B')
match = wnb_pattern.findall('Ccatcertificate') # will match since cat bounded by i and e
print(match)

['cat', 'cat']


<span style="font-size:30px; font-family:Courier;"> 14. Case-Insensitive Matching </span>

In [None]:
# case senstitive search
import re
case_pattern = re.compile(r'SABYASACHI\s\w+') # pattern is case sensitive
if (case_pattern.search('Sabyasachi Mitra') != None):
    match = case_pattern.search('Sabyasachi Mitra') # will not match since search lower case
    print(match.group())
else:
    print('will not match since search lower case')
#    
if (case_pattern.search('SABYASACHI MITRA') != None):
    match = case_pattern.search('SABYASACHI MITRA') # will not match since search lower case
    print('will match since search string is upper case')
    print(match.group())


will not match since search lower case
will match since search string is upper case
SABYASACHI MITRA


In [21]:
# case insenstitive search
import re
case_pattern = re.compile(r'SABYASACHI\s\w+', re.IGNORECASE) # pattern is case sensitive
if (case_pattern.search('Sabyasachi Mitra') != None):
    print('Will match since pattern is case insensitive')
    match = case_pattern.search('Sabyasachi Mitra') # will not match since search lower case
    print(match.group())
else:
    print('will not match since search lower case')

Will match since pattern is case insensitive
Sabyasachi Mitra


<span style="font-size:30px; font-family:Courier;"> 15. Substituting Strings </span>

In [69]:
import re
text = 'My Credit Card number: 765432222'
result = re.sub(r'\d', '#',text)
print(result)

My Credit Card number: #########


In [12]:
import re
text = 'My Credit Card number: 7654-322-22'
pattern = r"(\d{4})(-)(\d{3})(-)(\d{2})"
result = re.sub(pattern, r'XXXX\2XXX\4\5', text)
print(result)

My Credit Card number: XXXX-XXX-22


In [13]:
import re
pattern = r'(.*-)(\d+)'
text = 'My Credit Card number: 7654-322-22'
result = re.sub(pattern, r'XXXX-XXX-\2', text)
print(result)
                

XXXX-XXX-22


In [None]:
import re
text = 'Email: sabyasachikgp@gmail.com'
pattern = r'(\w+)(@)(.*)'
result = re.sub(pattern,r'XXXXXXXX\2\3', text)
print(result)

Email: XXXXXXXX@gmail.com


<span style="font-size:30px; font-family:Courier;"> 16. Substituting Strings - Backward reference </span>

In [None]:
import re
date = '3/24/2025'
pattern = r'(\d{1,2})/(\d{1,2})/(\d{4})'
result = re.sub(pattern, r'\2/\1/\3', date) # backward reference - move day to month position
print(result)

<style>
    .regex-container {
        font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
        margin: 20px 0;
        background-color: #1a252f; /* Background for the title area */
        padding: 10px;
        border-radius: 8px 8px 0 0;
    }
    .regex-table {
        border-collapse: collapse;
        width: 100%;
        background-color: white;
        border: 1px solid #ccc;
    }
    .regex-table th {
        background-color: #2c3e50;
        color: white;
        text-align: left;
        padding: 12px;
        font-size: 1.1em;
    }
    .regex-table td {
        padding: 12px;
        border-bottom: 1px solid #ddd;
        vertical-align: middle;
        color: #000000 !important; 
        font-size: 16px;
        line-height: 1.5;
    }
    .regex-table tr:nth-child(even) {
        background-color: #f2f2f2;
    }
    .symbol {
        font-family: 'Courier New', monospace;
        font-weight: bold;
        color: #b11b5e;
        background-color: #ffe8f0;
        padding: 3px 8px;
        border-radius: 4px;
        white-space: nowrap;
        border: 1px solid #ffcada;
    }
    /* Specifically targeting the heading to be white */
    .white-heading {
        color: #ffffff !important;
        margin: 10px;
        font-weight: bold;
    }
</style>

<div class="regex-container">
    <h2 class="white-heading: center">A Review of RegEx Symbols</h2>
    <table class="regex-table">
        <thead>
            <tr>
                <th style="width: 35%;">Symbol</th>
                <th>Description</th>
            </tr>
        </thead>
        <tbody>
            <tr><td><span class="symbol">?</span></td><td>Matches <b>zero or one</b> instance of the preceding qualifier.</td></tr>
            <tr><td><span class="symbol">*</span></td><td>Matches <b>zero or more</b> instances of the preceding qualifier.</td></tr>
            <tr><td><span class="symbol">+</span></td><td>Matches <b>one or more</b> instances of the preceding qualifier.</td></tr>
            <tr><td><span class="symbol">{n}</span></td><td>Matches <b>exactly n</b> instances of the preceding qualifier.</td></tr>
            <tr><td><span class="symbol">{n,}</span></td><td>Matches <b>n or more</b> instances of the preceding qualifier.</td></tr>
            <tr><td><span class="symbol">{,m}</span></td><td>Matches <b>0 to m</b> instances of the preceding qualifier.</td></tr>
            <tr><td><span class="symbol">{n,m}</span></td><td>Matches at least <b>n</b> and at most <b>m</b> instances.</td></tr>
            <tr><td><span class="symbol">{n,m}?</span> or <span class="symbol">*?</span> or <span class="symbol">+?</span></td><td>Performs a <b>non-greedy</b> match of the preceding qualifier.</td></tr>
            <tr><td><span class="symbol">^spam</span></td><td>The string must <b>begin</b> with "spam".</td></tr>
            <tr><td><span class="symbol">spam$</span></td><td>The string must <b>end</b> with "spam".</td></tr>
            <tr><td><span class="symbol">.</span></td><td>Matches <b>any character</b>, except newline characters.</td></tr>
            <tr><td><span class="symbol">\d, \w, \s</span></td><td>Matches a <b>digit</b>, <b>word</b>, or <b>space</b> character, respectively.</td></tr>
            <tr><td><span class="symbol">\D, \W, \S</span></td><td>Matches anything <b>except</b> a digit, word, or space character.</td></tr>
            <tr><td><span class="symbol">[abc]</span></td><td>Matches <b>any character</b> between the square brackets (a, b, or c).</td></tr>
            <tr><td><span class="symbol">[&Hat;abc]</span></td><td>Matches any character that <b>isn’t</b> between the brackets.</td></tr>
            <tr><td><span class="symbol">(Hello)</span></td><td><b>Groups</b> 'Hello' together as a single qualifier.</td></tr>
            <tr><td><span class="symbol">[]</span></td><td> Matches the <b> single character </b> within the bracket. </td></tr>
            <tr><td><span class="symbol">\s</span></td><td> It matches the <b> a white space characters: [\t\r\n\f].</b></td></tr>
             <tr><td><span class="symbol">\S</span></td><td> It matches the <b> a white space characters: [^\t\r\n\f].</b></td></tr>
        </tbody>
    </table>
</div>