# Introduction
Regular expressions (RegEx or RegExp) are powerful tools used for pattern matching and searching within text. They are widely used in various programming languages and tools to perform tasks like validation, data extraction and text manipulation.

RegEx is very similar to the `LIKE` keyword in SQL.

### Key concepts
- Pattern: A sequence of characters and special symbols that define a specific pattern to search for within a give text.
- Matching: The process of finding occurrences of a pattern within a string.
- Special characters: These characters have special meanings within regex patterns, allowing to define complex patterns concisely.

### Common use cases of regular expressions
- Data validation: Validating email address, phone numbers and other data formats.
- Text extraction: Extracting specific information from text, such as URLs, email addresses or dates.
- Text replacement: Replacing specific patterns of text with other text.
- Search and find: Finding specific patterns within large datasets.

In [1]:
# reading the data and printing the first 500 characters in the data
data = open("data.txt", "r").read()
print(data[: 500])

Dave Martin
615-555-7164
173 Main St., Springfield RI 55924
davemartin@bogusemail.com

Charles Harris
800-555-5669
969 High St., Atlantis VA 34075
charlesharris@bogusemail.com

Eric Williams
560-555-5153
806 1st St., Faketown AK 86847
laurawilliams@bogusemail.com

Corey Jefferson
900-555-9340
826 Elm St., Epicburg NE 10671
coreyjefferson@bogusemail.com

Jennifer Martin-White
714-555-7405
212 Cedar St., Sunnydale CT 74983
jenniferwhite@bogusemail.com

Erick Davis
800-555-6771
519 Washington St., 


In [2]:
# defining a method to mask email address
# sample input: user_name@email_service.com
# sample output: a####d@gmail.com
def mask_email(email):
    if "@" in email:
        user_name, domain_name = email.split("@")
        return f"{user_name[0]}*****{user_name[-1]}@{domain_name}"
    else:
        return "Invalid email address"

In [3]:
mask_email("sirdesaividish@gmail.com")

's*****h@gmail.com'

In [4]:
mask_email("sirdesaividish.gmail.com")

'Invalid email address'

In [5]:
# blackbox implementation of regular expression

# import the package for regular expression
import re

# using regex to check the validity of an email address
def is_valid_email(email):
    email_pattern = "^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,3})+$"
    result = re.search(email_pattern, email)
    # re.search() scans through the email string looking for the first location where the pattern is found
    if result:
        return True
    else:
        return False

In [6]:
is_valid_email("sirdesaividish@gmail.com")

True

In [7]:
is_valid_email("sirdesaividish_gmail.com")

False

In [8]:
is_valid_email("sirdesai@vidish.com")

True

# Writing Regular Expressions

Tools to test regular expression:
- https://regex101.com is a tool that helps you build, test, and debug regular expressions.
- https://pythex.org is also a similar tool.

### Meta characters
Meta characters are characters are characters with special meaning,

| character | description | example |
| :-: | :-: | :-: |
| `[]` | A set of characters | `"[a-m]"` |
| `\` | Signals a special character (can also be used to escape special characters) | `"\d"` |
| `.` | A character (except a newline character) | `"he..o"` |
| `^` | Starts with | `"^hello"` |
| `$` | Ends with | `"planet$"` |
| `*` | Zero or more occurrences | `"he.*o"` |
| `+` | One or more occurrences | `"he.+o"` |
| `?` | Zero or one occurrences | `"he.?o"` |
| `{}` | Exactly the specified number of occurrences | `"he.{2}o"` |
| `pipe` | Either or | `"falls pipe stays"` |
| `()` | Capture and group |  |

### Special sequences
A special sequence is `\` (backslash) followed by one of the characters in the list below and has a special meaning,

| character | description | example |
| :-: | :-: | :-: |
| `\A` | Returns a match if the specified characters are at the beginning of the string | `"\AThe"` |
| `\b` | Returns a match where the specified characters are at the beginning or at the end of the word | `r"\bain"` |
| `\b` | (the "r" in the beginning is making sure that the string is being treated as a "raw string") | `r"ain\b"` |
| `\B` | Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word | `r"\Bain"` |
| `\B` | (the "r" in the beginning is making sure that the string is being treated as a "raw string") | `r"ain\B"` |
| `\d` | Returns a match where the string contains digits (numbers from 0-9) | `\d` |
| `\D` | Returns a match where the string DOES NOT contain digits | `\D` |
| `\s` | Returns a match where the string contains a white space character | `\s` |
| `\S` | Returns a match where the string DOES NOT contain a white space character | `\S` |
| `\w` | Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character) | `\w` |
| `\W` | Returns a match where the string DOES NOT contain any word characters | `\W` |
| `\Z` | Returns a match if the specified characters are at the end of the string | `"Spain\Z"` |

### Sets
A set is a set of characters inside a pair of square brackets `[]` with special meaning,

| set | description |
| :-: | :-: |
| `[arn]` | Returns a match where one of the specified characters (`a`, `r` or `n`) is present |
| `[a-n]` | Returns a match for any lowercase character, alphabetically between `a` and `n` |
| `[^arn]` | Returns a match for any character EXCEPT, `a`, `r` and `n` |
| `[0123]` | Returns a match where any of the specified digits (`0`, `1`, `2`, or `3`) are present |
| `[0-9]` | Returns a match for any digits between 0 and 9 |
| `[0-5][0-9]` | Returns a match for any two-digit numbers from `00` and `59` |
| `[a-zA-Z]` | Returns a match for any character alphabetically between `a` and `z` , lowercase OR uppercase |
| `[+]` | In sets, `+`, `*`, `.`, `pipe`, `()`, `$`, `{}` has no special meaning, so `[+]` means: return a match for any `+` character in the string |

# Decoding The Regular Expression

In [9]:
# regex used earlier is as below
regex = "^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,3})+$"

### Meaning
- `^`: Matches the beginning of the string.
- `\w+`: Matches one or more word characters (letters, digits or underscores).
- `([\.-]?\w+)*`: Matches zero or more occurrences of an optional dot or hyphen followed by one or more word characters.
- `@`: Matches the literal `@` symbol.
- `\w+`: Matches one or more word characters.
- `([\.-]?\w+)*`: Matches zero or more occurrences of an optional dot or hyphen followed by one or more word characters.
- `(\.\w{2,3})`: Matches a dot followed by 2 to 3 word characters.
- `+`: Matches one or more occurrences of the preceding group (i.e., the TLD).
- `$`: Matches the end of the string.

In [10]:
import re
regex = "^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,3})+$"

reg_pattern = re.compile(
    """
    ^\w+([\.-]?\w+)*    # starts with \w+ - user
    @                   # single @ sign
    \w+([\.-]?\w+)*     # domain name
    (\.\w{2,3})+$       # .com, .ac, .in., .co
    """,
    re.VERBOSE | re.IGNORECASE
)

result = reg_pattern.match("abcd@iisc.ac.in")
print(result.string)
print(result)

abcd@iisc.ac.in
<re.Match object; span=(0, 15), match='abcd@iisc.ac.in'>


# RegEx Methods To Look Into
- `re.compile()`
- `re.match()`
- `re.search()`
- `re.findall()`
- `re.finditer()`
- `re.split()`
- `re.sub()`

In [11]:
# re.match()
pattern = "\d{3}-\d{3}-\d{4}"
print(re.match(pattern, data))

None


In [12]:
# re.search()
print(re.search(pattern, data))

<re.Match object; span=(12, 24), match='615-555-7164'>


In [13]:
# re.findall()
print(re.findall(pattern, data))

['615-555-7164', '800-555-5669', '560-555-5153', '900-555-9340', '714-555-7405', '800-555-6771', '783-555-4799', '516-555-4615', '127-555-1867', '608-555-4938', '568-555-6051', '292-555-1875', '900-555-3205', '614-555-1166', '530-555-2676', '470-555-2750', '800-555-6089', '880-555-8319', '777-555-8378', '998-555-7385', '800-555-7100', '903-555-8277', '196-555-5674', '900-555-5118', '905-555-1630', '203-555-3475', '884-555-8444', '904-555-8559', '889-555-7393', '195-555-2405', '321-555-9053', '133-555-1711', '900-555-5428', '760-555-7147', '391-555-6621', '932-555-7724', '609-555-7908', '800-555-8810', '149-555-7657', '130-555-9709', '143-555-9295', '903-555-9878', '574-555-3194', '496-555-7533', '210-555-3757', '900-555-9598', '866-555-9844', '669-555-7159', '152-555-7417', '893-555-9832', '217-555-7123', '786-555-6544', '780-555-2574', '926-555-8735', '895-555-3539', '874-555-3949', '800-555-2420', '936-555-6340', '372-555-9809', '890-555-5618', '670-555-3005', '509-555-5997', '721-55

In [14]:
# re.finditer()
numbers = re.finditer(pattern, data)
for i, num in enumerate(numbers):
    print(num)
    if (i==5):
        break

<re.Match object; span=(12, 24), match='615-555-7164'>
<re.Match object; span=(102, 114), match='800-555-5669'>
<re.Match object; span=(191, 203), match='560-555-5153'>
<re.Match object; span=(281, 293), match='900-555-9340'>
<re.Match object; span=(378, 390), match='714-555-7405'>
<re.Match object; span=(467, 479), match='800-555-6771'>


In [15]:
# re.finditer()
numbers = re.finditer(pattern, data)
for i, num in enumerate(numbers):
    print(num.group(), num.start(), num.end())
    if (i==5):
        break

615-555-7164 12 24
800-555-5669 102 114
560-555-5153 191 203
900-555-9340 281 293
714-555-7405 378 390
800-555-6771 467 479
