<h2 align="center"><font color='black'>Introduction to Regular Expressions</font></h2>

Regular expressions (regex or regexp) are powerful tools used for pattern matching and searching within text. They are widely used in various programming languages and tools to perform tasks like validation, data extraction, and text manipulation. Regex patterns are composed of a combination of characters and symbols that define a specific pattern to search for within a given text.

Learning regular expressions *(regex)* is a valuable skill that can greatly enhance your ability to work with text data efficiently and effectively. 

<center>
<img src="https://www.oreilly.com/content/wp-content/uploads/sites/2/2019/06/email-regex_crop-ae942dc427c8cebd3a83c52d17389123.jpg" width=300>
</center>

### <font color='black'>Table of contents<font><a class='anchor' id='top'></a>
1. [Importance of Regular Expressions](#importance)
2. [`re` libray](#re)
3. [Pattern 1: Matching Phone Numbers](#pattern1)
4. [Pattern 2: Matching Names with Phone Numbers](#pattern2)
5. [Pattern 3: Matching Email Addresses](#pattern3)
6. [But why is there an `r` before the pattern?](#whyr)
7. [Pattern 4: Matching URLs](#pattern4)
8. [Pattern 5: Matching Dates](#pattern5)
9. [Pattern 6: Matching Any Character](#pattern6)
10. [Where to practice?](#practice)
11. [Conclusion](#conclusion)

## Importance of Regular Expressions <a class='anchor' id='importance'></a> [↑](#top)

Regular expressions are a versatile and powerful tool used for text processing, pattern matching, and data manipulation. Here are some key reasons why learning regular expressions is important:

1. **Efficient Text Processing:** Regular expressions provide a concise and flexible way to search, extract, and manipulate text data. They allow you to perform complex operations with just a few lines of code.

2. **Pattern Matching:** One of the primary uses of regex is to find patterns within text. This ability is crucial for tasks like data validation, searching for keywords, and extracting specific information from large datasets.

3. **Data Validation:** Regular expressions are commonly used to validate user input in applications such as form validation, password validation, and email validation. This ensures that the data adheres to a certain format or structure.

4. **Text Extraction and Manipulation:** With regex, you can easily extract specific pieces of information from unstructured text, such as phone numbers, email addresses, URLs, and more. You can also replace or transform text based on patterns, saving time and effort.

5. **Automating Tedious Tasks:** Regular expressions can help automate tasks that involve text manipulation. For example, you can use regex to clean up messy data, reformat text, or extract information from logs.

6. **Text Parsing:** Many programming languages and tools support regex for parsing structured text formats like JSON, XML, HTML, and more. This enables you to extract relevant data from these formats efficiently.

7. **Programming and Scripting Languages:** Regular expressions are supported in a wide range of programming languages, including Python, Java, JavaScript, Perl, and more. Learning regex enhances your programming skills and makes you more versatile as a developer.

8. **Data Analysis:** In data analysis and data science, regex can help preprocess and clean text data before further analysis. This is particularly useful for sentiment analysis, text categorization, and natural language processing.

9. **Search and Replace in Editors:** Many text editors and IDEs support regex-based search and replace. This can be incredibly helpful when you're working with large codebases or documents.

10. **Problem-Solving:** Regular expressions challenge your problem-solving skills and logical thinking. They encourage you to think critically about patterns and come up with efficient solutions.

## `re` library <a class='anchor' id='re'></a> [↑](#top)

We will using the `re` library in Python which is the only import that we need for this notebook. The `re` library in Python is a powerful tool that provides support for working with regular expressions. "re" stands for "regular expression," and this library empowers you to search, manipulate, and process text data based on specific patterns. Whether you're validating user input, extracting information from text, or performing complex text transformations, the `re` library is your go-to tool.

Key features and aspects of the `re` library include:

1. **Pattern Matching:** The library enables you to define regular expressions that describe specific patterns you're looking for in text data.

2. **Searching and Matching:** You can use the `re.search()` function to find the first occurrence of a pattern within a string.

3. **Global Search:** The `re.findall()` function helps you find all occurrences of a pattern within a string, returning a list of matches.

4. **Pattern Compilation:** The `re.compile()` function allows you to compile a regular expression pattern into a regular expression object. This can improve efficiency when you need to use the same pattern repeatedly.

5. **Replacing Text:** The `re.sub()` function is used to replace occurrences of a pattern in a string with specified text.

6. **Grouping and Capturing:** Parentheses `( )` in your regular expression patterns allow you to capture specific parts of a match. This is useful for extracting specific information.

7. **Modifiers:** The `re.IGNORECASE` flag can be passed to functions to make pattern matching case-insensitive.

8. **Metacharacters:** The library supports a range of metacharacters, such as `.` (any character except newline), `*` (zero or more occurrences), `+` (one or more occurrences), and more.

9. **Anchors:** Anchors like `^` (start of string) and `$` (end of string) help you match patterns at specific positions within the text.

10. **Escape Sequences:** Special characters, like `\`, can be escaped using the backslash to match them literally.

11. **Greedy and Non-Greedy Matching:** Quantifiers like `*` and `+` are greedy by default, but you can use `*?` and `+?` for non-greedy matches.

The `re` library is an essential tool for working with text data, enabling you to manipulate and process text effectively using the power of regular expressions. It's widely used across various domains, including web development, data processing, text analysis, and more. Mastering the `re` library can greatly enhance your ability to work with textual information and solve a wide range of text-related challenges.

Let's go over some regex patterns and see what they mean and how they can be used in our workflows to get the results we are after. Let's begin.

## Pattern 1: Matching Phone Numbers<a class='anchor' id='pattern1'></a> [↑](#top)

Pattern: 

```python
'\d{3}-\d{3}-\d{4}'
```

Explanation:
- `\d`: Matches any digit (equivalent to [0-9]).
- `{3}`: Specifies that the preceding element (digit) should occur exactly 3 times.
- `-`: Matches the hyphen character literally.
- Combining these, `\d{3}-\d{3}-\d{4}` matches phone numbers in the format XXX-XXX-XXXX.

In [17]:
import re

text = """
Alice: 555-123-4567
Bob: 333-987-6543
Charlie: 777-555-88
"""
pattern = r'\d{3}-\d{3}-\d{4}'
matches = re.findall(pattern, text)
print("Pattern 1 Matches:", matches)

Pattern 1 Matches: ['555-123-4567', '333-987-6543']


<div class="alert alert-block alert-info">
<b></b> Notice Charlie's number wasn't matched because the phone number ended with only two digits. While this situation is less likely with actual phone numbers, it demonstrates a limitation of the pattern when dealing with cases where the length of the numbers is uncertain. In scenarios where you're unsure about the exact length of the numbers, this pattern might not provide the desired results, prompting the need to explore alternative regex patterns to address varying number lengths.
</div>

## Pattern 2: Matching Names with Phone Numbers
Pattern: 

```python 
"([A-Za-z]+):\s(\d{3}-\d{3}-\d{4})"
```

### Explanation:

- `([A-Za-z]+)`: Capturing group for matching names composed of alphabetic characters.
- `:\s:` Matches a colon followed by a space character.
- `(\d{3}-\d{3}-\d{4})`: Capturing group for matching phone numbers.

In [3]:
text = """
Alice: 555-123-4567
Bob: 333-987-6543
Charlie: 777-555-88
"""
pattern = r'([A-Za-z]+):\s(\d{3}-\d{3}-\d{4})'
matches = re.findall(pattern, text)
print(matches)
print('Pattern 2 Matches: \n')
for match in matches:
    print("Name:", match[0], "\tPhone:", match[1])

[('Alice', '555-123-4567'), ('Bob', '333-987-6543')]
Pattern 2 Matches: 

Name: Alice 	Phone: 555-123-4567
Name: Bob 	Phone: 333-987-6543


<div class="alert alert-block alert-info">
<b></b> Notice Charlie's number wasn't matched because the phone number ended with only two digits. While this situation is less likely with actual phone numbers, it demonstrates a limitation of the pattern when dealing with cases where the length of the numbers is uncertain. In scenarios where you're unsure about the exact length of the numbers, this pattern might not provide the desired results, prompting the need to explore alternative regex patterns to address varying number lengths.
</div>

## Pattern 3: Matching Email Addresses<a class='anchor' id='pattern3'></a> [↑](#top)
Pattern: 

```python 
"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b"
```

### Explanation:

- `\b`: Matches a word boundary to ensure the entire email address is matched.
- `[A-Za-z0-9._%+-]+`: Matches one or more characters from the allowed set in an email username.
- `@`: Matches the "@" symbol.
- `[A-Za-z0-9.-]+`: Matches one or more characters in the domain name.
- `\.:` Matches a literal period (dot).
- `[A-Z|a-z]{2,7}`: Matches the top-level domain (TLD) with 2 to 7 alphabetic characters.

In [4]:
pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b'
emails = """
Emails: john.doe@example.com, jane_smith@gmail.com, info@company.net
"""
matches = re.findall(pattern, emails, re.IGNORECASE)
print("Pattern 3 Matches:", matches)

Pattern 3 Matches: ['john.doe@example.com', 'jane_smith@gmail.com', 'info@company.net']


If you just want the user name that is the text before the `@` symbol then we can use the pattern below.

In [5]:
pattern = r'\b([A-Za-z0-9._]+)@\b'
emails = """
john.doe@example.com, jane_smith@gmail.com, info@company.net
"""
matches = re.findall(pattern, emails, re.IGNORECASE)
print("Pattern 3 Matches:", matches)

Pattern 3 Matches: ['john.doe', 'jane_smith', 'info']


Let's break down the regular expression pattern `r'\b@([A-Za-z0-9._]+)\b'` step by step:

1. `\b`: Word Boundary - `\b` is a word boundary anchor. It matches the position between a word character (such as letters or digits) and a non-word character (such as spaces, punctuation, or the start/end of a line). It's used to ensure that the pattern matches whole words.

2. `@`: Literal '@' Symbol - The `@` character matches the "@" symbol exactly.

3. `([A-Za-z0-9._]+)`: Capturing Group for Email Username - `(` and `)` define a capturing group. This capturing group is used to capture the email username part and `[A-Za-z0-9._]+` matches one or more occurrences of characters that are valid in an email username. This includes alphabetic characters (both uppercase and lowercase), digits, dots, and underscores.

4. `\b`: Word Boundary - Another word boundary anchor `\b` is used to ensure the pattern matches whole words.

If you want to fetch the domain names from an email you can use the below pattern that will escape the `@` symbol and include anything after that.

In [6]:
pattern = r'\b@([A-Za-z0-9._]+)\b'
emails = """
john.doe@example.com, jane_smith@gmail.com, info@company.net
"""
matches = re.findall(pattern, emails, re.IGNORECASE)
print("Pattern 3 Matches:", matches)

Pattern 3 Matches: ['example.com', 'gmail.com', 'company.net']


In [7]:
pattern = r'\b@([A-Za-z0-9._]+)\b'
emails = """
john.doe@example.com, jane_smith@gmail.com, info@Company.net, info@company.net
"""
matches = re.findall(pattern, emails,  re.IGNORECASE)
print("Pattern 3 Matches:", matches)

Pattern 3 Matches: ['example.com', 'gmail.com', 'Company.net', 'company.net']


https://www.w3schools.com/python/python_regex.asp

### But why is there an `r` before the pattern?<a class='anchor' id='whyr'></a> [↑](#top)

The `r` before the string in `r"\b[^@]+\b"` is called a "raw string literal" in Python. Let's explain why it's used:

In regular expressions, backslashes `\` are commonly used to escape special characters. However, in Python strings, backslashes also have their own escape sequences. For instance, `\n` represents a newline character, and `\t` represents a tab character. This can lead to conflicts between the regular expression's use of backslashes and Python's string processing.

To address this, you can use a raw string literal by prefixing the string with `r`. In a raw string literal, backslashes are treated as literal characters and not as escape characters. This is particularly useful when working with regular expressions because it ensures that backslashes are interpreted only by the regex engine and not by Python's string handling.

So, the `r"\b[^@]+\b"` pattern uses the raw string literal `r` to make sure that the backslashes are interpreted as part of the regular expression and not as escape characters for Python's string processing.

## Pattern 4: Matching URLs<a class='anchor' id='pattern4'></a> [↑](#top)

Pattern: 
```python
'(https?://\S+)'
```
### Explanation:

This pattern captures URLs starting with "http://" or "https://", followed by one or more non-whitespace characters. Let's break down the regular expression `(https?://\S+)` step by step:

1. `(https?://\S+)` -  is the regular expression pattern enclosed in parentheses. The parentheses are used to create a capturing group that allows us to extract the matched content.

2. `http` - `http` matches the literal characters "http" exactly.

3. `s?` - `s?` is a quantifier that matches the character "s" zero or one time. This allows the regular expression to match both "http" and "https". The question mark `?` makes the preceding character (in this case, "s") optional.

4. `://` - `://` matches the literal characters "://". This part represents the typical part of URLs that indicates the protocol (http or https).

5. `\S+`- `\S+` matches one or more non-whitespace characters. It's used to match the rest of the URL after the protocol.

Below is a cheatsheet that might help you navigate the world of Regular Expressions

<center>
<img src="https://pbs.twimg.com/media/DBNN9XQXcAA8L-Q.jpg" width=600>
</center>

In [8]:
text = """
    Check this out: http://www.kaggle.com and also https://web.mit.edu/
"""

pattern = r'(https?://\S+)'
matches = re.findall(pattern, text)
print("\nPattern 4 Matches:", matches)


Pattern 4 Matches: ['http://www.kaggle.com', 'https://web.mit.edu/']


## Pattern 5: Matching Dates<a class='anchor' id='pattern5'></a> [↑](#top)
Pattern: 

```python
'(\d{2}-\d{2}-\d{4})'
```
### Explanation:

This pattern captures dates in the format DD-MM-YYYY. Let's break down the regular expression `\d{2}-\d{2}-\d{4}` step by step:

1. `\d{2}`- `\d` matches any digit (0-9) and  `{2}` is a quantifier that specifies that the preceding element (in this case, `\d`) should occur exactly 2 times while This part `\d{2}` matches two consecutive digits.

2. `-` - `-` matches the hyphen character "-" literally.

3. `\d{2}` - Similarly, this part `\d{2}` matches two consecutive digits again.

4. `-` - Another hyphen character "-" matches here.

5. `\d{4}` - This part `\d{4}` matches four consecutive digits.

In [9]:
pattern = r'(\d{2}-\d{2}-\d{4})'
dates = """
Important dates: 05-12-2022 and 2010-08-23
"""
matches = re.findall(pattern, dates)
print("\nPattern 5 Matches:", matches)


Pattern 5 Matches: ['05-12-2022']



<p class="alert alert-block alert-info">
Notice that the second date is not matching since the pattern is not the same as we have mentioned. We can also write a more robust pattern for matching dates that handles different date formats, including variations with or without leading zeroes
</p>

### Explanation:

- `\b`: Word boundary to ensure complete date matches.
- `\d{1,2}`: Matches one or two digits for the day and month.
- `[-/]`: Matches either a hyphen or a forward slash as the delimiter.
- `\d{1,2}`: Matches one or two digits for the day and month.
- `[-/]`: Another delimiter.
- `\d{4}`: Matches exactly four digits for the year.
- `\b`: Word boundary to ensure the end of the date.

This pattern accommodates both single and double-digit days and months, as well as four-digit years, making it more robust for capturing various date formats. Below is the code for the above pattern we just went over:

In [10]:
pattern = r'\b\d{1,2}[-/]\d{1,2}[-/]\d{4}\b'
dates = """
Important dates: 05-12-2022 and 2010-08-23 and 7/6/1998
"""
matches5 = re.findall(pattern, dates)
print("Pattern 5 Matches:", matches)

Pattern 5 Matches: ['05-12-2022']




<div class="alert alert-block alert-info">
Notice that the second date is not matching because the date pattern is different. The year is in front which doesn't allow the match condition to be true.
</div>

## Pattern 6: Matching Any Character<a class='anchor' id='pattern6'></a> [↑](#top)
Pattern: 
```python
'.*'
```
### Explanation:

1. `.`: The dot (period) character in regex matches any character except for a newline character (`\n`). It represents a wildcard that can stand for any single character.

2. `*`: The asterisk is a quantifier that indicates "zero or more occurrences" of the preceding element. In this case, the preceding element is the dot (`.`), which means "zero or more occurrences of any character."

*For example*, if you have the text "Hello, World!", the pattern `.*` would match the entire string "Hello, World!". If used within a larger regex pattern, it would capture everything between other elements or patterns.

In [11]:
text = "The quick brown fox jumps over the lazy dog"
pattern = "brown.*"
matches = re.findall(pattern, text)
print("Pattern 6 Matches:", matches)

Pattern 6 Matches: ['brown fox jumps over the lazy dog']


Adding a word after our pattern `.*` will return all the characters upto that word starting from the beginning.

In [12]:
text = "The quick brown fox jumps over the lazy dog"
pattern = ".*brown"
matches = re.findall(pattern, text)
print("Pattern 6 Matches:", matches)

Pattern 6 Matches: ['The quick brown']


And if you just want to match just a word you can just mention that word and that word will be returned to you. Let's do a list of sentences to see how we can loop through them to see if the word appears in those sentences or not

In [13]:
text = ["The quick brown fox jumps over the lazy dog", "The lazy fox slept on the brown branch of a tree"]
pattern = "dog"
for idx, match in enumerate(text):
    matches = re.findall(pattern, match) 
    if matches:
        print(f"Pattern 6 Matches the input no. {idx+1}:", matches[0])
    else:
        print(f"Pattern 6 Match was not found in the input no. {idx+1}.")

Pattern 6 Matches the input no. 1: dog
Pattern 6 Match was not found in the input no. 2.


Suppose you have an invoice number and you want to extract a specific part from the same. You could do that too.

In [14]:
text = "TCS/2023-24/006"
pattern = r"\d{3}$"
match = re.findall(pattern, text)[0]
print(f'The Pattern returned {match}')

The Pattern returned 006


Let's say you want to get the year part from an invoice number. You could do it like this

In [15]:
text = "TCS/2023-24/006"
pattern = r"\d{4}"
match = re.findall(pattern, text)[0]
print(f'Year : {match}')

Year : 2023


## Where to practice?<a class='anchor' id='practice'></a> [↑](#top)

Enhance your mastery of regular expressions by utilizing platforms like https://regexr.com/ or https://regex101.com/ for hands-on practice. These websites offer an excellent environment to solidify your grasp of regular expressions. Once you've gained a solid foundation, you can further test your knowledge using Python or leverage the REGEXTRACT function in Google Sheets. These practical applications allow you to validate your learning and identify areas for improvement, if necessary.

or you can practice using the code below.

```python

import re
import ipywidgets as widgets
from IPython.display import display

text_input = widgets.Text(
    placeholder='Enter text here',
    description='Text:',
    disabled=False
)
pattern_input = widgets.Text(
    placeholder='Enter pattern here',
    description='Pattern:',
    disabled=False
)
display(text_input, pattern_input)

# Create a button widget
button = widgets.Button(description="Find Matches")
display(button)

# Output widget to display matches
output = widgets.Output()
display(output)

# Function to handle button click event
def find_matches(button_click):
    text = text_input.value
    pattern = pattern_input.value
    matches = re.findall(pattern, text)
    
    with output:
        print("Pattern Matches:", matches)

# Attach the function to the button's click event
button.on_click(find_matches)
```

## Conclusion<a class='anchor' id='conclusion'></a> [↑](#top)

- Regular expressions serve as versatile tools that empower you to efficiently search for and manipulate text based on specific patterns. By grasping the fundamental concepts and patterns, you gain the ability to carry out diverse tasks involving text processing and data extraction. ## Conclusion

- In our data-driven world, the capability to effectively process, search, and manipulate text data holds immense significance. Acquiring proficiency in regular expressions empowers you to unlock the full potential of text data and proficiently address a broad spectrum of challenges.

- Whether you're a programmer, data analyst, content creator, or anyone engaging with text, mastering regular expressions significantly amplifies your productivity and ushers in fresh opportunities for innovative solutions.

- While regular expressions can appear complex, particularly when unfamiliar, comprehending them offers you not only the capacity to create expressions tailored to your needs but also the ability to decipher expressions crafted by others.

<div class="alert alert-block alert-warning">
I trust this notebook has provided a solid foundation in grasping the essentials of regex, leaving you confident in crafting your own expressions. Your support through upvoting and following for more content like this is greatly appreciated.
</div>