# Regular Expressions

## Introduction

Regular expressions (regex or regexp) are patterns used to match character combinations in strings. They are a powerful tool for working with text search, manipulation, and validation. You can think of regular expressions like a "mini-language" for matching strings against patterns. They provide a powerful way to manipulate and analyze text in programs.

## Python regular expression methods


* `re.match(pattern, string, flags=0)` # match from the beginning of the string
* `re.search(pattern, string, flags=0)` # search the entire string
* `re.fullmatch(pattern, string, flags=0)` # match the entire string with the pattern

> re.match is anchored at the start ^pattern
> * Ensures the string begins with the pattern
> 
> re.fullmatch is anchored at the start and end of the pattern ^pattern$
> * Ensures the full string matches the pattern (can be especially useful with alternations as described here)
> 
> re.search is not anchored pattern
> * Ensures the string contains the pattern


* `re.findall(pattern, string, flags=0)` # find all matches in the string
* `re.sub(pattern, repl, string, count=0, flags=0)` # replace all matches in the string
* `re.split(pattern, string, maxsplit=0, flags=0)` # split string by the occurrences of pattern


* `re.compile(pattern, flags=0)` # compile a regular expression pattern into a regular expression object

> compile will save computation if you need to repeatedly use the same pattern






## 1. Character Classes

Character classes let you define specific groups of characters to match within strings. 

   - `\d` matches any digit from 0 to 9.
   - `\w` matches any alphanumeric character including underscores.
   - `\s` matches any whitespace character such as space, tab, newline, etc.
   - `.`  matches any character 
   - These classes can help find patterns based on specific types of characters.

> Use the . operator carefully since often class or negated character class (which we’ll cover next) are faster and more precise.
>
> `\d`, `\w` and `\s` also present their negations with `\D,` `\W` and `\S` respectively.
>
> For example, `\D` will perform the inverse match with respect to that obtained with `\d`.
> 
> * `\D`         matches a single non-digit character -> Try it!
>
> In order to be taken literally, you must escape the characters ^.[$()|*+?{\with a backslash \ as they have special meaning.
>
> * `\$\d`       matches a string that has a $ before one digit -> Try it!
> Notice that you can match also non-printable characters like tabs \t, new-lines \n, carriage returns \r.

In [1]:
import re

# Define a pattern to match the word "the"
patterns = [r"\d",  # \d matches any digit (0-9)
            r"\D",  # \D matches any non-digit
            r"\w",  # \w matches any alphanumeric character
            r"\W",  # \W matches any non-alphanumeric character
            r"\s",  # \s matches any whitespace character
            r"\S"]  # \S matches any non-whitespace character

# Define a string to search
text = "Hello 123."
#       0123456789  # ^^^ Indices of string characters

# Search the string using our pattern
for pattern in patterns:
    print(f"For pattern {pattern}:")
    match = re.search(pattern, text)

    # Print the match object
    print(f"Match: {match}")
    print(f"Group: {match.group()}")
    print(f"Span: {match.span()}")
    print(f"Start: {match.start()}")
    print(f"End: {match.end()}")
    print()

For pattern \d:
Match: <re.Match object; span=(6, 7), match='1'>
Group: 1
Span: (6, 7)
Start: 6
End: 7

For pattern \D:
Match: <re.Match object; span=(0, 1), match='H'>
Group: H
Span: (0, 1)
Start: 0
End: 1

For pattern \w:
Match: <re.Match object; span=(0, 1), match='H'>
Group: H
Span: (0, 1)
Start: 0
End: 1

For pattern \W:
Match: <re.Match object; span=(5, 6), match=' '>
Group:  
Span: (5, 6)
Start: 5
End: 6

For pattern \s:
Match: <re.Match object; span=(5, 6), match=' '>
Group:  
Span: (5, 6)
Start: 5
End: 6

For pattern \S:
Match: <re.Match object; span=(0, 1), match='H'>
Group: H
Span: (0, 1)
Start: 0
End: 1



## 2. Quantifiers

Quantifiers dictate how many times the preceding element must appear.

   - `*` matches 0 or more instances.
   - `+` matches 1 or more instances.
   - `?` matches 0 or 1 instance.
   - `{3}` matches exactly 3 instances.
   - Quantifiers enable flexible matching of repeated elements.
   - Examples:
     - `abc*`        matches a string that has ab followed by zero or more c 
     - `abc+`        matches a string that has ab followed by one or more c
     - `abc?`        matches a string that has ab followed by zero or one c
     - `abc{2}`      matches a string that has ab followed by 2 c
     - `abc{2,}`     matches a string that has ab followed by 2 or more c
     - `abc{2,5}`    matches a string that has ab followed by 2 up to 5 c


In [2]:
import re

# Define a pattern to match the word "the"
patterns = [r"\d*",     
            r"\d+", 
            r"\d?", 
            r"\d{2}", 
            r"\d{1,3}", 
            r"l*", 
            r"l+"]

# Define a string to search
text = "Hello 123 Yellow Mellow Mellowyellow laundry."
#       0123456789  # ^^^ Indices of string characters

# Search the string using our pattern
for pattern in patterns:
    print(f"For pattern {pattern}:")
    match = re.findall(pattern, text)

    # Print the match object
    print(f"{len(match)} matches found: {match}")
    print(f"Match: {match}")
    print()

For pattern \d*:
44 matches found: ['', '', '', '', '', '', '123', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
Match: ['', '', '', '', '', '', '123', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

For pattern \d+:
1 matches found: ['123']
Match: ['123']

For pattern \d?:
46 matches found: ['', '', '', '', '', '', '1', '2', '3', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
Match: ['', '', '', '', '', '', '1', '2', '3', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

For pattern \d{2}:
1 matches found: ['12']
Match: ['12']

For pattern \d{1,3}:
1 matches found: ['123']
Match: ['123']



## 3. Grouping

Parentheses `()` are used for grouping parts of a pattern together. 

   - This allows for applying quantifiers to groups.
   - It also enables the extraction of matched groups later, which can be essential for complex pattern extraction.
   - Examples: 
     - `a(bc)`           parentheses create a capturing group with value bc
     - `a(?:bc)*`        using ?: we disable the capturing group 
     - `a(?P<foo>bc)`    using ?P<foo> we put a name to the group 

### Basic Syntax for Grouping

The basic syntax for a group is to enclose the regex pattern you want to group in parentheses `( )`.

```python
pattern = '(ab)+'
string = 'ababab'
result = re.search(pattern, string)
print(result.group())
```

### Why Use Groups?

- **Capture substrings**: Extract specific parts from the text.
- **Apply quantifiers to multiple characters**: For example, `(ab){2}` will match `'abab'`.
- **Logical OR matching**: Use `|` to match one of many expressions.

### Practical Examples

#### Example 1: Extracting Dates

Suppose we want to extract dates formatted as "MM/DD/YYYY" from a text.

```python
pattern = r'(\d{2})/(\d{2})/(\d{4})'
text = "Today's date is 08/26/2023"
match = re.search(pattern, text)
if match:
    print(f"Month: {match.group(1)}, Day: {match.group(2)}, Year: {match.group(3)}")
```

#### Example 2: Simplifying Nested Groups

Grouping can also be nested. Consider the pattern for matching email addresses.

```python
pattern = r'((\w+\.)*\w+)@((\w+\.)+\w+)'
text = "Contact us at info@example.com"
match = re.search(pattern, text)
if match:
    print(f"User: {match.group(1)}, Domain: {match.group(3)}")
```

#### Example 3: Using Quantifiers with Groups

Groups can be combined with quantifiers. For example, the following pattern will match one or more occurrences of 'ab'.

```python
pattern = r'(ab)+'
text = "The sequence 'ababab' shows repetitive patterns."
match = re.findall(pattern, text)
if match:
    print(f"Found patterns: {match}")
```

#### Example 4: Logical OR in Groups

Use the `|` symbol to specify that only one pattern among multiple patterns should match.

```python
pattern = r'(apple|orange|banana)'
text = "I like to eat apple and orange."
match = re.findall(pattern, text)
if match:
    print(f"Fruits found: {match}")
```

In the following cell, we provide a more complex example of using groups to extract data from a text file: specifically, the use of named groups is demonstrated.

In [3]:
import re

# Define a string to search
text = """
Today's Temp:
85.4
Price #2
$12.34
Price #3 $1.23
"""

print("example 1:")
match = re.search(r"\$(?P<dollar>\d+)\.\d+", text) # this will allow us to name the group
print(match)
print(match.group('dollar'))  # here, we can call the group by name
print()

print("example 2:")
for match in re.finditer(r"\$\d+\.\d+", text):  # here, we create an iterator that will return all matches, one at a time
    print("match",  match.group(), "start index", match.start(), "End index", match.end())
print()

print("example 3:")
for match in re.finditer(r"\$(\d+)\.(\d+)", text): # here, we create an iterator that will return all matches, one at a time
    print("match",  match.group(), "start index", match.start(), "End index", match.end())
print()

print("example 4:")
print(re.findall(r"\$(\d+)\.(\d+)", text))
print()

print("example 5:")
for match in re.finditer(r"\$(?P<dollars>\d+)\.(?P<cents>\d+)", text): # here, we create an iterator that will return all matches, one at a time
    print(match.group('dollars'), "dollars", match.group("cents"),  "cents", )
print()

example 1:
<re.Match object; span=(29, 35), match='$12.34'>
12

example 2:
match $12.34 start index 29 End index 35
match $1.23 start index 45 End index 50

example 3:
match $12.34 start index 29 End index 35
match $1.23 start index 45 End index 50

example 4:
[('12', '34'), ('1', '23')]

example 5:
12 dollars 34 cents
1 dollars 23 cents



## 4. Bracket expressions 

Bracket expressions are enclosed in square brackets `[ ]`, and they will match any single character contained within the brackets.

In [4]:
ypattern = '[aeiou]'
string = 'hello world'
result = re.findall(pattern, string)
print(result)  # Output: ['e', 'o', 'o']

['ll', 'l']


### Use Cases for Bracket Expressions

- **Character classes**: `[aeiou]` will match any vowel.
- **Ranges**: `[0-9]` will match any digit.
- **Negation**: `[^aeiou]` will match anything other than a vowel.

### Practical Examples

#### Example 1: Extracting Vowels from a String

To extract all vowels from a string, we can use the following code.

In [5]:
pattern = '[aeiou]'
text = "Regular expressions are powerful."
matches = re.findall(pattern, text)
print(f"Vowels found: {matches}")

Vowels found: ['e', 'u', 'a', 'e', 'e', 'i', 'o', 'a', 'e', 'o', 'e', 'u']


#### Example 2: Finding Digits in a Text

We can find all digits in a string using the pattern `[0-9]`.

In [6]:
pattern = '[0-9]'
text = "The year is 2023."
matches = re.findall(pattern, text)
print(f"Digits found: {matches}")

Digits found: ['2', '0', '2', '3']


#### Example 3: Extracting Specific Words

To find all instances of specific words in a text, you can use bracket expressions to handle case variations.

In [7]:
pattern = '[Tt]he'
text = "The code, the comment, and The readme."
matches = re.findall(pattern, text)
print(f"Instances found: {matches}")

Instances found: ['The', 'the', 'The']


#### Example 4: Negation in Bracket Expressions

We can also negate a class by using `^` as the first character in the bracket expression.

In [8]:
pattern = '[^aeiou]'
text = "hello"
matches = re.findall(pattern, text)
print(f"Consonants found: {matches}")

Consonants found: ['h', 'l', 'l']


#### More advanced patterns

- `[abc]`            matches a string that has either an a or a b or a c -> is the same as a|b|c 
- `[a-c]`          same as previous
- `[a-fA-F0-9]`      a string that represents a single hexadecimal digit, case insensitively 
- `[0-9]%`           a string that has a character from 0 to 9 before a % sign
- `[^a-zA-Z]`        a string that has not a letter from a to z or from A to Z. In this case the ^ is used as negation of the expression 
- Greedy and Lazy match
    - The quantifiers `( * + {})` are greedy operators, so they expand the match as far as they can through the provided text.
    - For example, `<.+>` matches `<div>simple div</div>` in `This is a <div> simple div</div> test.` 
    - In order to catch only the div tag we can use a `?` to make it lazy:
        - `<.+?>`            matches any character one or more times included inside `<` and `>`, expanding as needed
        - Notice that we could also . in favor of a more strict regex:
            - `<[^<>]+>`         matches any character except < or > one or more times included inside < and >

In [9]:
import re

# Define a string to search
text = """
`This is a <div> simple div</div> test.`
"""

match = re.search(r"<.+>", text)
print(match)

match = re.search(r"<.+?>", text)
print(match)

match = re.search(r"<[^<>]+>", text)
print(match)

<re.Match object; span=(12, 34), match='<div> simple div</div>'>
<re.Match object; span=(12, 17), match='<div>'>
<re.Match object; span=(12, 17), match='<div>'>


## 5. Anchors

These determine the position of the match within the string.

   - `^` matches the start of a line or string.
   - `$` matches the end of a line or string.
   - Anchors help in scenarios where positioning is vital.

   - Examples:
     - `^The`        matches any string that starts with The 
     - `end$`        matches a string that ends with end
     - `^The end$`   exact string match (starts and ends with The end)
     - `roar`        matches any string that has the text roar in it

In [10]:
# Matches 'hello' only if it is at the start of the string
pattern = '^hello'
string = 'hello world'
result = re.search(pattern, string)
print(result.group() if result else "No match")  # Output: 'hello'

hello


### Why Use Anchor Expressions?

- **Precision**: Ensure that the pattern appears exactly where you want it in the text.
- **Efficiency**: Reduce the number of overall matches by specifying where to look.
- **Logical Matching**: Makes it easier to perform complex matches based on logical conditions.

### Practical Examples

#### Example 1: Validating Usernames

Let's validate usernames to ensure they start with a letter and end with a number.

In [11]:
pattern = '^[a-zA-Z][a-zA-Z0-9]*[0-9]$'
text = "JohnDoe9"
match = re.fullmatch(pattern, text)
print("Valid" if match else "Invalid")

Valid



#### Example 2: Extracting Sentences

To extract complete sentences from a text, you can use anchors to define sentence boundaries.

In [12]:
pattern = r'[^.!?]*[.!?]'
text = "Hello, world! How are you? Fine."
matches = re.findall(pattern, text)
print(matches)

['Hello, world!', ' How are you?', ' Fine.']


#### Example 3: Line-by-Line Parsing

For a multi-line string, you can combine the multiline flag `re.M` with the start-of-line anchor to match the beginning of each line.

In [13]:
pattern = r'^[a-z]+'
text = """apple
Banana
cherry"""
matches = re.findall(pattern, text, re.M)
print(matches)  # Output: ['apple', 'cherry']

['apple', 'cherry']


#### Example 4: Matching File Extensions

To match the file extensions at the end of filenames, use the `$` anchor.

In [14]:
pattern = r'\.(txt|csv|json)$'
text = "data.csv"
match = re.search(pattern, text)
if match:
    print(f"File type: {match.group(1)}")

File type: csv


## 6. Alternation

The `|` operator allows matching different options.

   - `cat|dog` would match either "cat" or "dog".
   This operator is useful when there are multiple valid pattern options.
   - `a(b|c)`     matches a string that has a followed by b or c (and captures b or c)
   - `a[bc]`     same as previous, but without capturing b or c

In [15]:
pattern = 'apple|orange'
string = 'I have an apple'
result = re.search(pattern, string)
print(result.group() if result else "No match")  # Output: 'apple'

apple


### Why Use Alternation Expressions?

- **Flexibility**: Allows you to match multiple patterns with a single regex.
- **Simplification**: Condenses multiple `re.search()` or `re.findall()` calls into one.
- **Enhanced Logic**: Enables complex pattern recognition in text.

### Practical Examples

#### Example 1: Matching Multiple Date Formats

Different documents may represent dates in different formats. Alternation can be useful for capturing all of them.

In [16]:
pattern = r'\d{2}-\d{2}-\d{4}|\d{2}/\d{2}/\d{4}'
text = "The dates are 08-26-2023 and 08/26/2023."
matches = re.findall(pattern, text)
print(f"Dates found: {matches}")

Dates found: ['08-26-2023', '08/26/2023']


#### Example 2: Extracting Phone Numbers

We can use alternation to match different phone number formats.

In [17]:
pattern = r'\(?\d{3}\)?[-.\s]\d{3}[-.\s]\d{4}|\d{3}[-.\s]\d{2}[-.\s]\d{4}'
text = "Contact numbers are (123) 456-7890 and 123-45-6789."
matches = re.findall(pattern, text)
print(f"Phone numbers found: {matches}")

Phone numbers found: ['(123) 456-7890', '123-45-6789']


#### Example 3: Parsing Logs for Errors and Warnings

We can use alternation to search for both "Error" and "Warning" labels in a log file.

In [18]:
pattern = r'Error:|Warning:'
text = "Error: File not found. Warning: Low disk space."
matches = re.findall(pattern, text)
print(f"Labels found: {matches}")



#### Example 4: Finding Synonyms

To search for synonyms in a text, you can use alternation expressions.

In [19]:
pattern = r'quick|fast|speedy'
text = "The quick fox jumps over the lazy dog."
matches = re.findall(pattern, text)
print(f"Synonyms found: {matches}")

Synonyms found: ['quick']


## 7. Escaping

The backslash `\` is used to escape special regex characters, treating them as literal characters.

   - For example, `\.` matches a literal period.
   - Escaping is needed when you want to match characters that are special in regex syntax.

In [20]:
import re

# Define a string to search
text = """
 This is a <div> simple div</div> test.`
"""

match = re.search(r"\.", text)
print(match)

match = re.search(r".", text)
print(match)

<re.Match object; span=(39, 40), match='.'>
<re.Match object; span=(1, 2), match=' '>


## 8. Boundaries

- `\b`: Matches the empty string, but only at the beginning or end of a word.
- `\B`: Matches the empty string, but not at the beginning or end of a word.

In [21]:
# Matches 'cat' only when it appears as a whole word
pattern = r'\bcat\b'
string = 'catapult the cat to the cathedral'
result = re.findall(pattern, string)
print(result)  # Output: ['cat']

['cat']


### Importance of Boundary Expressions

- **Precision**: Enhance the accuracy of text matching.
- **Context Sensitivity**: Locate words based on their position within the string.
- **Data Cleaning**: Useful for tasks that require very precise string manipulation.

### Practical Examples

#### Example 1: Precise Word Matching

Sometimes you want to match a specific word without catching substrings that contain it.

In [22]:
pattern = r'\bthe\b'
text = "the theme is on the table."
matches = re.findall(pattern, text)
print(f"Instances of 'the': {matches}")

Instances of 'the': ['the', 'the']


#### Example 2: Excluding Sub-Strings

In this example, we will use `\B` to exclude matches that form part of another word.

In [23]:
pattern = r'\Bion\B'
text = "The operation is an optional solution."
matches = re.findall(pattern, text)
print(f"Instances found: {matches}")

Instances found: ['ion']


In [24]:
pattern = r'\Bion\b'
text = "The operation is an optional solution."
matches = re.findall(pattern, text)
print(f"Instances found: {matches}")

Instances found: ['ion', 'ion']


In [25]:
pattern = r'ion'
text = "The operation is an optional solution."
matches = re.findall(pattern, text)
print(f"Instances found: {matches}")

Instances found: ['ion', 'ion', 'ion']


#### Example 3: Email Validation

We can use word boundaries to validate email addresses more accurately.

In [26]:
pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b'
text = "My email is example@gmail.com."
matches = re.findall(pattern, text)
print(f"Emails found: {matches}")

Emails found: ['example@gmail.com']


#### Example 4: Hashtag Extraction from Social Media Text

Word boundaries can be helpful to extract hashtags precisely from a text corpus.

In [27]:
pattern = r'\B#\w+\b'
text = "Let's #explore Regex! #Python3"
matches = re.findall(pattern, text)
print(f"Hashtags found: {matches}")

Hashtags found: ['#explore', '#Python3']


## 9. Backreferences

- `\1`, `\2`, ...: Refers back to a previously captured group.

In [28]:
# A simple example of using a backreference to match repeated words
pattern = r'\b(\w+)\b\s+\1\b'
string = 'hello hello world'
result = re.search(pattern, string)
print(result.group() if result else "No match")  # Output: 'hello hello'

hello hello




### The Utility of Backreferences

- **Pattern Reusability**: Use previously matched text within the same regex.
- **Validation**: Match strings with repeated or symmetrical patterns.
- **Data Transformation**: Advanced string replacements based on what was matched.

### Practical Examples

#### Example 1: Finding Repeated Words

The aim is to locate repeated words within a sentence.

In [29]:
pattern = r'\b(\w+)\b\s+\1\b'
text = "This is is a test test string."
matches = re.findall(pattern, text)
print(f"Repeated words: {matches}")

Repeated words: ['is', 'test']


#### Example 2: Validating Palindromic Words

Here, we aim to find palindromic words using backreferences.

In [30]:
pattern = r'\b(\w)(\w?)\w?\2\1\b'
text = "Level, civic, radar, rotor, deed."
matches = re.findall(pattern, text, re.IGNORECASE)
print(f"Palindromes: {matches}")

Palindromes: [('L', 'e'), ('c', 'i'), ('r', 'a'), ('r', 'o'), ('d', 'e')]


#### Example 3: HTML Tag Pair Matching

This example shows how backreferences can be used to match paired HTML tags.

In [31]:
pattern = r'<(\w+)>(.*?)<\/\1>'
text = "<title>This is a title</title>"
matches = re.search(pattern, text)
print(f"Tag: {matches.group(1)}, Content: {matches.group(2)}" if matches else "No match")

Tag: title, Content: This is a title


#### Example 4: Validating Date Formats

Backreferences can help in validating symmetrical date formats.

In [32]:
pattern = r'(\d{2})-(\d{2})-\1'
text = "The date is 12-21-12."
match = re.search(pattern, text)
print(f"Symmetrical date found: {match.group()}" if match else "No match")

Symmetrical date found: 12-21-12


## 10. Lookahead/Behind

These specify conditions that must follow (lookahead) or precede (lookbehind) the main pattern.

Look-ahead and Look-behind — `(?=)` and `(?<=)`

   - `?=` for positive lookahead.
   - `?!` for negative lookahead.
   - `?<=` for positive lookbehind.
   - `?<!` for negative lookbehind.
   - They allow for complex conditional matching scenarios.
   - Examples:      
       - **Look-Ahead**: 
         - `X(?=Y)` matches X only if X is followed by Y.
         - `d(?=r)`       matches a d only if is followed by r, but r will not be part of the overall regex match 
      - **Look-Behind**: 
        - `(?<=Y)X` matches X only if X is preceded by Y. 
        - `(?<=r)d`      matches a d only if is preceded by an r, but r will not be part of the overall regex match 


In [33]:
# Look-Ahead Example
pattern = r'\d(?=px)'
string = '3px 4py 5px'
result = re.findall(pattern, string)
print(result)  # Output: ['3', '5']

['3', '5']


In [34]:
# Look-Behind Example
pattern = r'(?<=@)\w+'
string = 'user@email.com'
result = re.findall(pattern, string)
print(result)  # Output: ['email']

['email']


### Significance of Look-Ahead and Look-Behind

- **Complex Matching**: Achieve intricate pattern recognition.
- **Conditional Logic**: Enables more conditional matching.
- **Efficiency**: Streamlines code and reduces computational load.

### Practical Examples

#### Example 1: Extracting Currency Amounts

We may want to find all amounts in a text that are specified in USD.

In [35]:
pattern = r'(?<=\$)\d+(\.\d{2})?'
text = "The price is $20.99 and $31."
matches = re.findall(pattern, text)
print(f"Amounts in USD: {matches}")

Amounts in USD: ['.99', '']


#### Example 2: Validating Passwords

Check if a password has at least one number following any sequence of letters.

In [36]:
pattern = r'^(?=.*[0-9])[A-Za-z0-9]+$'
text = "ValidPass1"
match = re.search(pattern, text)
print(f"Valid password: {bool(match)}")

Valid password: True


#### Example 3: Email Domain Extraction

To extract the domain of an email without the top-level domain (.com, .org, etc.)

In [37]:
pattern = r'(?<=@)[^.]+'
text = "My email is example@gmail.com."
matches = re.findall(pattern, text)
print(f"Email domain: {matches}")

Email domain: ['gmail']


#### Example 4: Identifying Modified Variables in Code

Suppose you want to find all variables in a code snippet that are incremented.

In [38]:
pattern = r'\b\w+(?=\+\+)'
text = "x++ y++ z"
matches = re.findall(pattern, text)
print(f"Incremented variables: {matches}")

Incremented variables: ['x', 'y']


## 11. Flags

Flags in python regular expressions are used to modify the meaning of a given pattern.

   - `re.I` makes the matching of the pattern case-insensitive.
   - `re.M` makes the $ anchor match the end of the string and the end of each line.
   - `re.S` makes a period (dot) match any character, including a newline.
   - `re.U` makes \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.
   - `re.X` allows writing regular expressions that look nicer and are more readable by allowing to visually separate logical sections of the pattern and add comments.

In [39]:
# Example with re.IGNORECASE
pattern = r'python'
string = 'Python is a programming language.'
result = re.findall(pattern, string, re.I)
print(result)  # Output: ['Python']

['Python']


### Importance of Flags

- **Flexibility**: Flags offer greater control over the matching behavior.
- **Readability**: Some flags can make regular expressions more understandable.
- **Efficiency**: Certain flags can help simplify complex regular expressions.

### Practical Examples

#### Example 1: Case-Insensitive Matching

Use `re.IGNORECASE` to find all case variants of a particular word.

In [40]:
pattern = r'\bword\b'
text = "Word word WoRd wORd"
matches = re.findall(pattern, text, re.I)
print(f"Case-insensitive matches: {matches}")

Case-insensitive matches: ['Word', 'word', 'WoRd', 'wORd']


#### Example 2: Multi-Line Matching

Utilizing `re.MULTILINE` to match patterns that span multiple lines.

In [41]:
pattern = r'^\w+'
text = """apple
banana
cherry"""
matches = re.findall(pattern, text, re.M)
print(f"Words at the beginning of lines: {matches}")

Words at the beginning of lines: ['apple', 'banana', 'cherry']


#### Example 3: Matching Across Newlines

Here, we will use `re.DOTALL` to match content across multiple lines.

In [42]:
pattern = r'<tag>.*?</tag>'
text = "<tag>Content across \nmultiple lines</tag>"
match = re.search(pattern, text, re.S)
print(f"Tag content: {match.group() if match else 'No match'}")

Tag content: <tag>Content across 
multiple lines</tag>


#### Example 4: Improved Readability with `re.VERBOSE`

Improving regex readability using comments enabled by `re.VERBOSE`.


In [43]:
pattern = r"""
    \b      # Word boundary
    \d{1,3} # 1 to 3 digits
    \b      # Word boundary
"""
text = "100 200 300 4000"
matches = re.findall(pattern, text, re.VERBOSE)
print(f"1 to 3 digit numbers: {matches}")

1 to 3 digit numbers: ['100', '200', '300']


## Tips and Practice

Here are some tips to help you practice using regular expressions (regex) in Python:

- Import the re module. This contains functions for working with regex in Python:

```python
import re
```

- Use raw strings for the regex patterns. This avoids having to escape special characters:

```python 
pattern = r'\d{4}-\d{2}-\d{2}'
```

- Use the re.compile() function to compile the regex pattern into a regex object. This can be reused:

```python
regex = re.compile(pattern)
```

- Use the regex methods like search(), findall(), etc to apply the pattern to strings:

```python
matches = regex.search(some_string)
all_matches = regex.findall(some_string)
```

- Test regex patterns with the re.fullmatch() function. This requires a full string match: 

```python
if re.fullmatch(pattern, some_string):
   print("Full match!")
```

- Use regex groups to extract parts of a matched pattern. Define groups with ():

```python
pattern = r'(\d{4})-(\d{2})-(\d{2})' 
date_parts = re.search(pattern, a_date).groups()
```

- Practice with some common patterns like email addresses, phone numbers, etc. Test strings to find matches.

- Use sites like regex101.com to test regex live and explain how they work.

- Look at regex examples in the Python documentation and tutorials. Adapt them for your own uses.

The best way to get better is to practice! Start simple and slowly try more complex regular expression patterns.

## Resources

* https://regexr.com/

* https://regex101.com/

* https://www.regular-expressions.info/index.html

* https://regexlearn.com/learn

* https://www.w3schools.com/python/python_regex.asp
