# Regex Tips Notes

Reference: [10 Tips to Get More From Regex](https://pybit.es/articles/mastering-regex/)

## Regular expression for matching repeating words

---


In [69]:
# Import the regular expression module
import re

In [70]:
# Define the string to search
s = 'Paris in the the spring'

In [71]:
# Compile a regular expression pattern
p = re.compile(
    r'''
    (       # Open first capturing group
    \b      # Establish a word boundary
    \w+     # One or more occurrence of a word character
    )       # Close first capturing group
    \s+     # One or more occurrence of a word character
    \1      # Capturing group 1
    ''',
    re.VERBOSE
)

In [72]:
# Search for repeating words and display the first match
p.search(s).group()

'the the'

---

## Tip 5 (Greediness).

By default, regex searches are _greedy_, meaning they match as much as possible:


In [73]:
# Define a text string
text = 'This is one group of words. This is another group of words.'

In [74]:
# Import the 're' module and run a search for words followed by a period.
import re
re.search(r'.+\.', text).group()

'This is one group of words. This is another group of words.'

The previous example returns the entire string, because the default  _greedy_ property of the search matches the longest possible string.

```bash
This is one group of words. This is another group of words.
```

To match only the first sentence, make the search pattern non-greedy:

In [75]:
# Define a text string
text = 'This is one group of words. This is another group of words.'

In [76]:
# Import the 're' module and run a search for words followed by a period.
import re

In [77]:
# Include the ? character after the .+ or .* to match the shortest possible 
re.search(r'.+?\.', text).group()

'This is one group of words.'

---

## Word Boundaries

- The regex syntax `\b` indicates a _word boundary_.
    - The match for a word boundary does not consume any characters (zero length).
    - Will match at the boundaries of word and non-word characters.
    - `\b` enables a "whole word only" functionality with the syntax `\bword\b`.

- `\b` matches:
    - Before the first character in a string, if the first character is a word character (`\w`).
    - After the last character in a string, if the first character is a word character (`\w`).
    - Between two characters in a string, where one is a word character (`\w`) and the other is not a word character (`\W`).


### Example #1


In [78]:
# Import the regular expression module
import re

In [79]:
# Define a string to search
text = 'This is a thrilling episode to thrash about in the bath through Thursday.'

In [80]:
# Match every instance of 'th' and 'Th' when preceded by a non-word character.
r = re.compile(
    r'''
    \b         # Start a word boundary
    [tT]h      # Literal th or Th
    ''',
    re.VERBOSE
)

In [81]:
# Display a list of matches
r.findall(text)

['Th', 'th', 'th', 'th', 'th', 'Th']

### Example #2


In [82]:
# Import the regular expression module
import re

In [83]:
# Define a string to search
text = 'This is a thrilling episode to thrash about in the bath through Thursday.'

In [84]:
# Match every instance of 'th' and 'Th' when followed by a non-word character.
re.search(r'[tT]h\b', text)

<re.Match object; span=(53, 55), match='th'>

---

## Backreferences

Helpful to do things like locating duplicate words:


In [85]:
# Import the regular expression module
import re

In [86]:
# Define a string to search
text = 'This is the song song that never never ends'

''' Match any word that appears immediately after itself.
    Create match group 1 to find every word, then search for group 1
    immediately after the match string.
'''

' Match any word that appears immediately after itself.\n    Create match group 1 to find every word, then search for group 1\n    immediately after the match string.\n'

In [87]:
r = re.compile(
    r'''
    (          # Start match group 1
    \b         # Start a word boundary
    \w+        # Match one or more word characters
    )          # End match group 1
    \s+        # Match one or more space characters
    \1         # Match an instance of match group 1
    ''',
    re.VERBOSE
)

# Display a list of matches
r.findall(text)

['song', 'never']

In [88]:
# Display a list of matches
r.findall(text)

['song', 'never']

---

## `re.subn`

The `re.subn` function performs **string replacement** like `re.sub`, and `re.subn` also counts the number of replacements performed.

The returned object is s 2-tuple with the first value being the string object after replacement occurs, and the second value being the number of replacements performed.


In [89]:
html = '''
<html>
    <head>
        <title>This is a sample page</title>
    </head>
    <body>
        <h1>This is the Sample Page Title</h1>
        <ul>
            <li>Point #1</li>
            <li>Point #2</li>
            <li>Point #3</li>
        </ul>
    </body>
</html>
'''

In [90]:
# Import the regular expression module
import re

In [91]:
# Define a function to strip HTML tags, leaving only text remaining
def strip_html(html: str = html) -> None:

    ''' The non-greedy quantifier (?) after the + character indicates
        the search will find the shortest possible match.'''

    text = re.subn(
        pattern=r'\n?\s*<[^<]+?>\n?\s*',
        repl=' ',
        string=html
    )

    return text


In [92]:
# Import the Pretty Print module
from pprint import pprint

In [93]:
# Call the function
text = strip_html()

# Assign the tuple indices to their own variables.
string = text[0].strip()
num_replacements = text[1]

In [94]:
# Display the results
print(f'\nString result: {string}\n')

print(f'Total replacements: {num_replacements}\n')


String result: This is a sample page    This is the Sample Page Title   Point #1  Point #2  Point #3

Total replacements: 18



---

## Compilation Flags

Allow the modification of some aspects of how regular expressions work.

- [Compilation Flags Reference](https://docs.python.org/3.6/howto/regex.html#compilation-flags).
- Used exclusively with `re.compile`.
- Multiple flags can be specified by bitwise OR-ing them; `re.I | re.M` sets both the `I` and `M` flags, for example.

| Flag | Meaning |
| :--- | :--- |
| ASCII, A | Makes several escapes like `\w`, `\b`, `\s` and `\d` match only on ASCII characters with the respective property. |
| DOTALL, S | Make `.` match any character, including newlines. |
| IGNORECASE, I | Do case-insensitive matches. |
| LOCALE, L | Do a locale-aware match. |
| MULTILINE, M | Multi-line matching, affecting `^` and `$`. |
| VERBOSE, X (for ‘extended’) | Enable verbose REs, which can be organized more cleanly and understandably. |

### Example Compilation Flag Usage:

In [95]:
# Import the regular expression module
import re

In [96]:
# Define a string to search
text = 'My friend "Oliver" is an absolutely fantastic and friendly squirrel. Oliver is very sweet.'

In [97]:
# Define a case-insensitive search pattern for name in between quotes (between non-word characters).
regex = re.compile(
    r'''
    [a-z]   # Lowercase character class for a-z
    +       # Match one or more repetition of the character class
    (?=")   # Lookahead to find a match before a literal " character.
    ''',
    re.IGNORECASE | re.VERBOSE
)

In [98]:
# Perform and display regex search results
search = regex.search(text)
print(search)

# Perform and display regex findall results
findall = regex.findall(text)
print(findall)

<re.Match object; span=(11, 17), match='Oliver'>
['Oliver']
