<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

# NLP Basics

**Regular Expressions and the `re` Module**

&copy; Dr. Yves J. Hilpisch

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>

## Regular Expressions

From ChatGPT:

"Regular expressions, often abbreviated as "regex" or "regexp," are a powerful tool for working with text. Think of them as a special language used for searching, matching, and manipulating strings based on specific patterns. Here's an attempt to explain regular expressions in a very simple and relatable way:

Imagine you have a magic notebook that can find, organize, or even alter sentences according to rules you write down. For example, you could ask it to find every sentence that mentions "cat" or change every "cat" to "dog". Regular expressions are like the set of rules you write in this magical notebook.

Here are some key points about regular expressions, broken down into simpler concepts:

1. **Pattern Matching:** At their core, regular expressions are about finding strings that match a pattern. For instance, if you're looking for any mention of a cat in a book, the regular expression for "cat" helps you find every place where "cat" appears.

2. **Wildcards:** Sometimes, you don't want to look for a specific word, but for a type of word or pattern. Regular expressions have special symbols that act like wildcards. For example, the dot `.` can represent any character. So, if you're looking for a three-letter word where the first letter is 'c' and the last is 't', `c.t` would match "cat", "cot", "cut", etc.

3. **Quantifiers:** These let you specify how many times something should appear. For example, if you want to find places where a character laughs in a text, you might look for "ha" followed by any number of additional "ha"s. In regex, you could write this pattern as `ha+`, where the `+` says "the thing before me might appear one or more times."

4. **Character Classes:** These are like character wildcards on steroids. They let you search for any character in a specific set. For example, `[aeiou]` matches any vowel.

5. **Boundaries:** Regular expressions can also understand the context. For example, you might only want to find "cat" when it's a whole word, not part of another word like "catalog". There are special symbols in regex to specify these kinds of boundaries.

6. **Groups and Capture:** You can group parts of your pattern together, and even extract or manipulate parts of the matched strings based on these groups. It's like saying, "find sentences with 'cat' and tell me what words come before it."

7. **Lookahead and Lookbehind:** These are advanced features that let you include the context before or after your match in your search criteria. It's a bit like saying, "find 'cat' but only if 'dog' comes somewhere after it in the sentence."

In essence, regular expressions are a concise way to describe complex patterns for searching and manipulating text. They're a bit like a puzzle or a mini-programming language dedicated to text processing: initially challenging, but incredibly powerful once you get the hang of them."

## `re` Module 

From ChatGPT:

"Building on the previous explanation of regular expressions as a magical set of rules for finding, organizing, or altering text based on patterns, let's introduce the Python `re` module.

The `re` module in Python is like a magic wand that lets you use the power of regular expressions directly in your Python programs. It turns the abstract concept of searching and manipulating text using patterns into something you can easily do with code. Here's a quick overview of how it works and what it offers:

1. **Searching:** You can use `re.search()` to look through a string for a pattern. If the pattern is found, `re.search()` returns a Match object; otherwise, it returns `None`. It's like asking, "Is this pattern anywhere in the text?"

2. **Matching:** The `re.match()` function is similar to `re.search()`, but it only checks at the beginning of the string. It's like saying, "Does the text start with this pattern?"

3. **Finding All Matches:** With `re.findall()`, you can find all non-overlapping matches of a pattern in a string, not just the first one. It's like collecting all instances where the pattern appears in the text.

4. **Splitting:** The `re.split()` function splits a string by occurrences of a pattern, much like breaking a sentence into words using spaces as the delimiter, but more powerful because you can define any pattern as the delimiter.

5. **Replacing:** `re.sub()` allows you to replace parts of a string that match a pattern with another string. It's like saying, "Everywhere this pattern appears, put something else instead."

6. **Compiling Patterns:** If you're working with a pattern frequently, you can compile it into a regular expression object using `re.compile()`. This makes your code more efficient because the pattern doesn't have to be interpreted every time it's used.

7. **Groups and Capturing:** By using parentheses in your pattern, you can define groups that allow you to extract specific parts of the matched text for further processing or examination.

8. **Flags:** The `re` module also supports a variety of flags that modify the behavior of the patterns, like making the search case-insensitive with `re.IGNORECASE`, or allowing dot `.` to match newlines with `re.DOTALL`.

In summary, the Python `re` module equips you with a toolkit for wielding the power of regular expressions in your Python programs, making it easier to perform complex text processing tasks. Whether you're validating user input, scraping data from a webpage, or searching through large text files, the `re` module provides a flexible and efficient way to get the job done."

## First Steps

In [None]:
!git clone https://github.com/tpq-classes/natural_language_processing.git
import sys
sys.path.append('natural_language_processing')


In [None]:
s = '''This is a Python multiline string object.
Python is pretty good at processing such string objects.
If we add regular expressions to the Python mix, it becomes
even more powerful.'''

In [None]:
s.find('Python')

In [None]:
s.lower().find('python')

In [None]:
s.replace('Python', 'C++')

In [None]:
s.rfind('Python')

In [None]:
s[11:].find('Python')

In [None]:
import re

In [None]:
m = re.search('Python', s)
m

In [None]:
s[10:16]

In [None]:
m.start()

In [None]:
m.end()

In [None]:
s[m.start():m.end()]

In [None]:
m = re.search('pYThon', s, flags=re.IGNORECASE)
m

In [None]:
m = re.search('p....n', s, flags=re.IGNORECASE)
m

In [None]:
re.search('.at', 'The cat wears a hat.')

In [None]:
re.findall('Python', s)

In [None]:
re.findall('pYTHOn', s, flags=re.IGNORECASE)

In [None]:
re.findall('.at', 'The cat wears a hat.')

In [None]:
s.split()[:6]

In [None]:
re.split('\s', s)[:6]

## Important Patterns 

In [None]:
t = '100 plus 50 minus 20 equals 130 times 10 equals 1300.'

In [None]:
re.search('\d', t)  # searches the whole string (stops at the first hit)

In [None]:
re.search('\d+', t)

In [None]:
t.startswith('100')

In [None]:
re.match('\d', t)  # "searches" only at the start of the string

In [None]:
re.match('\d+', t)

In [None]:
re.findall('\d', t)[:6]

In [None]:
re.findall('\d+', t)[:6]

In [None]:
re.findall('\d{2}', t)[:6]

In [None]:
re.findall('\d{3}', t)[:6]

In [None]:
re.findall('\d{4}', t)[:6]

In [None]:
re.findall('\d{3,4}', t)[:6]

In [None]:
re.findall('\d{2,3}', t)[:6]

In [None]:
re.findall('\d{2,4}', t)[:6]

## Use Case

In [None]:
url = 'https://hilpisch.com/walden.txt'

In [None]:
import requests

In [None]:
text = requests.get(url).text

In [None]:
print(text[200:800])

In [None]:
re.search('\d{3}', text)

In [None]:
%time re.findall('\d{4}', text)

<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>