# Regular Expression

## What is it?

>A regular expression, regex or regexp (sometimes called a rational expression) is, in theoretical computer science and formal language theory, a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for "find" or "find and replace" operations on strings, or for input validation.
<span style="margin-top: 0em; float: right;">-- <a href="https://en.wikipedia.org/wiki/Regular_expression">Wikipedia</a></span>

## Why do we need it?

<span style="font-size: large">Task: find all legal ip addresses in a 10 GB textual file.</span>

How do you plan to do that?

How do you describe the rules that make a legal ip address after all?

<span style="font-size: large">A more relevant task: process the movie names in the imdb data provided in project 4, so that Python can know two movie names are referring to the same movie?</span>

Why is this not trivial? See below for a tiny sample from `actress_movies.txt`

|                      |                                 |                                |                                                                                    |                                                                               |                                                                               |     |
|----------------------|---------------------------------|--------------------------------|------------------------------------------------------------------------------------|-------------------------------------------------------------------------------|-------------------------------------------------------------------------------|-----|
| ***Marman, Lexi***         | **Baby I Try for You**              | Finding June (2013)            | **Modus Operandi (2010)  {{SUSPENDED}}**                                               | **Raw (2011/I)**                                                                  | This Is Normal (2013)                                                         | ... |
| ***Orlova, Marina (III)*** | 2-Assa-2 (2008)                 | **Ironiya lyubvi (2010)  (voice)** | Naturshchitsa (2007)                                                               | Not So Young (2013)                                                           | **Stilyagi (2008)  (uncredited)**                                                 | ... |
| ***Piesse, Bonnie***       |Like Ingrid Bergman (2012)| **Love Eterne**                    | My Saga (2016)                                                                     | **Star Wars: Episode II - Attack of the Clones (2002)  (as Bonnie Maree Piesse)** | **Star Wars: Episode III - Revenge of the Sith (2005)  (as Bonnie Maree Piesse)** | ... |
| ***Portman, Natalie***     | **Isabella V (2010) {{SUSPENDED}}** | **Jackie (????)**                  | **Lego Star Wars: The Video Game (2005) (VG)  (voice) (archive footage) (uncredited)** | **Live Free or Die Hard (Project 12, 8/12) (2011)  (archive footage)**            | Star Wars: Episode II - Attack of the Clones (2002)                           | ... |

Most movies in the dataset have name in the form `title (year)`, there are so many exceptions that cannot be ignored, as bolded in the table.

**With regular expression, we can address such problem.**

## How does it work?

Let's forget about the tasks temporarily, and learn some basics of regular expression (regex).

https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285

Now try to write a regular expression that matches 4 three-digit numbers separated by `.`

However, legal ip address do not allow each of the 4 numbers be larger than 255.

`((2[0-4]\d|25[0-5]|[01]?\d\d?)\.){3}(2[0-4]\d|25[0-5]|[01]?\d\d?)`

### `re` module in Python

https://docs.python.org/3/howto/regex.html#regex-howto

In [4]:
import re
print(f"re.match(r'[ab]', 'cbb'):\t{re.match(r'[ab]', 'cbb')}")     # Check matching from the beginning of the string
print(f"re.match(r'[ab]', 'bb'):\t{re.match(r'[ab]', 'bb')}")     # Matches the 1st letter and done
print(f"re.match(r'[ab]', 'ab'):\t{re.match(r'[ab]', 'ab')}")     # Same as above
print(f"re.match(r'[ab]+', 'ab'):\t{re.match(r'[ab]+', 'ab')}")   # The pattern can match multiple letters

re.match(r'[ab]', 'cbb'):	None
re.match(r'[ab]', 'bb'):	<_sre.SRE_Match object; span=(0, 1), match='b'>
re.match(r'[ab]', 'ab'):	<_sre.SRE_Match object; span=(0, 1), match='a'>
re.match(r'[ab]+', 'ab'):	<_sre.SRE_Match object; span=(0, 2), match='ab'>


In [3]:
print(f"re.search(r'[ab]', 'cbb'):\t{re.search(r'[ab]', 'cbb')}")   # Find the first occurrence

re.search(r'[ab]', 'cbb'):	<_sre.SRE_Match object; span=(1, 2), match='b'>


In [36]:
print(f"re.findall(r'[ab]', 'ab'):\t{re.findall(r'[ab]', 'ab')}") # Find all occurrences

re.findall(r'[ab]', 'ab'):	['a', 'b']


#### Grouping

In [27]:
p = re.compile('(a(b)c)d')
m = p.match('abcde')
m.group(0)

'abcd'

In [19]:
m.group(1)

'abc'

In [20]:
m.group(2)

'b'

In [29]:
# named group
p = re.compile(r'\b(?P<word>\w+)\s+(?P=word)\b')
p.search('Paris in the the spring').group()

'the the'

In [32]:
p.search('Paris in the the spring').group('word')

'the'

#### Substitution

In [None]:
p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE)

In [11]:
p.match('section{First}').groupdict()

{'name': 'First'}

In [3]:
p.sub(r'subsection{\1}','section{First}')

'subsection{First}'

In [4]:
p.sub(r'subsection{\g<1>}','section{First}')

'subsection{First}'

In [5]:
p.sub(r'subsection{\g<name>}','section{First}')

'subsection{First}'