### 1. What exactly is a Regular Expression?

A regular expression, often called a pattern, is an expression used to specify a set of strings required for a particular purpose.

A simple way to specify a finite set of strings is to list its elements or members.
For example {file, file1, file2}.

However, there are often more concise ways to specify the desired set of strings.
For example, the set {file, file1, file2} can be specified by the pattern file(1|2)?.
We say that this pattern matches each of the three strings. Wanna check?

In most formalisms, if there exists at least one regular expression that matches a particular set then there exists an infinite number of other regular expressions that also match it, i.e. the specification is not unique.
For example, the string set {file, file1, file2} can also be specified by the pattern file\d?.

### 2. Uses of Regular Expressions

Some important usages of regular expressions are:

Check if an input honors a given pattern; for example, we can check whether a value entered in a HTML formulary is a valid e-mail address

Look for a pattern appearance in a piece of text; for example, check if either the word "color" or the word "colour" appears in a document with just one scan

Extract specific portions of a text; for example, extract the postal code of an address

Replace portions of text; for example, change any appearance of "color" or "colour" with "red"

Split a larger text into smaller pieces, for example, splitting a text by any appearance of the dot, comma, or newline characters

A regex pattern is a simple sequence of characters. The components of a regex pattern are:

literals (ordinary characters): these characters carry no special meaning and are processed as it is.

metacharacters (special characters): these characters carry a special meaning and processed in some special way.


### 3. Understanding the Regular Expression Syntax
A regex pattern is a simple sequence of characters. The components of a regex pattern are:

literals (ordinary characters): these characters carry no special meaning and are processed as it is.

metacharacters (special characters): these characters carry a special meaning and processed in some special way.



Let's start with a simple example.

Consider that we have got the list of several filenames in a folder.

file1.xml

file1.txt

file2.txt

file15.xml

file5.docx

file60.txt

file5.txt

And we want to filter out only those filenames which follow a specific pattern, i.e. file<one or more digits>.txt.

Let's try to do this on an online tool to learn, build, & test Regular Expressions (RegEx / RegExp), RegExr.

So, the regular expression we need here is:  file\d+\.txt

This expression can be understood as follows:

file is a substring of literals which are matched with the input as it is.

\d is a metacharacter which instructs the software to match this position with a digit (0-9).

+is also a metacharacter which instructs the software to match one or more iterations of the preceeding character (\d in this case)

  " \ . " is a literal. . is a metacharacter but we want to use it as a literal in this case. Hence, we escape it using \ character.

txt is a substring of literals which are matched with the input as it is.

In [1]:
import re

## 1. Compiling Regular Expressions
Regular expressions are compiled into Pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions.

re.compile(pattern, flags=0)

The regular expression is passed to re.compile() as a string.
Regular expressions are handled as strings because regular expressions aren’t part of the core Python language, and no special syntax was created for expressing them.

In [2]:
pattern =  re.compile("hello")

In [4]:
pattern

re.compile(r'hello', re.UNICODE)

In [5]:
type(pattern)

re.Pattern

In [8]:
pattern = re.compile("hello", flags=re.I)

In [9]:
pattern

re.compile(r'hello', re.IGNORECASE|re.UNICODE)

## 2.Performing Matches
So, we have created a Pattern object representing a compiled regular expression using re.compile() method.

Pattern objects have several methods and attributes.

Here is the list of different methods used for performing matches:



match()	 -   Determine if the RE matches at the beginning of the string.

search() - 	Scan through a string, looking for any location where this RE matches.

findall() - Find all substrings where the RE matches, and returns them as a list.

finditer() - Find all substrings where the RE matches, and returns them as an iterator.

In [10]:
pattern = re.compile("hello")
match = pattern.match("hello world")

In [11]:
type(match)

re.Match

In [12]:
match.span()

(0, 5)

match(string[, pos[, endpos]])

A match is checked only at the beginning (by default).

Checking starts from pos index of the string. (default is 0)

Checking is done until endpos index of string. endpos is set as a very large integer (by default).

Returns None if no match found.

If a match is found, a Match object is returned, containing information about the match: where it starts and ends, the substring it matched, and more.

In [13]:
match = pattern.match("ello hello world")

In [16]:
type(match)

NoneType

In [17]:
match = pattern.match("ello hello world" , pos=5)

In [18]:
match

<re.Match object; span=(5, 10), match='hello'>

In [19]:
type(match)

re.Match

In [20]:
 match.end()

10

In [21]:
pattern.match("hello", endpos=4) is None

True


search(string[, pos[, endpos]])

A match is checked throughtout the string.

Same behaviour of pos and endpos as the match() function.

Returns None if no match found.

If a match is found, a Match object is returned.

In [23]:
pattern.search("say hello")

<re.Match object; span=(4, 9), match='hello'>

In [24]:

pattern.search("say hello hello")

<re.Match object; span=(4, 9), match='hello'>


findall(string[, pos[, endpos]])

Finds all non-overlapping substrings where the match is found, and returns them as a list.

Same behaviour of pos and endpos as the match() and search() function.

In [26]:
pattern.findall("say hello hello")

['hello', 'hello']

In [27]:
pattern = re.compile("\d")

In [28]:
pattern.findall("1,2,3,4,5,5")

['1', '2', '3', '4', '5', '5']

finditer(string[, pos[, endpos]])

Finds all non-overlapping substrings where the match is found, and returns them as an iterator of the Match objects.

Same behaviour of pos and endpos as the match(), search() and findall() function.

In [32]:
pattern =  re.compile("hello")
match_iter = pattern.finditer("say hello hello")

In [33]:
type(match_iter)

callable_iterator

In [34]:
next(match_iter)

<re.Match object; span=(4, 9), match='hello'>

In [35]:
next(match_iter)

<re.Match object; span=(10, 15), match='hello'>

In [36]:
next(match_iter)

StopIteration: 

In [39]:
matches = pattern.finditer("say hello hello")
for match in matches:
    print(match)

<re.Match object; span=(4, 9), match='hello'>
<re.Match object; span=(10, 15), match='hello'>


It is not mandatory to create a Pattern object explicitly using re.compile() method in order to perform a regex operation.

You can direclty use the module level functions such as:

re.match(pattern, string, flags=0)

re.search(pattern, string, flags=0)

re.findall(pattern, string, flags=0)

re.finditer(pattern, string, flags=0)

and so on.

In a module level function, you can simply pass a string as your regex

In [40]:

re.match("hello", "hello")

<re.Match object; span=(0, 5), match='hello'>

In [41]:

re.findall("hello", "say hello hello")

['hello', 'hello']

In [42]:

txt = "This book costs $15."

pattern = re.compile("$15")

pattern.search(txt)

$ is a metacharacter and has a special meaning for regex engine. Here, we want to treat it like a literal.

In order to treat a metacharacter like a literal, you need to escape it using \ character.

In [43]:
pattern = re.compile("\$15")

pattern.search(txt)

<re.Match object; span=(16, 19), match='$15'>

In regular expressions, there are twelve metacharacters that should be escaped if they are to be used with their literal meaning:

Backslash \

Caret ^

Dollar sign $

Dot .

Pipe symbol |

Question mark ?

Asterisk *

Plus sign +

Opening parenthesis (

Closing parenthesis )

Opening square bracket [

The opening curly brace {
