In [2]:
import re

Raw string literal

Python raw string treats the backslash character (\\) as a literal character.

## Patterns Forming

### Basic Patterns: Ordinary Characters

In [2]:
pattern = r"Cookie"
sequence = "Cookie"
if re.match(pattern, sequence):
    print("Match!")
else: print("Not a match!")

Match!


`search` function scans for first location in string for the given pattern

`group` function returns the string matched by the re

### Wild Card Characters: Special Characters
. - A period. Matches any single character except the newline character.

In [11]:
re.search(r'Co.k.e', 'Cookie').group()

'Cookie'

^ - A caret. Matches the start of the string.

In [13]:
re.search(r'^Eat', "Eat cake!").group()

## However, the code below will not give the same result. Try it for yourself:
# re.search(r'^eat', "Let's eat cake!").group()

'Eat'

$ - Matches the end of string.

In [14]:
re.search(r'cake$', "Cake! Let's eat cake").group()

## The next search will return the NONE value, try it:
# re.search(r'cake$', "Let's get some cake on our way home!").group()

'cake'

[abc] - Matches a or b or c.

[a-zA-Z0-9] - Matches any letter from (a to z) or (A to Z) or (0 to 9).

If the first character of the set is ^, all the characters that are not in the set will be matched. That is [^5-9] means not in set [5-9]

In [27]:
print(re.findall(r'[a-z0-6]', 'Number: 5'))
print(re.findall(r'[a-zA-Z0-9]+','Vinayak 2019110067'))

['u', 'm', 'b', 'e', 'r', '5']
['Vinayak', '2019110067']


In [33]:
print(re.search(r'Number: [^5-9]', 'Number: 10').group())

Number: 1


\ - Backslash

In [34]:
## (Scenario 1) This treats '\s' as an escape character, '\s' defines a space
re.search(r'Not a\sregular character', 'Not a regular character').group()

'Not a regular character'

In [45]:
## (Scenario 2) '\' is treated as an ordinary character, because '\r' is not a recognized escape character
re.search(r'Just a \negular character', 'Just a \negular character').group()

'Just a \negular character'

In [47]:
## (Scenario 3) '\s' is escaped using an extra `\` so its interpreted as a literal string '\s'
re.search(r'Just a \\sregular character', 'Just a \sregular character').group()

'Just a \\sregular character'

In [48]:
print("Lowercase w:", re.search(r'Co\wk\we', 'Cookie').group())

## Matches any character except single letter, digit or underscore
print("Uppercase W:", re.search(r'C\Wke', 'C@ke').group())

## Uppercase W won't match single letter, digit
print("Uppercase W won't match, and return:", re.search(r'Co\Wk\We', 'Cookie'))

Lowercase w: Cookie
Uppercase W: C@ke
Uppercase W won't match, and return: None


\w - Lowercase 'w'. Matches any single letter, digit, or underscore.

\W - Uppercase 'W'. Matches any character not part of \w (lowercase w).

In [49]:
print("Lowercase w:", re.search(r'Co\wk\we', 'Cookie').group())

## Matches any character except single letter, digit or underscore
print("Uppercase W:", re.search(r'C\Wke', 'C@ke').group())

## Uppercase W won't match single letter, digit
print("Uppercase W won't match, and return:", re.search(r'Co\Wk\We', 'Cookie'))

Lowercase w: Cookie
Uppercase W: C@ke
Uppercase W won't match, and return: None


\s - Lowercase 's'. Matches a single whitespace character like: space, newline, tab, return.

\S - Uppercase 'S'. Matches any character not part of \s (lowercase s).

In [50]:
print("Lowercase s:", re.search(r'Eat\scake', 'Eat cake').group())
print("Uppercase S:", re.search(r'cook\Se', "Let's eat cookie").group())

Lowercase s: Eat cake
Uppercase S: cookie


\d - Lowercase d. Matches decimal digit 0-9.

\D - Uppercase d. Matches any character that is not a decimal digit.

In [54]:
# Example for \d
print(re.search(r'\d+', '100 cookies').group())

100


\t - Lowercase t. Matches tab.

\n - Lowercase n. Matches newline.

\r - Lowercase r. Matches return.

\A - Uppercase a. Matches only at the start of the string. Works across multiple lines as well.

\Z - Uppercase z. Matches only at the end of the string.

TIP: ^ and \A are effectively the same, and so are $ and \Z. Except when dealing with MULTILINE mode. Learn more about it in the flags section.

\b - Lowercase b. Matches only the beginning or end of the word.

In [65]:
# Example for \t
print("\\t (TAB) example: ", re.search(r'Eat\tcake', 'Eat\tcake').group())

# Example for \b
print("\\b match gives: ",re.findall(r'\b[A-E]ookie', 'Cookie Bookie'))

\t (TAB) example:  Eat	cake
\b match gives:  ['Cookie', 'Bookie']


### Repetions

`The + and * qualifiers are said to be greedy`

\+ - Checks if the preceding character appears one or more times starting from that position.

In [11]:
re.search(r'Co+kie', 'Cookie').group()

'Cookie'

\* - Checks if the preceding character appears zero or more times starting from that position.

In [12]:
# Checks for any occurrence of a or o or both in the given sequence
re.search(r'Ca*o*kie', 'Cookie').group()

'Cookie'

? - Checks if the preceding character appears exactly zero or one time starting from that position.

In [22]:
# Checks for exactly zero or one occurrence of a or o or both in the given sequence
print(re.findall(r'Colou?r', 'Colour'))
print(re.findall(r'Colou?r', 'Color'))
print(re.findall(r'Colou?r', 'Colouur'))

['Colour']
['Color']
[]


{x} - Repeat exactly x number of times.

{x,} - Repeat at least x times or more.

{x, y} - Repeat at least x times but no more than y times.

In [38]:
re.findall(r'\d{1,10}', '0987654321')

['0987654321']

### Grouping in Regular Expressions
The group feature of regular expression allows you to pick up parts of the matching text. Parts of a regular expression pattern bounded by parenthesis () are called groups.

In [44]:
statement = 'Please contact us at: support@datacamp.com,vinayak@vt.in'
match = re.search(r'([\w\.-]+)@([\w\.-]+)', statement)
# notice the parenthesis
if statement:
    print("Email address:", match.group()) # The whole matched text
    print("Username:", match.group(1)) # The username (group 1)
    print("Host:", match.group(2)) # The host (group 2)

Email address: support@datacamp.com
Username: support
Host: datacamp.com


In [45]:
re.findall(r'([\w\.-]+)@([\w\.-]+)', statement)

[('support', 'datacamp.com'), ('vinayak', 'vt.in')]

<> brackets :will let you create named groups.<br>
The syntax for creating named group is: (?P\<name>...)

In [47]:
statement = 'Please contact us at: support@datacamp.com'
match = re.search(r'(?P<email>(?P<username>[\w\.-]+)@(?P<host>[\w\.-]+))', statement)
if statement:
    print("Email address:", match.group('email'))
    print("Username:", match.group('username'))
    print("Host:", match.group('host'))

Email address: support@datacamp.com
Username: support
Host: datacamp.com


In [48]:
re.findall(r'(?P<email>(?P<username>[\w\.-]+)@(?P<host>[\w\.-]+))', statement)

[('support@datacamp.com', 'support', 'datacamp.com')]

### Greedy vs. Non-Greedy Matching

When a special character matches as much of the search sequence (string) as possible, it is said to be a "Greedy Match".

In [52]:
pattern = "cookie"
sequence = "Cake and cookie"

heading  = r'<h1>TITLE</h1>'
re.match(r'<.*>', heading).group()
# desired output is <h1> and <h2>
# but The pattern <.*> matched the whole 
#string, right up to the second occurrence of >.

'<h1>TITLE</h1>'

? after the qualifier makes it perform the match in a non-greedy or minimal fashion

In [55]:
heading  = r'<h1>TITLE</h1>'
print(re.match(r'<.*?>', heading).group())
print(re.findall(r'<.*?>', heading))

<h1>
['<h1>', '</h1>']


### Summary

<table>
<thead>
<tr>
<th style="text-align: left;">Character(s)</th>
<th style="text-align: left;">What it does</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left;">.</td>
<td style="text-align: left;">A period. Matches any single character except the newline character.</td>
</tr>
<tr>
<td style="text-align: left;">^</td>
<td style="text-align: left;">A caret. Matches a pattern at the start of the string.</td>
</tr>
<tr>
<td style="text-align: left;">\A</td>
<td style="text-align: left;">Uppercase A. Matches only at the start of the string.</td>
</tr>
<tr>
<td style="text-align: left;">$</td>
<td style="text-align: left;">Dollar sign. Matches the end of the string.</td>
</tr>
<tr>
<td style="text-align: left;">\Z</td>
<td style="text-align: left;">Uppercase Z. Matches only at the end of the string.</td>
</tr>
<tr>
<td style="text-align: left;">[ ]</td>
<td style="text-align: left;">Matches the set of characters you specify within it.</td>
</tr>
<tr>
<td style="text-align: left;">\</td>
<td style="text-align: left;">∙ If the character following the backslash is a recognized escape character, then the special meaning of the term is taken. <br>∙ Else the backslash () is treated like any other character and passed through. <br>∙ It can be used in front of all the metacharacters to remove their special meaning.</td>
</tr>
<tr>
<td style="text-align: left;">\w</td>
<td style="text-align: left;">Lowercase w. Matches any single letter, digit, or underscore.</td>
</tr>
<tr>
<td style="text-align: left;">\W</td>
<td style="text-align: left;">Uppercase W. Matches any character not part of <code>\w</code> (lowercase w).</td>
</tr>
<tr>
<td style="text-align: left;">\s</td>
<td style="text-align: left;">Lowercase s. Matches a single whitespace character like: space, newline, tab, return.</td>
</tr>
<tr>
<td style="text-align: left;">\S</td>
<td style="text-align: left;">Uppercase S. Matches any character not part of <code>\s</code> (lowercase s).</td>
</tr>
<tr>
<td style="text-align: left;">\d</td>
<td style="text-align: left;">Lowercase d. Matches decimal digit 0-9.</td>
</tr>
<tr>
<td style="text-align: left;">\D</td>
<td style="text-align: left;">Uppercase D. Matches any character that is not a decimal digit.</td>
</tr>
<tr>
<td style="text-align: left;">\t</td>
<td style="text-align: left;">Lowercase t. Matches tab.</td>
</tr>
<tr>
<td style="text-align: left;">\n</td>
<td style="text-align: left;">Lowercase n. Matches newline.</td>
</tr>
<tr>
<td style="text-align: left;">\r</td>
<td style="text-align: left;">Lowercase r. Matches return.</td>
</tr>
<tr>
<td style="text-align: left;">\b</td>
<td style="text-align: left;">Lowercase b. Matches only the beginning or end of the word.</td>
</tr>
<tr>
<td style="text-align: left;">+</td>
<td style="text-align: left;">Checks if the preceding character appears one or more times.</td>
</tr>
<tr>
<td style="text-align: left;">*</td>
<td style="text-align: left;">Checks if the preceding character appears zero or more times.</td>
</tr>
<tr>
<td style="text-align: left;">?</td>
<td style="text-align: left;">∙ Checks if the preceding character appears exactly zero or one time. <br>∙ Specifies a non-greedy version of +, *</td>
</tr>
<tr>
<td style="text-align: left;">{ }</td>
<td style="text-align: left;">Checks for an explicit number of times.</td>
</tr>
<tr>
<td style="text-align: left;">( )</td>
<td style="text-align: left;">Creates a group when performing matches.</td>
</tr>
<tr>
<td style="text-align: left;">&lt; &gt;</td>
<td style="text-align: left;">Creates a named group when performing matches.</td>
</tr>
</tbody>
</table>

## Function Provided by 're'

### compile(pattern, flags=0)
compile() --> regular expression pattern into a regular expression object.

In [56]:
pattern = re.compile(r"cookie")
sequence = "Cake and cookie"
pattern.search(sequence).group()

'cookie'

In [57]:
# This is equivalent to:
re.search(pattern, sequence).group()

'cookie'

### search(pattern, string, flags=0)
You scan through the given string/sequence, looking for the first location where the regular expression produces a match.<br>
Returns match object if found else NONE

In [60]:
pattern = "cookie"
sequence = "Cake and cookie"

re.search(pattern, sequence)

<re.Match object; span=(9, 15), match='cookie'>

### match(pattern, string, flags=0)
Returns a corresponding match object if zero or more characters at the beginning of string match the pattern. Else it returns None

In [75]:
pattern = "C"
sequence1 = "IceCream"
sequence2 = "CCake"

# No match since "C" is not at the start of "IceCream"
print("Sequence 1: ", re.match(pattern, sequence1))
print("Sequence 2: ", re.match(pattern,sequence2).group())

Sequence 1:  None
Sequence 2:  C


#### search() versus match()
The match() function checks for a match only at the beginning of the string (by default), whereas the search() function checks for a match anywhere in the string.

In [76]:
print("Sequence 1: ", re.search(pattern, sequence1).group())
print("Sequence 2: ", re.search(pattern,sequence2).group())

Sequence 1:  C
Sequence 2:  C


### findall(pattern, string, flags=0)
Finds all the possible matches in the entire sequence.<br>Returns List of matched strings.

In [77]:
statement = "Please contact us at: support@datacamp.com, xyz@datacamp.com"

#'addresses' is a list that stores all the possible match
addresses = re.findall(r'[\w\.-]+@[\w\.-]+', statement)
for address in addresses:
    print(address)

support@datacamp.com
xyz@datacamp.com


### finditer(string, [position, end_position])
Similar to findall() - it finds all the possible matches in the entire sequence but returns regex match objects as an iterator.<br>
Returned regex match object holds not only the sequence that matched but also their positions in the original text

In [78]:
statement = "Please contact us at: support@datacamp.com, xyz@datacamp.com"

#'addresses' is a list that stores all the possible match
addresses = re.finditer(r'[\w\.-]+@[\w\.-]+', statement)
for address in addresses:
    print(address)

<re.Match object; span=(22, 42), match='support@datacamp.com'>
<re.Match object; span=(44, 60), match='xyz@datacamp.com'>


### sub(pattern, repl, string, count=0, flags=0)
sub() is the substitute function. It returns the string obtained by replacing or substituting the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern is not found, then the string is returned unchanged.

### subn(pattern, repl, string, count=0)
The subn() is similar to sub(). However, it returns a tuple containing the new string value and the number of replacements that were performed in the statement.

In [79]:
statement = "Please contact us at: xyz@datacamp.com"
new_email_address = re.sub(r'([\w\.-]+)@([\w\.-]+)', r'support@datacamp.com', statement)
print(new_email_address)

Please contact us at: support@datacamp.com


In [81]:
statement = "Please contact us at: xyz@datacamp.com, vinayak@vt.in"
new_email_address = re.subn(r'([\w\.-]+)@([\w\.-]+)', r'support@datacamp.com', statement)
print(new_email_address)

('Please contact us at: support@datacamp.com, support@datacamp.com', 2)


### split(string, [maxsplit = 0])
splits the strings wherever the pattern matches and returns a list.

In [93]:
statement = "Please contact us at: xyz@datacamp.com, support@datacamp.com"
pattern = re.compile(r'[:,]')

address = pattern.split(statement,maxsplit=0)
print(address)

['Please contact us at', ' xyz@datacamp.com', ' support@datacamp.com']


### start(), end(), span()
* `start()` - Returns the starting index of the match.
* `end()` - Returns the index where the match ends.
* `span()` - Return a tuple containing the (start, end) positions of the match.

In [95]:
pattern = re.compile('COOKIE', re.IGNORECASE)
match = pattern.search("I am not a cookie monsterCOOKIE")

print("Start index:", match.start())
print("End index:", match.end())
print("Tuple:", match.span())

Start index: 11
End index: 17
Tuple: (11, 17)


In [99]:
for match in pattern.finditer("I am not a cookie monsterCOOKIE"):
    print("Start index:", match.start())
    print("End index:", match.end())
    print("Tuple:", match.span())
    print('----')

Start index: 11
End index: 17
Tuple: (11, 17)
----
Start index: 25
End index: 31
Tuple: (25, 31)
----


### Compilation Flags
An expression's behavior can be modified by specifying a flag value.
* IGNORECASE (I) - Allows case-insensitive matches.<br>
* DOTALL (S) - Allows . to match any character, including newline.<br>
* MULTILINE (M) - Allows start of string (^) and end of string ($) anchor to match newlines as well.<br>
* VERBOSE (X) - Allows you to write whitespace and comments within a regular expression to make it more readable.

In [100]:
statement = "Please contact us at: support@DataCamp.com, xyz@DATACAMP.com"

# Using the VERBOSE flag helps understand complex regular expressions
pattern = re.compile(r"""
[\w\.-]+ #First part
@ #Matches @ sign within email addresses
datacamp.com #Domain
""", re.X | re.I)

addresses = re.findall(pattern, statement)                       
for address in addresses:
    print("Address: ", address)

Address:  support@DataCamp.com
Address:  xyz@DATACAMP.com


## Case Study: Working with Regular Expressions

In [108]:
import re
import requests
the_idiot_url = 'https://www.gutenberg.org/files/2638/2638-0.txt'

def get_book(url):
    # Sends a http request to get the text from project Gutenberg
    raw = requests.get(url).text
    # Discards the metadata from the beginning of the book
    start = re.search(r"\*\*\* START OF THE PROJECT GUTENBERG EBOOK .* \*\*\*",raw ).end()
    # Discards the text starting Part 2 of the book
    stop = re.search(r"II\.", raw).start()
    # Keeps the relevant text
    text = raw[start:stop]
    return text

def preprocess(sentence):
    return re.sub('[^A-Za-z0-9.]+' , ' ', sentence).lower()

book = get_book(the_idiot_url)
processed_book = preprocess(book)
#print(processed_book)

Exercise: Find the number of the pronoun "the" in the corpus. Hint: Use the len() function.

In [110]:
len(re.findall(r'the', processed_book))

302

Exercise: Try to convert every single stand-alone instance of 'i' to 'I' in the corpus. Make sure not to change the 'i' occurring within a word:

In [112]:
processed_book = re.sub(r'\si\s', " I ", processed_book)
# print(processed_book)

Exercise: Find the number of times anyone was quoted ("") in the corpus.

In [113]:
len(re.findall(r'\”', book))

0

Exercise: What are the words connected by '--' in the corpus?

In [114]:
len(re.findall(r'[A-Za-z0-9+]--[A-Za-z0-9+]', book))

0