# Regular Expressions

- a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. 

- widely used in the Information Technology world.

- the python module **re** provides full support for Perl-like regular expressions in Python. 

- it raises the exception re.error if an error occurs while compiling or using a regular expression.


**Contents**

- Basic Patterns: Ordinary Characters
- Wild Card Characters: Special Characters
- Repetitions
- Groups and Grouping using Regular Expressions
- Greedy vs Non-Greedy Matching
- More on the re Python Library
- `search()` versus `match()`
- Regular Expression Modifiers: Option Flags
- Cheat Sheet


**IMPORTANT**<br>
There are various characters, which would have special meaning when they are used in regular expression. <br>
To avoid any confusion while dealing with regular expressions, we would use Raw Strings as `r'expression'`.

## Import the library
In Python, regular expressions are supported by the re module. <br>
That means that if you want to start using them in your Python scripts, you have to import this module with the help of import:

In [1]:
import re


## Basic Patterns: Ordinary Characters
You can easily tackle many basic patterns in Python using the ordinary characters. <br>
Ordinary characters are the simplest regular expressions. <br>
They match themselves exactly and do not have a special meaning in their regular expression syntax.

Examples are 'A', 'a', 'X', '5'.

In [3]:
pattern = r"Cooker"
sequence = "Cookie"
if re.match(pattern, sequence):
    print("Match!")
else: 
    print("Not a match!")
    

Not a match!


The `match(`) function returns a match object if the text matches the pattern. Otherwise it returns None. The `re` module also contains several other functions and you will learn some of them later on in the tutorial. For now, though, let's focus on ordinary characters.


Do you notice the `r` at the start of the pattern Cookie?

This is called a raw string literal. It changes how the string literal is interpreted. Such literals are stored as they appear.

For example, `\` is just a backslash when prefixed with a `r` rather than being interpreted as an escape sequence. You will see what this means with special characters. Sometimes, the syntax involves backslash-escaped characters and to prevent these characters from being interpreted as escape sequences, you use the raw `r` prefix. You don't actually need it for this example, however it is a good practice to use it for consistency.

## Wild Card Characters: Special Characters

Special characters are characters which do not match themselves as seen but actually have a special meaning when used in a regular expression.

The most widely used special characters are:

- A period. Matches any single character except newline character.

In [8]:
re.search(r'Co.k.e', 'Cookie').group()

'Cookie'

The `group()` function returns the string matched by the re. You will see this function in more detail later.

`\w `- Lowercase w. Matches any single letter, digit or underscore.

In [18]:
re.search(r'Co\wk\we', 'Cookie').group()

'Cookie'

`\W` - Uppercase w. Matches any character not part of `\w` (lowercase w).

In [21]:
re.search(r'C\Wke', 'C@ke').group()

'C@ke'

`\s` - Lowercase s. Matches a single whitespace character like: space, newline, tab, return.

re.search(r'Eat\scake', 'Eat cake').group()

`\S` - Uppercase s. Matches any character not part of `\s` (lowercase s).

In [26]:
re.search(r'Cook\Se', 'Cookie').group()

'Cookie'

`\t` - Lowercase t. Matches tab.

In [None]:
re.search(r'Eat\tcake', 'Eat	cake').group()

- `\n` - Lowercase n. Matches newline.

- `\r` - Lowercase r. Matches return.

- `\d` - Lowercase d. Matches decimal digit 0-9.

In [None]:
re.search(r'c\d\dkie', 'c00kie').group()

`^` - Caret. Matches a pattern at the start of the string.

In [9]:
re.search(r'^Eat', 'Eat cake').group()

'Eat'

`$` - Matches a pattern at the end of string.

In [10]:
re.search(r'cake$', 'Eat cake').group()

'cake'

`[abc]`<br>
- Matches a or b or c.

`[a-zA-Z0-9]` <br>
- Matches any letter from (a to z) or (A to Z) or (0 to 9). <br>
- Characters that are not within a range can be matched by complementing the set. <br>
- If the first character of the set is ^, all the characters that are not in the set will be matched.

In [11]:
re.search(r'Number: [0-6]', 'Number: 5').group()

'Number: 5'

In [12]:
# Matches any character except 5
re.search(r'Number: [^5]', 'Number: 0').group()

'Number: 0'

`\A` - Uppercase a. Matches only at the start of the string. Works across multiple lines as well.

In [13]:
re.search(r'\A[A-E]ookie', 'Cookie').group()

'Cookie'

`\b` - Lowercase b. Matches only the beginning or end of the word.

In [16]:
re.search(r'\b[A-E]ookie', 'this is a Cookie').group()

'Cookie'

`\` - Backslash. If the character following the backslash is a recognized escape character, then the special meaning of the term is taken. For example, \n is considered as newline. However, if the character following the \ is not a recognized escape character, then the \ is treated like any other character and passed through.


In [40]:
# This checks for '\' in the string instead of '\t' due to the '\' used 
print(re.search(r'Back\\+stail', r'Back\\\stail').group())

# This treats '\s' as an escape character because it lacks '\' at the start of '\s'
print(re.search(r'Back\stail', 'Back tail').group())


Back\\\stail
Back tail


## Repetitions
It becomes quite tedious if you are looking to find long patterns in a sequence. Fortunately, the `re` module handles repetitions using the following special characters:

`+` - Checks for one or more characters to its left.

In [40]:
re.search(r'Co+kie', 'Cooookie').group()

'Cooookie'

`*` - Checks for zero or more characters to its left.


In [44]:
re.search(r'Ca*o*kie', 'Cakie').group()

'Cakie'

`?` - Checks for exactly zero or one character to its left.

In [49]:
re.search(r'Ca?o?kie', 'Caokie').group()

'Caokie'

But what if you want to check for exact number of sequence repetition?

For example, checking the validity of a phone number in an application. re module handles this very gracefully as well using the following regular expressions:

`{x}` - Repeat exactly x number of times.

`{x,}` - Repeat at least x times or more.

`{x, y}` - Repeat at least x times but no more than y times.

In [47]:
re.search(r'^\d{9,10}$', '9161845789').group()

'9161845789'

The `+` and `*` qualifiers are said to be **greedy**.

## Groups and Grouping using Regular Expressions


Suppose that, when you are validating email addresses and want to check the user name and host separately.

This is when the group feature of regular expression comes in handy. It allows you to pick up parts of the matching text.

Parts of a regular expression pattern bounded by parentheses() are called **groups**. 

The parentheses does not change what the expression matches, but rather forms groups within the matched sequence. You have been using the `group()` function all along in this tutorial's examples. The plain `match.group()` without any argument is still the whole matched text as usual.



In [72]:
email_address = 'Please contact us at: support@somewhere.com'
match = re.search(r'([\w\.-]+)@([\w\.-]+)', email_address)

print(match.group()) # The whole matched text
print(match.group(1)) # The username (group 1)
print(match.group(2)) # The host (group 2)

support@somewhere.com
support
somewhere.com


## Greedy vs Non-Greedy Matching

When a special character matches as much of the search sequence (string) as possible, it is said to be a **"Greedy Match"**. 

It is the normal behaviour of a regular expression but sometimes this behaviour is not desired:

In [65]:
heading  = r'<h1>TITLE</h1>'
re.match(r'<.*>', heading).group()

'<h1>TITLE</h1>'

The pattern `<.*>` matched the whole string, right up to the second occurrence of `>`.

However, if you only wanted to match the first **<h1>** tag, you could have used the greedy qualifier `*?` that matches as little text as possible.

Adding `?` after the qualifier makes it perform the match in a non-greedy or minimal fashion; That is, as few characters as possible will be matched. When you run `<.*>`, you will only get a match with **<h1>**

In [66]:
heading  = r'<h1>TITLE</h1>'
re.match(r'<.*?>', heading).group()

'<h1>'

## More on the re Python Library

The `re` library in Python provides several functions that makes it a skill worth mastering. You have already seen some of them, such as the `re.search()`, `re.match()`. 

### Search
`search(pattern, string, flags=0)`<br>
With this function, you scan through the given string/sequence looking for the first location where the regular expression produces a match. It returns a corresponding match object if found, else returns None if no position in the string matches the pattern. Note that None is different from finding a zero-length match at some point in the string.

In [50]:
pattern = "cookie"
sequence = "Cake and cookie"

re.search(pattern, sequence)

<re.Match object; span=(9, 15), match='cookie'>

### Match 
`match(pattern, string, flags=0)`<br>
Returns a corresponding match object if zero or more characters at the beginning of string match the pattern.<br>
Else it returns None, if the string does not match the given pattern.



In [68]:
pattern = "C"
sequence1 = "IceCream"

# No match since "C" is not at the start of "IceCream"
print(re.match(pattern, sequence1))


sequence2 = "Cake"
print(re.match(pattern,sequence2).group())

None
C


### search() versus match()

The `match()` function checks for a match only at the beginning of the string (by default) whereas the `search()` function checks for a match anywhere in the string.

### findall
`findall(pattern, string, flags=0)`<br>
Finds all the possible matches in the entire sequence and returns them as a list of strings. Each returned string represents one match.

In [69]:
email_address = "Please contact us at: support@somwhere.com, xyz@somwhere.com"

#'addresses' is a list that stores all the possible match
addresses = re.findall(r'[\w\.-]+@[\w\.-]+', email_address)
for address in addresses: 
    print(address)

support@somwhere.com
xyz@somwhere.com


In [70]:
text = '09, Oct 2019'
print(re.findall('\d+', text))  # \d  Any digit. The + mandates at least 1 digit.

['09', '2019']


### Substitution
`sub(pattern, repl, string, count=0, flags=0)`<br>
This is the **substitute** function. It returns the string obtained by replacing or substituting the leftmost non-overlapping occurrences of pattern in string by the replacement **repl**. If the pattern is not found then the string is returned unchanged.

In [48]:
email_address = "Please contact us at: xyz@somewhere.com"
new_email_address = re.sub(r'([\w\.-]+)@([\w\.-]+)', r'support@somewhere.com', email_address)
print(new_email_address)

Please contact us at: support@somewhere.com


### Compile
`compile(pattern, flags=0)`<br>

Compiles a regular expression pattern into a **regular expression object**. When you need to use an expression several times in a single program, using the `compile()` function to save the resulting regular expression object for reuse is more efficient. This is because the compiled versions of the most recent patterns passed to `compile()` and the module-level matching functions are cached.

## Modifier flags
An expression's behaviour can be modified by specifying a flags value.  
You can add flag as an extra argument to the various functions that you have seen in this tutorial.  
Some of the flags used are: `IGNORECASE`, `DOTALL`, `MULTILINE`, `VERBOSE`, etc.

## Regular Expression Modifiers: Option Flags

<table class="table table-bordered">
<tr>
<th style="text-align:center;width:10%">Sr.No.</th>
<th style="text-align:center;">Modifier &amp; Description</th>
</tr>
<tr>
<td class="ts">1</td>
<td><p><b>re.I</b></p>
<p>Performs case-insensitive matching.</p></td>
</tr>
<tr>
<td class="ts">2</td>
<td><p><b>re.L</b></p>
<p>Interprets words according to the current locale. This interpretation affects the alphabetic group (\w and \W), as well as word boundary behaviour(\b and \B).</p></td>
</tr>
<tr>
<td class="ts">3</td>
<td><p><b>re.M</b></p>
<p>Makes $ match the end of a line (not just the end of the string) and makes ^ match the start of any line (not just the start of the string).</p></td>
</tr>
<tr>
<td class="ts">4</td>
<td><p><b>re.S</b></p>
<p>Makes a period (dot) match any character, including a newline.</p></td>
</tr>
<tr>
<td class="ts">5</td>
<td><p><b>re.U</b></p>
<p>Interprets letters according to the Unicode character set. This flag affects the behaviour of \w, \W, \b, \B.</p></td>
</tr>
<tr>
<td class="ts">6</td>
<td><p><b>re.X</b></p>
<p>Permits "cuter" regular expression syntax. It ignores whitespace (except inside a set [] or when escaped by a backslash) and treats unescaped # as a comment marker.</p></td>
</tr>
</table>

## Cheat Sheet

Here is a quick cheat sheet for various rules in regular expressions:

**Identifiers:**

\d ==> any number<br>
\D ==> anything but a number<br>
\s ==> space<br>
\S ==> anything but a space<br>
\w ==> any letter<br>
\W ==> anything but a letter<br>
. ==> any character, except for a new line<br>
\b ==> space around whole words<br>
\. ==> period. must use backslash, because . normally means any character.<br>

**Modifiers:**

{1,3} ==> for digits, u expect 1-3 counts of digits, or "places"<br>
<span>+</span> ==>match 1 or more<br>
? ==> match 0 or 1 repetitions.<br>
<span>*</span> ==> match 0 or MORE repetitions<br>
$ ==> matches at the end of string<br>
^ ==> matches start of a string<br>
| ==> matches either/or. Example x|y = will match either x or y<br>
[] ==> range, or "variance"<br>
{x} ==> expect to see this amount of the preceding code.<br>
{x,y} ==> expect to see this x-y amounts of the preceding code<br>

**White Space Charts:**<br>

\n ==> new line<br>
\s ==> space<br>
\t ==> tab<br>
\e ==> escape<br>
\f ==> form feed<br>
\r ==> carriage return<br>


**Characters to REMEMBER TO ESCAPE IF USED!**<br>

. + * ? [ ] $ ^ ( ) { } | \


**Brackets:**

[] ==> quant[ia]tative = will find either quantitative, or quantatative.<br>
[a-z] ==> return any lowercase letter a-z<br>
[1-5a-qA-Z] ==> return all numbers 1-5, lowercase letters a-q and uppercase A-Z<br>

# Lab: Extracting data using regular expressions

* Create a new notebook, call it "Regular Expression Lab", or something suitably descriptive
* You task is it process the file "data/http-accesslog.txt"

This is an http (webserver) log file. The lines look like this:
```
46.119.125.179 - - [31/Dec/2015:05:22:37 +0100] "GET /administrator/index.php HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 5.1; rv:29.0) Gecko/20100101 Firefox/29.0" "-"
180.76.15.150 - - [31/Dec/2015:05:34:44 +0100] "GET / HTTP/1.1" 200 10479 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" "-"
157.55.39.200 - - [31/Dec/2015:06:17:42 +0100] "GET /robots.txt HTTP/1.1" 200 304 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
66.249.69.112 - - [31/Dec/2015:06:44:45 +0100] "GET /robots.txt HTTP/1.1" 200 304 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
```

| Field | Purpose |
|:--|:--|
|IP address| The address of the requester |
|Access Time| The date and time that the request occurred|
|Request|The http request (method URL PROTOCOL)|
|Response code|200 = OK [Good reference for HTTP codes](https://www.restapitutorial.com/httpstatuscodes.html)|
|Length|How many bytes were sent|
|Browser ID|includes browser name and version, may include operating system|

Tasks
* Were there any failed requests?
* Were there any redirects?
* How may days does the log cover?
* What was the busiest day?
* How many versions of Firefox were used?
* How many different operating systems did those Firefox users use?
    