# Basic usage of Regular Expressions (aka _regex_)
## [Dr. Tirthajyoti Sarkar](https://www.linkedin.com/in/tirthajyoti-sarkar-2127aa7/), Sunnyvale, CA, Nov 2018

Regular expressions or regex are used to identify whether a pattern exists in a given sequence of characters (string) or not. They help in manipulating textual data, which is often a pre-requisite for data science projects that involve text mining.

Regex is like a mini-programming language in itself and the common ideas are used in not only Python, but in all widely used web apps languages like JavaScript, PHP, Perl etc. The regex module is in-built in Python and you just have to import it by,

In [2]:
import re

### Use `match` method to check if a pattern matches a string/sequence. It is case-sensitive
One of the most common regex methods is match. This is used to check an exact or partial match at the beginning of the string (by default).
Let’s define a string and a pattern.

In [3]:
string1 = 'Python'
pattern = r"Python"

Let’s write a conditional expression to check for a match.

In [4]:
if re.match(pattern,string1):
    print("Matches!")
else:
    print("Doesn't match.")

Matches!


Now, let’s test with a string which only differs in the first letter by making it a lowercase.

In [5]:
string2 = 'python'

In [6]:
if re.match(pattern,string2):
    print("Matches!")
else:
    print("Doesn't match.")

Doesn't match.


### Instead of repeating the code, we can use `compile` to create a regex program and use methods
In a program or module, if we are making heavy use of a particular pattern, then it is better to use the `compile` method and create a regex program and then call methods on this program. Here is how you compile a regex program.

In [7]:
prog = re.compile(pattern)
prog.match(string1)

<_sre.SRE_Match object; span=(0, 6), match='Python'>

This code produced a `SRE.Match` object which has a `span` of (0,6) and the matched string of ‘Python’. The `span` here simply denotes the start and end indices of the pattern that was matched. 

These indices may come handy in a text mining program where the subsequent code uses the indices for further search or decision-making purpose. We will see some examples of that later.

### So compiled progarms return special object e.g. `match` objects. But if they don't match it will return `None`, so we can still run our conditional loop!
Compiled object act like function that they return `None` if the pattern does not match. 

Here, we check that by writing a simple conditional. This concept will come handy later when we write a small utility function to check for the type of the returned object from regex compiled programs and act accordingly. We cannot be sure about whether a pattern will match a given string or it will appear in some corpus of the text (if we are searching for the pattern anywhere within the text). Depending on the situation, we may encounter `Match` objects or `None` as the returned value and we have to handle it gracefully.

In [8]:
prog = re.compile(pattern)
if prog.match(string1)!=None:
    print("Matches!")
else:
    print("Doesn't match.")

Matches!


In [9]:
if prog.match(string2)!=None:
    print("Matches!")
else:
    print("Doesn't match.")

Doesn't match.


### Use additional parameters in `match` to check for positional matching
By default, match looks for pattern matching at the beginning of the given string. But sometimes, we need to check matching at a specific location in the string.

The following example matches `y` for the 2nd position (index/pos 1).

In [10]:
prog = re.compile(r'y')

In [11]:
prog.match('Python',pos=1)

<_sre.SRE_Match object; span=(1, 2), match='y'>

In [12]:
prog = re.compile(r'thon')

In [13]:
prog.match('Python',pos=2)

<_sre.SRE_Match object; span=(2, 6), match='thon'>

Continuing with the same program prog, following example looks for match in a different string,

In [14]:
prog.match('Marathon',pos=4)

<_sre.SRE_Match object; span=(4, 8), match='thon'>

### Let's see a use case. Find out how many words in a list has last three letters with 'ing'
Suppose, we want to find out if a given string has last three letters as ***‘ing’***. 

What is a possible use? 

This kind of query may come up in a text analytics/text mining program where somebody is interested in finding instances of present continuous tense words which are highly likely to end with ‘ing’. However, other nouns may also have ending with ‘ing’ (as we will see in the example).

In [15]:
prog = re.compile(r'ing')
words = ['Spring','Cycling','Ringtone']
for w in words:
    if prog.match(w,pos=len(w)-3)!=None:
        print("{} has last three letters 'ing'".format(w))
    else:
        print("{} does not have last three letter as 'ing'".format(w))

Spring has last three letters 'ing'
Cycling has last three letters 'ing'
Ringtone does not have last three letter as 'ing'


It looks plain and simple and you might as well wonder what the purpose of using a special regex module for this. A simple string method should have been sufficient. 

Yes, it would have been OK for this particular example but the whole point of using regex is to be able to use very complex string patterns which are not at all obvious how to wrote using simple string methods. We will shortly see the real power of regex as compared to string methods. 

But before that, let’s explore another most commonly used method called search.

### We could have used simple string method. What's powerful about regex? The answer is that it can match very complex pattern. But to see such examples, let's first explore `search` method.

`Search` and `match` are related concepts and they both return the same Match object. The real difference between them is that `match` works for only the first match (either at the beginning of the string or at a specified position, as we saw in the previous exercises) whereas `search` looks for the pattern anywhere in the string and returns the appropriate position if it finds a match. 

See the following example,

In [16]:
prog = re.compile('ing')

In [17]:
prog.search('Spring')

<_sre.SRE_Match object; span=(3, 6), match='ing'>

In [18]:
prog.search('Ringtone')

<_sre.SRE_Match object; span=(1, 4), match='ing'>

As you can see, the match method returns `None` for the input spring. But `search` returns a `Match` object with the `span=(3,6)` as it finds the pattern ‘ing’ spanning those positions.

Similarly, for the string ‘Ringtone’, it finds the correct position of the match and returns `span=(1,4)`.

### Use the `span()` method of the `match` object, returned by `search`, to locate the position of the matched pattern

As you can understand by now, the `span` contained in the `Match` object, is useful for locating the exact position of the pattern as it appears in the string.

Run the following code to demonstrate this,

In [19]:
prog = re.compile(r'ing')
words = ['Spring','Cycling','Ringtone']
for w in words:
    mt = prog.search(w)
    # Span returns a tuple of start and end positions of the match
    start_pos = mt.span()[0] # Starting position of the match
    end_pos = mt.span()[1] # Ending position of the match
    print("The word '{}' contains 'ing' in the position {}-{}".format(w,start_pos,end_pos))

The word 'Spring' contains 'ing' in the position 3-6
The word 'Cycling' contains 'ing' in the position 4-7
The word 'Ringtone' contains 'ing' in the position 1-4


### Examples of various single character pattern matching with `search`. Here we will also use `group()` method, which essentially returns the string matched

Now, we will start getting into the real usage of regex with examples of various useful pattern matching. 

First, we will explore single character matching. We will also use the group method, which essentially returns the matched pattern in a string format so that we can print and process it easily.

#### Dot `.` matches any single character except newline character

In [20]:
prog = re.compile(r'py.')
print(prog.search('pygmy').group())
print(prog.search('Jupyter').group())

pyg
pyt


#### `\w` (lowercase w) matches any single letter, digit or underscore

In [21]:
prog = re.compile(r'c\wm')
print(prog.search('comedy').group())
print(prog.search('camera').group())
print(prog.search('pac_man').group())
print(prog.search('pac2man').group())

com
cam
c_m
c2m


#### `\W` (uppercase W) matches anything not covered with `\w`

In [22]:
prog = re.compile(r'9\W11')
print(prog.search('9/11 was a terrible day!').group())
print(prog.search('9-11 was a terrible day!').group())
print(prog.search('9.11 was a terrible day!').group())
print(prog.search('Remember the terrible day 09/11?').group())

9/11
9-11
9.11
9/11


#### `\s` (lowercase s) matches a single whitespace character like: space, newline, tab, return.

In [23]:
prog = re.compile(r'Data\swrangling')

print(prog.search("Data wrangling is cool").group())
print("-"*80)
print("Data\twrangling is the full string")
print(prog.search("Data\twrangling is the full string").group())
print("-"*80)

print("Data\nwrangling is the full string")
print(prog.search("Data\nwrangling").group())

Data wrangling
--------------------------------------------------------------------------------
Data	wrangling is the full string
Data	wrangling
--------------------------------------------------------------------------------
Data
wrangling is the full string
Data
wrangling


#### `\d` matches numerical digits 0 - 9

In [24]:
prog = re.compile(r"score was \d\d")

print(prog.search("My score was 67").group())
print(prog.search("Your score was 73").group())

score was 67
score was 73


### Examples of pattern matching either at the start or end of the string

First, let us write a small function to handle cases where the match is not found i.e. to handle `None` objects as returns from the regex method.

In [25]:
def print_match(s):
    if prog.search(s)==None:
        print("No match")
    else:
        print(prog.search(s).group())

#### `^` (Caret) matches a pattern at the start of the string

In [26]:
prog = re.compile(r'^India')

print_match("Russia implemented this law")
print_match("India implemented that law")
print_match("This law was implemented by India")

No match
India
No match


#### `$` (dollar sign) matches a pattern at the end of the string

In [27]:
prog = re.compile(r'Apple$')

print_match("Patent no 123456 belongs to Apple")
print_match("Patent no 345672 belongs to Samsung")
print_match("Patent no 987654 belongs to Apple")

Apple
No match
Apple


### Pattern matching with multiple characters

Now, we turn to more exciting and useful pattern matching with examples of multiple characters matching. You should start seeing and appreciating the real power of regex by now.

For these examples and exercises, also try to think how you would implement them without regex i.e. by using simple string methods and any other logic that you can think of. Then, compare that solution to the ones implemented with regex for brevity and efficiency.

#### `*` matches 0 or more repetitions of the preceding RE

In [28]:
prog = re.compile(r'ab*')

print_match("a")
print_match("ab")
print_match("abbb")
print_match("b")
print_match("bbab")
print_match("something_abb_something")

a
ab
abbb
No match
ab
abb


#### `+` causes the resulting RE to match 1 or more repetitions of the preceding RE

In [29]:
prog = re.compile(r'ab+')

print_match("a")
print_match("ab")
print_match("abbb")
print_match("b")
print_match("bbab")
print_match("something_abb_something")

No match
ab
abbb
No match
ab
abb


#### `?` causes the resulting RE to match precisely 0 or 1 repetitions of the preceding RE

In [30]:
prog = re.compile(r'ab?')

print_match("a")
print_match("ab")
print_match("abbb")
print_match("b")
print_match("bbab")
print_match("something_abb_something")

a
ab
ab
No match
ab
ab


### Greedy vs. non-greedy matching
The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient. Sometimes, this behavior is natural but in some cases you may want to match minimally. 

Lazy matching, on the other hand, ‘takes as little as possible’. This can be effected by adding a `?` at the end of the pattern. We show examples of such cases in the following code to illustrate the point.

In [31]:
prog = re.compile(r'<.*>')
print_match('<a> b <c>')

<a> b <c>


In [32]:
prog = re.compile(r'<.*?>')
print_match('<a> b <c>')

<a>


### Controlling how many repetitions to match

In many situations, we want to have precise control over how many repetitions of the pattern we want to match in a text. This can be done in few ways, we show examples of such kind below,

#### `{m}` specifies exactly `m` copies of RE to match. Fewer matches cause a non-match and returns `None`

In [33]:
prog = re.compile(r'A{3}')

print_match("ccAAAdd")
print_match("ccAAAAdd")
print_match("ccAAdd")

AAA
AAA
No match


#### `{m,n}` specifies exactly `m` to `n` copies of RE to match.  Omitting `m` specifies a lower bound of zero, and omitting `n` specifies an infinite upper bound.

In [34]:
prog = re.compile(r'A{2,4}B')

print_match("ccAAABdd")
print_match("ccABdd")
print_match("ccAABBBdd")
print_match("ccAAAAAAABdd")

AAAB
No match
AAB
AAAAB


In [35]:
prog = re.compile(r'A{,3}B')

print_match("ccAAABdd")
print_match("ccABdd")
print_match("ccAABBBdd")
print_match("ccAAAAAAABdd")

AAAB
AB
AAB
AAAB


In [36]:
prog = re.compile(r'A{3,}B')

print_match("ccAAABdd")
print_match("ccABdd")
print_match("ccAABBBdd")
print_match("ccAAAAAAABdd")

AAAB
No match
No match
AAAAAAAB


#### `{m,n}?` specifies `m` to `n` copies of RE to match in a non-greedy fashion.

In [37]:
prog = re.compile(r'A{2,4}')
print_match("AAAAAAA")

prog = re.compile(r'A{2,4}?')
print_match("AAAAAAA")

AAAA
AA


### Sets of matching characters

To match arbitrarily complex pattern we need to be able to include logical combination of characters together as a bunch. Regex gives us that kind of capability. Following examples demonstrate such uses of regex,

#### `[x,y,z]` matches x, y, or z

In [38]:
prog = re.compile(r'[A,B]')
print_match("ccAd")
print_match("ccABd")
print_match("ccXdB")
print_match("ccXdZ")

A
A
B
No match


#### A range of characters can be matched inside the set. This is one of the most widely used regex techniques!
Suppose, we want to pick out an email address from a text. Email address are generally of the form `{some_name}@{some_domain_name}.{some_domain_identifier}`.

In [39]:
prog = re.compile(r'[a-zA-Z]+@+[a-zA-Z]+\.com')

print_match("My email is coolguy@xyz.com")
print_match("My email is coolguy12@xyz.com")

coolguy@xyz.com
No match


What is going on here?

Look at the regex pattern inside the [ … ]. It is `a-zA-Z`. This covers all alphabets lowercase and uppercase! With this one simple regex, you are able to match any (pure) alphabetical string for that part of the email. Now, the next pattern is **`@`** which is added to the previous regex by a **`+`**. This is the way to build up a complex regex by adding/stacking up individual regex patterns. We also use the same `[a-zA-Z]` for the email domain name and add a **.com** at the end to complete the pattern as a valid email address. Why \.? Because, by itself, DOT (.) is used as a special modifier in regex, but here we want to use DOT (.) just as DOT (.), not as a modifier. So, we need to precede it by a ‘\’.

So, with this regex, we could extract the first email address perfectly but go ‘No match’ with the second one. 

What happened with the second email ID?

The regex could not capture it because it had a number ‘12’ in the name! That pattern is not captured by the expression [a-zA-Z].

Let’s change that and add the digits as well,

In [40]:
prog = re.compile(r'[a-zA-Z0-9]+@+[a-zA-Z]+\.com')

print_match("My email is coolguy12@xyz.com")
print_match("My email is coolguy12@xyz.org")

coolguy12@xyz.com
No match


Now we catch the first email ID perfectly. But what’s going on with the second one? We again got a mismatch. The reason is that we changed the **.com** to **.org** in that email and in our regex expression that portion was hard coded as .com, so it did not find a match.

Let’s try to address this in the following regex,

In [41]:
prog = re.compile(r'\w+@+\w+\.+[a-z]{2,4}')
print_match("My email is coolguy12@xy2z.org")
print_match("My email is coolguy12[AT]xyz[DOT]org")

coolguy12@xy2z.org
No match


In this regex, we used the fact that most domain identifiers have 2 or 3 characters, so we used `[a-zA-Z]{2,3}` to capture that.

What happened with the second email ID? This is an example of the small tweaks that you can make to stay ahead of telemarketers who want to scrape online forums or any other corpus of text and extract your email ID. If you do not want your email to be found you can change @ to `[AT]` and . to `[DOT]` and hopefully that can beat some regex techniques (but not all)!


### Combining forces - OR-ing of regex using `|` - let us try extracting various types of phone numbers

Because regex patterns are like complex and compact logical constructors themselves, it makes perfect sense that we want to combine them to construct even more complex programs when needed. We can do that using `|` operator. Following example demonstrates the point,

In [42]:
prog = re.compile(r'[0-9]{10}')

print_match("3124567897")
print_match("312-456-7897")

3124567897
No match


So, here we are trying to extract patterns of 10-digit number which could be phone numbers. Note the use of `{10}` to denote exactly 10 digit number in the pattern. But the second number could not be matched for obvious reason – it had **`-`** symbols inserted in between groups of numbers. We can tackle this using multiple smaller regex and logically combining them.

In [43]:
prog = re.compile(r'[0-9]{10}|[0-9]{3}-[0-9]{3}-[0-9]{4}')

print_match("3124567897")
print_match("312-456-7897")

3124567897
312-456-7897


Phone numbers are written in a myriad ways and if you search on the web, you will see examples of very complex regex (written not only in Python but other widely used languages for web apps like JavaScript, C++, PHP, Perl, etc.) for capturing phone numbers. Here we show some more examples just to give you a flavor,

In [47]:
p0=r'\+*\d*\s[0-9]{3}-[0-9]{3}-[0-9]{4}'
p1= r'[0-9]{10}'
p2=r'[0-9]{3}-[0-9]{3}-[0-9]{4}'
p3 = r'\([0-9]{3}\)[0-9]{3}-[0-9]{4}'
p4 = r'[0-9]{3}\.[0-9]{3}\.[0-9]{4}'
pattern= p0+'|'+p1+'|'+p2+'|'+p3+'|'+p4
prog = re.compile(pattern)

print_match("A phone number 3124567897")
print_match("Another phone number 312-456-7897")
print_match("(312)456-7897 is my phone number")
print_match("I gave him 312.456.7897 as the phone number")
print_match("An international number +22 312-456-7897")

3124567897
 312-456-7897
(312)456-7897
312.456.7897
+22 312-456-7897


### `findall` method finds all the occurance of the pattern and return them as a list of strings
The last regex method that we will learn for lesson is `findall`. Essentially, this is a **search-and aggregate** method i.e. it puts all the instances that match with the regex pattern in a given text and return them in a list. This is extremely useful, as we can just count the length of the returned list to count the number of occurrences or pick and use the returned pattern-matched words one by one as we see fit.

Note, that although we are giving short examples (of single sentences) in this notebook, you will often deal with large corpus of text while using regex. In that case, you are likely to get many matches from a single regex pattern search. For all those cases, `findall` method is going to be most useful.

In [3]:
# A multi-line string
ph_numbers = '''Here are some phone numbers. Pick out the numbers with 312 area code: 
312-423-3456, 456-334-6721, 312-5478-9999, 312-Not-a-Number,777.345.2317, 312.331.6789
'''

print(ph_numbers)
re.findall('312+[-\.][0-9-\.]+',ph_numbers)

Here are some phone numbers. Pick out the numbers with 312 area code: 
312-423-3456, 456-334-6721, 312-5478-9999, 312-Not-a-Number,777.345.2317, 312.331.6789



['312-423-3456', '312-5478-9999', '312.331.6789']

### Use `split()` method to extract meaningful pieces of textual data from a string

In [49]:
text = "Some File.Ver10.Rev2.txt"
re.split('\.Ver\d+\.Rev\d+',text)

['Some File', '.txt']

In [50]:
sentence = """A, very   very; irregular_sentence"""
" ".join(re.split('[;,\s_]+', sentence))

'A very very irregular sentence'