# Introduction to Regular Expressions

**Regular expressions** give us a way to search strings for specific patterns. A regular expression, or simply **regex**, is a special string that describes a specific pattern that you would like to match in another string.


### Examples of questions that regexes can answer

It might be helpful to see a list of questions that a regex pattern can match:

* Match all words that begin with 'S' and end in 'y'
* Match the word 'friend' or 'freind'
* Match a word with at least 3 digits in it
* Match all Gmail email addresses
* Capture the word immediately following the word 'Author'
* Capture the word immediately following the third occurrence of the word 'coffee'

## Regular expressions in Python

Many programming languages provide the capability to use regular expressions and Python is no exception. The [re module][1] is part of the Python standard library and gives Python programmers all the necessary tools to use regular expressions. The official documentation has a good tutorial for beginners on [how to use regular expressions in Python][2] that I recommend reading through in addition to this text.

## Mini-Programming Language

Regular expressions are a miniature programming language that have their own strict set of rules just like any other language. Regular expression syntax is written as a string mixing **literal** and **special** characters. 

### Literal vs Special Characters

There are two distinct categories of characters within a regex string - **Literal** and **Special**

* **Literal** - these characters don't have any special meaning. They simply represent themselves. They are also referred to as **regular** characters.
* **Special** - these characters do have a special meaning and do not represent themselves literally. It is these special characters that provide the power in regular expressions. They are also referred to as **metacharacters**.


### Different flavors of regex

Different programming languages (Python, Perl, Java, etc...) have their own "flavor" of regex syntax. The vast majority of regex syntax between programming languages overlap, but there are some differences. You might not be able to copy and paste a regex solution you find on Stack Overflow into your Python program if it comes from another language. This text only covers regex for the Python programming language.


## Regular expressions in pandas

While learning how to use the `re` module is a valuable addition to your toolset, it is not necessary in order to use regular expressions in pandas. The **regex pattern** is what is key and the focus of these chapters. You will learn the fundamentals of how to create regex patterns to match particular parts of text.

### Matching patterns in pandas string Series

Because we are using pandas, we will be matching patterns within Series containing strings (those with data type object or string). There are several methods that accept regex pattern strings as input with the `contains` and `extract` method being common.

### The `contains` and `extract` string Series methods

Most of our work in this part of the  book will be with the `contains` and `extract` string Series methods available from the `str` accessor. The `contains` method accepts a regex pattern and returns a boolean for each value informing us whether or not the string matched the given pattern. The resulting boolean Series can be used to filter the data.

The `extract` method also accepts a regex pattern, but allows us to extract any number of matched patterns from each value in the Series. For instance, we may want to extract the domain name from each email address.

### Movie titles

Let's get started with regexes by reading in the movie dataset and selecting the `'title'` column. We'll use it to match a variety of different patterns.

[1]: https://docs.python.org/3/library/re.html
[2]: https://docs.python.org/3/howto/regex.html

In [None]:
import pandas as pd
movie = pd.read_csv('../data/movie.csv')
title = movie['title']
title.head(3)

## Matching with only literal characters

The simplest regex patterns you can write contain only literal characters. These strings will look like any ordinary string. Let's search for movies that have the word `'World'` in them.

`'World'` is a valid regular expression. Each of the five characters (`'W'`, `'o'`, `'r'`, `'l'`, and `'d'`) has no special meaning. We will use the `contains` Series string method which accepts a regular expression as its first argument and returns a boolean Series.

In [None]:
pattern = 'World'
title.str.contains(pattern).head()

### Filter for movies containing `'World'`

Let's take this resulting Series and use it for boolean indexing. The result should be the movie titles that have `'World'` in them.

In [None]:
pattern = 'World'
filt = title.str.contains(pattern)
title[filt].head()

### Defining a function to filter regexes

Instead of repeatedly running the same three lines above for every regex, we define the following function. It accepts the Series and the regex pattern and returns the matches. Additional keyword arguments forwarded to the `contains` method will be captured by `**kwargs`. These keyword arguments will be explained as needed. For now, no extra arguments will be passed.

In [None]:
def find_pattern(s, pattern, **kwargs):
    filt = s.str.contains(pattern, **kwargs)
    return s[filt]

### Regular Expressions are case sensitive

Regular expressions are case sensitive by default. `'World'` only matches movie titles with an uppercase `'W'` followed immediately by lowercase `'orld'`. Searching for lowercase `'world'` returns a different set of results.

In [None]:
find_pattern(title, 'world')

### Find all movies containing exact string `'Star Wars'`

Let's complete one more regular expression that only has literal characters (no special characters) and find all the movies that have the exact phrase `'Star Wars'` in them.

In [None]:
find_pattern(title, 'Star Wars')

## Special Characters

The following characters are the **special characters** or **metacharacters**:

`. ^ $ * + ? { } [ ] \ | ( )`

Each of these characters has a special meaning. They do not represent their literal character value. For instance, the `?` character does not match a literal question mark in the string. The rest of this chapter and the ones that follow in this part are devoted to examples that explain each of the special characters above.

## The dot metacharacter `.`

The **dot** or **period** is a special character that matches any character. For example, the regex `'m.le'` matches any string that has an `m` followed by any character followed by `le`. It matches `'male'`, `'mile'`, `'mole'`, `'thimble'`, `'tumble'`, etc... Let's see the movie titles that have this pattern:

In [None]:
find_pattern(title, 'm.le')

## Using raw python strings for regexes

Before we get too far into regexes, it's probably best to use raw strings when defining a pattern. Normal Python strings use the backslash character as an **escape character** which changes the meaning of the very next character (or characters). For instance, when `\n` is found in a normal python string, the Python interpreter translates it as "new line". Only a few characters are meaningful after the backslash and are listed in the [official documentation][0]. The following string, `s1`, is defined with this newline character. Printing it to the screen reveals how Python interprets it.

[0]: https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

In [None]:
s1 = 'First line.\nNext line'
print(s1)

Raw strings treat each character as a literal value. So `\n` in a raw string translates as two characters, the backslash character followed by the character `n`. In order to create a raw string, you must precede the string by `r`. This `r` is not part of the string. It is used to inform the Python interpreter that it is a raw string.  Below, the same set of characters are placed within a raw string. Printing out the string does not output a new line.

In [None]:
s2 = r'First line.\nNext line'
print(s2)

To further show the difference between a normal python string and a raw string, we can take the length of each of the strings above. In the normal Python string, `\n` is interpreted as a single character, while in the second it is two separate characters.

In [None]:
len(s1)

In [None]:
len(s2)

### Use raw strings from now on

Because the backslash character is a regex metacharacter, we will only use raw strings to define our regexes from here on out. If we don't use raw strings, Python will interpret the backslash character as an escape character used in conjunction with the next character. Many regex patterns will not have backslashes in them, so the raw string will be interpreted the exact same as a normal string. But, to help with consistency and ensure that there are no surprises when using backslashes in regexes, I advise writing the patterns with raw strings.

## The caret metacharacter `^`

The caret, `^`, is a special character that forces the pattern to match from the beginning of the string. Let's take a look at the difference between the regexes `War` and `^War`. The first matches the word 'War' anywhere in the string. The second matches the word `'War'` only at the beginning. Let's output the differences:

In [None]:
find_pattern(title, r'War').head()

In [None]:
find_pattern(title, r'^War').head()

## The dollar sign metacharacter `$`

The dollar sign metacharacter, `$`, works analogously to the caret but instead forces a match to the **end** of the string. Let's find all the movies that end in 'War':

In [None]:
find_pattern(title, r'War$').head()

### Start and end anchors

The caret and dollar sign metacharacters are also know as **anchors** since they anchor the pattern to either the start or end.

## Combining special characters

A regex can have any number of literal and special characters. The following regex matches movies that begin with `S`, followed by any character, followed by `n`.

In [None]:
find_pattern(title, r'^S.n').head()

## Setting regex options with flags

There are a number of options, each with pre-defined default values, that you may change when performing regex searches in Python. These options are known as **flags** and are set with the `flags` keyword argument. Perhaps the most common flag to set is when you'd like to ignore the case of the characters. The simple regex, `t.s` matches the lowercase letters `'t'` and `'s'` separated by any character, returning 72 matches.

In [None]:
find_pattern(title, 't.s').size

The flag objects are found in the `re` standard library, which we import below and use to make case-insensitive matches with `re.I`, which is an alias of `re.IGNORECASE`. All flags may be accessed as a [single-character variable name][0] in the `re` module. Ignoring the case finds many more matches.

[0]: https://docs.python.org/3/howto/regex.html#compilation-flags

In [None]:
import re
find_pattern(title, 't.s', flags=re.I).size

All flags are essentially integers and are subclassed from the built-in `int`.

In [None]:
re.I.value

In [None]:
issubclass(type(re.I), int)

## Exercises

### Exercise 1

<span style="color:green; font-size:16px">Find all movies that have 2 consecutive z's in them.</span>

### Exercise 2

<span style="color:green; font-size:16px">Find all movies that begin with 9.</span>

### Exercise 3

<span style="color:green; font-size:16px">Find all movies that have a `b` as their third character.</span>

### Exercise 4

<span style="color:green; font-size:16px">Find all movies with a fourth-to-last character of `M` and a last character of `e`.</span>

### Exercise 5

<span style="color:green; font-size:16px">Use a regular expression to find movies that are exactly 6 characters in length.</span>

### Exercise 6

<span style="color:green; font-size:16px">Complete exercise 5 using a different string-only Series method that does not require a regex.</span>

### Exercise 7

<span style="color:green; font-size:16px">Find all movies containing the letter `'q'` ignoring case.</span>