# Introduction to Regular Expressions

**Regular Expressions** give us a way to do much more powerful string manipulation. A regular expression, or simply **regex**, is a special string that describes a specific pattern that you would like to match in another string.

### Examples of questions that regexes can answer
It might be helpful to see a list of questions that a regex pattern can match:

* Match all words that begin with 'S' and end in 'y'
* Match the word 'friend' or 'freind'
* Match a word with at least 3 digits in it
* Match all Gmail email addresses
* Capture the word immediately following the word 'Author'
* Capture the word immediately following the third occurrence of the word 'coffee'

## The `contains` and `extract` methods
We will be primarily concerned with finding matching patterns within string values of a Pandas Series. We will then select all values within the Series that match the pattern via boolean indexing. The `contains` string Series method will be used for this.

Eventually, we will use the `extract` string Series method to extract particular substrings from the strings within the Series.

### A simple example without regular expressions
Let's match all movie titles that contain either an 'x', 'y', or 'z'. Without using a regex, we would use multiple `contains` string methods separating them with the logical **or** symbol:

In [None]:
import pandas as pd
movie = pd.read_csv('../data/movie.csv')
title = movie['title']
title.head(3)

In [None]:
has_xyz = title.str.contains('x') | title.str.contains('y') | title.str.contains('z')
title[has_xyz].head()

We can sum up this boolean Series to determine the number of values that have either an 'x', 'y', or 'z' in them.

In [None]:
has_xyz.sum()

### Use a regex instead
Instead, we can use the regex `'[xyz]'`, which matches the pattern for any string that contains an 'x', 'y', or 'z'. We can verify that we get the same total. This regex plus many more will be covered in detail below.

In [None]:
title.str.contains('[xyz]').sum()

## Mini-Programming Language
Regular expressions are a miniature programming language that have their own strict set of rules just like any other language. The syntax is written as a string mixing both **literal** and **special** characters. 

### Literal vs Special Characters
There are two distinct categories of characters within a regex string - **Literal** and **Special**

* **Literal** - these characters don't have any special meaning. They simply represent themselves. They are also referred to as **regular** characters.
* **Special** - these characters do have a special meaning. Each special character represents something very specific. They are also referred to as **metacharacters**.

### Matching with only Literal Characters
The simplest regex patterns you can write contain only literal characters. These strings will look like any ordinary string. Let's search for movies that have the word `'Star'` in them.

`'Star'` is a valid regular expression. We will use the `contains` Series string method which accepts a regular expression as its first argument. It returns a boolean Series.

In [None]:
pattern = 'Star'
title.str.contains(pattern).head()

### Filter for only movies containing `Star`
Let's take this resulting Series and use it for boolean indexing. The result should be the movie titles that have **`Star`** in them.

In [None]:
pattern = 'Star'
filt = title.str.contains(pattern)
title[filt].head()

### Regular Expressions are case sensitive
Regexes are case sensitive by default. `'Star'` only matches movie titles with an uppercase `'S'` followed immediately by lowercase `'tar'`. Let's search for lowercase `'star'`:

In [None]:
pattern = 'star'
filt = title.str.contains(pattern)
title[filt]

### Find all movies containing exact string `'Star Wars'`

In [None]:
pattern = 'Star Wars'
filt = title.str.contains(pattern)
title[filt]

Find all movies containing exact string `'hine'`:

In [None]:
pattern = 'hine'
filt = title.str.contains(pattern)
title[filt].head()

## Special Characters
The following characters are the **special** or **metacharacters**

`. ^ $ * + ? { } [ ] \ | ( )`

#### Details and examples with special characters
The rest of this notebook is devoted to examples that explain each of the special characters above. This will not be an exhaustive coverage of regular expressions as they can get quite complex. There are even entire books written on the subject.

## The dot metacharacter `.`
The **dot** or **period** is a special character that matches any character. For example the regex `'m.le'` will match any string that has an **`m`** followed by any character followed by `le`. It will match 'male', 'mile', 'mole', 'thimble', 'tumble', etc... Let's see how many movie titles have this pattern:

In [None]:
pattern = 'm.le'
filt = title.str.contains(pattern)
title[filt]

## The caret metacharacter `^`
The caret, `^`, is a special character that forces the pattern to match from the beginning of the string. Let's take a look at the difference between the regexes `War` and `^War`. The first matches the word 'War' anywhere in the string. The second matches the word 'War' only at the beginning. Let's output the differences:

In [None]:
pattern = 'War'
filt = title.str.contains(pattern)
title[filt].head()

In [None]:
pattern = '^War'
filt = title.str.contains(pattern)
title[filt]

## The dollar sign metacharacter `$`
The dollar sign metacharacter, **`$`** works analogously to the caret but instead forces a match to the **end** of the string. Let's find all the movies that end in 'War':

In [None]:
pattern = 'War$'
filt = title.str.contains(pattern)
title[filt]

### Start and End Anchor tags
The caret and dollar metacharacters are also know as **anchor** tags since they anchor the pattern to either the beginning or end.

## Combining special characters
A regex can have any number of literal and meta special characters. The following regex matches movies that begin with `S`, followed by any character followed `n`.

In [None]:
pattern = r'^S.n'
filt = title.str.contains(pattern)
title[filt].head(3)

## Exercises

### Exercise 1
<span  style="color:green; font-size:16px">Find all movies that have 2 consecutive z's in them.</span>

### Exercise 2
<span  style="color:green; font-size:16px">Find all movies that begin with 9.</span>

### Exercise 3
<span  style="color:green; font-size:16px">Find all movies that have a `b` as their third character.</span>

### Exercise 4
<span  style="color:green; font-size:16px">Find all movies with a fourth-to-last character of `M` and a last character of `e`.</span>

### Exercise 5
<span  style="color:green; font-size:16px">Could you use a regular expression to find a movie that was exactly 6 characters in length?</span>

### Exercise 6
<span  style="color:green; font-size:16px">What is a more natural way to complete exercise 5 without a regex?</span>