# 3. Regular Expressions part 1

### Objectives
* Understand what regular expressions are and what kind of questions they can answer
* Know that regular expressions consist of **literal** and **special** characters
* Know the basic functionality of all the special characters - `. ^ $`
* Know how to combine multiple special characters together

# Regular Expressions for more Powerful String Manipulations
**Regular Expressions** give us a way to do much more powerful string manipulations. A regular expression, or simply **regex**, is a special string that describes a specific pattern that you would like to match in another string.

### Examples of questions that regexes can answer
It might be helpful to see a list of questions that a regex pattern can match:
* Match all words that begin with 'S' and end in 'y'
* Match the word 'friend' or 'freind'
* Match a word with at least 3 digits in it
* Match all Gmail email addresses
* Capture the word immediately following the word 'Author'
* Capture the word immediately following the third occurrence of the word 'coffee'

# Primarily use `contains` and `extract`
We will be primarily concerned with finding matching patterns within string values of a Pandas Series. We will then select all values within the Series that match the pattern via boolean indexing. The **`contains`** string Series method will be used for this.

Eventually, we will use the **`extract`** string Series method to extract particular substrings from the strings within the Series.

### A simple example without regular expressions
Let's match all movie titles that contain either an 'x', 'y', or 'z'. Without using a regex, we would use multiple **`contains`** string methods separating them with the logical **or** symbol:

In [1]:
import pandas as pd
movie = pd.read_csv('../data/movie.csv')
title = movie['title']

In [2]:
has_xyz = title.str.contains('x') | title.str.contains('y') | title.str.contains('z')
title[has_xyz].head()

9     Harry Potter and the Half-Blood Prince
21                    The Amazing Spider-Man
30                                   Skyfall
35                       Monsters University
37           Transformers: Age of Extinction
Name: title, dtype: object

We can sum up this boolean Series to determine the number of values that have either an 'x', 'y', or 'z' in them.

In [3]:
has_xyz.sum()

1193

### Use a regex instead
Instead, we can use the regex **`'[xyz]'`**, which matches the pattern for any string that contains an 'x', 'y', or 'z'. We can verify that we get the same total. This regex plus many more will be covered in detail below.

In [4]:
title.str.contains('[xyz]').sum()

1193

## Regular Expressions are a Mini-Programming Language
Regular expressions are a miniature programming language that have their own strict set of rules just like any other language. The syntax is written as a string mixing both **literal** and **special** characters. 

## Literal vs Special Characters
There are two distinct categories of characters within a regex string - **Literal** and **Special**
* **Literal** - these characters don't have any special meaning. They simply represent themselves. They are also referred to as **regular** characters.
* **Special** - these characters do have a special meaning. Each special character represents something very specific. They are also referred to as **metacharacters**.

## Matching with only Literal Characters
The most simple regex patterns you can write contain only literal characters. These strings will look like any ordinary string. Let's search for movies that have the word **`'Star'`** in them.

**`'Star'`** is a valid regular expression. We will use the **`contains`** Series string method which accepts a regular expression as its first argument. It returns a boolean Series.

In [5]:
pattern = 'Star'
title.str.contains(pattern).head()

0    False
1    False
2    False
3    False
4     True
Name: title, dtype: bool

### Filter for only movies containing `Star`
Let's take this resulting Series and use it for boolean indexing. The result should be the movie titles that have **`Star`** in them.

In [6]:
pattern = 'Star'
filt = title.str.contains(pattern)
title[filt].head(5)

4        Star Wars: Episode VII - The Force Awakens
48                          Star Trek Into Darkness
57                                 Star Trek Beyond
159                                       Star Trek
233    Star Wars: Episode III - Revenge of the Sith
Name: title, dtype: object

## Regular Expressions are case sensitive
Regexes are case sensitive by default. **`'Star'`** only matches movie titles with an uppercase **`'S'`** followed immediately by lowercase **`'tar'`**. Let's search for lowercase **`'star'`**:

In [7]:
pattern = 'star'
filt = title.str.contains(pattern)
title[filt]

2641    Firestarter
2737      Superstar
Name: title, dtype: object

### Find all movies containing exact string `'Star Wars'`

In [8]:
pattern = 'Star Wars'
filt = title.str.contains(pattern)
title[filt]

4           Star Wars: Episode VII - The Force Awakens
233       Star Wars: Episode III - Revenge of the Sith
234       Star Wars: Episode II - Attack of the Clones
237          Star Wars: Episode I - The Phantom Menace
1521        Star Wars: Episode VI - Return of the Jedi
2031    Star Wars: Episode V - The Empire Strikes Back
2973                Star Wars: Episode IV - A New Hope
3271                         Star Wars: The Clone Wars
Name: title, dtype: object

#### Find all movies containing exact string `'hine'`:

In [9]:
pattern = 'hine'
filt = title.str.contains(pattern)
title[filt].head()

94      Terminator 3: Rise of the Machines
475                       The Time Machine
1302                              Sunshine
1372                  Hot Tub Time Machine
1710                  Machine Gun Preacher
Name: title, dtype: object

## Special Characters
The following characters are the **special** or **metacharacters**

`. ^ $ * + ? { } [ ] \ | ( )`

#### Details and examples with special characters
The rest of this notebook is devoted to examples that explain each of the special characters above. This will not be an exhaustive coverage of regular expressions as they can get quite complex. There are even entire books written on the subject.

## The dot metacharacter `.`
The **dot** or **period** is a special character that matches any character. For example the regex **`'m.le'`** will match any string that has an **`m`** followed by any character followed by **`le`**. It will match 'male', 'mile', 'mole', 'thimble', 'tumble', etc...

Let's see how many movie titles have this pattern:

In [10]:
pattern = 'm.le'
filt = title.str.contains(pattern)
title[filt]

661                          Mona Lisa Smile
1465                               Wimbledon
1733    Indiana Jones and the Temple of Doom
1770                           A Simple Wish
1923                             The Gambler
2019                         Ready to Rumble
2326              The Baader Meinhof Complex
2476                           A Simple Plan
3374                     Rumble in the Bronx
4711                             Tumbleweeds
Name: title, dtype: object

## The caret metacharacter `^`
The caret, **`^`** is a special character that forces the pattern to match from the beginning of the string. Let's take a look at the difference between the regexes **`War`** and **`^War`**. The first matches the word 'War' anywhere in the string. The second matches the word 'War' only at the beginning.

Let's output the differences:

In [11]:
pattern = 'War'
filt = title.str.contains(pattern)
title[filt].head()

4             Star Wars: Episode VII - The Force Awakens
27                            Captain America: Civil War
46                                           World War Z
64     The Chronicles of Narnia: The Lion, the Witch ...
108                                             Warcraft
Name: title, dtype: object

In [12]:
pattern = '^War'
filt = title.str.contains(pattern)
title[filt]

108                    Warcraft
187           War of the Worlds
597                   War Horse
1483         Warriors of Virtue
1598                Warm Bodies
1934                        War
1961                    Warrior
2878                   WarGames
3160                  War, Inc.
3431                    Warlock
3536                War & Peace
4012    Warlock: The Armageddon
Name: title, dtype: object

## The dollar sign metacharacter `$`
The dollar sign metacharacter, **`$`** works analogously to the caret but instead forces a match to the **end** of the string. Let's find all the movies that end in 'War':

In [13]:
pattern = 'War$'
filt = title.str.contains(pattern)
title[filt]

27              Captain America: Civil War
241             The Huntsman: Winter's War
323                     The Flowers of War
534                   Charlie Wilson's War
611                             Hart's War
666                         This Means War
1160                           Lord of War
1261                        The Art of War
1549                    Dragon Wars: D-War
1934                                   War
2867    Tae Guk Gi: The Brotherhood of War
2962                         5 Days of War
3577                            Men of War
3742                           Born of War
Name: title, dtype: object

## Start and End Anchor tags
The caret and dollar metacharacters are also know as **anchor** tags since they anchor the pattern to either the beginning or end.

# Combining special characters
A regex can have any number of literal and meta special characters. The following regex matches movies that begin with **`S`**, followed by any character followed **`n`**.

In [14]:
pattern = r'^S.n'
filt = title.str.contains(pattern)
title[filt].head()

248                          San Andreas
315                      Son of the Mask
683         Sin City: A Dame to Kill For
784     Sinbad: Legend of the Seven Seas
1203                            Sin City
Name: title, dtype: object

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">Find all movies that have 2 consecutive z's in them.</span>

In [None]:
# your code here

### Problem 2
<span  style="color:green; font-size:16px">Find all movies that begin with 9.</span>

In [None]:
# your code here

### Problem 3
<span  style="color:green; font-size:16px">Find all movies that have a `b` as their third character.</span>

In [None]:
# your code here

### Problem 4
<span  style="color:green; font-size:16px">Find all movies with a fourth-to-last character of `M` and a last character of `e`.</span>

In [None]:
# your code here

### Problem 5
<span  style="color:green; font-size:16px">Could you use a regular expression to find a movie that was exactly 6 characters in length?</span>

In [None]:
# your code here

### Problem 6
<span  style="color:green; font-size:16px">What is a more natural way to complete problem 5 without a regex?</span>

In [None]:
# your code here