# 4. Regular Expressions part 2

### Objectives
* Know the basic functionality of the special characters - `* + ? { } [ ] \ |`
* Know how to combine multiple special characters together
* Know how to change operator precedence with parentheses
* Use `contains` to select entire values and `extract` to select substrings

This notebook continues coverage of the special characters. In the previous notebook, the dot, caret, and dollar sign metacharacters were covered.

## The asterisk metacharacter `*`
The **asterisk** or **star** metacharacter matches the previous character 0 or more times. For instance, the regex, **`'Ah* No'`** will look for strings that have an uppercase 'A' followed by 0 or more lowercase 'h' followed by ' No'. 

Let's see how this works on a Series of fake data:

In [1]:
# Create Series of fake data
import pandas as pd
s = pd.Series(['Ouch', 'Ah No', 'Ahh', 'Nooo', 'Ahhhhhhh No', 'A No', 'A'])
s

0           Ouch
1          Ah No
2            Ahh
3           Nooo
4    Ahhhhhhh No
5           A No
6              A
dtype: object

In [2]:
pattern = 'Ah* No'
filt = s.str.contains(pattern)
s[filt]

1          Ah No
4    Ahhhhhhh No
5           A No
dtype: object

Without the ' No' at the end, it would match two more values:

In [3]:
pattern = 'Ah*'
filt = s.str.contains(pattern)
s[filt]

1          Ah No
2            Ahh
4    Ahhhhhhh No
5           A No
6              A
dtype: object

## The plus metacharacter `+`
The **plus** metacharacter is very similar to the asterisk, except that it matches 1 or more of the previous character. Thus for the regex **`'Ah+ No'`**, the 'h' must appear at least once.

In [4]:
pattern = 'Ah+ No'
filt = s.str.contains(pattern)
s[filt]

1          Ah No
4    Ahhhhhhh No
dtype: object

## The question mark metacharacter `?`
The question mark is similar to both the asterisk and the star, except that it matches the previous character 0 or 1 times exactly.

In [5]:
pattern = 'Ah? No'
filt = s.str.contains(pattern)
s[filt]

1    Ah No
5     A No
dtype: object

Using another example, the regex **`'Sec?r'`** will match both 'Secret' and 'Serving'. Basically, the character before the question mark is **optional**.

In [6]:
movie = pd.read_csv('../data/movie.csv')
title = movie['title']

In [7]:
pattern = 'Sec?r'
filt = title.str.contains(pattern)
title[filt].head(10)

198     Night at the Museum: Secret of the Tomb
282     Harry Potter and the Chamber of Secrets
355             The Secret Life of Walter Mitty
513                     The Secret Life of Pets
1229                              Secret Window
1321                                   Serenity
1410                                Secretariat
1725                               Serving Sara
1750                                Serendipity
1788     Divine Secrets of the Ya-Ya Sisterhood
Name: title, dtype: object

## The curly braces metacharacter `{m, n}`
The curly braces metacharacter matches the previous character a given number of times. There are three different ways to use the curly braces:

* a single integer **`a{3}`**
* a single integer followed by a comma **`a{3,}`**
* two integers separated by a comma **`a{3,5}`**

**`a{3}`** matches exactly three consecutive a's. **`a{3,}`** matchces 3 or more consecutive a's. **`a{3,5}`** matches between 3 and 5 consecutive a's.

Let's create another Series by hand and match all the strings that begin with 'A', have the letter 'h' repeat between 2 and 5 times and then followed by ' No'.

In [8]:
s = pd.Series(['Ouch', 'Ahhh No', 'Ahh No', 'Nooo', 'Ahhhhhhh No', 'A No', 'A', 'Ahhh'])
s

0           Ouch
1        Ahhh No
2         Ahh No
3           Nooo
4    Ahhhhhhh No
5           A No
6              A
7           Ahhh
dtype: object

In [9]:
pattern = 'Ah{2,5} No'
filt = s.str.contains(pattern)
s[filt]

1    Ahhh No
2     Ahh No
dtype: object

## The pipe metacharacter `|`
The pipe metacharacter is equivalent to an **or** condition. It matches the entire word before or after the pipe. The regex **`'Friend|Enemy'`** matches any string with 'Friend' or 'Enemy' in it.

In [10]:
pattern = 'Friend|Enemy'
filt = title.str.contains(pattern)
title[filt]

403                            Enemy of the State
408                            Enemy at the Gates
1055                     My Best Friend's Wedding
1214                           Behind Enemy Lines
1413                        Friends with Benefits
1775        How to Lose Friends & Alienate People
2216                        My Best Friend's Girl
3116    Seeking a Friend for the End of the World
3495                           Friends with Money
4184                          We Are Your Friends
4279                        Dysfunctional Friends
4670                               Mutual Friends
Name: title, dtype: object

You can add as many pipes as you please:

In [11]:
pattern = 'Friend|Enemy|Good|Evil'
filt = title.str.contains(pattern)
title[filt].head(10)

55               The Good Dinosaur
343         A Good Day to Die Hard
403             Enemy of the State
408             Enemy at the Gates
670     Resident Evil: Retribution
672        The Long Kiss Goodnight
815       Resident Evil: Afterlife
923             As Good as It Gets
976      Resident Evil: Apocalypse
1055      My Best Friend's Wedding
Name: title, dtype: object

## The brackets metacharacter `[ ]`
The brackets metacharacter allows you match one of several characters at single particular position. As we saw with the very fist example, **`'[xyz]'`** matches any single 'x', 'y', or 'z'.

Another example, **`'T[aeiou]d'`** matches any words that begin with 'T', followed by exactly one vowel and then 'd'. The brackets contain all the possible matches for a single character.

Specifically, it matches the following: 'Tad', 'Ted', 'Tid', 'Tod', and 'Tud'.

In [12]:
pattern = 'T[aeiou]d'
filt = title.str.contains(pattern)
title[filt]

18      Pirates of the Caribbean: On Stranger Tides
628                                           Ted 2
841                                    Crimson Tide
922                                             Ted
1594                            The Prince of Tides
2016                  Win a Date with Tad Hamilton!
2171                     Bill & Ted's Bogus Journey
2576                                     Tidal Wave
3053               Bill & Ted's Excellent Adventure
4283        Living Dark: The Story of Ted the Caver
4809                                        Tadpole
Name: title, dtype: object

### Entire character classes within the brackets
Let's say you want to match all the lowercase letters 'a' through 'z'. You could write each letter within the brackets. Thankfully, there is a much easier way with **character classes**.

Character classes are special notation within the brackets that can be used to denote entire subsets of characters. Take the following:
* **`'[0-9]'`** represents all digits 0 through 9
* **`'[a-z]'`** represents all lowercase letters
* **`'[A-Z]'`** represents all uppercase letters
* **`'[a-zA-Z]'`** represents all lowercase and uppercase letters

Remember this notation only works within the brackets.

### Digits in movies
Let's match all movies with a digit in them.

In [13]:
pattern = '[0-9]'
filt = title.str.contains(pattern)
title[filt].head()

6                 Spider-Man 3
19              Men in Black 3
31                Spider-Man 2
32                  Iron Man 3
39    The Amazing Spider-Man 2
Name: title, dtype: object

### Matching movies with 2 digits in a row
We can match movies with two digits in a row by using the digits character class twice.

In [14]:
pattern = '[0-9][0-9]'
filt = title.str.contains(pattern)
title[filt].head()

60                            2012
85                        47 Ronin
212               The 13th Warrior
258         300: Rise of an Empire
268    Around the World in 80 Days
Name: title, dtype: object

## Combining Special Characters
You are allowed to combine any number of literal and special characters together with your regex. For instance, matching movies with two or more digits in a row could have been done by using the curly braces for repeats like this:

In [15]:
pattern = '[0-9]{2,}'
filt = title.str.contains(pattern)
title[filt].head()

60                            2012
85                        47 Ronin
212               The 13th Warrior
258         300: Rise of an Empire
268    Around the World in 80 Days
Name: title, dtype: object

#### Find all movies that begin with exactly 4 digits in a row
We can use the caret to anchor the digits to the start and the curly braces to match exactly 4 digits.

In [16]:
pattern = '^[0-9]{4}'
filt = title.str.contains(pattern)
title[filt].head()

60                         2012
697     3000 Miles to Graceland
1541                       1941
1707                       1911
2056                       1408
Name: title, dtype: object

#### Find all movies that begin with 'The' and end with 'Movie'
We anchor 'The' to the beginning with the caret and 'Movie' to the end with the dollar symbol. We use **`.*`** in the middle to represent any character repeated 0 or more times.

In [17]:
pattern = '^The .* Movie$'
filt = title.str.contains(pattern)
title[filt].head()

319                   The Peanuts Movie
561               The Angry Birds Movie
569                  The Simpsons Movie
759                      The Lego Movie
1586    The SpongeBob SquarePants Movie
Name: title, dtype: object

#### Find all movies that are exactly 10 characters long
**`.{10}`** matches exactly any 10 characters in a row. We must anchor it to the beginning and end to ensure that the string is exactly 10 characters in length.

In [18]:
pattern = '^.{10}$'
filt = title.str.contains(pattern)
title[filt].head()

22    Robin Hood
28    Battleship
32    Iron Man 3
76    Waterworld
78    Inside Out
Name: title, dtype: object

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">Find all movies that begin with 'The' followed by the next word that begins with digits.</span>

In [None]:
# your code here

### Problem 2
<span  style="color:green; font-size:16px">Find all movies that have three consecutive capital letters in them.</span>

In [None]:
# your code here

### Problem 3
<span  style="color:green; font-size:16px">Find all movies that have begin and end with a capital letter.</span>

In [None]:
# your code here

### Problem 4
<span  style="color:green; font-size:16px">Find all the movies that have a digit followed by a comma followed by a digit.</span>

In [None]:
# your code here

### Problem 5
<span  style="color:green; font-size:16px">Find all the movies that have either an ampersand or a question mark in them.</span>

In [None]:
# your code here

### Problem 6
<span  style="color:green; font-size:16px">Which movie has the most ampersands, question marks, and periods in it?</span>

In [None]:
# your code here

# Continue using regexes to find movies

# Many other str methods use regexes. Experiment with them