# Quantifiers

This notebook continues coverage of the special characters by focusing on those that are used to match repeated characters. These group of metacharacters are called **quantifiers**, because they quantify they type of repetition you desire.

## The asterisk metacharacter `*`
The **asterisk** or **star** metacharacter matches the previous character 0 or more times. For instance, the regex, `'Ah* No'` will look for strings that have an uppercase 'A' followed by 0 or more lowercase 'h' followed by ' No'. Let's see how this works on a Series of sample data:

In [None]:
# Create Series of fake data
import pandas as pd
s = pd.Series(['Ouch', 'Ah No', 'Ahh', 'Nooo', 'Ahhhhhhh No', 'A No', 'A'])
s

In [None]:
pattern = 'Ah* No'
filt = s.str.contains(pattern)
s[filt]

Without the ' No' at the end, it would match two more values:

In [None]:
pattern = 'Ah*'
filt = s.str.contains(pattern)
s[filt]

## The plus metacharacter `+`
The **plus** metacharacter is very similar to the asterisk, except that it matches 1 or more of the previous character. Thus for the regex `'Ah+ No'`, the 'h' must appear at least once.

In [None]:
pattern = 'Ah+ No'
filt = s.str.contains(pattern)
s[filt]

## The question mark metacharacter `?`
The question mark is similar to both the asterisk and the star, except that it matches the previous character 0 or 1 times exactly.

In [None]:
pattern = 'Ah? No'
filt = s.str.contains(pattern)
s[filt]

Using another example, the regex `'Sec?r'` will match both 'Secret' and 'Serving'. Basically, the character before the question mark is **optional**.

In [None]:
movie = pd.read_csv('../data/movie.csv')
title = movie['title']
pattern = 'Sec?r'
filt = title.str.contains(pattern)
title[filt].head()

## The curly braces metacharacter `{m,n}`
The curly braces metacharacter matches the previous character a given number of times. There are four different ways to use the curly braces:

* a single integer `a{3}` - matches exactly three 'a' characters in a row
* a single integer followed by a comma `a{3,}` - matches three or more 'a' characters in a row
* a comma followed by a single integer `a{,3}` - matches zero to three 'a' characters in a row
* two integers separated by a comma `a{3,5}` - matches between 3 and 5 'a' characters in a row

Let's create another Series by hand and match all the strings that begin with 'A', have the letter 'h' repeat between 2 and 5 times and then followed by ' No'.

In [None]:
s = pd.Series(['Ouch', 'Ahhh No', 'Ahh No', 'Nooo', 'Ahhhhhhh No', 'A No', 'A', 'Ahhh'])
s

In [None]:
pattern = 'Ah{2,5} No'
filt = s.str.contains(pattern)
s[filt]

## Exercises
Use the title column of the movie Series for these exercises.

In [None]:
movie = pd.read_csv('../data/movie.csv')
title = movie['title']

### Exercise 1
<span  style="color:green; font-size:16px">Find all movies that have a 'z' as their 15th character.</span>

### Exercise 2
<span  style="color:green; font-size:16px">Find all movies that have the word 'Friend' or 'Friends' in them.</span>

### Exercise 3
<span  style="color:green; font-size:16px">Find all movies that have between 40 and 43 characters in them. Can you verify the results with another `str` accessor method?</span>

### Exercise 4
<span  style="color:green; font-size:16px">Find all movies that begin with 'The' and end in 'Movie'</span>

### Exercise 5
<span  style="color:green; font-size:16px">Create your own Series and make a regular expression that uses the `+` metacharacter. Is this character necessary?</span>