# Character Sets and Grouping

## Special Characters lose their meaning within the brackets
Special characters lose their special meaning within the brackets. For instance, `[.]` matches the literal dot and `[()*$]` matches any string with the literal parentheses, asterisk, or dollar sign. Let's match movies with an asterisk in them:

In [None]:
import pandas as pd
movie = pd.read_csv('../data/movie.csv')
title = movie['title']

pattern = '[*]'
filt = title.str.contains(pattern)
title[filt]

Match movies with either an asterisk or dollar sign.

In [None]:
pattern = '[*$]'
filt = title.str.contains(pattern)
title[filt]

### Excluding character classes with the caret
It is possible to exclude character classes by putting a caret as the first character inside the brackets. For instance, **`Z[^aeiou]`** matches strings that begin with 'Z' followed by a non-vowel.

In [None]:
pattern = 'Z[^aeiou]'
filt = title.str.contains(pattern)
title[filt]

The following finds all movies that have an uppercase 'T' followed by a non-lowercase letter.

In [None]:
pattern = 'T[^a-z]'
filt = title.str.contains(pattern)
title[filt]

## The backslash `\` metacharacter
The backslash metacharacter is used in conjunction with the very **next** character to change its meaning. Many of the following refer to character classes as seen above.

* `\d` - all digits, equivalent to `[0-9]`
* `\D` - any non-digit.
* `\s` - any amount of whitespace including normal spaces and tabs
* `\S` - any non-whitespace
* `\w` - any 'word' character, which is any upper or lowercase letter, digit or underscore. Equivalent to `[A-Za-z0-9_]`
* `\W` - any non-word character
* `\b` - [word boundary][1]
* `\B` - non-word boundary

For instance, `^\W` matches all strings that begin with a non-word character.

### Prefix the string with `r` to make it a raw string

The backslash is a special character in normal Python strings. `\n` represents a newline character, `\t` represents a tab. To be sure your regex is exactly what you see, its best to use **raw** Python strings. Prepend the string with an r **outside** of the quotation marks to make it a raw string. Python will treat the backslash as a literal backslash without any special meaning.

[1]: https://www.regular-expressions.info/wordboundaries.html

In [None]:
pattern = r'^\W+'
filt = title.str.contains(pattern)
title[filt]

### Backslash escapes special characters
As we just saw, the special characters lose their special ability within the brackets. Preceding a special character by a backslash has the same effect. For instance `\*` represents a literal asterisk and is the same as **`[*]`**

In [None]:
pattern = r'\*'
filt = title.str.contains(pattern)
title[filt]

## The parentheses metacharacters `( )`
The parentheses metacharacters are used to **group** together parts of the regular expression. For instance, let's say we want to find all movies that begin with the word 'In' or 'My'. You might think about using `'^In|My'`:

In [None]:
pattern = '^In|My'
filt = title.str.contains(pattern)
title[filt].head()

### The meaning of `^In|My`
There are a couple things wrong with this regex. As seen above, we returned movies that begin with 'In', such as 'Indiana' or 'Inside' instead of the just the word 'In'.

Second, the movie, 'Journey 2: The Mysterious Island' has 'My' within the name and not at the beginning. This mistake is happening because of **operator precedence** within the regex. 

`^In|My` matches movies that begin with the letters 'In' **OR** have 'My' anywhere inside it. The caret is only anchoring 'In'.

## Using parentheses to change operator precedence
We can use parentheses to change the operator precedence just how we do in mathematical expressions. Let's modify our expression to `'^(In|My)'`. Ignore the warning for now. We will take care of it below.

In [None]:
pattern = '^(In|My)'
filt = title.str.contains(pattern)
title[filt].head(15)

### Getting closer
We grouped `In|My` together so the movie must begin with them. We'd like to make sure that In and My are not part of a larger word. To do that we can use the word boundary, `\b`, which ensures that the word is ended.
* `'^(In|My)\b'`


In [None]:
pattern = r'^(In|My)\b'
filt = title.str.contains(pattern)
title[filt].head(15)

### Why are we getting `UserWarning: This pattern has match groups`?
Besides operator precedence, parentheses have an alternative function and that is to extract specific text from a string. In regex terminology, we call this a **capturing group**. This warning is alerting us that we have used the syntax for a capture group but are not using a method to do extraction. It tells us to use the `extract` method if in fact we were interested in extracting this group.

### Specifying a non-capturing group
Our regular expression is valid in its current state. We can signal that this is a **non-capturing group** by placing `?:` as the first two characters inside of the parentheses. This eliminates the warning.

In [None]:
pattern = r'^(?:In|My)\b'
filt = title.str.contains(pattern)
title[filt].head()

## Using capture groups with the `extract` string method
We can use the exact same pattern with the **`extract`** string method to extract the group.

In [None]:
pattern = r'^(In|My)\b'
title.str.extract(pattern).head()

### Why are all the values missing?
Only a small fraction of the movie titles begin with 'In' or 'My'. Let's drop the missing values and see the extracted text:

In [None]:
pattern = r'^(In|My)\b'
title.str.extract(pattern).dropna().head()

### Extracting the fourth word of movie titles that begin with 'In' or 'My'
Let's try something a bit more complex and extract the fourth word of all movies that begin with the words 'In' or 'My'. For instance, the movie, 'In the Heart of the Sea' meets our criteria. The word 'of' would be extracted from it. We will make the assumption that words are separated by spaces.

To accomplish this, we need to match movies that begin with 'In' or 'My' and then match two words, before capturing the fourth word. We already saw that `^(?:In|My)` completes the first part of this task. 

We now need to match a space followed by a word. We can use `\s` to match a space and follow this with `\S+` to match any number of non-space characters. Combining these, `\s\S+` is what we can use to match our definition of a "word".

We want to match this pattern exactly twice. We can do so with `{2}`, but we must ensure that the it applies to the all of `\s\S+`, so we must wrap it in parentheses to control for operator precedence like this - `(\s\S+){2}`. We must signal that this is not a capturing group and arrive at `(?:\s\S+){2}` to match two consecutive words.

We still need to match a space after the third word and then capture the fourth word. Finally, we have a workable regex with the following:

In [None]:
pattern = r'^(?:In|My)(?:\s\S+){2}\s(\S+)'
title.str.extract(pattern).dropna().head()

### `extract` must have capture groups
The regex used with the **`extract`** string method must have capture groups. If not, an error will be raised.


### Multiple capture groups for `extract`
You can capture more than one group with `extract`. Take a look at the following regex which captures the first word after a movie that begins with 'The' and the first word after 'of'.

In [None]:
pattern = r'^The (\S+) .*of (\S+)'
title.str.extract(pattern).dropna().head()

## Many other string methods take regexes
You can use regular expressions in several other Series string methods such as **`count`**, **`replace`** and **`split`**. For instance, the following counts the times consecutive lowercase vowels appear for each string. We then find the maximum number of times this happens within the movie titles.

In [None]:
pattern = r'[aeiou]{2}'
title.str.count(pattern).max()

## Other Dialects of Regex
Regular expressions are not quite standardized for every single programming language, so you will need to ensure you are implementing the right 'dialect' for each language.

## More to Regex
There is more to regular expressions not covered in these notebooks. 

* [Official Python Documentation][1]
* [Thorough Online Tutorial][2]
* [Practice with explanations][3] - make sure to choose Python

[1]: https://docs.python.org/3/howto/regex.html
[2]: https://www.regular-expressions.info/
[3]: https://regex101.com/

## Regex Summary
* Literal characters represent themselves
* Special or metacharacters represent something entirely different
* Primarily usage of regex is to either match a particular string or extract a substring
* Many Pandas string methods accept regular expressions
* You will often use `contains` and `extract`
* Use raw Python strings when writing regex. Raw strings have 'r' prepended to them.

### Metacharacter Summary
* All metacharacters - `. ^ $ * + ? { } [ ] \ | ( )`

### The dot `.`
* `.` - Matches any character except line breaks

### Anchors - `^, $`
* `^` - Anchors next characters to beginning
    * `^My` matches strings that begin with 'My'
* `$` - Anchors previous characters to end
    * `Movie$` matches strings that end with 'Movie'

### Quantifiers - `*, +, ?, {}`
* `*` - Matches 0 or more occurrences of previous character
* `+` - Matches 1 or more occurrences of previous character
* `?` - Matches 0 or 1 occurrences of previous character
* `{m}` - Matches exactly m of the previous character, 
* `{m,}` - Matches m or more of the previous character 
* `{,n}` - Matches up to n of the previous character 
* `{m,n}` - Matches between m and n repeats of the previous character

### Character Sets
* `[]` - A character set to match one out of many characters. `[aeiou]` matches a single vowel
* `[a-z]`, `[A-Z]`, `[0-9]` - Character sets for lowercase, uppercase, and digits
* `[^abc]` - Use caret at beginning of bracket to match anything but these characters
* `\` - backslash changes meaning of next character
* `\s` - whitespace - single space, tab, new-line
* `\S` - non-whitespace
* `\w` - word character - lower/uppercase, digits, and underscore
* `\W` - non-word-character
* `\d` - digits
* `\D` - non-digits
* `\b` - word boundary - matches empty string between words, that is between `\w` and `\W`
* `\B` - non-word boundary
* `\.` - Escapes all special characters such as literal dot here. `\*` matches the literal asterisk

## Exercises

### Exercise 1
<span  style="color:green; font-size:16px">For all movies that begin with 'The' and are followed by the next word that begins with a digit, extract just the digits part of this word.</span>

### Exercise 2
<span  style="color:green; font-size:16px">Find all movies that have two separate numbers in them. An example would be, '7 days and 7 nights'.</span>

### Exercise 3
<span  style="color:green; font-size:16px">Find all the movies that have 6 or more non-vowel and non-space characters in a row.</span>

### Exercise 4
<span  style="color:green; font-size:16px">Extract the very next character after 't' or 'T' for each movie.</span>

### Exercise 5
<span  style="color:green; font-size:16px">What is the most common character after 't' or 'T'?</span>

### Exercise 6
<span style="color:green; font-size:16px">Extract all the words that begin with 'T' or 't' and end in 'e' then find their frequency. Research the word boundaray special character.</span>