# Or Conditions and Character Classes

In regular Python, the `or` binary operator is used to test whether at least one of the two conditions provided evaluates as `True`. With regular expressions, there are two separate special characters available - the pipe, `|`, and the square brackets, `[]`, to test multiple conditions. This chapter covers these two types of or operations along with **character classes**, a compact way to represent spans of characters through the square brackets or backslash metacharacters.

In [None]:
import re
import pandas as pd

def find_pattern(s, pattern, **kwargs):
    filt = s.str.contains(pattern, **kwargs)
    return s[filt]

movie = pd.read_csv('../data/movie.csv')
title = movie['title']
title.head(3)

## The pipe metacharacter `|`

The pipe metacharacter, `|`, allows you to match two or more regex patterns within a single regex.  Take a look at the regex `'Friend|Enemy'`. It matches text containing the entire word `Friend` or `Enemy`.

In [None]:
find_pattern(title, r'Friend|Enemy').head()

You can add as many pipes as you wish to search for any number of patterns.

In [None]:
find_pattern(title, r'Friend|Enemy|Good|Evil').head(10)

### Each section is a separate and independent regex

The pipe character creates independent patterns that are each matched separately against the entire text. It is no different than calling the `contains` method with each pattern and using the `or` Series operator (which happens to be the same pipe character in regexes) to test all conditions.

In [None]:
filt1 = title.str.contains(r'Friend')
filt2 = title.str.contains(r'Enemy')
filt3 = title.str.contains(r'Good')
filt4 = title.str.contains(r'Evil')
filt = filt1 | filt2 | filt3 | filt4
filt.sum()

The length of both Series is found to verify that they match the same titles.

In [None]:
find_pattern(title, r'Friend|Enemy|Good|Evil').size

### Each independent regex can be complex

There is no restriction for the independent regexes separated by the pipe character. Below, we match titles that begin with `G` and end in `n` or begin with `D` and end in `s`, or begin with `F` and end in `n`.

In [None]:
find_pattern(title, r'^G.*n$|^D.*s$|^F.*n$').head()

## The brackets metacharacter `[ ]`

The brackets metacharacter allows you to match one of several characters at a single position. To use, place all of the possible characters you would like to match for a single position inside the brackets. The pattern `'S[aeiou]n'` matches text that has an `'S'`, followed by exactly one vowel and then `'n'`. Specifically, it matches `'San'`, `'Sen'`, `'Sin'`, `'Son'`, and `'Sun'`. The brackets are a single character **or** condition.

In [None]:
find_pattern(title, r'S[aeiou]n').head(7)

### Entire character classes within the brackets

Let's say you want to match exactly one of the lowercase letters 'a' through 'z'. You could do so by writing each of the 26 letters within the brackets, creating a very long pattern. Thankfully, there is a much easier way with **character classes**.

Character classes represent entire subsets of characters and are written within the square brackets with a hyphen separating the start and end character. This is the only part of the regex syntax where hyphens have a meaning. Outside of the brackets, they are literal characters and have no special meaning. There are three common character classes - one for lowercase letters, one for upper case letters, and one for the digits 0 through 9. 

* `'[a-z]'` - all lowercase letters
* `'[A-Z]'` - all uppercase letters
* `'[0-9]'` - all digits 0 through 9

### Combining character classes

Each of these character classes may be combined within a single set of square brackets.

* `'[a-zA-Z]'` - all lowercase and upper case letters
* `'[a-zA-Z0-9]'` - all lowercase and upper case letters and digits

### Different start and end characters

It's possible to choose the start and end of the character class.

* `'[d-s]'` lowercase letters d through s
* `'[H-K]'` uppercase letters H through K
* `'[5-7]'` digits 5 through 7

### Character classes technically use Unicode code point

The start and end characters of a character class do not have to strictly be either both lowercase, both uppercase, or both digits. Any characters may used as the start and end as long as the start character has a lower Unicode code point than the end character. Let's examine the character class `'[7-C]'`. First, we'll get the Unicode code point of the start and end with the builtin `ord` function.

In [None]:
ord('7'), ord('C')

We print the character mapped to each of the code points between the two numbers above.

In [None]:
for i in range(55, 68):
    print(i, chr(i))

The character class `[7-C]` represents each of the characters above which includes several non-digit, non-uppercase ones. Using character classes like this that overlap different regions of Unicode is atypical, but shows exactly how Python processes the regex. As another example, the character class `['%-~']` matches all characters with Unicode code point 37 through 126.

In [None]:
ord('%'), ord('~')

### Character class examples

Here, we find all movies that begin with at least two capital letters.

In [None]:
find_pattern(title, r'^[A-Z]{2}').head()

Here, we find all movies that end in two consecutive digits preceded by a space character.

In [None]:
find_pattern(title, r' [0-9]{2}$').head()

### Excluding characters within the brackets

Let's say you would like to find all titles that do NOT begin with the letter `A`, `B`, `G`, `S`, or `T`. To exclude one or more characters, place a caret, `^`, as the first character within the brackets followed by all the characters you want to exclude. The caret has a completely different meaning when it is used as the first character within the brackets. The first carat below is used as its normal start anchor special character. The second carat is used to exclude each of the other characters in the brackets. The brackets have their own special set of syntax and this is one of the rules.

In [None]:
find_pattern(title, r'^[^ABGST]').head()

Here we find all titles that do not begin with B through Y. These are mostly going to be movies that begin with A or Z, but will include any movies that begin with lowercase letters, digits, or any of the thousands of other non-uppercase Unicode characters.

In [None]:
find_pattern(title, r'^[^B-Y]').head()

### Special characters lose their meaning within the brackets

Special characters are treated as literal characters when placed within the brackets (the carat as the first character being an exception). Here, titles containing an asterisk or left parentheses are returned.

In [None]:
find_pattern(title, r'[*(]').head()

### Combining character classes with single characters

It's possible to combine character classes with single characters inside the brackets. Here, we find movies that begin with a digit 2 through 6 or start with `Y` or `@`.

In [None]:
find_pattern(title, r'^[2-6Y@]').head()

## Character classes with the backslash metacharacter `\`

More character classes may be represented with the backslash special character, `\`. The backslash character is used in conjunction with the **very next** character to identify a specific class. There are three main character classes able to defined with the backslash.

* `\d` - any single digit - equivalent to `[0-9]`
* `\s` - a single white space, a newline, or tab character
* `\w` - any single 'word' character, which is any uppercase or lowercase letter, digit, or underscore - equivalent to `[A-Za-z0-9_]`

Let's use each of these character classes to search for movies that begin with three digits, are followed by a space, and then have at least 8 consecutive word characters.

In [None]:
find_pattern(title, r'^\d{3}\s\w{8,}')

Without these new character classes provided by the backslash, the regex pattern would have been `^[0-9]{3} [A-Za-z0-9_]{8,}`. While the backslash isn't necessary for these classes, it does shorten the syntax.

### Complement character classes with uppercase letters

All three character classes above use lowercase letters following the backslash. To refer to the complement (all characters that are NOT in the class), use the uppercase version of the letter.

* `\D` - any single non-digit - equivalent to `[^0-9]`
* `\S` - a single non-space character
* `\W` - any single non-word character - equivalent to `[^A-Za-z0-9_]`

Here, we select all movies that are exactly 10 characters in length and have no spaces in them.

In [None]:
find_pattern(title, r'^\S{10}$').head()

## Word boundaries with `\b`

Occasionally, you'll want to ensure that you are matching at the beginning or end of a particular word. Let's say we are interested in finding words that begin with `'A'` and end in `'r'`. We could naively try the regex below, which matches `'Avatar'`, `'Avengers'`, and `'America'`.

In [None]:
find_pattern(title, r'A\w+r').head()

Only `'Avatar'` matches the desired pattern. We could try specifying the end of a word by adding a space to our pattern, but this eliminates single word movies such as `'Avatar'`.

In [None]:
find_pattern(title, r'A\w+r ').head()

Use the **word boundary**, `\b`, to  help match the end of a word. It matches only at the start and end of a word. Here, we make sure that `'A'` is the first letter of a word and `'r'` is the end.

In [None]:
find_pattern(title, r'\bA\w+r\b').head()

The `\b` checks for a non-word character immediately before or after its location. Above, it checks for a non-word character (such as a space) before `'A'` and immediately after `'r'`. The beginning or ending of the line also match `\b`. 

It is a **zero-width** match that does not actually represent a single character, just checks for the word boundary. For instance, if we try and find movies that have a word that begins with `'A'`, ends in `'r'` and have the following word begin with `'T'`, then the following would NOT work and will not match any string.

In [None]:
find_pattern(title, r'\bA\w+r\bT')

The above attempts to match `'T'` immediately after `'r'`, but also match a word boundary between them. That is impossible. An alternative way of describing `\b` as zero-width is to say it does not **consume** any characters. You can analogize it to stopping at the border between countries to show your passport. Your car doesn't move any further, but there is some kind of validation that happens. The word boundary is the validation that happens here. To match the next word for capital `'T'` you will need to add a space (`'\s'` in this case) to the regex.

In [None]:
find_pattern(title, r'\bA\w+r\b\sT').head()

Non-word boundaries are represented by `\B`. Here, we match movies containing a word that begins with `'The'` but do not end immediately after the `'e'`.

In [None]:
find_pattern(title, r'\bThe\B').head()

### Backslashes escape special characters

A backslash followed by a metacharacter escapes its special meaning and reduces it to being treated as a literal character. Here, we find movies that have two capital letters followed by a literal period, `'.'`.

In [None]:
find_pattern(title, r'[A-Z]\.[A-Z]').head()

### Backslash character classes are valid within the square brackets

While special characters do lose their meaning within the square brackets, the backslash character classes do not. The pattern `[\d\s]` matches a single word or digit character. The word boundary, `\b`, which is not a character class, represents the backspace character (a very rare character in modern computing) when inside the square brackets.

## More methods that accept regexes

Thus far we've only used the `contains` string-only Series method with our regexes. This section will cover several other methods that also accept regexes.

### The `count` method

The `count` string-only Series method counts non-overlapping occurrences of the given pattern. Below, we count the number of uppercase letters in each title.

In [None]:
title.str.count(r'[A-Z]').head()

### The `split` method

The `split` method splits the string into as many substrings as there are matches of the pattern and returns a Series of lists of strings. To help motivate this example, we'll search for the first five movies that have a hyphen surrounded by a space on either side or a colon in them and assign the result to a new variable.

In [None]:
filt = title.str.contains(r' - |:')
t = title[filt].head()
t

Calling the `split` method with the same pattern, splits the string at every match returning a list of each substring.

In [None]:
t.str.split(r' - |:')

Series with lists as values are difficult to work with, so set `expand` to `True` to return a DataFrame. The number of columns will be equal to the value with the most substrings.

In [None]:
t.str.split(r' - |:', expand=True)

### The `replace` method

A `replace` method is available directly from the Series as well as from the `str` accessor. The normal `replace` method is more flexible and was initially covered in the Essential Pandas parts, so will be covered below. To use a `regex` with it, you must set the `regex` parameter to `True`. Here, we replace words that are between one and three characters long with a period.

In [None]:
title.replace(r'\b\w{1,3}\b', '.', regex=True).head()

## Exercises

### Exercise 1

<span style="color:green; font-size:16px">Find all movies that either start with `'C'` or end with `'c'`.</span>

### Exercise 2

<span style="color:green; font-size:16px">Find all movies that are 6 or less characters in length or between 30 and 33 characters in length.</span>

### Exercise 3

<span style="color:green; font-size:16px">Find all movies that have the word `'and'` followed by the word `'the'`.</span>

### Exercise 4

<span style="color:green; font-size:16px">Find all movies that have the word `'and'` followed by two words and then followed by the word `'the`'. In this exercise, words are defined as 1 or more consecutive "word" characters.</span>

### Exercise 5

<span style="color:green; font-size:16px">Find all movies that begin with `'The'` followed by the next word that begins with digits.</span>

### Exercise 6

<span style="color:green; font-size:16px">Find all movies that have three consecutive capital letters in them.</span>

### Exercise 7

<span style="color:green; font-size:16px">Find all movies that begin and end with a capital letter.</span>

### Exercise 8

<span style="color:green; font-size:16px">Find all the movies that have a digit followed by a comma followed by a digit.</span>

### Exercise 9

<span style="color:green; font-size:16px">Find all the movies that have either an ampersand or a question mark in them.</span>

### Exercise 10

<span style="color:green; font-size:16px">Which movie has the most ampersands, question marks, and periods in it?</span>

### Exercise 11

<span style="color:green; font-size:16px">Find all the movies with exactly three words with each word no more than 6 characters in length. For this exercise, a word is defined as consecutive non-space characters followed by exactly one space (or end of string).</span>

### Exercise 12

<span style="color:green; font-size:16px">Find all movies that have four consecutive non-word characters.</span>

### Exercise 13

<span style="color:green; font-size:16px">Find all movies that have at least one word that ends in `'ats'`.</span>

### Exercise 14

<span style="color:green; font-size:16px">Find all the movies containing, but not ending in `'tes'`.</span>

### Exercise 15

<span style="color:green; font-size:16px">Find all movies containing a word that is at least 7 lowercase letters in length.</span>

### Exercise 16

<span style="color:green; font-size:16px">Find all movies that have a word that contains, but does not start with `'Z'`.</span>

### Exercise 17

<span style="color:green; font-size:16px">Find all movies containing a word starting with `'W'` and ending with `'w'`.</span>

### Exercise 18

<span style="color:green; font-size:16px">Count the total number of digit characters between 7 and 9 in all of the movies.</span>

### Exercise 19

<span style="color:green; font-size:16px">Use the `count` method to count the number of words in each title. Use consecutive word characters as the definition of a word.</span>