# Matching Literals and Character Ranges

Arguably the two basic building blocks of regular expressions are literals and ranges. Literals are characters that do not have any function other than to represent that character *literally*. For example, alphanumeric characters like "a" and "4" represent "a" and "4" in a regular expression. They do not have any special functions. Character ranges allow us to qualify multiple characters in a given position. We will also learn how to use the escape `\` operator to access other characters like whitespaces. 

To streamline things a little bit, we are going to import the `fullmatch` function directly. 

In [None]:
from re import fullmatch 

## Literals

There is an array of characters that do not have any special function in regular expressions other than the literal characters themselves. The most common literals of course are the alphabetic characters. 

```
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789
```

So if we provided a regular expression of "Anaconda" and compared it to the string "Anaconda", that of course would provide a match. 

In [None]:
fullmatch(pattern="Anaconda", string="Anaconda")

Notice that this did not return a `True` or `False` value which you might have expected. It instead returns a `Match` object. If there was no match, then the `fullmatch()` will return `None` like in this example showing the string "Python" does not match the regular expression "Anaconda". 

Since notebooks provide no output for `None`, we will reveal it using a `print()`. 

In [None]:
print(fullmatch(pattern="Anaconda", string="Python"))

If we wanted a simple `True` or `False`, we can create a `if-then` operation or just simply compare if the output of `fullmatch()` is not None. We will take this approach for the remainder of the notebook. 

In [None]:
fullmatch(pattern="Anaconda", string="Python") != None

### Metacharacters

There are a handful of characters that have special functionality in regular expressions, which we will learn about in this course. These metacharacters are the following: 

```
.[](){}\^$|*+?
```

The only way to use the literal counterparts of these characters (suppose I wanted to match a dollar sign `\$`), I would have to precede it with an escaping backslash `\`. Watch what happens when I try to match the string "\\$151" with a regular expression "$151". 

In [None]:
fullmatch(pattern="$151", string="$151") != None

In [None]:
fullmatch(pattern="\$151", string="$151") != None

This is because Python has its own escaping with the backslash outside the regular expression. Incredibly annoying! Rather than having to backslash the backslash like this...

In [None]:
fullmatch(pattern="\\$151", string="$151") != None

I highly recommend you just use a raw string in Python, which treats the backslash as a literal backslash and not an escape character to the Python compiler. 

In [None]:
fullmatch(pattern=r"\$151", string="$151") != None

That's much better! Precede the regular expression string with an `r` when using backslashes `\`, and that will prevent clashing with Python's treatment of strings. 

## Character Ranges

Matching literals are not that interesting. After all, we could use Python equality checks for that and not have to mess with those pesky escapes. But this is where things start to get interesting with regular expressions. 

A **character range** is a square bracket containing a list of valid characters for a single position. For example, here is a regular expression that matches only characters `7`, `8`, `9`, `D`, and `J`.

```
[789DJ]
```

It will only match a string containing just one of those characters.

In [None]:
fullmatch(pattern="[789DJ]", string="7") != None

In [None]:
fullmatch(pattern="[789DJ]", string="A") != None

Let's say I wanted to match product codes that are 5 characters in length. The first, third, and fifth characters must be `T`, `A` and `B` respectively. However, the second position can be either `H`, `B`, or `Z`. The fourth position can be a `7` or a `9`. This might be tedious to do with substring operations, but it is simple with a regular expression. 

```
T[HB]A[79]B
```

In [None]:
fullmatch(pattern="T[HB]A[79]B", string="THA7B") != None

In [None]:
fullmatch(pattern="T[HB]A[79]B", string="THA2B") != None

Character ranges can also be expressed with a span of characters, like `[A-Z]` to match any uppercase alphabetic character. Below we match airport codes that are three uppercase letters. 

```
[A-Z][A-Z][A-Z]
```

In [None]:
fullmatch(pattern="[A-Z][A-Z][A-Z]", string="ABQ") != None

In [None]:
fullmatch(pattern="[A-Z][A-Z][A-Z]", string="DFW") != None

In [None]:
fullmatch(pattern="[A-Z][A-Z][A-Z]", string="JFK") != None

In [None]:
fullmatch(pattern="[A-Z][A-Z][A-Z]", string="9DK") != None

In [None]:
fullmatch(pattern="[A-Z][A-Z][A-Z]", string="KDAL") != None

If coding `[A-Z]` three times feels repetitive, we will learn about quantifiers later. We can also use it to specify numbers. Below we match all airline codes where the first character is alphabetic and the second character is a numeric digit. 

In [None]:
fullmatch(pattern="[A-Z][0-9]", string="F9") != None

In [None]:
fullmatch(pattern="[A-Z][0-9]", string="DL") != None

In [None]:
fullmatch(pattern="[A-Z][0-9]", string="WN") != None

We can also specify arbitary ranges and lowercase, such as `[g-j]` and `[4-7]`. 

In [None]:
fullmatch(pattern="[g-j][4-7]", string="i6") != None

In [None]:
fullmatch(pattern="[g-j][4-7]", string="c4") != None

In [None]:
fullmatch(pattern="[g-j][4-7]", string="j3") != None

We can also merge several ranges as being valid in one character range, like `[A-Za-z0-3]`

In [None]:
fullmatch(pattern="[A-Za-z0-3]", string="j") != None

In [None]:
fullmatch(pattern="[A-Za-z0-3]", string="4") != None

In [None]:
fullmatch(pattern="[A-Za-z0-3]", string="2") != None

You can also match uppercase and lowercase letters using `[A-z]`. 

In [None]:
fullmatch(pattern="[A-z]", string="d") != None

In [None]:
fullmatch(pattern="[A-z]", string="D") != None

Many metacharacters can be put inside a character range where they will be treated as literals. To literally treat a dash `-` as a dash `-`, just make it the first character in the character range as shown below. We also throw in a dollar sign as a valid character. 

In [None]:
fullmatch(pattern="[-$A-z][0-9]", string="-5") != None

In [None]:
fullmatch(pattern="[-$A-z][0-9]", string="$5") != None

In [None]:
fullmatch(pattern="[-$A-z][0-9]", string="V5") != None

Finally, you can negate a set of characters matching anything BUT the specified characters by starting with a carrot `^`. To match anything but vowels, you would use the regex `[^AEIOU]`. 

In [None]:
fullmatch(pattern="[^AEIOU]", string="I") != None

In [None]:
fullmatch(pattern="[^AEIOU]", string="C") != None

If you happen to want to match a carrot literally in the character range, just don't put it at the beginning of the range as shown below. 

In [None]:
fullmatch(pattern="[AEIOU^]", string="^") != None

## Digit, Word, and Whitespace Characters

There are a handful of special characters that are enabled with a backslash `\` followed by a letter. 

|Pattern|Description|
|---|----|
|`\s` |Whitespace (space, newline, tab)|
|`\S` |Not whitespace (not space)|
|`\d` |Digit 0-9|
|`\D` |Not a digit 0-9|
|`\w` |Word characters (alphas, digits, and underscore)|
|`\W` |Not a word character (alphas, digits, and underscore)|

I'm not exactly a fan of the last four, as I prefer to use character ranges which in my opinion are easier to read. I could match a letter followed by two digits using ranges like this. 

```
[A-z][0-9][0-9]
```

In [None]:
fullmatch(pattern="[A-z][0-9][0-9]", string="A15") != None

In [None]:
fullmatch(pattern="[A-z][0-9][0-9]", string="115") != None

But you may see folks using `\w` for the letter and `\d` for the digits. Don't forget to use raw strings because we are using backslashes! 

```
\w\d\d
```

In [None]:
fullmatch(pattern=r"\w\d\d", string="A15") != None

In [None]:
fullmatch(pattern=r"\w\d\d", string="115") != None

Now strangely, you may notice that `\w` also matches digits and just alphabetic letters. They also match underscores. That's just how they work. `¯\_(ツ)_/¯`

In [None]:
fullmatch(pattern=r"\w\d\d", string="_15") != None

Like I said, I personally don't like using the `\d` and `\w` and their negated counterparts `\D` and `\W`. I prefer to use character ranges as I find them easier to read and interpet. But don't be surprised by these when you encounter them. 

I find the `\s` and `\S` to be useful though. We can use them to match whitespace including spaces, tabs, and newlines. Below we match a lowercase letter and a digit separated by a space. 

In [None]:
fullmatch(pattern=r"[a-z]\s[0-9]", string="a 3") != None

In [None]:
fullmatch(pattern=r"[a-z]\s[0-9]", string="2 3") != None

You can also use a space to match a space. It just will not match tabs or newlines. 

In [None]:
fullmatch(pattern=r"[a-z] [0-9]", string="a 3") != None

## Exercise 

An airport in the United States must have an ICAO code that starts with "K" and is typically followed by three uppercase letters. Create a regex that qualifies an airport code based on this convention by completing the code (replacing the question mark "?") below. 

In [None]:
fullmatch(pattern=?, string="KDFW") != None 

### SCROLL DOWN FOR ANSWER
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
v 

Use the pattern `K[A-Z][A-Z][A-Z]` to match the ICAO airport convention. 

In [None]:
fullmatch(pattern="K[A-Z][A-Z][A-Z]", string="KDFW") != None 