Regular expressions (regexes) are, when you understand them, one of the most fun things you can work with in programming. They are a mini-language for matching text.

The first thing to know is that non-special characters match themselves in text.

In [None]:
import re
help(re.search)

In [None]:
re.search(r"e", "hello")

In [None]:
re.search(r"l", "hello")

You can match more than one letter, of course.

In [None]:
sentence = ("A symmetry of a pattern is, loosely speaking, a way of transforming "
            "the pattern so that the pattern looks exactly the same after the "
            "transformation.")

In [None]:
re.search(r"pattern", sentence)

`re.search` gives us a match object that has many methods, but only finds the first match.

`re.findall` gives us a list of all matches.

In [None]:
re.findall(r"pattern", sentence)

In [None]:
re.findall(r"at", sentence)

## Matching anything

The `.` (period) character matches anything (except a newline). We can use this to find strings that match wildcards, like "a double-o followed by any character."

In [None]:
re.search(r"oo.", sentence)

See how the match is "oos".

In [None]:
re.findall(r".at.", sentence)

In [None]:
re.search(r"\.", sentence)

In [None]:
# Case-insensitive matching
print(re.findall(r"h", "Hello there! How many I help you?"))
print(re.findall(r"h", "Hello there! How many I help you?", re.IGNORECASE))

## What can I do with a match object?

In [None]:
match = re.search("pattern", sentence)
help(match)

## Start and end matches

You often want to match something if and only if it is at the beginning or end of a string.

`^` matches the beginning of a string.

`$` matches the end of a string.

In [None]:
re.search(r"^A ", sentence)

In [None]:
print(re.search(r"^pattern", sentence))

If I want to match the end of this string, I have to match a period.

In [None]:
re.search(r"n.$", sentence)

In [None]:
re.search(r"n.$", "I like singing")

What happened here? `.` matches anything, so I have to _escape_ it to match just a period.

In [None]:
re.search(r"n\.$", sentence)

In [None]:
re.search(r"n\.$", "I like singing")

## Matching multiples

Often, you want to match a multiple amount of something. Whether it's 0 or more, 1 or more, 0 or 1, or something else, we've got you covered.

* `*` matches 0 or more.
* `+` matches 1 or more.
* `?` matches 0 or 1.
* `{n}` matches `n` repetitions.
* `{m,n}` matches `m` to `n` repetitions. You can leave out `m` or `n` to match 0 to `n`, or `m` to infinity.

In [None]:
re.findall(r"o+", sentence)

In [None]:
re.findall(r"ng? ", sentence)

In [None]:
no_a = "b"
one_a = "ab"
lots_of_a = "aaaaaaaaaaaab"

In [None]:
print(re.search(r"a*b", no_a))
print(re.search(r"a*b", one_a))
print(re.search(r"a*b", lots_of_a))

In [None]:
print(re.search(r"a+b", no_a))
print(re.search(r"a+b", one_a))
print(re.search(r"a+b", lots_of_a))

In [None]:
print(re.search("a?b", no_a))
print(re.search("a?b", one_a))
print(re.search("a?b", lots_of_a))

In [None]:
print(re.search("a{2}b", no_a))
print(re.search("a{2}b", one_a))
print(re.search("a{2}b", lots_of_a))

In [None]:
print(re.search("a{1,2}b", no_a))
print(re.search("a{1,2}b", one_a))
print(re.search("a{1,2}b", lots_of_a))

In [None]:
print(re.search("a{1,}b", no_a))
print(re.search("a{1,}b", one_a))
print(re.search("a{1,}b", lots_of_a))

In [None]:
print(re.search("a{,2}b", no_a))
print(re.search("a{,2}b", one_a))
print(re.search("a{,2}b", lots_of_a))

In [None]:
re.findall(r"a?b", "ababb")

In [None]:
# Find 2 instances of ab
re.search(r"(a+b){2}", "abaaaabaab")

## Matching sets of things

All the above is good, but not that useful by itself. Being able to match a group of characters is super-useful.

We use square brackets to do this.

* `[abz]` will match an a, b, or z.
* `[A-Z]` matches a range of letters from A to Z.
* `[^A-Z]` matches anything that _isn't_ A to Z.

In [None]:
# Get words three to five letters long
re.findall(r" [A-Za-z]{3,5} ", sentence)

In [None]:
# Find the first number in a string
re.search(r"[0-9]+", "I ate 130 ghost peppers")

In [None]:
# Find all punctuation
re.findall(r"[\.,;?!]", sentence)

In [None]:
# or
re.findall(r"[^A-Za-z0-9 ]", sentence)

In [None]:
# Find a phone number
re.search(r"[0-9]{3}-[0-9]{3}-[0-9]{4}", "My phone number is 919-555-1212.")

## Character classes

That last match was pretty wordy. Luckily, we have something called _character classes_ for commonly used groups of characters.

* `\d` matches digits.
* `\D` matches _non_-digits.
* `\w` matches "word characters": basically `[a-zA-Z0-9_]`, plus all other valid Unicode characters that can be in words.
* `\W` matches _non_-word-characters.
* `\s` matches space characters -- `[ \t\n\r\f\v]`.
* `\S` matches non-space characters.

In [None]:
# Find a phone number
re.search(r"\d{3}-\d{3}-\d{4}", "My phone number is 919-555-1212.")

In [None]:
# Find all punctuation
re.findall(r"[^\w\s]", sentence)

There's a few odder ones:

* `\A` matches the beginning of the string. This is a lot like `^`, but different for multi-line strings.
* `\Z` matches the end of the string. This is a lot like `$`, but different for multi-line strings.
* `\b` matches a word boundary. This means it matches an empty string at the end of a word.

In [None]:
# Get words three to five letters long
re.findall(r"\b\w{3,5}\b", sentence)

In [None]:
# Pick out email addresses
possible_emails = ["clinton", "clinton@dreisbach.us", "beanguy@example.org", 
                   "Email help@example.org for more information",
                   "terry@example.org", "@carmen", "what@what", "hi@example.org"]
[possibility 
 for possibility in possible_emails 
 if re.search("\A\w+@\w+\.\w{2,3}\Z", possibility)]

Note that a regex for emails is more complex than this. It's not that hard, though:

```
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
```

## Capturing matches

We often want to capture part of a match for later use. You can use parentheses to mark part of your regex as something you will capture.

In [None]:
# city and state
possibilities = ["Decatur, GA", "Wilkesboro, NC", "Seattle", "Wichita Falls, TX", "DC"]
for possibility in possibilities:
    match = re.search("^([\w\s]+), ([A-Z]{2})", possibility)
    if match:
        city, state = match.groups()
        print("City:", city, "| State:", state)

In [None]:
# Re-format phone numbers for later
phone_nums = ["999-555-1212", "(703) 555-9999", "800.555.7341", "3145558286"]
cleaned = []
for num in phone_nums:
    match = re.search(r"\(?(\d{3})\)?[-.]?\s*(\d{3})[\-\.]?(\d{4})", num)
    cleaned.append("{}-{}-{}".format(*match.groups()))
print(cleaned)

## Non-capturing group

Use `(?:)` to make a group but not capture it.

In [None]:
phone_num_with_possible_area_code = r"(?:\(?(\d{3})\)?[-.]?\s*)?(\d{3})[\-\.]?(\d{4})"
phone_nums = ["999-555-1212", "(703) 555-9999", "800.555.7341", "3145558286", "555-1212"]
cleaned = []
for num in phone_nums:
    match = re.search(phone_num_with_possible_area_code, num)
    cleaned.append("{}-{}-{}".format(*match.groups()))
print(cleaned)

In [None]:
re.search(r"(?:ab)+", "ccccababababcccccab")

## Scratching the surface

This is just the beginning with regular expressions. You can go really deep down this hole.

* [Python regex docs](https://docs.python.org/3/library/re.html)
* [Regexr](http://www.regexr.com/)
* [Regex One](http://regexone.com/)
* [Regular-Expressions.info](http://www.regular-expressions.info/)


In [None]:
# Pick out email addresses
possible_emails = ["clinton", "clinton@dreisbach.us", "beanguy@example.org", 
                   "Email help@example.org for more information",
                   "terry@example.org", "@carmen", "what@what", "hi@example.org"]
emails = []
for possibility in possible_emails:
    match = re.search("\w+@\w+\.\w{2,3}", possibility)
    if match:
        emails.append(match.group(0))
emails

In [None]:
for match in re.finditer(r"a*b", "ccccabaabcccaaaaaababccb"):
    print(match)

In [None]:
phone_nums = """456-111-4567
(919) 444-9721
(123) 456 7890
313.424.5353
1-800-987-2345
+1 (424) 979-3333
555-1212"""

phone_nums = phone_nums.split("\n")
phone_nums

In [None]:
phone_num_regex = r"(?:\(?(\d{3})\)?[\-\.]?\s*)?(\d{3})[\-\.]?\s*(\d{4})"

In [None]:
default_area_code = "919"
for num in phone_nums:
    match = re.search(phone_num_regex, num)
    if match:
        area_code, prefix, suffix = match.groups()
        if area_code is None:
            area_code = default_area_code
        print("{}\t{}-{}-{}".format(num, area_code, prefix, suffix))

In [None]:
date_str = """9/4/1976
09/30/77
20111103
Nov 30, 2014
5 Oct 1995
1999-10-04"""

dates = date_str.split("\n")

In [None]:
def extract_date(date_str):
    date_regex = [r"(?P<month>\d{1,2})/(?P<day>\d{1,2})/(?P<year>\d{4}|\d{2})",
                  r"(?P<year>\d{4})-?(?P<month>\d{2})-?(?P<day>\d{2})",
                  r"(?P<day>\d{1,2})\s*(?P<month>[A-Za-z]{3})\s*(?P<year>\d{4})",
                  r"(?P<month>[A-Za-z]{3})\s*(?P<day>\d{1,2})\s*,?\s*(?P<year>\d{4})"]
    
    for regex in date_regex:
        match = re.match(regex, date_str)
        if match:
            return match
        
def clean_date(year, month, day):
    months = {"Jan": 1, "Feb": 2, "Oct": 10, "Nov": 11}

    try:
        month = int(month)
    except ValueError:
        month = months[month]
    day = int(day)
    year = int(year)
    if year < 15:
        year += 2000
    elif year < 100:
        year += 1900
    
    return {"year": year, "month": month, "day": day}
        

for date in dates:
    match = extract_date(date)
    if match:
        ddict = match.groupdict()
        ddict = clean_date(**ddict)
        ddict['orig'] = date
            
        print("{orig}\t{month:02d}/{day:02d}/{year:d}".format(**ddict))