# Regex 1

## Reading

- New text: "Principles and Techniques of Data Science", by Sam Lau, Joey Gonzalez, and Deb Nolan
- Used for Berkeley's DS100 Course.
- Read Chapter 13: https://www.textbook.ds100.org/ch/13/text_regex.html

In [None]:
#import statements
import re

In [None]:
# Example strings
# from DS100 book...
def reg(regex, text):
    """
    Prints the string with the regex match highlighted.
    """
    print(re.sub(f'({regex})', r'\033[1;30;43m\1\033[m', text))
    
s1 = " ".join(["A DAG is a directed graph without cycles.",
               "A tree is a DAG where every node has one parent (except the root, which has none).",
               "To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯"])
print(s1)

s2 = """1-608-123-4567
a-bcd-efg-hijg (not a phone number)
1-608-123-456 (not a phone number)
608-123-4567
123-4567
1-123-4567 (not a phone number)
"""
print(s2)

s3 = "In CS 320, there are 8 quizzes, 7 projects, 38 lectures, and 1000 things to learn.  CS 320 is awesome!"
print(s3)

s4 = """In CS 320,  there are 8 quizzes,    7 projects,
38 lectures, and 1000 things to learn.  CS 320 is awesome!"""
print(s4)

In [None]:
print(s1)

### double escaping (use case for raw strings)

- Regex does another level of formatting with special sequences like \t, \n, etc.,

#### Find the right arm "\".

- `reg(<PATTERN>, <STRING>)`

In [None]:
# Python will be unhappy 
# \ works as escape sequence here and it is trying to escape the second ",
# meaning it thinks we are mentioning " literal
# reg("\", s1) # uncomment to see error

In [None]:
# Regex will be unhappy
# reg("\\", s1) # uncomment to see error

In [None]:
# Correct and cumbersome way to do this
reg("\\\\", s1)

In [None]:
# Better way would be to use raw string to avoid double escaping
reg(r"\\", s1)

### Regex is case sensitive

#### Find all occurrences of "a".

In [None]:
reg(r"a", s1)

#### Find all occurrences of "A".

In [None]:
reg(r"A", s1)

### Character classes

- Character classes can be mentioned within `[...]`
- `^` means `NOT` of a character class
- `-` enables us to mention range of characters, for example `[A-Z]`
- `|` enables us to perform `OR`

#### Find both "a" and "A".

In [None]:
# Doesn't work - because we are trying to match literally for "aA"
reg("aA", s1)

In [None]:
reg("???", s1)

#### Find all the vowels.

In [None]:
reg("???", s1)

#### Find everything except vowels.

In [None]:
reg("???", s1)

#### Find all capital letters.

In [None]:
reg("???", s1)

#### What if we want to find "A", "Z", and "-"?

In [None]:
# How can we change this to do that?
reg(r"???", s1)

#### Invalid ranges don't work. For example: `[Z-A]`.

In [None]:
# reg("[Z-A]", s1) # uncomment to see error

#### Find all words related to graphs.

In [None]:
# | means OR
reg(r"tree|directed|undirected|graph|DAG|node|child|parent|root|cycles", s1)

### Metacharacters

- predefined character classes
    - `\d` => digits
    - `\s` => whitespace (space, tab, newline)
    - `\w` => "word" characters (digits, letters, underscores, etc) --- helpful for variable name matches and whole word matches (as it doesn't match whitespace --- `\s`)
    - `.` => wildcard: anything except newline
- capitalized version of character classes mean `NOT`, for example `\D` => everything except digits

#### Find all digits.

In [None]:
# v1
reg(r"???", s1)

In [None]:
# v2 - with metacharacters
reg(r"???", s1)

#### Find all whitespaces.

In [None]:
reg(r"???", s1)

#### Find everything except whitespaces.

In [None]:
reg(r"???", s1)

#### Find anything except newline.

In [None]:
reg(r"???", s1)

#### What if we want to actually match "."?

In [None]:
#How can we change this to do that?
reg(r"???", s1)

### REPETITION

- `<character>{<num matches>}` - for example: `w{3}`
- matches cannot overlap

#### Find all "www".

In [None]:
# v1
reg(r"www", s1)

In [None]:
# v2 - repitition
reg(r"???", s1)

In [None]:
# Lesson: matches cannot overlap
reg(r"???", s1) 

### Variable length repitition operators

- `*` => 0 or more (greedy: match as many characters as possible)
- `+` => 1 or more (greedy: match as many characters as possible)
- `?` => 0 or 1
- `*?` => 0 or more (non-greedy: match as few characters as possible)
- `+?` => 1 or more (non-greedy: match as few characters as possible)

#### Find everything inside of parentheses.

In [None]:
# this doesn't work
# it captures everything because () have special meaning (coming up)
reg(r"???", s1)

In [None]:
# How can we change this to not use special meaning of ()?
# * is greedy: match as many characters as possible
reg(r"???", s1)

In [None]:
# non-greedy: stop at the first possible spot instead of the last possible spot
reg(r"???", s1)

### Anchor characters
- `^` => start of string
    - `^` is overloaded --- what was the other usage?
- `$` => end of string

#### Find everything in the first sentence.

In [None]:
# doesn't work because remember regex finds all possible matches
# so it matches every single sentence 
# (even though we are doing non-greedy match)
reg(r"???", s1)

In [None]:
reg(r"???", s1)

#### Find everything in the first two sentences.

In [None]:
reg(r"???", s1)

#### Find last "word" in the sentence.

In [None]:
reg(r"???", s1)

### Case study: find all phone numbers.

In [None]:
print(s2)
# The country code (1) in the front is optional
# The area code (608) is also optional
# Doesn't make sense to match country code without area code though!

In [None]:
# Full US phone numbers
reg(r"???", s2)

In [None]:
# The country code (1) in the front is optional
reg(r"???", s2)

In [None]:
# The area code (608) is also optional
# Doesn't make sense to have country code without area code though!
reg(r"???", s2)

In [None]:
# This is good enough for 320 quizzes/tests
# But clearly, the last match is not correct
reg(r"???", s2)

Regex documentation link: https://docs.python.org/3/library/re.html.

In [None]:
# BONUS: negative lookbehind (I won't test this)
reg(r"???", s2)

There is also a negative lookahead. For example, how to avoid matching "1-608-123-456" in "1-608-123-4569999". You can explore this if you are interested.

In [None]:
reg(r"(?<!\d\-)((\d-)?\d{3}-)?\d{3}-\d{4}", "608-123-4569999")

### Testing your regex
- you could use `reg(...)` function
- another useful resource: https://regex101.com/

### `re` module
- `re.findall(<PATTERN>, <SEARCH STRING>)`: regular expression matches
    - returns a list of strings 
- `re.sub(<PATTERN>, <REPLACEMENT>, <SEARCH STRING>)`: regular expression match + substitution
    - returns a new string with the substitutions (remember strings are immutable)

In [None]:
print(s3)

#### Find all digits.

In [None]:
re.findall(r"\d+", s3)

### Groups
- we can capture matches using `()` => this is the special meaning of `()`
- returns a list of tuples, where length of the tuple will be number of groups

#### Find all digits and the word that comes after that.

In [None]:
re.findall(r"(\d+) (\w+)", s3)

### Unlike matches, groups can overlap

#### Find and group all digits and the word that comes after that.

In [None]:
re.findall(r"((\d+) (\w+))", s3)

#### Substitute all digits with "###".

In [None]:
re.sub(r"\d+", "###", s3)

#### Substitute all whitespaces with single white space.

In [None]:
print(s4)

In [None]:
re.sub(r"\s+", " ", s4)

### How to use groups is substitution?
- `\g<N>` gives you the result of the N'th grouping.

#### Substitute all whitespaces with single white space.

In [None]:
print(re.sub(r"(\d+)", "<b>\g<1></b>", s3))

In CS <b>320</b>, there are <b>40</b> lectures, <b>10</b> quizzes, <b>3</b> exams, <b>7</b> projects, and <b>1000</b> things to learn. CS <b>320</b> is awesome!