In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 5)


# Lecture 11 - Working with Text

## DSC 80, Fall 2022

## Today, in DSC 80...

- Much of the data we get comes in the form of *text*.
- We'll review Python's string methods and see their limitations
- We'll learn a powerful (but arcane) method for pattern-matching: **regular expressions** 

## Basic string methods

### String methods

- Python's `str` type comes with a bunch of useful methods attached:

In [None]:
def print_methods(cls):
    print('\n'.join(m for m in dir(cls) if not m.startswith('_')))

In [None]:
print_methods(str)

### Most used string methods

- Some of the most commonly-used string methods are:
    - `.split()`
    - `.strip()` (along with `.lstrip()` and `.rstrip()`)
    - `.startswith` and `.endswith`
    - `.replace`
    - `.lower()` and `.upper()`
    - `.join()`

In [None]:
'break this into words, pls'.split()

In [None]:
'break\nthis\ninto\nlines and words'.split()

In [None]:
'     this is surrounded by spaces, and I do not like that    '.strip()

In [None]:
'WHY AM I SHOUTING?'.lower()

In [None]:
'-- this starts with double dashes'.startswith('--')

In [None]:
'the answer to everything is 42'.replace('42', '(REDACTED)')

In [None]:
' '.join(['these', 'are', 'words', 'in', 'a', 'sentence'])

In [None]:
print('\n'.join(['this', 'is', 'how', 'to', 'print', 'lines']))

In [None]:
import itertools
s = 'this is the spongebob meme'
memify = itertools.cycle([str.upper, str.lower])
''.join(f(l) for l, f in zip(s, memify))

### `pd.Series` string methods

- pandas' `Series` provides many of the same string methods as Python's `str` class.
- Work elementwise.

In [None]:
print_methods(pd.Series.str)

### String methods are pretty powerful

- We can do *much* of our string processing with Python's string methods

### Example 1: Cleaning up text files

- **Goal**: get a set containing all of the words in the text file (without attached punctuation)

In [None]:
!head ./data/lorem.txt -n 3

In [None]:
with open('./data/lorem.txt') as fh:
    words = set(w.lower().strip('.,?') for w in fh.read().split())

In [None]:
words

### Example 2: *Canonicalization*

Consider the following two DataFrames (see [this presentation](https://docs.google.com/presentation/d/1xQsqa7e3xDZ9nBiekbSBOecwvQm8pSVGa-FBoV6aJ7E/edit#slide=id.g11197671c7e_0_813)) for inspiration).

In [1]:
import os
codes = pd.read_csv(os.path.join('data', 'codes.csv'))
programs = pd.read_csv(os.path.join('data', 'programs.csv'))

display(codes)
display(programs)

NameError: name 'pd' is not defined

What would happen if we try to merge the two DataFrames on `'department'`?

In [None]:
codes.merge(programs, on='department')

### String canonicalization

- One solution is to **canonicalize** both `'department'` columns, so that there is just a single way to format each department's name **in both DataFrames**. 
- We can do this by implementing a `canonicalize_department` function, which takes in a department's name as a string and reformats it.
- `canonicalize_department` should:
    - Fix cases (upper vs. lower).
    - Standardize variants of words – e.g. `'eng.'` vs `'engineering'`.
    - Fix punctuation – e.g. `'&'` vs. `'and'`.

In [None]:
display(codes)
display(programs)

In [None]:
def canonicalize_department(d):
    return (d
           .lower()
           .replace('sci.', 'science')
           .replace('stud.', 'studies')
           .replace('eng.', 'engineering')
           .replace('&', 'and')
           .replace('(', '- ')
           .replace(')', '')
           )

In [None]:
codes['department_clean'] = codes['department'].apply(canonicalize_department)
programs['department_clean'] = programs['department'].apply(canonicalize_department)

display(codes)
display(programs)

Now, we can join `codes` with `programs` on `'department_clean'`.

In [None]:
codes.merge(programs, on='department_clean')

### Reflection

The process of **string canonicalization** is very brittle. 
- `canonicalize_department` was hyper-specific to the four department names we had access to. 
- We don't know if it'll work for other departments.

### Example 3: Get all the phone numbers

- **Goal**: extract all phone numbers from a piece of text
- Assume they are of the form `(###) ###-####`
- How do we do this with string methods?
- Strategy:
    - Split by spaces.
    - Find "words" that look like area codes: `(330)`.
    - Check if the following "word" looks like `867-5309` 

In [None]:
contact = '''
Thank you for buying our expensive product!

If you have a complaint, please send it to complaints@compuserve.com or call (800) 867-5309.

If you are happy with your purchase, please call us at (800) 123-4567; we'd love to hear from you!

Due to high demand, please allow one-hundred (100) business days for a response.
'''

In [None]:
def is_possibly_area_code(s):
    return (
        len(s) == 5
        and
        s.startswith('(') and s.endswith(')')
        and
        s[1:4].isnumeric()
    )

In [None]:
is_possibly_area_code('(123)')

In [None]:
is_possibly_area_code('(99)')

In [None]:
def is_phone_number(s):
    """T/F: does `s` look like 867-5309?"""
    return (
        len(s) == 8
        and
        s[0:3].isnumeric()
        and
        s[3] == '-'
        and
        s[4:].isnumeric()
    )

In [None]:
is_phone_number('867-5309')

In [None]:
# remove punctuation
pieces = [s.rstrip('.,?;"\'') for s in contact.split()]

for i, piece in enumerate(pieces):
    if is_possibly_area_code(piece):
        if is_phone_number(pieces[i+1]):
            print(piece, pieces[i+1])

- The above could result in an `IndexError`.
- This is better (`itertools.pairwise` only exists in Python 3.10+):

```python
for first, second in itertools.pairwise(pieces):
    if is_phone_number(second):
        print(first, second)
```

### Is there a better way?

- This was an example of **pattern matching**.
- It can be done with string methods...
- ...but there is often a better approach: **regular expressions**.

In [None]:
contact = '''
Thank you for buying our expensive product!

If you have a complaint, please send it to complaints@compuserve.com or call (800) 867-5309.

If you are happy with your purchase, please call us at (800) 123-4567; we'd love to hear from you!

Due to high demand, please allow one-hundred (100) business days for a response.
'''

In [None]:
import re
re.findall(r'\(\d{3}\) \d{3}-\d{4}', contact)

## Basic regular expressions

### Regular expressions

- A regular expression, or **regex** for short, is a sequence of characters used to **match patterns in strings**.
    - For example, `\d{3} \d{3}-\d{4}` describes a **pattern** that matches US phone numbers of the form `'XXX XXX-XXXX'`.
    - Think of regex as a "mini-language" (formally: they are a grammar for describing a language)
- **Pros**:
    - They are very powerful.
    - Widely used (virtually every programming language, text editor, etc.)
- **Cons**:
    - Hard to read
    - Lots of different "dialects"

### Regular expressions in Python

- Python comes with the `re` module
- We'll explore it later.
- For now, meet `re.fullmatch(pattern, text_to_match)`:
    - if the `pattern` matches *all* of `text_to_match`, a `re.Match` object is returned
    - else `None` is returned

In [None]:
import re

In [None]:
re.fullmatch(pattern='foo', string='foo')

In [None]:
# not matched!
re.fullmatch(pattern='foo', string='bar')

In [None]:
# also doesn't match!
re.fullmatch(pattern='foo', string='foo bar')

### Literal patterns

- The most basic pattern consists only of literal characters.
    - i.e., characters that have no special meaning
- Not very useful *by itself*
    - `==` checks for a "full" match
    - `str.find` searches for a literal pattern

In [None]:
re.fullmatch('foo', 'foo')

In [None]:
re.fullmatch('foo', 'bar')

### Regex's special characters

- The power of regexes comes from **special characters**
    - i.e., characters that have a special meaning
- There are many.
- To start with, we'll look at
    - `*`: the "closure" operator (repeat 0 or more times)
    - `|`: the "or" operator
    - `()`: grouping

### The `*` operator: *zero or more*

- `*` means: match the previous zero or more times
- **Note**: applies only to immediately preceding character (or group)
    - "High precedence"

In [None]:
re.fullmatch('woo*', 'wo')

In [None]:
re.fullmatch('woo*', 'wo')

In [None]:
re.fullmatch('woo*', 'woo')

In [None]:
re.fullmatch('woo*', 'woooooooooo')

In [None]:
re.fullmatch('whoaa*h', 'whoaaaaaah')

In [None]:
# matches *zero* or more times!
re.fullmatch('z*oo', 'oo')

### The `|` operator: *or*

- `|` means: match if the previous pattern matches or the next pattern matches
- **Note**: unlike `*`, has very low precedence

In [315]:
re.fullmatch('this|that', 'this')

<re.Match object; span=(0, 4), match='this'>

In [316]:
re.fullmatch('this|that', 'that')

<re.Match object; span=(0, 4), match='that'>

In [317]:
re.fullmatch('dsc 40b|80', 'dsc 40b')

<re.Match object; span=(0, 7), match='dsc 40b'>

In [318]:
# was this what you expected? this is the low precedence!
re.fullmatch('dsc 40b|80', 'dsc 80')

### Grouping `()`

- Parens can be used to make **groups**

In [319]:
re.fullmatch('dsc 40b|80', 'dsc 80')

In [320]:
re.fullmatch('dsc (40b|80)', 'dsc 80')

<re.Match object; span=(0, 6), match='dsc 80'>

In [None]:
re.fullmatch('blah*', 'blahblahblah')

In [None]:
re.fullmatch('(blah)*', 'blahblahblah')

### Bulding regular expressions

- These operators can be used in combination.

In [312]:
# match strings like ababab... or cdcdcd...
pattern = '(ab)*|(cd)*'

re.fullmatch(pattern, 'ababab')

<re.Match object; span=(0, 6), match='ababab'>

In [313]:
re.fullmatch(pattern, 'cdcdcd')

<re.Match object; span=(0, 6), match='cdcdcd'>

In [314]:
re.fullmatch(pattern, 'abcdab')

### Regex building blocks 🧱

The four main building blocks for all regexes are shown below ([table source](https://www.cs.princeton.edu/courses/archive/spring17/cos226/lectures/54RegularExpressions.pdf), [inspiration](https://docs.google.com/presentation/d/1xQsqa7e3xDZ9nBiekbSBOecwvQm8pSVGa-FBoV6aJ7E/edit#slide=id.g11197671c7e_0_919)).

| operation | order of op. | example | matches ✅ | does not match ❌ |
|:--- |:---|:---|:---|:---|
| <span style='color:purple'><b>concatenation</b></span> | 3 | `AABAAB` | `'AABAAB'` | every other string |
| <span style='color:purple'><b>or</b></span> | 4 | `AA\|BAAB` | `'AA'`, `'BAAB'` | every other string |
| <span style='color:purple'><b>closure</b><br>(zero or more)</span> | 2 | `AB*A` | `'AA'`, `'ABBBBBBA'` | `'AB'`, `'ABABA'` |
| <span style='color:purple'><b>parentheses</b></span> | 1 | `A(A\|B)AAB` <hr style="height:1px"> `(AB)*A` | `'AAAAB'`, `'ABAAB'`<hr style="height:1px">`'A'`, `'ABABABABA'` | every other string<hr style="height:1px">`'AA'`, `'ABBA'` |

Note that `|`, `(`, `)`, and `*` are **special characters**, not literals. They manipulate the characters around them.

***Example:*** `AB*A` matches strings with an `'A'`, followed by zero or more `'B'`s, and then an `'A'`. 

✅ `'AA'`, `'ABA'`, `'ABBBBBBBBBBBBBBA'`<br>
❌ `'AB'`, `'ABAB'`

***Example:*** `(AB)*A` matches strings with zero or more `'AB'`s, followed by an `'A'`.

✅ `'A'`, `'ABA'`, `'ABABABABA'`<br>
❌ `'AA'`, `'ABBBBBBBA'`, `'ABAB'`

### Exercise

Write a regular expression that matches `'billy'`, `'billlly'`, `'billlllly'`, etc.
- First, think about how to match strings with any even number of `'l'`s, including zero `'l'`s (i.e. `'biy'`).
- Then, think about how to match only strings with a **positive even** number of `'l'`s.

<br><br>

<details>
<summary>
    ✅ Click here to see the answer <b>after</b> you've tried it yourself at <a href='https://regex101.com'>regex101.com</a>.
</summary>
<code>bi(ll)*y</code> will match any even number of <code>'l'</code>s, including 0.
    
To match only a positive even number of <code>'l'</code>s, we'd need to first "fix into place" two <code>'l'</code>s, and then follow that up with zero or more pairs of <code>'l'</code>s. This specifies the regular expression <code>bill(ll)*y</code>.
    </details>

### Exercise

Write a regular expression that matches `'billy'`, `'billlly'`, `'biggy'`, `'biggggy'`, etc.

Specifically, it should match any string with a **positive even** number of `'l'`s in the middle, or a **positive even** number of `'g'`s in the middle.

<br>

<details>
<summary>
    ✅ Click here to see the answer <b>after</b> you've tried it yourself at <a href='https://regex101.com'>regex101.com</a>.
</summary>

Possible answers: <code>bi(ll(ll)\*|gg(gg)\*)y</code> or <code>bill(ll)\*y|bigg(gg)\*y</code>.
 
<br>

Note, <code>bill(ll)\*|gg(gg)\*y</code> is <b>not</b> a valid answer! This is because "concatenation" comes before "or" in the order of operations. This regular expression would match strings that match <code>bill(ll)\*</code>, like <code>'billll'</code>, OR strings that match <code>gg(gg)\*y</code>, like <code>'ggy'</code>.

    
</details>

## Intermediate regex

### More operators

- There's (much) more!
- Let's look at:
    - `+`: repeat one or more times
    - `{i,j}`: between i and j times
    - `.`: wildcard
    - `[A-z]`: character classes

### The `.` operator: *wildcard*

- `.` matches any single character

In [None]:
re.fullmatch('.', 'a')

In [None]:
re.fullmatch('.', '7')

In [None]:
re.fullmatch('.', '!')

In [None]:
re.fullmatch('.', 'aa')

In [None]:
re.fullmatch('..', 'a4')

### The `+` operator: one or more times

- `+` matches one or more repetitions of the previous pattern
- Precedence like `*`

In [None]:
re.fullmatch('123+', '12')

In [None]:
re.fullmatch('123+', '123')

In [None]:
re.fullmatch('123+', '1233')

In [None]:
# equivalently
re.fullmatch('1233*', '1233')

### `{i,j}`: repeat between i and j times

- `{i,j}` matches between i and j repetitions of the previous pattern

In [None]:
re.fullmatch('wo{2,4}', 'wo')

In [None]:
re.fullmatch('wo{2,4}', 'woo')

In [None]:
re.fullmatch('wo{2,4}', 'wooo')

In [None]:
re.fullmatch('wo{2,4}', 'woooo')

In [None]:
re.fullmatch('wo{2,4}', 'wooooo')

### Character classes: `[A-z]`, etc.

- Square brackets like `[]` can be used to make **character classes**
- A character class matches anything inside the class
- Ranges can be specified

In [None]:
# match 0, 1, and 2
re.fullmatch('[012]', '2')

In [None]:
# match 0, 1, and 2
re.fullmatch('[012]', '3')

In [None]:
# match 0 through 8, but not 9 (sorry 9)
re.fullmatch('[0-8]', '7')

In [None]:
re.fullmatch('[0-8]', '9')

In [None]:
# works for letters, too.
# this matches all uppercase letters
re.fullmatch('[A-Z]', 'Q')

In [None]:
# this matches all letters
re.fullmatch('[A-z]', 'q')

In [None]:
# this matches all letters and numbers
re.fullmatch('[A-z0-9]', '3')

### More regex syntax

| operation | example | matches ✅ | does not match ❌ |
|:--- |:---|:---|:---|
| <span style='color:purple'><b>wildcard</b></span> | `.U.U.U.` | `'CUMULUS'`<br>`'JUGULUM'` | `'SUCCUBUS'`<br>`'TUMULTUOUS'` |
| <span style='color:purple'><b>character class</b></span>  | `[A-Za-z][a-z]*` | `'word'`<br>`'Capitalized'` | `'camelCase'`<br>`'4illegal'` |
| <span style='color:purple'><b>at least one</b></span> | `bi(ll)+y` | `'billy'`<br>`'billlllly'` | `'biy'`<br>`'bily'` |
| <span style='color:purple'><b>between a and b occurrences</b></span> | `m[aeiou]{1,2}m` | `'mem'`<br>`'maam'`<br>`'miem'` | `'mm'`<br>`'mooom'`<br>`'meme'` |

`.`, `[`, `]`, `+`, `{`, and `}` are also special characters, in addition to `|`, `(`, `)`, and `*`.

***Example:*** `[A-E]+` is just shortform for `(A|B|C|D|E)(A|B|C|D|E)*`.

### Exercise

Write a regular expression that matches any lowercase string has a repeated vowel, such as `'noon'`, `'peel'`, `'festoon'`, or `'zeebraa'`.

<br>

<details>
<summary>
    ✅ Click here to see the answer <b>after</b> you've tried it yourself at <a href='https://regex101.com'>regex101.com</a>.
</summary>

One answer: <code>[a-z]\*(aa|ee|ii|oo|uu)[a-z]\*</code>
 
<br>

This regular expression matches strings of lowercase characters that have <code>'aa'</code>, <code>'ee'</code>, <code>'ii'</code>, <code>'oo'</code>, or <code>'uu'</code> in them anywhere. <code>[a-z]\*</code> means "zero or more of any lowercase characters"; essentially we are saying it doesn't matter what letters come before or after the double vowels, as long as the double vowels exist somewhere.

    
</details>

### Exercise

Write a regular expression that matches any string that contains **both** a lowercase letter and a number, in any order. Examples include `'billy80'`, `'80!!billy'`, and `'bil8ly0'`.

<br>

<details>
<summary>
    ✅ Click here to see the answer <b>after</b> you've tried it yourself at <a href='https://regex101.com'>regex101.com</a>.
</summary>

One answer: <code>(.\*[a-z].\*[0-9].\*)|(.\*[0-9].\*[a-z].\*)</code>
 
<br>

We can break the above regex into two parts – everything before the `|`, and everything after the `|`.

The first part, <code>.\*[a-z].\*[0-9].\*</code>, matches strings in which there is at least one lowercase character and at least one digit, with the lowercase character coming first.

The second part, <code>.\*[0-9].\*[a-z].\*</code>, matches strings in which there is at least one lowercase character and at least one digit, with the digit coming first.
    
Note, the <code>.\*</code> between the digit and letter classes is needed in the event the string has non-digit and non-letter characters.

    
</details>

### Escaping special characters

- `.`, `+`, `?`, etc. are **special characters**. But what if we want a literal `.`?
- To match a special character (e.g. `.` or `*`) as a literal, place a `\` right before it to **escape** it.

In [None]:
re.fullmatch('.', '.')

In [None]:
re.fullmatch('.', 'a')

In [None]:
re.fullmatch('\.', '.')

In [None]:
re.fullmatch('\.', 'a')

### Anchors ⚓️

- Place `^` at the start of a regex to require that the match string is **at the start** of the line.
- Place `$` at the end of a regex to require that the match string is **at the end** of the line.
- `re.fullmatch` implicitly adds a `^` to start of the pattern, and a `$` to the end

### Builtin character classes

- Python's `re` comes with some builtin character classes:
    - `\d`: digits
    - `\s`: whitespace
    - `\w`: alphanumeric "word" characters (`[A-Z][a-z][0-9]_`)
    - `\b`: word boundary
- When using these, it is a good idea to use "raw strings" (`r"this is a raw string"`) for the pattern string!

In [321]:
re.fullmatch('\d{3} \d{3}-\d{4}', '123 456-7890')

<re.Match object; span=(0, 12), match='123 456-7890'>

In [322]:
re.findall('cat', 'my cat is catatonic : (')

['cat', 'cat']

In [323]:
re.findall('\bcat\b', 'my cat is catatonic : (')

[]

In [324]:
re.findall(r'\bcat\b', 'my cat is catatonic : (')

['cat']

### Even more regex syntax

| operation | example | matches ✅ | does not match ❌ |
|:--- |:---|:---|:---|
| <span style='color:purple'><b>escape character</b></span> | `ucsd\.edu` | `'ucsd.edu'` | `'ucsd!edu'` |
| <span style='color:purple'><b>beginning of line</b></span> | `^ark` | `'ark two'`<br>`'ark o ark'` | `'dark'` |
| <span style='color:purple'><b>end of line</b></span>  | `ark$` | `'dark'`<br>`'ark o ark'` | `'ark two'` |
| <span style='color:purple'><b>zero or one</b></span> | `cat?` | `'ca'`<br>`'cat'` | `'cart'` (matches `'ca'` only) |
| <span style='color:purple'><b>built-in character classes*</b></span> | `\w+` <br> `\d+` | `'billy'`<br>`'231231'` | `'this person'`<br>`'858 people'` |
| <span style='color:purple'><b>character class negation</b></span> | `[^a-z]+` | `'KINGTRITON551'`<br>`'1721$$'` | `'porch'`<br>`'billy.edu'` |

***Note:*** 
- `\d` refers to digits
- `\w` refers to alphanumeric characters (`[A-Z][a-z][0-9]_`)
- `\s` refers to whitespace
- `\b` is a word boundary

### Exercise

Write a regular expression that matches any string that:
- is between 5 and 10 characters long, and
- is made up of only vowels (either uppercase or lowercase, including `'Y'` and `'y'`), periods, and spaces.

Examples include `'yoo.ee.IOU'` and `'AI.I oey'`.

<br>

<details>
<summary>
    ✅ Click here to see the answer <b>after</b> you've tried it yourself at <a href='https://regex101.com'>regex101.com</a>.
</summary>

One answer: <code>^[aeiouyAEIOUY. ]{5,10}$</code>
 
<br>

<b>Key idea:</b> Within a character class (i.e. <code>[...]</code>), special characters do not generally need to be escaped.


    
</details>

## Regex in Python

### `re` in Python

The `re` package is built into Python. It allows us to use regular expressions to find, extract, and replace strings.

In [325]:
import re

`re.search` takes in a string `regex` and a string `text` and returns the location and substring corresponding to the **first** match of `regex` in `text`.

In [326]:
re.search('AB*A', 'here is a string for you: ABBBA. here is another: ABBBBBBBA')

<re.Match object; span=(26, 31), match='ABBBA'>

`re.findall` takes in a string `regex` and a string `text` and returns a list of all matches of `regex` in `text`.

In [327]:
re.findall('AB*A', 'here is a string for you: ABBBA. here is another: ABBBBBBBA')

['ABBBA', 'ABBBBBBBA']

`re.sub` takes in a string `regex`, a string `repl`, and a string `text`, and replaces all matches of `regex` in `text` with `repl`.

In [328]:
re.sub('AB*A', 'billy', 'here is a string for you: ABBBA. here is another: ABBBBBBBA')

'here is a string for you: billy. here is another: billy'

### Capture groups
* Surround a regex with `(` and `)` to define a **capture group** within a pattern.
- Capture groups are useful for extracting relevant parts of a string.

In [329]:
re.findall(r'\w+@(\w+)\.edu', 'my old email was billy@notucsd.edu, my new email is notbilly@ucsd.edu')

['notucsd', 'ucsd']

- Notice what happens if we remove the `(` and `)`!

In [330]:
re.findall(r'\w+@\w+\.edu', 'my old email was billy@notucsd.edu, my new email is notbilly@ucsd.edu')

['billy@notucsd.edu', 'notbilly@ucsd.edu']

- Earlier, we also saw that parentheses can be used to group parts of a regex together. When using `re.findall`, all groups are treated as capturing groups.

In [331]:
# A regex that matches strings with two of the same vowel followed by 3 digits
# We only want to capture the digits, but...
re.findall(r'(aa|ee|ii|oo|uu)(\d{3})', 'eeoo124')

[('oo', '124')]

## Example: Log parsing

Web servers typically record every request made of them in the "logs".

In [333]:
s = '''132.249.20.188 - - [05/May/2022:14:26:15 -0800] "GET /my/home/ HTTP/1.1" 200 2585'''

Let's use our new regex syntax (including capturing groups) to extract the day, month, year, and time from the log string `s`.

In [334]:
exp = '\[(.+)\/(.+)\/(.+):(.+):(.+):(.+) .+\]'
re.findall(exp, s)

[('05', 'May', '2022', '14', '26', '15')]

While above regex works, it is not very **specific**. It _works_ on incorrectly formatted log strings.

In [335]:
other_s = '[adr/jduy/wffsdffs:r4s4:4wsgdfd:asdf 7]'
re.findall(exp, other_s)

[('adr', 'jduy', 'wffsdffs', 'r4s4', '4wsgdfd', 'asdf')]

### The more specific, the better!
* Be as specific in your pattern matching as possible – you don't want to match and extract strings that don't fit the pattern you care about.
    - `.*` matches every possible string, but we don't use it very often.
    
* A better date extraction regex:
```
\[(\d{2})\/([A-Z]{1}[a-z]{2})\/(\d{4}):(\d{2}):(\d{2}):(\d{2}) -\d{4}\]
```

    * `\d{2}` matches any 2-digit number.
    * `[A-Z]{1}` matches any single occurrence of any uppercase letter.
    * `[a-z]{2}` matches any 2 consecutive occurrences of lowercase letters.
    * Remember, special characters (`[`, `]`, `/`) need to be escaped with `\`.

In [336]:
s

'132.249.20.188 - - [05/May/2022:14:26:15 -0800] "GET /my/home/ HTTP/1.1" 200 2585'

In [337]:
new_exp = '\[(\d{2})\/([A-Z]{1}[a-z]{2})\/(\d{4}):(\d{2}):(\d{2}):(\d{2}) -\d{4}\]'
re.findall(new_exp, s)

[('05', 'May', '2022', '14', '26', '15')]

A benefit of `new_exp` over `exp` is that it doesn't capture anything when the string doesn't follow the format we specified.

In [338]:
other_s

'[adr/jduy/wffsdffs:r4s4:4wsgdfd:asdf 7]'

In [339]:
re.findall(new_exp, other_s)

[]

## Limitations

### Limitations of regexes

Writing a regular expression is like writing a program.
* You need to know the syntax well.
* They can be easier to write than to read.
* They can be difficult to debug.

Regular expressions are terrible at certain types of problems. Examples:
* Anything involving counting (same number of instances of a and b).
* Anything involving complex structure (palindromes).
* Parsing highly complex text structure ([HTML](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags), for instance).

Below is a regular expression that validates email addresses in Perl. See [this article](http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html) for more details.



<center><img src="imgs/image_8.png" width=700></center>

StackOverflow crashed due to regex! See [this article](https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016) for the details.

<center><img src='imgs/so_regex.png' width=60%></center>

### Summary

- Regular expressions are used to match and extract patterns from text.
- You don't need to force yourself to "memorize" regex syntax – refer to the resources in the [Agenda](#Agenda) section of the lecture and on the [Resources](https://dsc80.com/resources#regular-expressions) tab of the course website.
- Also refer to the three tables of syntax in the lecture:
    - [Regex building blocks](#Regex-building-blocks-🧱).
    - [More regex syntax](#More-regex-syntax).
    - [Even more regex syntax](#Even-more-regex-syntax).
- **Note:** You don't always have to use regular expressions! If Python/`pandas` string methods work for your task, you can still use those.
- **Play [Regex Golf](https://alf.nu/RegexGolf?world=regex&level=r00) to practice!** 🏌️
- **Next time:** Using regular expressions in `pandas` (through `.str`). Describing text data quantitatively.

### Resources

Lots and lots of regular expressions! Good resources:
- [regex101.com](https://regex101.com), a helpful site to have open while writing regular expressions.
- Python [`re` library documentation](https://docs.python.org/3/library/re.html) and [how-to](https://docs.python.org/3/howto/regex.html).
    - The "how-to" is great, read it!
- [regex "cheat sheet"](https://dsc80.com/resources/other/berkeley-regex-reference.pdf) (taken from [here](https://ds100.org/sp22/resources/)).

See [dsc80.com/resources/#regular-expressions](https://dsc80.com/resources/#regular-expressions).