# Regular Expressions - Day 1

## When _NOT_ to Use Regular Expressions

- When determining if a `str` starts or ends with a substring.
- Determine if a `str` contains a substring.
- Simple substring replacements.
    - `str` objects are immutable, so replacement requires a new `str`


In [1]:
# Define a string
text = 'Awesome! I am still working on 100DaysOfCode'

In [2]:
# Determine if the string starts with 'Awesome'
text.startswith('Awesome!')

True

In [3]:
# Determine if the string ends with '100DaysOfCode'
text.endswith('100DaysOfCode')

True

In [4]:
# Determine if the string contains the word 'working'
'working' in text

True

In [5]:
# Determine if the string contains the case insensitive version of '100DaysOfCode'
'100DaysOfCode'.lower() in text.lower()

True

In [6]:
# Replace '100DaysOfCode with '200DaysOfCode'
text.replace('100', '200')

'Awesome! I am still working on 200DaysOfCode'


---

# Regular Expressions - Day 1a

## Regular Expressions Overview

- Regular expressions are a meta language.
- The `re` module has two main methods, `search` and `match`.
    - `match` matches from start to end.
    - `search` can match a substring.


In [7]:
# Create a string object
text = 'Awesome! I am still working on the 100DaysOfCode challenge'

In [8]:
# Import the regular expressions module
import re

In [9]:
# Use a search method to match part of the string
re.search(
    r'I am',
    text
)

<re.Match object; span=(9, 13), match='I am'>

In [10]:
# Using a match method with a substring will return None, because it 'match' tries to match the substring to the string, from end to end
re.match(
    r'I am',
    text
)

In [11]:
# Properly use a match method from start to end
re.match(
    'Awesome!.+challenge',
    text
)

<re.Match object; span=(0, 58), match='Awesome! I am still working on the 100DaysOfCode >

---

# Regular Expressions - Day 2

## String Capturing Parenthesis

- Identify and access **capturing groups** within a string using parenthesis.


In [12]:
# Create two string objects
hundred = 'Awesome! I am still working on the #100DaysOfCode challenge'
two_hundred = 'Awesome! I am still working on the #200DaysOfCode challenge'

In [13]:
# Match using a capturing group
result_1 = re.search(
    r'(\#\d+DaysOfCode)',
    hundred
)

In [14]:
# List the matching groups (returned in a tuple object)
result_1.groups()

('#100DaysOfCode',)

In [15]:
# Index the matching group, to get the precise search match
result_1.groups()[0]

'#100DaysOfCode'

In [16]:
# Repeat the process with the two_hundred string
result_2 = re.search(
    pattern=r'(\#\d+DaysOfCode)',
    string=two_hundred
)

In [17]:
# Display the matching groups
result_2.groups()

('#200DaysOfCode',)

In [18]:
# Index the matching group, to get the precise search match
result_2.groups()[0]

'#200DaysOfCode'

---

## `findall` is Your Friend

- The `re.findall` method can match multiple instances of a substring.
    - Returns a `list` of matching instances.
- Without `findall`, searches stop after matching the first occurrence of a substring.
- Search the string `text` below for all instances of a three-digit number.


In [19]:
# Define the string to search
text = '''
$ python module_index.py |grep ^re
re                 | stdlib | 005, 007, 009, 015, 021, 022, 068, 080, 081, 086, 095
'''

In [20]:
# Use findall to match all instances
result_1 = re.findall(
    pattern=(r'\d{3}'),
    string=text
)

In [21]:
# Display the list of matches
result_1

['005', '007', '009', '015', '021', '022', '068', '080', '081', '086', '095']

## `findall` Continued

- Use `findall` and `Counter` to determine the most common word that start with an uppercase letter in the string `text` below.


In [22]:
# Define the text to search
text = '''Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been 
the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and 
scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into 
electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of
Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus
PageMaker including versions of Lorem Ipsum'''


In [23]:
# Use `findall` to create a list of all words that have an uppercase first character
# re.VERBOSE allows inline comments to describe the purpose of each piece of a regex
upper_words = re.findall(
        r'''
        [A-Z]       # Use a character class to match any uppercase letter in the first position
        [a-z0-9]+    # Use a character class to match one or more lowercase letters or numbers that follow
        ''',
        text,
        re.VERBOSE
)

In [24]:
# Display the resulting list of words
upper_words

['Lorem',
 'Ipsum',
 'Lorem',
 'Ipsum',
 'It',
 'It',
 'Letraset',
 'Lorem',
 'Ipsum',
 'Aldus',
 'Page',
 'Maker',
 'Lorem',
 'Ipsum']

In [25]:
# Use Counter to determine the most common words in the upper_words list
from collections import Counter

In [26]:
# Define a Counter object with the upper_words list
word_count = Counter(upper_words)

In [27]:
# Display the word_count object
word_count

Counter({'Lorem': 4,
         'Ipsum': 4,
         'It': 2,
         'Letraset': 1,
         'Aldus': 1,
         'Page': 1,
         'Maker': 1})

In [28]:
# Get the top 3 word appearances from the word_count object
word_count.most_common(3)

[('Lorem', 4), ('Ipsum', 4), ('It', 2)]

---

# Regular Expressions - Day 3