<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Regular-Expressions-(Regex)-in-Python--🐍" data-toc-modified-id="Regular-Expressions-(Regex)-in-Python--🐍-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Regular Expressions (Regex) in Python  🐍</a></span><ul class="toc-item"><li><span><a href="#Python's-re-Module" data-toc-modified-id="Python's-re-Module-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Python's <code>re</code> Module</a></span><ul class="toc-item"><li><span><a href="#Common-Regex-Functions" data-toc-modified-id="Common-Regex-Functions-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Common Regex Functions</a></span></li><li><span><a href="#Regex-Patterns" data-toc-modified-id="Regex-Patterns-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Regex Patterns</a></span></li><li><span><a href="#Examples-of-Regex-Functions-and-Patterns" data-toc-modified-id="Examples-of-Regex-Functions-and-Patterns-1.1.3"><span class="toc-item-num">1.1.3&nbsp;&nbsp;</span>Examples of Regex Functions and Patterns</a></span><ul class="toc-item"><li><span><a href="#1.-re.search(pattern,-string):" data-toc-modified-id="1.-re.search(pattern,-string):-1.1.3.1"><span class="toc-item-num">1.1.3.1&nbsp;&nbsp;</span>1. <code>re.search(pattern, string)</code>:</a></span></li><li><span><a href="#2.-re.match(pattern,-string):" data-toc-modified-id="2.-re.match(pattern,-string):-1.1.3.2"><span class="toc-item-num">1.1.3.2&nbsp;&nbsp;</span>2. <code>re.match(pattern, string)</code>:</a></span></li><li><span><a href="#3.-re.findall(pattern,-string):" data-toc-modified-id="3.-re.findall(pattern,-string):-1.1.3.3"><span class="toc-item-num">1.1.3.3&nbsp;&nbsp;</span>3. <code>re.findall(pattern, string)</code>:</a></span></li><li><span><a href="#4.-re.finditer(pattern,-string):" data-toc-modified-id="4.-re.finditer(pattern,-string):-1.1.3.4"><span class="toc-item-num">1.1.3.4&nbsp;&nbsp;</span>4. <code>re.finditer(pattern, string)</code>:</a></span></li><li><span><a href="#5.-re.sub(pattern,-replacement,-string):" data-toc-modified-id="5.-re.sub(pattern,-replacement,-string):-1.1.3.5"><span class="toc-item-num">1.1.3.5&nbsp;&nbsp;</span>5. <code>re.sub(pattern, replacement, string)</code>:</a></span></li><li><span><a href="#6.-re.split(pattern,-string):" data-toc-modified-id="6.-re.split(pattern,-string):-1.1.3.6"><span class="toc-item-num">1.1.3.6&nbsp;&nbsp;</span>6. <code>re.split(pattern, string)</code>:</a></span></li></ul></li><li><span><a href="#Note" data-toc-modified-id="Note-1.1.4"><span class="toc-item-num">1.1.4&nbsp;&nbsp;</span>Note</a></span></li></ul></li></ul></li></ul></div>

# Regular Expressions (Regex) in Python  🐍


Regex, short for Regular Expression, is a **sequence of characters** that **defines a search pattern**. 

Regex is commonly used for **pattern** **search**, **matching** and **data manipulation** in various types of strings.

This makes it invaluable for tasks such as **data validation**, **text parsing**, and **data extraction**. Some typical applications include:

1. Emails
2. URLs
3. Phone Numbers
4. Dates and Times
5. Social Security Numbers
6. Credit Card Numbers
7. File Paths
8. HTML Tags
9. Log Files
10. Natural Language Processing

## Python's `re` Module

Python provides the `re` module, which allows you to work with regular expressions. Before using `re`, you need to import it:

```python
import re
```

In [1]:
import re

### Common Regex Functions

Regex functions in Python allow you to find and work with patterns in strings:

1. To find the pattern:
   - `re.search(pattern, string)`: Searches the string for a match to the pattern and returns a match object if found.
   - `re.match(pattern, string)`: Searches the string for a match only at the beginning and returns a match object if found.
   - `re.findall(pattern, string)`: Returns all occurrences of the pattern in the string as a list of strings.
   - `re.finditer(pattern, string)`: Returns an iterator yielding match objects for all occurrences of the pattern in the string.

2. To work with the pattern:
   - `re.sub(pattern, replacement, string)`: Replaces all occurrences of the pattern in the string with the replacement string.
   - `re.split(pattern, string)`: Splits the string by occurrences of the pattern and returns a list of substrings.



### Regex Patterns

![](https://github.com/data-bootcamp-v4/lessons/blob/main/img/regex.png?raw=true)

### Examples of Regex Functions and Patterns

#### 1. `re.search(pattern, string)`:
   - Searches the `string` for a match to the `pattern`.
   - Returns a match object if the pattern is found, otherwise returns `None`.


 We can use the `re.search()` function to find specific patterns within a string. Let's explore some regex patterns and how they work:

In [3]:
text = 'I have 10 apples   and  2 bananas.'
pattern='\d+'    # numeric, 1 or more characters

result = re.search(pattern, text)

if result:
    print(f"Match found: {result.group()}")
else:
    print("No match found.")

Match found: 10


In [6]:
pattern = '\w' # matches any word character. A word character includes alphanumeric characters (letters and digits) and underscores (_). It is equivalent to [a-zA-Z0-9_].

result = re.search(pattern, text) # Returns first match, 'I'
if result:
    print(f"Match found: {result.group()}")
else:
    print("No match found.")

Match found: I


In [7]:
pattern = r"apple|banana" # Alternation allows matching one of several patterns separated by |.
text = "I have a banana and an apple."

result = re.search(pattern, text)
if result:
    print("Fruit found:", result.group())
else:
    print("No fruit found.")


Fruit found: banana


In [8]:
pattern = r"co.kie" # The dot (.) in regex represents any character (except newline).
text = "I love my cookie and coke."

result = re.search(pattern, text)
if result:
    print("Match found:", result.group())
else:
    print("No match found.")


Match found: cookie


In [9]:
pattern = r"[aeiou]" # Character classes allow matching any one of several characters at a specific position.
text = "The quick brown fox jumps over the lazy dog."

result = re.search(pattern, text)
if result:
    print("Vowel found:", result.group())
else:
    print("No vowel found.")


Vowel found: e


In [10]:
# Quantifiers specify how many times a character or group should repeat.
pattern = r"\d{3}-\d{2}-\d{4}" #3 digits - 3 digits - 4 digits
text = "My social security number is 123-45-6789."

result = re.search(pattern, text)
if result:
    print("SSN found:", result.group())
else:
    print("No SSN found.")


SSN found: 123-45-6789


#### 2. `re.match(pattern, string)`:
   - Searches the `string` for a match only at the beginning.
   - Returns a match object if the pattern is found at the start, otherwise returns `None`.

In [5]:
pattern = '\w' # matches any word character. A word character includes alphanumeric characters (letters and digits) and underscores (_). It is equivalent to [a-zA-Z0-9_].
text = ' I have an apple and a banana.'

result = re.match(pattern, text)
if result:
    print(f"Match found: {result.group()}")
else:
    print("No match found.")

No match found.


Since there is a space at the beginning of the text, and that is not an alphanumeric character, returns match not found.

#### 3. `re.findall(pattern, string)`:
   - Returns all occurrences of the `pattern` in the `string` as a list of strings.


In [11]:
pattern = '\d+' # numerico, 1 o mas caracteres
text = 'I have 3 apples and 5 bananas.'

result = re.findall(pattern, text)
print(f"Occurrences: {result}")

Occurrences: ['3', '5']


Lets look at the regex pattern `[\w\.]+@\w+\.\w+`, designed to match email addresses:

1. `[\w\.]+`: Matches one or more occurrences of word characters or dots (`.`).
   - `\w` represents word characters (letters, digits, and underscores).
   - `\.` matches a literal dot (period).

2. `@`: Matches the `@` symbol.

3. `\w+`: Matches one or more occurrences of word characters after the `@` symbol.
   - `\w` represents word characters (letters, digits, and underscores).

4. `\.`: Matches a literal dot (period).

5. `\w+`: Matches one or more occurrences of word characters after the dot.
   - `\w` represents word characters (letters, digits, and underscores).


In [13]:
emails_text = """
Here are some made-up email addresses:
john.doe@example.com
mary_smith123@gmail.com
theodore@example.co.uk
contact_us@company.net
info123@yahoo.com
alice.bob@example.org
support@website.io
sales.department@example.com
test.email@domain.com
random.email@subdomain.co
"""

pattern = '[\w\.]+@\w+\.\w+'

re.findall(pattern, emails_text)

['john.doe@example.com',
 'mary_smith123@gmail.com',
 'theodore@example.co',
 'contact_us@company.net',
 'info123@yahoo.com',
 'alice.bob@example.org',
 'support@website.io',
 'sales.department@example.com',
 'test.email@domain.com',
 'random.email@subdomain.co']

What do you observe in the result? How would you fix it?

In [None]:
# Your answer here

#### 4. `re.finditer(pattern, string)`:
   - Returns an iterator yielding match objects for all occurrences of the `pattern` in the `string`.

In [15]:
pattern = '\d+'
text = 'I have 3 apples and 5 bananas.'

matches = re.finditer(pattern, text)
for match in matches:
    print(f"Match found: {match.group()}")

Match found: 3
Match found: 5


#### 5. `re.sub(pattern, replacement, string)`:
   - Replaces all occurrences of the `pattern` in the `string` with the `replacement` string.

In [17]:
pattern = r'apples'
text = 'I have 3 apples and apples.'

result = re.sub(pattern, 'oranges', text)
print(f"Updated text: {result}")

Updated text: I have 3 oranges and oranges.


In [18]:
re.sub('\d+', '', text)   # replaces numbers for nothing

'I have  apples and apples.'

#### 6. `re.split(pattern, string)`:
   - Splits the `string` by occurrences of the `pattern` and returns a list of substrings.


In [19]:
pattern = '\s+' # matches one or more occurrences of whitespace characters


result = re.split(pattern, text)
print(f"Splitted text: {result}")

Splitted text: ['I', 'have', '3', 'apples', 'and', 'apples.']


### Note

You can use Python functions, for example `re.sub()` instead of `replace()`, or `re.split()` instead of `split()`, if you don't need a regex pattern.