## Regex notes

- https://docs.python.org/3/library/re.html
- https://docs.python.org/3/howto/regex.html
- https://regexr.com/
- https://regex101.com/
- https://regex-generator.olafneumann.org/
- https://github.com/PacktPublishing/2018-Python-Regular-Expressions---Real-World-Projects/tree/master
- https://www.w3resource.com/python-exercises/re/ 
- https://github.com/codebasics/py/blob/master/Advanced/regex/regex_tutorial_exercise_questions.ipynb -exercise
- https://github.com/Asabeneh/30-Days-Of-Python/blob/master/18_Day_Regular_expressions/18_regular_expressions.md
- https://github.com/CoreyMSchafer/code_snippets/tree/master/Python-Regular-Expressions
- https://regex-generator.olafneumann.org/?sampleText=2020-03-12T13%3A34%3A56.123Z%20INFO%20%20%5Borg.example.Class%5D%3A%20This%20is%20a%20&flags=i


# Python Regular Expressions (Regex): An In-Depth Guide

## **1. Introduction to Regular Expressions**

### **1.1 What is a Regular Expression?**
A **Regular Expression (Regex)** is a sequence of characters that defines a search pattern. This pattern can be used for searching, replacing, and manipulating strings in a flexible and efficient way. Regex is widely used in text processing tasks like validation, parsing, and string manipulation.

### **1.2 Why Use Regex?**
- **Text Searching:** Finding specific patterns in text data.
- **Validation:** Ensuring strings conform to a required format (e.g., email addresses).
- **Replacement:** Modifying or replacing parts of strings that match a pattern.
- **Parsing:** Extracting specific information from a text.

## **2. Basic Syntax of Python Regex**

### **2.1 Importing the `re` Module**
In Python, the `re` module provides support for working with regular expressions.

```python
import re
```

### **2.2 Common Metacharacters**
- **`.`**: Matches any single character except a newline.
- **`^`**: Matches the start of a string.
- **`$`**: Matches the end of a string.
- **`*`**: Matches 0 or more repetitions of the preceding element.
- **`+`**: Matches 1 or more repetitions of the preceding element.
- **`?`**: Matches 0 or 1 repetition of the preceding element.
- **`[]`**: Matches any one of the characters inside the square brackets.
- **`|`**: Logical OR, matches the expression before or after the `|`.
- **`()`**: Groups expressions and captures the matched sub-expression.
- **`\`**: Escapes a special character, or denotes a special sequence.

### **2.3 Basic Functions**
- **`re.match()`**: Checks for a match only at the beginning of the string.
- **`re.search()`**: Searches for the first location where the pattern matches.
- **`re.findall()`**: Finds all substrings where the pattern matches and returns them as a list.
- **`re.sub()`**: Replaces occurrences of the pattern with a replacement string.

## **3. Understanding Regex Patterns**

### **3.1 Character Classes**
- **`[abc]`**: Matches any one of `a`, `b`, or `c`.
- **`[^abc]`**: Matches any character except `a`, `b`, or `c`.
- **`[a-z]`**: Matches any lowercase letter.
- **`[0-9]`**: Matches any digit.

### **3.2 Special Sequences**
- **`\d`**: Matches any digit, equivalent to `[0-9]`.
- **`\D`**: Matches any non-digit, equivalent to `[^0-9]`.
- **`\s`**: Matches any whitespace character (space, tab, newline).
- **`\S`**: Matches any non-whitespace character.
- **`\w`**: Matches any alphanumeric character (letters and digits) and underscores, equivalent to `[a-zA-Z0-9_]`.
- **`\W`**: Matches any non-alphanumeric character.

### **3.3 Quantifiers**
- **`*`**: Matches 0 or more repetitions.
- **`+`**: Matches 1 or more repetitions.
- **`?`**: Matches 0 or 1 repetition.
- **`{n}`**: Matches exactly `n` repetitions.
- **`{n,}`**: Matches `n` or more repetitions.
- **`{n,m}`**: Matches between `n` and `m` repetitions.


## **5. Practical Examples**

### **5.1 Email Validation**
```python
pattern = r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$'
email = "test.email@example.com"
if re.match(pattern, email):
    print("Valid email")
else:
    print("Invalid email")
```

### **5.2 Extracting Phone Numbers**
```python
text = "Call me at 123-456-7890 or 987.654.3210"
pattern = r'\b\d{3}[-.]\d{3}[-.]\d{4}\b'
phone_numbers = re.findall(pattern, text)
print(phone_numbers)
```

### **5.3 Removing HTML Tags**
```python
html = "<p>This is a <b>bold</b> paragraph.</p>"
pattern = r'<.*?>'
clean_text = re.sub(pattern, '', html)
print(clean_text)
```

### **5.4 Password Validation**
```python
pattern = r'^(?=.*[A-Z])(?=.*[a-z])(?=.*[0-9])(?=.*[!@#$%^&*]).{8,}$'
password = "StrongPassw0rd!"
if re.match(pattern, password):
    print("Valid password")
else:
    print("Invalid password")
```

### **5.5 Parsing Log Files**
```python
log_entry = "ERROR 2024-08-17 12:34:56 - Something went wrong"
pattern = r'^(ERROR|WARNING|INFO)\s(\d{4}-\d{2}-\d{2})\s(\d{2}:\d{2}:\d{2})\s-\s(.*)$'
match = re.search(pattern, log_entry)
if match:
    level, date, time, message = match.groups()
    print(f"Level: {level}, Date: {date}, Time: {time}, Message: {message}")
```

## **6. Real-World Use Cases**

### **6.1 Data Cleaning in Data Science**
Regex can be used to clean and preprocess data, such as removing unwanted characters, correcting formats, or extracting specific fields.

### **6.2 Web Scraping**
In web scraping, regex is useful for extracting data from HTML or JSON responses, such as extracting product prices or titles from e-commerce sites.

### **6.3 Input Validation**
Forms in web applications often require regex to validate inputs like email addresses, phone numbers, ZIP codes, etc.

### **6.4 Log File Analysis**
System administrators and developers often use regex to parse log files and extract meaningful data, like timestamps or error messages.

### **6.5 Natural Language Processing (NLP)**
Regex is used in NLP tasks for tokenization, text normalization, and entity extraction.

## **7. Performance Considerations**

### **7.1 Efficiency**
- Use raw strings (e.g., `r'pattern'`) to avoid escaping backslashes.
- Pre-compile regex patterns with `re.compile()` when using the same pattern multiple times.

### **7.2 Alternatives**
For simple string searches or replacements, consider using Python’s built-in string methods like `str.find()`, `str.replace()`, or `str.split()` instead of regex.

## **8. Conclusion**

Regular expressions are powerful tools for text processing and can handle a wide range of tasks, from simple searches to complex text manipulations. Mastering regex will greatly enhance your ability to work with strings and text data in Python.

---

In [2]:
import re
pattern = r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$'
email = "test.email@example.com"
if re.match(pattern, email):
    print("Valid email")
else:
    print("Invalid email")


Valid email


In [1]:
import logging

# Configure the logging
logging.basicConfig(
    level=logging.DEBUG,               # Set the logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
    format='%(asctime)s - %(levelname)s - %(message)s',  # Define the log message format
    datefmt='%Y-%m-%d %H:%M:%S'        # Define the date and time format
)

# Example usage of logging
logging.debug('This is a debug message')
logging.info('This is an info message')
logging.warning('This is a warning message')
logging.error('This is an error message')
logging.critical('This is a critical message')


2024-08-20 12:19:28 - DEBUG - This is a debug message
2024-08-20 12:19:28 - INFO - This is an info message
2024-08-20 12:19:28 - ERROR - This is an error message
2024-08-20 12:19:28 - CRITICAL - This is a critical message


In [2]:
import logging

# Create a logger object
logger = logging.getLogger('my_logger')
logger.setLevel(logging.DEBUG)  # Set the logging level

# Create a file handler that logs messages to a file
file_handler = logging.FileHandler('app.log')
file_handler.setLevel(logging.DEBUG)

# Create a console handler that logs messages to the console
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.DEBUG)

# Define a formatter for the log messages
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s', datefmt='%Y-%m-%d %H:%M:%S')

# Set the formatter for both handlers
file_handler.setFormatter(formatter)
console_handler.setFormatter(formatter)

# Add the handlers to the logger
logger.addHandler(file_handler)
logger.addHandler(console_handler)

# Example usage of logging
logger.debug('This is a debug message')
logger.info('This is an info message')
logger.warning('This is a warning message')
logger.error('This is an error message')
logger.critical('This is a critical message')


2024-08-20 12:20:45 - DEBUG - This is a debug message
2024-08-20 12:20:45 - DEBUG - This is a debug message
2024-08-20 12:20:45 - INFO - This is an info message
2024-08-20 12:20:45 - INFO - This is an info message
2024-08-20 12:20:45 - ERROR - This is an error message
2024-08-20 12:20:45 - ERROR - This is an error message
2024-08-20 12:20:45 - CRITICAL - This is a critical message
2024-08-20 12:20:45 - CRITICAL - This is a critical message


Sure! Let's go through examples for each of the concepts discussed above, from basic to advanced.

---

# **Python Regex Examples**

## **1. Basic Syntax Examples**

### **1.1 `.` (Dot) Metacharacter**
- **Definition**: Matches any single character except a newline.

```python
import re

pattern = r'.'
text = "abc"
matches = re.findall(pattern, text)
print(matches)  # Output: ['a', 'b', 'c']
```

### **1.2 `^` and `$` Anchors**
- **Definition**: `^` matches the start of a string, and `$` matches the end.

```python
# `^` example
pattern = r'^Hello'
text = "Hello World"
match = re.match(pattern, text)
print(bool(match))  # Output: True

# `$` example
pattern = r'World$'
match = re.search(pattern, text)
print(bool(match))  # Output: True
```

### **1.3 `*` (Zero or More)**
- **Definition**: Matches 0 or more repetitions of the preceding element.

```python
pattern = r'ab*c'
text = "ac abc abbc"
matches = re.findall(pattern, text)
print(matches)  # Output: ['ac', 'abc', 'abbc']
```

### **1.4 `+` (One or More)**
- **Definition**: Matches 1 or more repetitions of the preceding element.

```python
pattern = r'ab+c'
text = "ac abc abbc"
matches = re.findall(pattern, text)
print(matches)  # Output: ['abc', 'abbc']
```

### **1.5 `?` (Zero or One)**
- **Definition**: Matches 0 or 1 repetition of the preceding element.

```python
pattern = r'ab?c'
text = "ac abc abbc"
matches = re.findall(pattern, text)
print(matches)  # Output: ['ac', 'abc']
```

### **1.6 `[]` (Character Set)**
- **Definition**: Matches any one of the characters inside the square brackets.

```python
pattern = r'[aeiou]'
text = "hello"
matches = re.findall(pattern, text)
print(matches)  # Output: ['e', 'o']
```

### **1.7 `|` (Alternation)**
- **Definition**: Matches the expression before or after the `|`.

```python
pattern = r'cat|dog'
text = "I have a cat and a dog."
matches = re.findall(pattern, text)
print(matches)  # Output: ['cat', 'dog']
```

### **1.8 `()` (Grouping)**
- **Definition**: Groups expressions and captures the matched sub-expression.

```python
pattern = r'(ab)+'
text = "ababab"
match = re.match(pattern, text)
print(match.group())  # Output: 'ababab'
```

### **1.9 `\` (Escape)**
- **Definition**: Escapes a special character or denotes a special sequence.

```python
pattern = r'\.'
text = "file.txt"
match = re.search(pattern, text)
print(match.group())  # Output: '.'
```

## **2. Special Sequences Examples**

### **2.1 `\d` (Digit)**
- **Definition**: Matches any digit, equivalent to `[0-9]`.

```python
pattern = r'\d+'
text = "There are 123 apples"
matches = re.findall(pattern, text)
print(matches)  # Output: ['123']
```

### **2.2 `\D` (Non-Digit)**
- **Definition**: Matches any non-digit, equivalent to `[^0-9]`.

```python
pattern = r'\D+'
text = "There are 123 apples"
matches = re.findall(pattern, text)
print(matches)  # Output: ['There are ', ' apples']
```

### **2.3 `\s` (Whitespace)**
- **Definition**: Matches any whitespace character (space, tab, newline).

```python
pattern = r'\s+'
text = "Hello World"
matches = re.findall(pattern, text)
print(matches)  # Output: [' ']
```

### **2.4 `\S` (Non-Whitespace)**
- **Definition**: Matches any non-whitespace character.

```python
pattern = r'\S+'
text = "Hello World"
matches = re.findall(pattern, text)
print(matches)  # Output: ['Hello', 'World']
```

### **2.5 `\w` (Word Character)**
- **Definition**: Matches any alphanumeric character and underscores, equivalent to `[a-zA-Z0-9_]`.

```python
pattern = r'\w+'
text = "Python_3.8"
matches = re.findall(pattern, text)
print(matches)  # Output: ['Python_3', '8']
```

### **2.6 `\W` (Non-Word Character)**
- **Definition**: Matches any non-word character.

```python
pattern = r'\W+'
text = "Python_3.8!"
matches = re.findall(pattern, text)
print(matches)  # Output: ['.']
```

## **3. Quantifiers Examples**

### **3.1 `*` (Zero or More)**
```python
pattern = r'ab*c'
text = "ac abc abbc"
matches = re.findall(pattern, text)
print(matches)  # Output: ['ac', 'abc', 'abbc']
```

### **3.2 `+` (One or More)**
```python
pattern = r'ab+c'
text = "ac abc abbc"
matches = re.findall(pattern, text)
print(matches)  # Output: ['abc', 'abbc']
```

### **3.3 `?` (Zero or One)**
```python
pattern = r'ab?c'
text = "ac abc abbc"
matches = re.findall(pattern, text)
print(matches)  # Output: ['ac', 'abc']
```

### **3.4 `{n}` (Exactly `n` Times)**
```python
pattern = r'ab{2}c'
text = "abc abbc abbbc"
matches = re.findall(pattern, text)
print(matches)  # Output: ['abbc']
```

### **3.5 `{n,}` (At Least `n` Times)**
```python
pattern = r'ab{2,}c'
text = "abc abbc abbbc"
matches = re.findall(pattern, text)
print(matches)  # Output: ['abbc', 'abbbc']
```

### **3.6 `{n,m}` (Between `n` and `m` Times)**
```python
pattern = r'ab{2,3}c'
text = "abc abbc abbbc"
matches = re.findall(pattern, text)
print(matches)  # Output: ['abbc', 'abbbc']
```
