## Regular Expressions (Regex)
- Regular expressions, or regex, are a sequence of characters that define a search pattern.
- They are commonly used for text matching, manipulation, and validation. 
- Regex provides a powerful and flexible way to work with strings, allowing you to search, match, split, and replace text based on specific patterns.

### Regex Syntax and Patterns

### Basic Functions in re

| Function       | Description                                             |
| -------------- | ------------------------------------------------------- |
| `re.search()`  | Searches for a match anywhere in the string             |
| `re.match()`   | Checks for a match **only at the beginning**            |
| `re.findall()` | Returns all non-overlapping matches as a list           |
| `re.sub()`     | Replaces occurrences of the pattern with another string |
| `re.split()`   | Splits a string at each match of the pattern            |
| `re.compile()` | Compiles regex pattern for reuse                        |


### Special Characters and Patterns in Regex

| Symbol   | Meaning                        | Example                                          |
|----------|--------------------------------|--------------------------------------------------|
| `.`      | Any character except newline   | `a.c` matches `"abc"`, `"a3c"`                   |
| `^`      | Start of string or line              | `^Hello` matches `"Hello World"`                |
| `$`      | End of string                  | `world$` matches `"Hello world"`                |
| `*`      | 0 or more occurrences of the preceding character or group        | `lo*` matches `"l"`, `"lo"`, `"loo"`            |
| `+`      | 1 or more repetitions          | `lo+` matches `"lo"`, `"loo"`                   |
| `?`      | 0 or 1 occurrence              | `lo?` matches `"l"`, `"lo"`                     |
| `{n}`    | Exactly n repetitions          | `a{3}` matches `"aaa"`                          |
| `{n,}`   | At least n repetitions         | `a{2,}` matches `"aa"`, `"aaa"`, `"aaaa"`, etc. |
| `{n,m}`  | Between n and m repetitions    | `a{2,4}` matches `"aa"`, `"aaa"`, `"aaaa"`      |
| `[]`     | Matches any character within the brackets      | `[aeiou]` matches `"a"`, `"e"`, etc.            |
| `|`      | OR operator                    | `cat|dog` matches `"cat"` or `"dog"`            |
| `()`     | Grouping                       | `(ab)+` matches `"ab"`, `"abab"`, etc.          |

### Character Classes

| Pattern | Description                                  |
|---------|----------------------------------------------|
| `\d`    | Digit (0–9)                                  |
| `\D`    | Non-digit                                    |
| `\w`    | Matches any word character (a-z, A-Z, 0-9, _) |
| `\W`    | Non-word character                           |
| `\s`    | Whitespace (space, tab, newline)             |
| `\S`    | Non-whitespace                               |


### Data Extraction and Matching

In [None]:
import re

: 

#### re.search()
Searches for the first occurrence of the pattern anywhere in the string.

Example 1: Extraction of Phone Numbers

In [27]:
text = "My phone number is 123-456-7890."
pattern = r"\d{3}-\d{3}-\d{4}"
match = re.search(pattern, text)
if match:
    phone_number = match.group()
    print("Phone number found:", phone_number)
else:
    print("Phone number not found.")

Phone number found: 123-456-7890


#### re.findall()
Returns all matches as a list.

In [5]:
emails = "melodybonareri@gmail.com, alice@yahoo.com, johnyahoo.com"
matches = re.findall(r'\w+@\w+\.\w+', emails)
print(matches)

['melodybonareri@gmail.com', 'alice@yahoo.com']


| Part  | Meaning                                                              |
| ----- | -------------------------------------------------------------------- |
| `\w+` | One or more **word characters** (letters, numbers, or underscore)    |
| `@`   | The **@ symbol** (as-is)                                             |
| `\w+` | One or more word characters again (for the domain name like "gmail") |
| `\.`  | A **literal dot `.`** (escaped because `.` means "any character")    |
| `\w+` | One or more word characters (for the domain suffix like "com")       |


Example 2: Extracting Dates

To extract dates in the format “DD/MM/YYYY”, you can use:

In [29]:
text = "The event will be held on 25/12/2024."
pattern = r"\b\d{2}/\d{2}/\d{4}\b"
dates = re.findall(pattern, text)
print("Dates found:", dates)

Dates found: ['25/12/2024']


Example 3: Extracting URLs

To extract URLs from a block of text, you can use the following regex pattern:

In [30]:
text = "Visit us at https://www.example.com or http://example.org for more information."
pattern = r"https?://[^s]+"
urls = re.findall(pattern, text)
print("URLs found:", urls)

URLs found: ['https://www.example.com or http://example.org for more information.']


| Part     | Meaning                                                                |
| -------- | ---------------------------------------------------------------------- |
| `https?` | Matches `http` or `https` (`s?` means the **"s" is optional**)         |
| `://`    | Matches the literal `://` that follows in a URL                        |
| `[^\s]+` | Matches **one or more characters that are NOT whitespace**             |
| `[^...]` | A **negated character class**: match anything **except** what's inside |
| `\s`     | Any **whitespace character** (space, tab, newline)                     |
| `+`      | Match **one or more** of the previous pattern (i.e., non-whitespace)   |


#### re.sub()
Replaces matched patterns in a string.

In [6]:
text = "Python is awesome!"
new_text = re.sub(r'awesome', 'great', text)
print(new_text)

Python is great!


#### re.split()
Splits the string at every match of the pattern.

In [None]:
data = "apple,banana;grape orange"
parts = re.split(r'[;, ]+', data)
print(parts)

['apple', 'banana', 'grape', 'orange']


Using ```re.split(r'[;, ]+', data)``` will split at:

- ```,``` between apple and banana
- ```;``` between banana and grape
- ```' '``` (space) between grape and orange

In [13]:
messy_data = """
Customer Name: Jane Doe, Email: janedoe@gmail.com, Phone: +254-712-345678
Name: John Smith; Email: john_smith@company.co.ke; Tel: 0712345678
Customer=Alice; Email=alice@my-mail.org, Contact=+254712345679
Username: mike.1990! Email --> mike1990@yahoo.com  Number: 0722 334 445
random text here, no useful data

Steve | steve@outlook.com | 0700-112-233 | Nairobi
"""


#### Practice Tasks with Regex

##### 1. Extract All Email Addresses

In [15]:
emails = re.findall(r'\w+[\w.+-]*@\w+\.\w+', messy_data)
print("Emails:", emails)


Emails: ['janedoe@gmail.com', 'john_smith@company.co', 'mike1990@yahoo.com', 'steve@outlook.com']


- \w matches any letter, digit or underscore
- [\w.+-]* matches additional characters allowed in emails (like ., +, -)

##### 2. Extract All Phone Numbers

In [38]:
phones = re.findall(r'\+[\d\s-]{9,}|07\d[\d\s-]{7,}', messy_data)
print("Phone Numbers:", phones)

Phone Numbers: ['+254-712-345678\n', '0712345678\n', '+254712345679\n', '0722 334 445\n', '0700-112-233 ']


```\+[\d\s-]{9,}```
- \+ — matches the plus sign literally (for +254)
- [\d\s-]{9,} — matches at least 9 characters that can be:
- \d → any digit (0–9)
- \s → any whitespace (space, tab)
- → a hyphen

#####  3. Extract Only Names with Letters

In [20]:
names = re.findall(r'[A-Z][a-z]+\s[A-Z][a-z]+', messy_data)
print("Names:", names)

Names: ['Customer Name', 'Jane Doe', 'John Smith']


| Component | Explanation                                                         |
| --------- | ------------------------------------------------------------------- |
| `[A-Z]`   | Matches **one uppercase letter** (A to Z) — start of the first name |
| `[a-z]+`  | Matches **one or more lowercase letters** — rest of the first name  |
| `\s`      | Matches a **single whitespace character** (like a space)            |
| `[A-Z]`   | Matches **one uppercase letter** — start of the last name           |
| `[a-z]+`  | Matches **one or more lowercase letters** — rest of the last name   |


##### 4. Extract All Non-Word Characters

In [21]:
non_word_chars = re.findall(r'\W', messy_data)
print("Non-word characters:", set(non_word_chars))

Non-word characters: {'-', '|', '.', '!', '+', '\n', ',', '=', ';', ':', '>', '@', ' '}


## Applications of Regex
Regular expressions (regex) have a wide range of applications across various fields, particularly in programming, data science, and text processing. Here are some key applications of regex:

### 1. Data Validation

Regex is commonly used to validate user input, ensuring that it conforms to expected formats. For example, regex can check if an email address, phone number, or credit card number is formatted correctly. This is crucial in web forms and applications to prevent invalid data entry.

### 2. Data Extraction

In data science and web scraping, regex is invaluable for extracting specific information from unstructured text. It can be used to pull out email addresses, URLs, or any other patterns from large datasets. For instance, regex can efficiently extract links from HTML content or find specific data points in log files.

### 3. Text Manipulation

Regex allows for powerful text manipulation capabilities, such as replacing substrings, splitting strings, and reformatting text. This is useful in scenarios where you need to clean or transform data, such as changing date formats or removing unwanted characters from strings.

### 4. Search and Replace

In text editors and programming environments, regex is often used for search-and-replace functionality. This allows users to find complex patterns in the text and replace them with new content, which can save significant time compared to manual editing.

### 5. Natural Language Processing (NLP)

In NLP, regex is employed for tasks such as tokenization, where text is split into meaningful components. It can also be used for identifying specific patterns in text data, such as hashtags in social media posts or keywords in documents, aiding in text analysis and sentiment analysis.

### 6. Log Analysis

Regex is frequently used in log file analysis to filter and extract relevant information from large volumes of log data. This can help in monitoring system performance, identifying errors, and troubleshooting issues by searching for specific patterns in the logs.