  
  
   # Chapter 3- PART II: Text preprocessing for NLP

<img src="https://user-images.githubusercontent.com/7065401/55025843-7d99a280-4fe0-11e9-938a-4879d95c4130.png"
    style="width:150px; float: right; margin: 0 40px 40px 40px;"></img>
    
<img src="https://www.searchenginejournal.com/wp-content/uploads/2020/08/an-introduction-to-natural-language-processing-with-python-for-seos-5f3519eeb8368-1520x800.webp" style="width:300px; float: left; margin: 0 40px 40px 40px;"></img>

    


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

# Part II: Information Extraction - A glance at Regex for NLP

### 1- What is a Regular Expression?

A regular expression is a sequence of characters that defines a search pattern. It can be used to match strings or parts of strings based on specific criteria. Regular expressions are highly flexible and allow you to define complex patterns for text manipulation.
Regex (Regular Expression) is a way to search through a text (equivalent to cntl+F; cntr + R)

### 2-Common Use Cases:

**Validation:** Regex is used to validate user input, such as email addresses, phone numbers, and passwords.
Search and Replace: It's used for searching and replacing text within a document or a string.

**Data Extraction:** Regex can extract specific data from structured or unstructured text.

**Text Parsing:** It helps in parsing and analyzing text data, such as log files and configuration files.

**Pattern Matching:** You can use regex to find patterns in text, like dates, URLs, or IP addresses.


### 3- Basic Syntax:

**Literals:** Most characters in a regex pattern are treated as literals and match themselves. For example, the pattern "abc" matches the string "abc" exactly.

**Metacharacters:** Special characters with special meanings, such as :

- **`.`** : Matches any single character except a newline.
  - Example: `a.b` matches "aab", "acb", "a1b", etc.
  
- **`*`** : Matches zero or more occurrences of the preceding character.
  - Example: `ab*` matches "a", "ab", "abb", "abbb", etc.

- **`+`** : Matches one or more occurrences of the preceding character.
  - Example: `ab+` matches "ab", "abb", "abbb", but not "a".

- **`?`** : Matches zero or one occurrence of the preceding character.
  - Example: `ab?` matches "a" or "ab".

**Character Classes:** Square brackets **`[ ]`** define character classes. For example, [aeiou] matches any vowel.

**Anchors:** **`^`** matches the start of a line, and **`$`** matches the end of a line.

**Quantifiers:**  **`{}`** define the number of occurrences. For example, **`\d{2,4}`** matches 2 to 4 digits.

**Escape Sequences:** Backslashes **`\`** are used to escape metacharacters when you want to match them as literals.


![image.png](attachment:image.png)

### 4- Examples:

* **`^abc`**: Matches any string that starts with "abc."
* **`.\d{2,4}`**: Matches any character followed by 2 to 4 digits.
* **`[A-Za-z]+`**: Matches one or more uppercase or lowercase letters.
* **`\b\d{5}\b`**: Matches a 5-digit word (like a ZIP code) surrounded by word boundaries.


### 5- Resources:

    -  Online Regex Testers: Tools like RegExr, Regex101, and RegexPlanet allow you to experiment with regular expressions and see the matches in real-time.
    -  Documentation: Refer to Python's re module documentation for Python-specific regex syntax and functions.
    - ChatGPT 😉


Regular expressions are a valuable skill for anyone working with text data or performing text processing tasks in programming. They can save you time and help you manipulate text effectively.

### Ressource 1: https://www.programiz.com/python-programming/regex

![image.png](attachment:image.png)


In [None]:
import re

pattern = '^a...s$'
test_string = 'abyss'
result = re.match(pattern, test_string)

if result:
      print("Search successful.")
else:
      print("Search unsuccessful.")


![image.png](attachment:image.png)

### Ressource 2: https://docs.python.org/3/library/re.html

### Ressource 3: Regex101

"Regex101" is a popular online platform for testing and experimenting with regular expressions (regex). It provides a user-friendly interface where you can input your regular expression patterns and test them against sample text.
Ressource: https://regex101.com/





In [None]:
import re
text_to_search='''
 ABCDEFGHIJKLMNOPQRSTUVWXYZ
 abcdefghijklmnopqrstuvwxyz
 metacharacters (need to be escaped):
 .^$-*/+!,? | \ { } [ ] ( )
 deeplearning.ai
 (216)94371111
 0021694376666
 #MyStudentsAreIncredible
 00000000011122
 Mr. Mohamed
 Mrs. Amina
abc@gmail.com
abC@gmail.com
ab_12@xyz.com

 '''

### Example 1 : Extract full phone number from the text above

In [None]:
#

In [None]:
pattern = '\d{13}'
matches = re.findall(pattern , text_to_search)
matches

['0021694376666', '0000000001112']

In [None]:
#

In [None]:
pattern = '\(\d{3}\)\d{8}'
matches = re.findall(pattern , text_to_search)
matches

['(216)94371111']

In [None]:
#

In [None]:
pattern = '\d{13}|\(\d{3}\)\d{8}'
matches = re.findall(pattern , text_to_search)
matches

['(216)94371111', '0021694376666', '0000000001112']

### Example 2 : Extract e-mail ID from the text above

In [None]:
email_pattern = None
matches= None
matches

['abc@gmail.com', 'abC@gmail.com', 'ab_12@xyz.com']

# Exercice 1:

You are given a list of email addresses, and your task is to use regex to find and extract valid email addresses from the list.

1. alice@example.com
2. bob@company.co.uk
3. carol@-invalid.com
4. david@123.45
5. eve@domain.org
6. frank@email.net
7. grace@website.io
8. hank@valid-email.com
9. ian@name@domain.com
10. jen@email123


In [None]:
None

alice@example.com
bob@company.co.uk
carol@-invalid.com
eve@domain.org
frank@email.net
grace@website.io
hank@valid-email.com


# Exercice 2:

Write a regex pattern to match dates in the format "dd/mm/yyyy" in a given text.

For example, given the text: "Today's date is 24/09/2024, and tomorrow's date is 25/09/2024."

Your regex pattern should find and return the following matches:

"24/09/2024"
"25/09/2024"

In [None]:
None

24/09/2024
25/09/2024
