## Regular Expressions

Regular expressions or “regexes” will enable us to examine patterns within our code. For example, we might want to validate that an email address is formatted correctly. Regular expressions will enable us to examine expressions in this fashion.



In [1]:
email = input("What's your email? ").strip()

if "@" in email:
    print("Valid")
else:
    print("Invalid")

Valid


### 1.0 Python's re module

The re.search() function is used to search for a pattern within a string.



In [2]:
import re

email = input("What's your email? ").strip()

if re.search("@", email):
    print("Valid")
else:
    print("Invalid")

Valid


### 2.0 Meta Characters 

Meta-characters in regular expressions are special characters with predefined meanings that help define the search pattern. They are essential for constructing complex patterns to match specific strings. Below are some commonly used meta-characters in regular expressions:

. (Dot): Matches any single character except newline \n.

^ (Caret): Matches the start of the string.

$ (Dollar): Matches the end of the string.

\* (Asterisk): Matches zero or more occurrences of the preceding character or group.

\+ (Plus): Matches one or more occurrences of the preceding character or group.

? (Question Mark): Matches zero or one occurrence of the preceding character or group.

{} (Curly Braces): Specifies the exact number of occurrences of the preceding character or group.

[] (Square Brackets): Matches any single character within the brackets. Can specify ranges and negate using ^.

| (Vertical Bar): Acts as an OR operator, allowing for multiple alternative patterns.

\ (Backslash): Escapes special characters, allowing them to be treated as literals.


For example:

. matches any character except newline.

^hello matches strings that start with "hello".

world$ matches strings that end with "world".

ab* matches "a" followed by zero or more "b"s.

a+ matches one or more "a"s.

[aeiou] matches any vowel character.

[^0-9] matches any character except digits.

(abc|def) matches "abc" or "def".

### 3.0 Pattern Matching

#### Basic Pattern Matching

The re.search() function is used to search for a pattern within a string.

In [3]:
import re

pattern = 'hello'
text = 'hello world'

match = re.search(pattern, text)
if match:
    print('Pattern found at index:', match.start())
else:
    print('Pattern not found.')


Pattern found at index: 0


#### Character Classes

Character classes match any one of a specified set of characters. Use square brackets [ ] to define a character class.

In [4]:
import re

pattern = '[aeiou]'
text = 'hello world'

matches = re.findall(pattern, text)
print('Vowels found:', matches)

Vowels found: ['e', 'o', 'o']


In [9]:


pattern = '[a-z]'
text = 'hello world'

matches = re.findall(pattern, text)
print('Vowels found:', matches)

Vowels found: ['h', 'e', 'l', 'l', 'o', 'w', 'o', 'r', 'l', 'd']


In [10]:


pattern = '[A-Z]'
text = 'hello world'

matches = re.findall(pattern, text)
print('Vowels found:', matches)

Vowels found: []


In [11]:

pattern = '[A-Z0-9]'
text = 'hello world'

matches = re.findall(pattern, text)
print('Vowels found:', matches)

Vowels found: []


In [12]:

pattern = '[A-Z0-9]'
text = 'hello world'

matches = re.findall(pattern, text,re.IGNORECASE)
print('Vowels found:', matches)

Vowels found: ['h', 'e', 'l', 'l', 'o', 'w', 'o', 'r', 'l', 'd']


#### Quantifiers

Quantifiers specify the number of times a character or group can occur. Common quantifiers include *, +, ?, {}, and {n,m}.

In [5]:
import re

pattern = 'lo+'
text = 'hello world'

matches = re.findall(pattern, text)
print('Matches found:', matches)


Matches found: ['lo']


#### Anchors

Anchors match specific positions in the string. Common anchors include ^ for the start of a string and $ for the end of a string.

In [6]:
import re

pattern = '^hello'
text = 'hello world'

match = re.search(pattern, text)
if match:
    print('Pattern found at the beginning of the string.')
else:
    print('Pattern not found at the beginning of the string.')


Pattern found at the beginning of the string.


#### Grouping and Capturing:

Parentheses () are used to group patterns together. They can also be used to capture parts of the matched text.

In [8]:
import re

pattern = r'(\d{3})-(\d{3})-(\d{4})'
text = 'Phone numbers: 123-456-7890, 987-654-3210'

matches = re.findall(pattern, text)
for match in matches:
    print('Full number:', match[0])
    print('Area code:', match[1])
    print('Exchange code:', match[2])
    #print('Subscriber number:', match[3])


Full number: 123
Area code: 456
Exchange code: 7890
Full number: 987
Area code: 654
Exchange code: 3210


#### Some Additional Patterns



\d    decimal digit

\D    not a decimal digit

\s    whitespace characters

\S    not a whitespace character

\w    word character, as well as numbers and the underscore

\W    not a word character



### 4.0 RegEx in Practice

Here is the full expression that one would have to type to ensure that a valid email is inputted:

```python 
^[a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$
```


In [14]:
import re

email = input("What's your email? ").strip()
pattern = r'^[a-zA-Z0-9.!#$%&\'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$'
if re.search(pattern, email, re.IGNORECASE):
    print("Valid")
else:
    print("Invalid")

Invalid


#### ChatGPT

prompt: write an RE expression to validate email

In [15]:
import re

def validate_email(email):
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return re.match(pattern, email) is not None

# Test cases
emails = [
    "example@example.com",
    "user.name@example.com",
    "user+tag@example.com",
    "user-name@example.com",
    "user_name@example.com",
    "user@example.co.in",
    "user@example.co.kr",
    "user@example.co.uk",
    "user@example.com.au",
    "user@example.org"
]

for email in emails:
    print(email, ":", validate_email(email))


example@example.com : True
user.name@example.com : True
user+tag@example.com : True
user-name@example.com : True
user_name@example.com : True
user@example.co.in : True
user@example.co.kr : True
user@example.co.uk : True
user@example.com.au : True
user@example.org : True


Explanation of the regular expression ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$:

^ asserts the start of the string.

[a-zA-Z0-9._%+-]+ matches one or more of the allowed characters before the @ symbol.

@ matches the @ symbol.

[a-zA-Z0-9.-]+ matches one or more of the allowed characters after the @ symbol (domain name).

\. matches the dot in the domain name.

[a-zA-Z]{2,} matches the top-level domain (e.g., com, org, net) consisting of at least two letters.

$ asserts the end of the string.