- Without parentheses: You get the entire match as a single unit.

- With parentheses: You can extract and manipulate specific parts of the match using captured groups.

# extract only numbers

In [56]:
import re
text = 'I have a problem with my order which has number 412889912. (short order #9912)'
# pattern = 'order[^0-9]*([0-9]*)'
pattern = r'order[^\d]*(\d*)'
output = re.findall(pattern, text)
print(output)

['412889912', '9912']


# iterate over result and use group

In [46]:
import re

pattern = r'order[^\d]*(\d*)'

test_string = "This is an order1234 and another order ABCD 5678 and orderXYZ000"

matches = re.finditer(pattern, test_string)

for match in matches:
    print(f"Full match: {match.group(0)} -> ", f"Captured digits: {match.group(1)}")

Full match: order1234 ->  Captured digits: 1234
Full match: order ABCD 5678 ->  Captured digits: 5678
Full match: orderXYZ000 ->  Captured digits: 000


# extract profile info (example)

In [59]:
text='''
Born	Elon Reeve Musk
June 28, 1971 (age 50)
Pretoria, Transvaal, South Africa
Citizenship	
South Africa (1971–present)
Canada (1971–present)
United States (2002–present)
Education	University of Pennsylvania (BS, BA)
Title	
Founder, CEO and Chief Engineer of SpaceX
CEO and product architect of Tesla, Inc.
Founder of The Boring Company and X.com (now part of PayPal)
Co-founder of Neuralink, OpenAI, and Zip2
Spouse(s)	
Justine Wilson
​
​(m. 2000; div. 2008)​
Talulah Riley
​
​(m. 2010; div. 2012)​
​
​(m. 2013; div. 2016)
'''

find name

In [60]:
re.findall('Born[\t]*(.*)\n',text)

['Elon Reeve Musk']

find birthdate

In [61]:
re.findall(r'Born.*\n(.*) \(age',text)

['June 28, 1971']

find age

In [62]:
re.findall(r'age (\d+)',text)

['50']

find birthplace

In [64]:
re.findall(r'age.*\n(.+)',text)

['Pretoria, Transvaal, South Africa']

# match

In [74]:
import re

pattern = r'a.+b'

# This will match 'a' followed by one or more of any characters and then 'b'
test_string1 = "abxxx"  # This will not match because there are no characters between 'a' and 'b'
test_string2 = "acbxxx"
test_string3 = "axyzbxxx"

print(re.match(pattern, test_string1))  # No match
print(re.match(pattern, test_string2))  # Matchب
print(re.match(pattern, test_string3))  # Match
print(re.match(pattern, test_string3).group(0))  # Match


None
<re.Match object; span=(0, 3), match='acb'>
<re.Match object; span=(0, 5), match='axyzb'>
axyzb


In [89]:
text = '''
Tesla's gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion. 
In previous quarter i.e. FY2020 Q4 it was $3 billion.
'''

pattern = r'\$([\d\.]+)'
matches = re.findall(pattern, text)
matches

['4.85', '3']

# findall vs search

- re.findall(pattern, text) searches for all occurrences of the pattern

- re.search(pattern, text) and re.findall(pattern, text)

In [92]:
text = '''
Tesla's gross cost of operating lease vehicles in FY2021 Q1 ljh lsj a 123 was $4.85 billion. Same number for FY2020 Q4 was $8 billion
'''
pattern = r'FY(\d{4} Q[1-4])[^\$]+\$([0-9\.]+)'

matches = re.search(pattern, text)
matches

<re.Match object; span=(51, 84), match='FY2021 Q1 ljh lsj a 123 was $4.85'>

we can extract more than one information out of text by a single pattern.

And access the result by groups method

In [93]:
matches.groups()

('2021 Q1', '4.85')

# Exercise: make a regular expression that will match an email

In [4]:
import re
def test_email(your_pattern):
    pattern = re.compile(your_pattern)
    emails = ["john@example.com", "python-list@python.org", "wha.t.`1an?ug{}ly@email.com"]
    for email in emails:
        if not re.match(pattern, email):
            print("You failed to match %s" % (email))
        elif not your_pattern:
            print("Forgot to enter a pattern!")
        else:
            print("Pass")
# Your pattern here!
pattern = r"\"?([-a-zA-Z0-9.`?{}]+@\w+\.\w+)\"?"
test_email(pattern)

Pass
Pass
Pass


Explanation of the Pattern

- \"?:
    - This matches an optional double quote character (").
    - The \ is an escape character to indicate that the double quote should be treated as a literal character.
    - The ? means that the preceding character (the double quote in this case) is optional and can appear 0 or 1 time.
- ([-a-zA-Z0-9.?{}]+)`:
    - This is a capturing group denoted by the parentheses (...).
    - Inside the capturing group, we have a character class [-a-zA-Z0-9.?{}]`:
        - -a-zA-Z0-9.: This matches any of the characters in the ranges a-z, A-Z, 0-9, as well as the characters . (dot) and - (hyphen).
        - ?{}: These characters are included in the character class and can be matched as part of the email username.
    - The + outside the character class means that one or more of the preceding characters can be matched.
- @:

    - This matches the @ symbol, which is mandatory in an email address.

- \w+:

    - This matches one or more word characters (alphanumeric characters plus underscore). It represents the domain part of the email address right after the @ symbol.

- \.:

    - This matches a literal dot (.) character in the domain name.

    - The \ is an escape character to indicate that the dot should be treated as a literal character rather than a special regex character.

- \w+:

    - This matches one or more word characters (alphanumeric characters plus underscore) after the dot in the domain name, representing the top-level domain (e.g., com, org, net).

- \"?:

    - This matches an optional double quote character ("), similar to the initial part of the regex.