## What are regular expressions?

- The Regex or Regular Expression is a way to define a pattern for searching or manipulating strings. We can use a regular expression to match, search, replace, and manipulate inside textual data.

- In simple words, the regex pattern Jessa will match to name Jessa.

- Also, you can write a regex pattern to validate a password with some predefined constraints, such as the password must contain at least one special character, digit, and one upper case letter. If the pattern matches against the password, we can say that password is correctly constructed.

- Also, Regular expressions are instrumental in extracting information from text such as log files, spreadsheets, or even textual documents.

- For example, Below are some of the cases where regular expressions can help you to save a lot of time.

- Searching and replacing text in files
- Validating text input, such as password and email address
- Rename a hundred files at a time. For example, You can change the extension of all files using a regex pattern

- Regular expressions are a concise and flexible way to match patterns within strings of text. They are often used in text processing, search and data extraction, and can be implemented in many programming languages.

- A regular expression is a sequence of characters that defines a search pattern. These characters can be letters, numbers, or symbols, and are used to describe the pattern to be matched. For example, the regular expression \d will match any single digit character (0-9), while the regular expression [a-z] will match any lowercase letter.

- Regular expressions are typically used with a search function or method, which takes the regular expression as an argument and returns all matches in the input string. For example, in Python, you can use the re module and the search() function to find all matches of a regular expression in a string:

## The re module
- We will start this tutorial by using the RE module, a built-in Python module that provides all the required functionality needed for handling patterns and regular expressions.

- Type import re at the start of your Python file, and you are ready to use the re module's methods and special characters. To get to know the RE module's functionality, methods, and attributes, use the help function.

- Just Pass the module's name as an argument to the help function like this print(help(re)) . It will show hundreds of lines simply because this module is vast and comprehensive.

- Now let's how to use the re module to perform regex pattern matching in Python.

## Example 1: Write a regular expression to search digit inside a string
- Now, let's see how to use the Python re module to write the regular expression. Let's take a simple example of a regular expression to check if a string contains a number.

- For this example, we will use the ( \d ) metacharacter, we will discuss regex metacharacters in detail in the later section of this article.

- As of now, keep in mind that a  \d  is a special sequence that matches any digit between 0 to 9.

In [1]:
import re
target_str = 'my mobile is 9999879879777'
res = re.findall(r"\d", target_str)
print(res)

['9', '9', '9', '9', '8', '7', '9', '8', '7', '9', '7', '7', '7']


- To search for a digit within a string using a regular expression, you can use the \d character class. This character class represents any single digit character (0-9).

- For example, the following regular expression will match any single digit within a string: \d
- To search for multiple consecutive digits, you can use the + quantifier to match one or more occurrences of the preceding character. For example: \d+
- This regular expression will match any sequence of one or more digits within the string.

- Here is an example of how you can use this regular expression to search for digits in a string using the re module in Python:

  

In [2]:
import re

string = "The quick brown fox jumps over the lazy dog. 123"
pattern = "\d+"

matches = re.search(pattern, string)
print(matches)

<re.Match object; span=(45, 48), match='123'>


In [3]:
import re

string = "The quick brown fox jumps over the lazy dog. 123"
pattern = "\d+"

matches = re.findall(pattern, string)
print(matches)

['123']


In [4]:
s = 'raju 12345'
p = "\d+"
m = re.findall(p, s)
m

['12345']

## Use raw string to define a regex
- Note: I have used a raw string to define a pattern like this r"\d". Always write your regex as a raw string.

- As you may already know, the backslash has a special meaning in some cases because it may indicate an escape character or escape sequence. To avoid that always use a raw string.

- For example, let's say that in Python we are defining a string that is actually a path to an exercise folder like this path = "c:\example\task\new".

- Now, let's assume you wanted to search this path inside a target string using a regular expression. let's write code for the same.

In [7]:
import re

print("without raw string:")
# path_to_search = "c:\example\task\new"
target_string = r"c:\example\task\new\exercises\session1"

# regex pattern
pattern = "^c:\\example\\task\\new"
# \n and \t has a special meaning in Python
# Python will treat them differently
res = re.search(pattern, target_string)
print(res.group())

without raw string:


error: bad escape \e at position 3

- Notice that inside the pattern we have two escape characters \t and \n. If you execute the above code you will the re.error: bad escape error because \n and \thas a special meaning in Python.

- To avoid such issues, always write a regex pattern using a raw string. The character r denotes the raw string.

- Now replace the existing pattern with pattern = r"^c:\\example\\task\\new" and execute our code again. Now you can get the following output.



In [8]:
import re

print("without raw string:")
# path_to_search = "c:\example\task\new"
target_string = r"c:\example\task\new\exercises\session1"

# regex pattern
pattern = r"^c:\\example\\task\\new"
# \n and \t has a special meaning in Python
# Python will treat them differently
res = re.search(pattern, target_string)
print(res.group())

without raw string:
c:\example\task\new


- In Python, you can use raw strings to define regular expressions by prefixing the string with r. Raw strings are useful for defining regular expressions because they treat backslashes (\) as literal characters, rather than escape characters.

- For example, consider the following regular expression, which matches a backslash followed by a digit:

- pattern = "\\d"

- To define this regular expression using a raw string, you can use the following syntax:


- pattern = r"\d"
- The r prefix indicates that the string is a raw string, so the backslash is treated as a literal character, rather than an escape character. This makes it easier to read and write regular expressions that contain backslashes.

- Here is an example of using a raw string to define a regular expression in Python:

In [9]:
import re

string = "The quick brown fox jumps over the lazy dog. 123"
pattern = r"\d+"

matches = re.search(pattern, string)
print(matches)

<re.Match object; span=(45, 48), match='123'>


- This code will find the first occurrence of a sequence of digits within the string and print the match object.

- You can also use raw strings when defining regular expressions with the compile() function:

In [10]:
import re

string = "The quick brown fox jumps over the lazy dog. 123"
pattern = r"\d+"
regex = re.compile(pattern)

matches = regex.search(string)
print(matches)

<re.Match object; span=(45, 48), match='123'>


## Python regex methods

## Python Compile Regex Pattern using re.compile()

- Python’s re.compile() method is used to compile a regular expression pattern provided as a string into a regex pattern object (re.Pattern). Later we can use this pattern object to search for a match inside different target strings using regex methods such as a re.match() or re.search().

- In simple terms, We can compile a regular expression into a regex object to look for occurrences of the same pattern inside various target strings without rewriting it.

## How to use re.compile() method

- Syntax of re.compile()
- re.compile(pattern, flags=0)
- pattern: regex pattern in string format, which you are trying to match inside the target string.
- flags: The expression’s behavior can be modified by specifying regex flag values. This is an optional parameter
- There are many flags values we can use. For example, the re.I is used for performing case-insensitive matching. We can also combine multiple flags using OR (the | operator).

- Return value

- The re.compile() method returns a pattern object ( i.e., re.Pattern).

![image.png](attachment:image.png)

In [None]:
#compile regex pattern to find 5 letter word from string


import re

pattern = r"\b\w{5}\b"
regex = re.compile(pattern)

- This will create a regular expression object that can be used to search for five letter words within a string.

- The pattern \b\w{5}\b uses the following elements:

- \b: This is a word boundary. It matches the position between a word character (as defined by \w) and a non-word character.
- \w: This is a shorthand character class that matches any alphanumeric character (letters and digits).
- {5}: This is a quantifier that specifies that the preceding element (in this case, \w) should be matched exactly five times.
- Together, these elements create a pattern that matches any sequence of five word characters that is surrounded by word boundaries (i.e., a five letter word).

- Here is an example of how to use the regular expression object to search for five letter words within a string:

In [148]:
import re

string = "The quick brown fox jumps over the lazy dog."
pattern = r"\b\w{5}\b"
regex = re.compile(pattern)

matches = regex.findall(string)
print(matches)

#You can also use the search() or finditer() functions to find the first or all occurrences of the pattern in the string, respectively.

['quick', 'brown', 'jumps']


## How to compile regex pattern

### Write regex pattern in string format
- Write regex pattern using a raw string. For example, a pattern to match any digit.
- str_pattern = r'\d'

### Pass a pattern to the compile() method
- pattern = re.compile(r'\d{3})
- It compiles a regular expression pattern provided as a string into a regex pattern object.

### Use Pattern object to match a regex pattern
- Use Pattern object returned by the compile() method to match a regex pattern.
- res = pattern.findall(target_string)

### Example to compile a regular expression
- Now, let’s see how to use the re.compile() with the help of a simple example.

 - Pattern to compile: r'\d{3}'

### What does this pattern mean?

- First of all, I used a raw string to specify the regular expression pattern.
- Next, \d is a special sequence and it will match any digit from 0 to 9 in a target string.
- Then the 3 inside curly braces mean the digit has to occur exactly three times in a row inside the target string.
- In simple words, it means to match any three consecutive digits inside the target string such as 236 or 452, or 782.

In [12]:
import re

# Target String one
str1 = "Emma's luck numbers are 251 761 231 451"

# pattern to find three consecutive digits
string_pattern = r"\d{3}"
# compile string pattern to re.Pattern object
regex_pattern = re.compile(string_pattern)

# print the type of compiled pattern
print(type(regex_pattern))
# Output <class 're.Pattern'>

# find all the matches in string one
result = regex_pattern.findall(str1)
print(result)
# Output ['251', '761', '231', '451']

# Target String two
str2 = "Kelly's luck numbers are 111 212 415"
# find all the matches in second string by reusing the same pattern
result = regex_pattern.findall(str2)
print(result)
# Output ['111', '212', '415']

<class 're.Pattern'>
['251', '761', '231', '451']
['111', '212', '415']


- As you can see, we found four matches of “three consecutive” digits inside the first string.

 ### Note:

- The re.compile() method changed the string pattern into a re.Pattern object that we can work upon.
- Next, we used the re.Pattern object inside a re.findall() method to obtain all the possible matches of any three consecutive digits inside the target string.
- Now, the same reagex_pattern object can be used similarly for searching for three consecutive digits in other target strings as well.

## Why and when to use re.compile()
- Performance improvement

- Compiling regular expression objects is useful and efficient when the expression will be used several times in a single program.

- Keep in mind that the compile() method is useful for defining and creating regular expressions object initially and then using that object we can look for occurrences of the same pattern inside various target strings without rewriting it which saves time and improves performance.

### Readability

- Another benefit is readability. Using re.compile() you can separate the definition of the regex from its use.

- For example:

In [13]:
pattern= re.compile("str_pattern")
result = pattern.match(string)
#is equivalent to

result = re.match("str_pattern", string)

- Avoid using the compile() method when you want to search for various patterns inside the single target string. You do not need to use the compile method beforehand because the compiling is done automatically with the execution of other regex methods.

## Is it worth using Python’s re.compile()?
- As you know, Python always internally compiles and caches regexes whenever you use them anyway (including calls to search() or match()), so using compile() method, you’re only changing when the regex gets compiled.

- But compiling regex is useful for the following situations.

- It denotes that the compiled regular expressions will be used a lot and is not meant to be removed.
- By compiling once and re-using the same regex multiple times, we reduce the possibility of typos.
- When you are using lots of different regexes, you should keep your compiled expressions for those which are used multiple times, so they’re not flushed out of the regex cache when the cache is full.
- Also, please check the official documentation which says, The compiled versions of the most recent patterns passed to re.compile() and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.

- So, in conclusion, Yes, you should use the compile() method when you’re going to perform a lot of matches using the same pattern. Also, when you are searching for the same pattern over and over again and in multiple target strings

In [16]:
import re

string = "The quick brown fox jumps over the lazy dog. 123"
pattern = "\d+"
regex = re.compile(pattern)

match = regex.search(string)
print(match)


<re.Match object; span=(45, 48), match='123'>


In [17]:
import re
s = 'raju 123'
p = "\d+"
regex = re.compile(p)
match  = regex.search(s)
match


<re.Match object; span=(5, 8), match='123'>

## Python Regex Search using re.search()

- Python regex re.search() method looks for occurrences of the regex pattern inside the entire target string and returns the corresponding Match Object instance where the match found.

- The re.search() returns only the first match to the pattern from the target string. Use a re.search() to search pattern anywhere in the string.

### How to use re.search()

- Before moving further, let’s see the syntax of it.

- Syntax

- re.search(pattern, string, flags=0)
- The regular expression pattern and target string are the mandatory arguments, and flags are optional.

- pattern: The first argument is the regular expression pattern we want to search inside the target string.
- string: The second argument is the variable pointing to the target string (In which we want to look for occurrences of the pattern).
- flags: Finally, the third argument is optional and it refers to regex flags by default no flags are applied.

- There are many flags values we can use. For example, the re.I is used for performing case-insensitive searching. We can also combine multiple flags using bitwise OR (the | operator).

### Return value

- The re.search() method returns a Match object ( i.e., re.Match). This match object contains the following two items.

- The tuple object contains the start and end index of a successful match.
- Second, it contains an actual matching value that we can retrieve using a group() method.
- If the re.search() method fails to locate the occurrences of the pattern that we want to find or such a pattern doesn’t exist in a target string it will return a None type.

- Now, Let’s see how to use re.search().

![image.png](attachment:image.png)

### Regex search example – look for a word inside the target string
- Now, let’s see how to use re.search() to search for the following pattern inside a string.

- Pattern: \w{8}

## What does this pattern mean?

- The \w is a regex special sequence that represents any alphanumeric character such as letters uppercase or lowercase, digits as well as the underscore character.
- Then the 8 inside curly braces mean the characters have to occur exactly 8 times in a row inside the target string



- In simple words, it means to search any eight-letter word

- "Emma is a baseball player who was born on June 17, 1993."
- As we can see in the above string baseball is the first eight-letter word inside the target string, so we should get the baseball as an output.


In [20]:
import re

# Target String
target_string = "Emma is a baseball player who was born on June 17"

# search() for eight-letter word
result = re.search(r"\w{8}", target_string)

# Print match object
print("Match Object", result)
# output re.Match object; span=(10, 18), match='baseball'

# print the matching word using group() method
print("Matching word: ", result.group()) 
# Output 'baseball'

Match Object <re.Match object; span=(10, 18), match='baseball'>
Matching word:  baseball


- Let’s understand the above example.

- First of all, I used a raw string to specify the regular expression pattern. As you may already know, the backslash has a special meaning in some cases because it may indicate an escape character or escape sequence. To avoid that we used raw string.
- Also, we are not defining and compiling this pattern beforehand (like the compile method), The practice is to write the actual pattern in the string format.
- Next, we wrote a regex pattern to search for any eight-letter word inside the target string.
- Next, we passed this pattern to re.search() method to looks for occurrences of the pattern and it returned the re.Match object.
- Next, we used the group() method of a re.Match object to retrieve the exact match value i.e., baseball.

In [29]:
#Regex search example find exact substring or word
#In this example, we will find substring “ball” and “player” inside a target string.

import re

# Target String
target_string = "Emma is a baseball player who was born on June 17, 1993."

# find substring 'ball'
result = re.search(r"ball", target_string)

# Print matching substring
print(result.group())
# output 'ball'

# find exact word/substring surrounded by word boundary
result = re.search(r"\bball\b", target_string)
if result:
    print(result)
# output None

# find word 'player'
result = re.search(r"\bplayer\b", target_string)
print(result.group())
# output 'player'

ball
player


## When to use re.search()
- The search() method will always match and return only the first occurrence of the pattern from the target string.


- Use it when you want to find the first match. The search method is useful for a quick match. I.e., As soon as it gets the first match, it will stop its execution. You will get performance benefits.
- Also, please use it when you want to check the possibility of a pattern in a long target string.
- Avoid using the search() method in the following cases

- To search all occurrence to a regular expression, please use the findall() method instead.
- To search at the start of the string, Please use the match() method instead. Also, read regex search() vs. match()
- If you want to perform search and replace operation in Python using regex, please use the re.sub() method.

- Search vs. findall
- Both search and findall method servers the different purpose/use case when performing regex pattern matching in Python.

- As you know, the search method scans the entire string to look for a pattern and returns only the first match. I.e., As soon as it gets the first match, it stops its execution.

- On the other hand, The findall() method returns all matches to the pattern.

- So use the findall() method to search all occurrence/possible matches to a regular expression.

- One more difference, the search method returns a Match object which consists of the start and end index of a successful match and the actual matching value that we can retrieve using a group() method.

- On the other hand, the findall() method returns all the matches in the form of a Python list.


## Regex search groups or multiple patterns
- In this section, we will learn how to search for multiple distinct patterns inside the same target string. Let’s assume, we want to search the following two distinct patterns inside the target string at the same time.

- A ten-letter word
- Two consecutive digits
- To achieve this, Let’s write two regular expression patterns.

### Regex Pattern 1: \w{10}

- It will search for any six-letter word inside the target string

### Regex Pattern 2: \d{2}

- Now each pattern will represent one group. Let’s add each group inside a parenthesis ( ). In our case r"(\w{10}).+(\d{2})"

- On a successful search, we can use match.group(1) to get the match value of a first group and match.group(2) to get the match value of a second group.

- Now let’s see how to use these two patterns to search any six-letter word and two consecutive digits inside the target string.

In [31]:
import re

target_string = "Emma is a basketball player who was born on June 17."

# two group enclosed in separate ( and ) bracket
result = re.search(r"(\w{10}).+(\d{2})", target_string)

# Extract the matches using group()

# print ten-letter word
print(result.group(1))
# Output basketball

# print two digit number
print(result.group(2))
# Output 17
#result.group()

basketball
17


'basketball player who was born on June 17'

### Let’s understand this example

- We enclosed each pattern in the separate opening and closing bracket.
- I have added the .+ metacharacter before the second pattern. the dot represents any character except a new line and the plus sign means that the preceding pattern is repeating one or more times. So .+ means that before the first group, we have a bunch of characters that we can ignore
- Next, we used the group() method to extract two matching values.


- Note: The group() method returned two matching values because we used two patterns.

- Also, read search for a regex pattern within a text file.

- Search multiple words using regex
- Let’s take another example and search any three words surrounded by space using regex. Let’s search words “emma”, “player”, “born” in the target string.

- Use | (pipe) operator to specify multiple patterns.

In [32]:
import re

str1 = "Emma is a baseball player who was born on June 17, 1993."

# search() for eight-letter word surrounded by space
# \b is used to specify word boundary
result = re.findall(r"\bEmma\b|\bplayer\b|\bborn\b", str1)
print(result)
# Output ['Emma', 'player', 'born']

['Emma', 'player', 'born']


In [34]:
res = re.findall(r"\bborn\b|\bjune\b", str1.lower())
res

['born', 'june']

In [36]:
res = re.findall(r"\bborn\b|\bjune\b", str1, re.IGNORECASE)
res

['born', 'June']

### Case insensitive regex search
- There is a possibility that the string contains lowercase and upper case words or words with a combination of lower case and uppercase letters.

- For example, you want to search a word using regex in a target string, but you don’t know whether that word is in uppercase or lowercase letter or a combination of both. Here you can use the re.IGNORECASE flag inside the search() method to perform case-insensitive searching of a regex pattern.

In [35]:
import re

# Target String
target_string = "Emma is a Baseball player who was born on June 17, 1993."

# case sensitive searching
result = re.search(r"emma", target_string)
print("Matching word:", result)
# Output None

print("case insensitive searching")
# using re.IGNORECASE
result = re.search(r"emma", target_string, re.IGNORECASE)
print("Matching word:", result.group())
# Output 'Emma'

Matching word: None
case insensitive searching
Matching word: Emma


## Python Regex Match: A Comprehensive guide for pattern matching

- Python re.match() method looks for the regex pattern only at the beginning of the target string and returns match object if match found; otherwise, it will return None.

- In this article, You will learn how to match a regex pattern inside the target string using the match(), search(), and findall() method of a re module.

- The re.match() method will start matching a regex pattern from the very first character of the text, and if the match found, it will return a re.Match object. Later we can use the re.Match object to extract the matching string.

- After reading this article you will able to perform the following regex pattern matching operations in Python.

### How to use re.match()
- Before moving further, let’s see the syntax of  re.match()

### Syntax of re.match()
- re.match(pattern, string, flags=0)
- The regular expression pattern and target string are the mandatory arguments, and flags are optional.

- pattern: The regular expression pattern we want to match at the beginning of the target string. Since we are not defining and compiling this pattern beforehand (like the compile method). The practice is to write the actual pattern using a raw string.
- string: The second argument is the variable pointing to the target string (In which we want to look for occurrences of the pattern).
- flags: Finally, the third argument is optional and it refers to regex flags by default no flags are applied.
- There are many flag values we can use. For example, the re.I is used for performing case-insensitive searching. We can also combine multiple flags using bitwise OR (the | operator).
### Return value
- If zero or more characters at the beginning of the string match the regular expression pattern, It returns a corresponding match object instance i.e., re.Match object. The match object contains the locations at which the match starts and ends and the actual match value.

- If it fails to locate the occurrences of the pattern that we want to find or such a pattern doesn’t exist in a target string it will return a None type

![image.png](attachment:image.png)

### Match regex pattern at the beginning of the string
- Now, Let’s see the example to match any four-letter word at the beginning of the string. (Check if the string starts with a given pattern).

- Pattern to match: \w{4}

### What does this pattern mean?

- The \w is a regex special sequence that represents any alphanumeric character meaning letters (uppercase or lowercase), digits, and the underscore character.
- Then the 4 inside curly braces say that the character has to occur exactly four times in a row (four consecutive characters).
- In simple words, it means to match any four-letter word at the beginning of the following string.

- target_string = "Emma is a basketball player who was born on June 17, 1993"
- As we can see in the above string Emma is the four-letter word present at the beginning of the target string, so we should get Emma as an output.

In [37]:
import re

target_string = "Emma is a basketball player who was born on June 17"
result = re.match(r"\w{4}", target_string) #

# printing the Match object
print("Match object: ", result)
# Output re.Match object; span=(0, 4), match='Emma'

# Extract match value
print("Match value: ", result.group())
# Output 'Emma'

Match object:  <re.Match object; span=(0, 4), match='Emma'>
Match value:  Emma


- As you can see, the match starts at index 0 and ends before index 4. because the re.match() method always performance pattern matching at the beginning of the target string.

- Let’s understand the above example

- I used a raw string to specify the regular expression pattern. As you may already know, the backslash has a special meaning in some cases because it may indicate an escape character or escape sequence to avoid that used raw string.
- Next, we wrote a regex pattern to match any four-letter word.
- Next, we passed this pattern to match() method to look for a pattern at the string’s start.
- Next, it found a match and returned us the re.Match object.
- In the end, we used the group() method of a Match object to retrieve the exact match value, i.e., Emma.
### Match regex pattern anywhere in the string
- Let’s assume you want to match any six-letter word inside the following target string

- target_string = "Jessa loves Python and pandas"
- If you use a match() method to match any six-letter word inside the string you will get None because it returns a match only if the pattern is located at the beginning of the string. And as we can see the six-letter word is not present at the start.

- So to match the regex pattern anywhere in the string you need to use either search() or findall() method of a RE module.

In [38]:
import re

target_string = "Jessa loves Python and pandas"
# Match six-letter word
pattern = r"\w{6}"

# match() method
result = re.match(pattern, target_string)
print(result)
# Output None

# search() method
result = re.search(pattern, target_string)
print(result.group()) 
# Output 'Python'

# findall() method
result = re.findall(pattern, target_string)
print(result) 
# Output ['Python', 'pandas'] 

None
Python
['Python', 'pandas']


## Match regex at the end of the string
- ometimes we want to match the pattern at the end of the string. For example, you want to check whether a string is ending with a specific word, number or, character.

- Using a dollar ($) metacharacter we can match the regular expression pattern at the end of the string.

- Example to match the four-digit number at the end of the string

In [39]:
import re

target_string = "Emma is a basketball player who was born on June 17, 1993"

# match at the end
result = re.search(r"\d{4}$", target_string)
print("Matching number: ", result.group())  
# Output 1993

Matching number:  1993


In [41]:
res = re.search(r"\d{3}$", target_string)
res.group()

'993'

### Match the exact word or string
- In this section, we will see how to write a regex pattern to match an exact word or a substring inside the target string. Let’s see the example to match the word “player” in the target string.

In [42]:
import re

target_string = "Emma is a basketball player who was born on June 17"
result = re.findall(r"player", target_string)
print("Matching string literal: ", result) 
# Output ['player']

Matching string literal:  ['player']


## Understand the Match object
- As you know, the match() and search() method returns a re.Match object if a match found. Let’s see the structure of a re.Match object.

- re.Match object; span=(0, 4), match='Emma'
- This re.Match object contains the following items.

- A span attribute that shows the locations at which the match starts and ends. i.e., is the tuple object contains the start and end index of a successful match.
- Save this tuple and use it whenever you want to retrieve a matching string from the target string
- Second, A match attribute contains an actual match value that we can retrieve using a group() method.
- The Match object has several methods and attributes to get the information about the matching string. Let’s see those.

- Method	Description
- group()	Return the string matched by the regex
- start()	Return the starting position of the match
- end()	Return the ending position of the match
- span()	Return a tuple containing the (start, end) positions of the match.


![image.png](attachment:image.png)

In [43]:
import re

target_string = "Jessa and Kelly"

# Match five-letter word
res = re.match(r"\b\w{5}\b", target_string)

# printing entire match object
print(res)
# Output re.Match object; span=(0, 5), match='Jessa'

# Extract Matching value
print(res.group())
# Output Jessa

# Start index of a match
print(res.start())
# Output  0

# End index of a match
print("End index: ", res.end())  # 5

# Start and end index of a match
pos = res.span()
print(pos)
# Output (0, 5)

# Use span to retrieve the matching string
print(target_string[pos[0]:pos[1]])
# Output 'Jessa'

<re.Match object; span=(0, 5), match='Jessa'>
Jessa
0
End index:  5
(0, 5)
Jessa


## Match regex pattern that starts and ends with the given text
- Let’s assume you want to check if a given string starts and ends with a particular text. We can do this using the following two regex metacharacter with re.match() method.

- Use the caret metacharacter to match at the start
- Use dollar metacharacter to match at the end
- Now, let’s check if the given string starts with the letter ‘p’ and ends with the letter ‘t’

In [50]:
import re

# string starts with letter 'p' ends with letter 's'
def starts_ends_with(str1):
    res = re.match(r'^(R).*(s)$', str1)
    if res:
        print(res.group())
    else:
        print('None')

str1 = "Raju is for Python developers"
starts_ends_with(str1)
# Output 'PYnative is for Python developers'

str2 = "Raju is for Python"
starts_ends_with(str2)
# Output None


Raju is for Python developers
None


## VMore matching operations
- In this section, let’s see some common regex matching operations such as

- Match any character
- Match number
- Match digits
- match special characters

In [51]:
import re

str1 = "Emma 12 25"
# Match any character
print(re.match(r'.', str1))
# output 'E'

# Match all digits
print(re.findall(r'\d', str1))
# Output ['1', '2', '2', '5']

# Match all numbers
# + indicate 1 or more occurence of \d
print(re.findall(r'\d+', str1))
# output ['12', '25']

# Match all special characters and symbols
str2 = "Hello #Jessa!@#$%"
print(re.findall(r'\W', str2))
# Output [' ', '#', '!', '@', '#', '$', '%']

<re.Match object; span=(0, 1), match='E'>
['1', '2', '2', '5']
['12', '25']
[' ', '#', '!', '@', '#', '$', '%']


### Regex Search vs. match
- In this section, we will understand the difference between the search() and match() methods. You will also get to know when to use the match and search method while performing regex operations.

- Python RE module offers two different methods to perform regex pattern matching.

- The match() checks for a match only at the beginning of the string.
- The search() checks for a match anywhere in the string.
- How re.match() works

- The match method returns a corresponding match object instance if zero or more characters at the beginning of the string match the regular expression pattern.

- In simple words, the re.match returns a match object only if the pattern is located at the beginning of the string; otherwise, it will return None.

- How re.search() works

- On the other hand, the search method scans the entire string to look for a pattern and returns only the first match. I.e., As soon as it gets the first match, it stops its execution.

- Let’s see the example to understand the difference between search and match. In this example, we will see how to match the regex pattern using the match and search method.

- Now, Let’s try to match any2 digit number inside the following target string using search and match method.

- Emma is a baseball player who was born on June 17, 1993
- As you can see, a two-digit number is not present at the start of a string, So the match() method should return None, and the search method should return the match.

- Because the match() method tries to find a match only at the start and search(), try to find a match anywhere in the string.

In [52]:
import re

target_string = "Emma is a baseball player who was born on June 17, 1993"

# Match 2-digit number
# Using match()
result = re.match(r'\d{2}', target_string)
print(result)
# Output None

# Using search()
result = re.search(r'\d{2}', target_string)
print(result.group())
# Output 17

None
17


- Let’s see example code to understand how the search and match method behaves when a string contains newlines.

- We use the re.M flag with caret (^) metacharacter to match each regex pattern at each newline’s start. But you must note that even in MULTILINE mode, match() will only match at the beginning of the string and not at the beginning of each line.

- On the other hand, the search method scans the entire multi-line string to look for a pattern and returns only the first match

- Let’s see the example to understand the difference between search and match when searching inside a multi-line string.

In [53]:
import re

multi_line_string = """emma 
love Python"""

# Matches at the start
print(re.match('emma', multi_line_string).group())
# Output 'emma'

# re.match doesn't match at the start of each newline
# It only match at the start of the string
# Won't match
print(re.match('love', multi_line_string, re.MULTILINE))
# Output None

# found "love" at start of newline
print(re.search('love', multi_line_string).group())
# Output 'love'

pattern = re.compile('Python$', re.MULTILINE)
# No Match
print(pattern.match(multi_line_string))
# Output None

# found 'Python" at the end
print(pattern.search(multi_line_string).group())
# Output 'Python'

emma
None
love
None
Python


### re.fullmatch()
- Unlike the match() method, which performs the pattern matching only at the beginning of the string, the re.fullmatch method returns a match object if and only if the entire target string from the first to the last character matches the regular expression pattern.

- If the match performed successfully it will return the entire string as a match value because we always match the entire string in fullmatch.

- For example, you want the target string to have exactly 42 characters in length. Let’s create a regular expression pattern that will check if the target string is 42 characters long.

- Pattern to match: .{42}

### What does this pattern mean?

- This pattern says I want to match a string of 42 characters.

- Now let’s have a closer look at the pattern itself. First, you will see the dot in regular expressions syntax.

- The DOT is a special character matching any character, no matter if it’s a letter, digit, whitespace, or a symbol except the newline character, which in Python is a backslash.
- Next, 42 inside the curly braces says that string must be 42 characters long
- Now, let’s see the example.

In [54]:
import re

# string length of 42
str1 = "My name is maximums and my salary is 1000$"
print("str1 length: ", len(str1))

result = re.fullmatch(r".{42}", str1)

# print entire match object
print(result)

# print actual match value
print("Match: ", result.group())

str1 length:  42
<re.Match object; span=(0, 42), match='My name is maximums and my salary is 1000$'>
Match:  My name is maximums and my salary is 1000$


- As you can see from the output, we got a match object, meaning the match was performed successfully.

- Note: If the string contains one or more newline characters, the match will fail because the special character excludes the new line. Therefore if our target string had had multiple lines or paragraphs, the match would have failed. we cal solve such problems using the flags attribute.

- Why and when to use re.match() and re.fullmatch()
- Use re.match() method when you want to find the pattern at the beginning of the string (starting with the string’s first character).
- If you want to match a full string against a pattern then use re.fullmatch(). The re.fullmatch method returns a match object if and only if the entire target string from the first to the last character matches the regular expression pattern.

## Python Regex Find All Matches using findall() and finditer()

### How to use re.findall()
- Before moving further, let’s see the syntax of the re.findall() method.
### Syntax:

- re.findall(pattern, string, flags=0)
- pattern: regular expression pattern we want to find in the string or text
- string: It is the variable pointing to the target string (In which we want to look for occurrences of the pattern).
- Flags: It refers to optional regex flags. by default, no flags are applied. For example, the re.I flag is used for performing case-insensitive findings.
- The regular expression pattern and target string are the mandatory arguments, and flags are optional.

- Return Value

- The re.findall() scans the target string from left to right as per the regular expression pattern and returns all matches in the order they were found.

- It returns None if it fails to locate the occurrences of the pattern or such a pattern doesn’t exist in a target string.

![image.png](attachment:image.png)

### Example to find all matches to a regex pattern
- In this example, we will find all numbers present inside the target string. To achieve this, let’s write a regex pattern.

- Pattern: \d+

### What does this pattern mean?

- The \d is a special regex sequence that matches any digit from 0 to 9 in a target string.
- The +  metacharacter indicates number can contain at minimum one or maximum any number of digits.

- In simple words, it means to match any number inside the following target string.

- target_string = "Emma is a basketball player who was born on June 17, 1993. She played 112 matches with scoring average 26.12 points per game. Her weight is 51 kg."
- As we can see in the above string ’17’, ‘1993’, ‘112’, ’26’, ’12’, ’51’ number are present, so we should get all those numbers in the output.

- Example

In [55]:
import re

target_string = "Emma is a basketball player who was born on June 17, 1993. She played 112 matches with scoring average 26.12 points per game. Her weight is 51 kg."
result = re.findall(r"\d+", target_string)

# print all matches
print("Found following matches")
print(result)

# Output ['17', '1993', '112', '26', '12', '51']

Found following matches
['17', '1993', '112', '26', '12', '51']


#### Note:

- First of all, I used a raw string to specify the regular expression pattern i.e r"\d+". As you may already know, the backslash has a special meaning in some cases because it may indicate an escape character or escape sequence to avoid that we must use raw string.

### Finditer method
- The re.finditer() works exactly the same as the re.findall() method except it returns an iterator yielding match objects matching the regex pattern in a string instead of a list.

- It scans the string from left to right, and matches are returned in the iterator form. Later, we can use this iterator object to extract all matches.

- In simple words, finditer() returns an iterator over MatchObject objects.

![image.png](attachment:image.png)

### But why use finditer()?

- In some scenarios, the number of matches is high, and you could risk filling up your memory by loading them all using findall(). Instead of that using the finditer(), you can get all possible matches in the form of an iterator object, which will improve performance.

- It means, finditer() returns a callable object which will load results in memory when called. Please refer to this Stackoverflow answer to get to know the performance benefits of iterators.

- finditer example
- Now, Let’s see the example to find all two consecutive digits inside the target string.

In [56]:
import re

target_string = "Emma is a basketball player who was born on June 17, 1993. She played 112 matches with a scoring average of 26.12 points per game. Her weight is 51 kg."

# finditer() with regex pattern and target string
# \d{2} to match two consecutive digits 
result = re.finditer(r"\d{2}", target_string)

# print all match object
for match_obj in result:
    # print each re.Match object
    print(match_obj)
    
    # extract each matching number
    print(match_obj.group())

<re.Match object; span=(49, 51), match='17'>
17
<re.Match object; span=(53, 55), match='19'>
19
<re.Match object; span=(55, 57), match='93'>
93
<re.Match object; span=(70, 72), match='11'>
11
<re.Match object; span=(108, 110), match='26'>
26
<re.Match object; span=(111, 113), match='12'>
12
<re.Match object; span=(145, 147), match='51'>
51


### More use

- Use finditer to find the indexes of all regex matches
- Regex findall special symbols from a string
- Regex find all word starting with specific letters
- In this example, we will see solve following 2 scenarios

- find all words that start with a specific letter/character
- find all words that start with a specific substring
- Now, let’s assume you have the following string:
- target_string = "Jessa is a Python developer. She also gives Python programming training"
- Now let’s find all word that starts with letter p. Also, find all words that start with substring ‘py‘

- Pattern: \b[p]\w+\b

- The \b is a word boundary, then p in square bracket [] means the word must start with the letter ‘p‘.
- Next, \w+ means one or more alphanumerical characters after a letter ‘p’
- In the end, we used \b to indicate word boundary i.e. end of the word.

In [57]:
import re

target_string = "Jessa is a Python developer. She also gives Python programming training"
# all word starts with letter 'p'
print(re.findall(r'\b[p]\w+\b', target_string, re.I))
# output ['Python', 'Python', 'programming']

# all word starts with substring 'Py'
print(re.findall(r'\bpy\w+\b', target_string, re.I))
# output ['Python', 'Python']

['Python', 'Python', 'programming']
['Python', 'Python']


- Regex to find all word that starts and ends with a specific letter
- In this example, we will see solve following 2 scenarios

- find all words that start and ends with a specific letter
- find all words that start and ends with a specific substring

In [58]:
import re

target_string = "Jessa is a Python developer. She also gives Python programming training"
# all word starts with letter 'p' and ends with letter 'g'
print(re.findall(r'\b[p]\w+[g]\b', target_string, re.I))
# output 'programming'

# all word starts with letter 'p' or 't' and ends with letter 'g'
print(re.findall(r'\b[pt]\w+[g]\b', target_string, re.I))
# output ['programming', 'training']

target_string = "Jessa loves mango and orange"
# all word starts with substring 'ma' and ends with substring 'go'
print(re.findall(r'\bma\w+go\b', target_string, re.I))
# output 'mango'

target_string = "Kelly loves banana and apple"
# all word starts or ends with letter 'a'
print(re.findall(r'\b[a]\w+\b|\w+[a]\b', target_string, re.I))
# output ['banana', 'and', 'apple']

['programming']
['programming', 'training']
['mango']
['banana', 'and', 'apple']


### Regex to find all words containing a certain letter
- In this example, we will see how to find words that contain the letter ‘i’.

In [59]:
import re

target_string = "Jessa is a knows testing and machine learning"
# find all word that contain letter 'i'
print(re.findall(r'\b\w*[i]\w*\b', target_string, re.I))
# found ['is', 'testing', 'machine', 'learning']

# find all word which contain substring 'ing'
print(re.findall(r'\b\w*ing\w*\b', target_string, re.I))
# found ['testing', 'learning']

['is', 'testing', 'machine', 'learning']
['testing', 'learning']


## Regex findall repeated characters
- For example, you have a string: ""Jessa Erriika""

- As the result you want to have the following matches: (J, e, ss, a, E, rr, ii, k, a)

In [60]:
import re

target_string = "Jessa Erriika"
# This '\w' matches any single character
# and then its repetitions (\1*) if any.
matcher = re.compile(r"(\w)\1*")

for match in matcher.finditer(target_string):
    print(match.group(), end=", ")

J, e, ss, a, E, rr, ii, k, a, 

### Python Regex Split String Using re.split()

- In this article, will learn how to split a string based on a regular expression pattern in Python. The Pythons re module’s re.split() method split the string by the occurrences of the regex pattern, returning a list containing the resulting substrings.

- After reading this article you will be able to perform the following split operations using regex in Python.

- re.split(pattern, str)	Split the string by each occurrence of the pattern.
- re.split(pattern, str, maxsplit=2)	Split the string by the occurrences of the pattern. Limit the number of splits to 2
- re.split(p1|p2, str)	Split string by multiple delimiter patterns (p1 and p2).


### How to use re.split() function
- Before moving further, let’s see the syntax of Python’s re.split() method.

- Syntax
- re.split(pattern, string, maxsplit=0, flags=0)


- The regular expression pattern and target string are the mandatory arguments. The maxsplit, and flags are optional.

- pattern: the regular expression pattern used for splitting the target string.
- string: The variable pointing to the target string (i.e., the string we want to split).
- maxsplit: The number of splits you wanted to perform. If maxsplit is 2, at most two splits occur, and the remainder of the string is returned as the final element of the list.
- flags: By default, no flags are applied.
- There are many regex flags we can use. For example, the re.I is used for performing case-insensitive searching.
- Note: If capturing parentheses are used in the pattern, then the text of all groups in the pattern is also returned as part of the resulting list.

- Return value

- It split the target string as per the regular expression pattern, and the matches are returned in the form of a list.

- If the specified pattern is not found inside the target string, then the string is not split in any way, but the split method still generates a list since this is the way it’s designed. However, the list contains just one element, the target string itself.

### Regex example to split a string into words
- Now, let’s see how to use re.split() with the help of a simple example. In this example, we will split the target string at each white-space character using the \s special sequence.


- Let’s add the + metacharacter at the end of \s. Now, The \s+ regex pattern will split the target string on the occurrence of one or more whitespace characters. Let’s see the demo.

In [67]:
import re

target_string = "My name is maximums and my luck numbers are 12 45 78"
# split on white-space 
word_list = re.split(r"\s+", target_string)
print(word_list)
#As you can see in the output, we got the list of words separated by whitespace.

['My', 'name', 'is', 'maximums', 'and', 'my', 'luck', 'numbers', 'are', '12', '45', '78']


In [68]:
t = "raju is great"
w = re.split(r"\s+", t)
print(w)

['raju', 'is', 'great']


#### Limit the number of splits

- The maxsplit parameter of re.split() is used to define how many splits you want to perform.

- In simple words, if the maxsplit is 2, then two splits will be done, and the remainder of the string is returned as the final element of the list.

- So let’s take a simple example to split a string on the occurrence of any non-digit. Here we will use the \D special sequence that matches any non-digit character.

In [69]:
import re

target_string = "12-45-78"

# Split only on the first occurrence
# maxsplit is 1
result = re.split(r"\D", target_string, maxsplit=1)
print(result)
# Output ['12', '45-78']

# Split on the three occurrence
# maxsplit is 3
result = re.split(r"\D", target_string, maxsplit=3)
print(result)
# Output ['12', '45', '78']

['12', '45-78']
['12', '45', '78']


## Regex to Split string with multiple delimiters

- In this section, we’ll learn how to use regex to split a string on multiple delimiters in Python.

- For example, using the regular expression re.split() method, we can split the string either by the comma or by space.

- With the regex split() method, you will get more flexibility. You can specify a pattern for the delimiters where you can specify multiple delimiters, while with the string’s split() method, you could have used only a fixed character or set of characters to split a string.

- Let’s take a simple example to split the string either by the hyphen or by the comma.

- Example to split string by two delimiters

In [73]:
import re

target_string = "12,45,78,85-17-89"
# 2 delimiter - and ,
# use OR (|) operator to combine two pattern
result = re.split(r"-|,", target_string)
print(result)
# Output ['12', '45', '78', '85', '17', '89']

['12', '45', '78', '85', '17', '89']


## Regex to split string on five delimiters
- Here we will use regex to split a string with five delimiters Including the dot, comma, semicolon, a hyphen, and space followed by any amount of extra whitespace.

In [78]:
import re

target_string = "PYnative   dot.com; is for, Python-developer"
# Pattern to split: [-;,.\s]\s*
result = re.split(r"[-;,.\s]\s*", target_string)
print(result)
# Output ['PYnative', 'dot', 'com', 'is', 'for', 'Python', 'developer']

['PYnative', 'dot', 'com', 'is', 'for', 'Python', 'developer']


- Note: we used [] meta character to indicate a list of delimiter characters. The [] matches any single character in brackets. For example, [-;,.\s] will match either hyphen, comma, semicolon, dot, and a space character.

- Regex to split String into words with multiple word boundary delimiters
- In this example, we will use the[\b\W\b]+ regex pattern to cater to any Non-alphanumeric delimiters. Using this pattern we can split string by multiple word boundary delimiters that will result in a list of alphanumeric/word tokens.

- Note: The \W is a regex special sequence that matches any Non-alphanumeric character. Non-alphanumeric means no letter, digit, and underscore.


In [79]:
import re

target_string = "PYnative! dot.com; is for, Python-developer?"
result = re.split(r"[\b\W\b]+", target_string)
print(result)
# Output ['PYnative', 'dot', 'com', 'is', 'for', 'Python', 'developer', '']

['PYnative', 'dot', 'com', 'is', 'for', 'Python', 'developer', '']


In [80]:
#Split strings by delimiters and specific word
import re

text = "12, and45,78and85-17and89-97"
# split by word 'and' space, and comma
result = re.split(r"and|[\s,-]+", text)
print(result)
# Output ['12', '', '45', '78', '85', '17', '89', '97']

['12', '', '45', '78', '85', '17', '89', '97']


## Regex split a string and keep the separators
- As I told you at the start of the article if capturing parentheses are used in the pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

- Note: You are capturing the group by writing pattern inside the (,).

- In simple terms, be careful while using the re.split() method when the regular expression pattern is enclosed in parentheses to capture groups. If capture groups are used, then the matched text is also included in the resulted list.

- It is helpful when you want to keep the separators/delimiter in the resulted list.

In [81]:
import re

target_string = "12-45-78."

# Split on non-digit
result = re.split(r"\D+", target_string)
print(result)
# Output ['12', '45', '78', '']

# Split on non-digit and keep the separators
# pattern written in parenthese
result = re.split(r"(\D+)", target_string)
print(result)
# Output ['12', '-', '45', '-', '78', '.', '']

['12', '45', '78', '']
['12', '-', '45', '-', '78', '.', '']


## Regex split string by ignoring case
- There is a possibility that the string contains lowercase and upper case letters.

- For example, you want to split a string on the specific characters or range of characters, but you don’t know whether that character/word is an uppercase or lowercase letter or a combination of both. Here you can use the re.IGNORECASE or re.I flag inside the re.split() method to perform case-insensitive splits.


In [82]:
import re

# Without ignoring case
print(re.split('[a-z]+', "7J8e7Ss3a"))
# output ['7J8', '7S', '3', '']

# With ignoring case
print(re.split('[a-z]+', "7J8e7Ss3a", flags=re.IGNORECASE))
# output ['7', '8', '7', '3', '']

# Without ignoring case
print(re.split(r"emma", "Emma knows Python.EMMA loves Data Science"))
# output ['Emma knows Python.EMMA loves Data Science']

# With ignoring case
print(re.split(r"emma", "Emma knows Python.EMMA loves Data Science", flags=re.IGNORECASE))
# output ['', ' knows Python.', ' loves Data Science']

['7J8', '7S', '3', '']
['7', '8', '7', '3', '']
['Emma knows Python.EMMA loves Data Science']
['', ' knows Python.', ' loves Data Science']


## String’s split() method vs. regex split()
- Now let’s think of the default split() method in Python, which is specific to strings. As you most probably know, the default split() method splits a string by a specific delimiter. However, please note that this delimiter is a fixed string that you define inside the method’s parentheses.


- The difference between the defaults split() and the regular expressions split() methods are enormous. There is way more flexibility when using the regular expressions split, which can prove very useful in some scenarios and for specific tasks.

- With the re.split() method, you can specify a pattern for the delimiter, while with the defaults split() method, you could have used only a fixed character or set of characters.
- Also, using re.split() we can split a string by multiple delimiters.

## Split string by upper case words
- For example, you have a string like “EMMA loves PYTHON and ML”, and you wanted to split it by uppercase words to get results like [‘HELLO there’, ‘HOW are’, ‘YOU’]

In [83]:
import re

print(re.split(r"\s(?=[A-Z])", "EMMA loves PYTHON and ML"))
# output ['EMMA loves', 'PYTHON and', 'ML']

['EMMA loves', 'PYTHON and', 'ML']


## Explanation

- We used lookahead regex \s(?=[A-Z]).
- This regex will split at every space(\s), followed by a string of upper-case letters([A-Z]) that end in a word-boundary(\b).

## Python Regex Replace Pattern in a string using re.sub()
- In this article, will learn how to use regular expressions to perform search and replace operations on strings in Python.

- Python regex offers sub() the subn() methods to search and replace patterns in a string. Using these methods we can replace one or more occurrences of a regex pattern in the target string with a substitute string.

- After reading this article you will able to perform the following regex replacement operations in Python.




- re.sub(pattern, replacement, string)	Find and replaces all occurrences of pattern with replacement
- re.sub(pattern, replacement, string, count=1)	Find and replaces only the first occurrences of pattern with replacement
- re.sub(pattern, replacement, string, count=n)	Find and replaces first n occurrences of pattern with the replacement

### How to use re.sub() method
- To understand how to use the re.sub() for regex replacement, we first need to understand its syntax.

- Syntax of re.sub()

- re.sub(pattern, replacement, string[, count, flags])
- The regular expression pattern, replacement, and target string are the mandatory arguments. The count and flags are optional.

- pattern: The regular expression pattern to find inside the target string.
- replacement: The replacement that we are going to insert for each occurrence of a pattern. The replacement can be a string or function.
- string: The variable pointing to the target string (In which we want to perform the replacement).
- count: Maximum number of pattern occurrences to be replaced. The count must always be a positive integer if specified. .By default, the count is set to zero, which means the re.sub() method will replace all pattern occurrences in the target string.
- flags: Finally, the last argument is optional and refers to regex flags. By default, no flags are applied.
- There are many flag values we can use. For example, the re.I is used for performing case-insensitive searching and replacing.
- Return value

- It returns the string obtained by replacing the pattern occurrences in the string with the replacement string. If the pattern isn’t found, the string is returned unchanged.

- Now, let’s test this.

- Regex example to replace all whitespace with an underscore
- Now, let’s see how to use re.sub() with the help of a simple example. Here, we will perform two replacement operations



- Replace all the whitespace with a hyphen
- Remove all whitespaces
- Let’s see the first scenario first.

- Pattern to replace: \s

- In this example, we will use the \s regex special sequence that matches any whitespace character, short for [ \t\n\x0b\r\f]

- Let’s assume you have the following string and you wanted to replace all the whitespace with an underscore.

- target_string = "Jessa knows testing and machine learning"

In [85]:
import re

target_str = "Jessa knows testing and machine learning"
res_str = re.sub(r"\s", "_", target_str)
# String after replacement
print(res_str)
# Output 'Jessa_knows_testing_and_machine_learning'


Jessa_knows_testing_and_machine_learning


- Regex to remove whitespaces from a string

- Now, let’s move to the second scenario, where you can remove all whitespace from a string using regex. This regex remove operation includes the following four cases.

- Remove all spaces, including single or multiple spaces ( pattern to remove \s+ )
- Remove leading spaces ( pattern to remove ^\s+ )
- Remove trailing spaces ( pattern to remove \s+$ )
- Remove both leading and trailing spaces. (pattern to remove  ^\s+|\s+$ )
- Example 1: Remove all spaces

In [86]:
import re

target_str = "   Jessa Knows Testing And Machine Learning \t  ."

# \s+ to remove all spaces
# + indicate 1 or more occurrence of a space
res_str = re.sub(r"\s+", "", target_str)
# String after replacement
print(res_str)
# Output 'JessaKnowsTestingAndMachineLearning.'

JessaKnowsTestingAndMachineLearning.


In [88]:
#Example 2: Remove leading spaces

import re

target_str = "   Jessa Knows Testing And Machine Learning \t  ."

# ^\s+ remove only leading spaces
# caret (^) matches only at the start of the string
res_str = re.sub(r"^\s+", "", target_str)

# String after replacement
print(res_str)
# Output 'Jessa Knows Testing And Machine Learning 	  .'

Jessa Knows Testing And Machine Learning 	  .


In [90]:
#Example 3: Remove trailing spaces

import re

target_str = "   Jessa Knows Testing And Machine Learning   \t\n"
# ^\s+$ remove only trailing spaces
# dollar ($) matches spaces only at the end of the string
res_str = re.sub(r"\s+$", "", target_str)

# String after replacement
print(res_str)
# Output '   Jessa Knows Testing And Machine Learning'

   Jessa Knows Testing And Machine Learning


In [92]:
#Example 4: Remove both leading and trailing spaces

import re

target_str = "   Jessa Knows Testing And Machine Learning   \t\n"
# ^\s+ remove leading spaces
# ^\s+$ removes trailing spaces
# | operator to combine both patterns
res_str = re.sub(r"^\s+|\s+$", "", target_str)

# String after replacement
print(res_str)
# Output 'Jessa Knows Testing And Machine Learning'

Jessa Knows Testing And Machine Learning


# Substitute multiple whitespaces with single whitespace using regex



In [93]:
import re

target_str = "Jessa Knows Testing    And Machine     Learning \t \n"

# \s+ to match all whitespaces
# replace them using single space " "
res_str = re.sub(r"\s+", " ", target_str)

# string after replacement
print(res_str)
# Output 'Jessa Knows Testing And Machine Learning'

Jessa Knows Testing And Machine Learning 


- Limit the maximum number of pattern occurrences to be replaced
- As I told you, the count argument of the re.sub() method is optional. The count argument will set the maximum number of replacements that we want to make inside the string. By default, the count is set to zero, which means the re.sub() method will replace all pattern occurrences in the target string.

- Replaces only the first occurrences of a pattern

- By setting the count=1 inside a re.sub() we can replace only the first occurrence of a pattern in the target string with another string.

- Replaces the n occurrences of a pattern


- Set the count value to the number of replacements you want to perform.

- Now let’s see the example.

In [94]:
import re

# original string
target_str = "Jessa knows testing and machine learning"
# replace only first occurrence
res_str = re.sub(r"\s", "-", target_str, count=1)
# String after replacement
print(res_str)
# Output 'Jessa-knows testing and machine learning'

# replace three occurrence
res_str = re.sub(r"\s", "-", target_str, count=3)
print(res_str)
# Output 'Jessa-knows-testing-and machine learning'

Jessa-knows testing and machine learning
Jessa-knows-testing-and machine learning


### Regex replacement function
- We saw how to find and replace the regex pattern with a fixed string in the earlier example. In this example, we see how to replace a pattern with an output of a function.

- For example, you want to replace all uppercase letters with a lowercase letter. To achieve this we need the following two things

- A regular expression pattern that matches all uppercase letters
- and the replacement function will convert matched uppercase letters to lowercase.
- Pattern to replace: [A-Z]

- This pattern will match any uppercase letters inside a target string.

### replacement function

- You can pass a function to re.sub.  When you execute re.sub()  your function will receive a match object as the argument. If can perform replacement operation by extracting matched value from a match object.

- If a replacement is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument and returns the replacement string

- So in our case, we will do the followings


- First, we need to create a function to replace uppercase letters with a lowercase letter
- Next, we need to pass this function as the replacement argument to the re.sub()
- Whenever re.sub() matches the pattern, It will send the corresponding match object to the replacement function
- Inside a replacement function, we will use the group() method to extract an uppercase letter and convert it into a lowercase letter

In [95]:
import re

# replacement function to convert uppercase letter to lowercase
def convert_to_lower(match_obj):
    if match_obj.group() is not None:
        return match_obj.group().lower()

# Original String
str = "Emma LOves PINEAPPLE DEssert and COCONUT Ice Cream"

# pass replacement function to re.sub()
res_str = re.sub(r"[A-Z]", convert_to_lower, str)
# String after replacement
print(res_str)
# Output 'Emma loves pineapple dessert and coconut Ice Cream'

emma loves pineapple dessert and coconut ice cream


### Regex replace group/multiple regex patterns
- We saw how to find and replace the single regex pattern in the earlier examples. In this section, we will learn how to search and replace multiple patterns in the target string.

- To understand this take the example of the following string

- student_names = "Emma-Kelly Jessa Joy Scott-Joe Jerry"

- Here, we want to find and replace two distinct patterns at the same time.

- We want to replace each whitespace and hyphen(-) with a comma (,) inside the target string.  To achieve this, we must first write two regular expression patterns.

- Pattern 1: \s matches all whitespaces
- Pattern 2: -  matches hyphen(-)

In [96]:
import re

# Original string
student_names = "Emma-Kelly Jessa Joy Scott-Joe Jerry"

# replace two pattern at the same time
# use OR (|) to separate two pattern
res = re.sub(r"(\s)|(-)", ",", student_names)
print(res)
# Output 'Emma,Kelly,Jessa,Joy,Scott,Joe,Jerry'

Emma,Kelly,Jessa,Joy,Scott,Joe,Jerry


### Replace multiple regex patterns with different replacement
- To understand this take the example of the following string

- target_string = "EMMA loves PINEAPPLE dessert and COCONUT ice CREAM"

- The above string contains a combination of uppercase and lowercase words.

- Here, we want to match and replace two distinct patterns with two different replacements.

- Replace each uppercase word with a lowercase
- And replace each lowercase word with uppercase
- So we will first capture two groups and then replace each group with a replacement function. If you don’t know the replacement function please read it here.

### Group 1: ([A-Z]+)

- To capture and replace all uppercase word with a lowercase.
- [A-Z] character class means, any character from the capital A to capital Z in uppercase exclusively.
### Group 2: ([a-z]+)

- To capture and replace all lowercase word with an uppercase
- [a-z] character class means, match any character from the small case a to z in lowercase exclusively.

- Note: Whenever you wanted to capture groups always write them in parenthesis (, ).

In [97]:
import re

# replacement function to convert uppercase word to lowercase
# and lowercase word to uppercase
def convert_case(match_obj):
    if match_obj.group(1) is not None:
        return match_obj.group(1).lower()
    if match_obj.group(2) is not None:
        return match_obj.group(2).upper()

# Original String
str = "EMMA loves PINEAPPLE dessert and COCONUT ice CREAM"

# group 1 [A-Z]+ matches uppercase words
# group 2 [a-z]+ matches lowercase words
# pass replacement function 'convert_case' to re.sub()
res_str = re.sub(r"([A-Z]+)|([a-z]+)", convert_case, str)

# String after replacement
print(res_str)
# Output 'emma LOVES pineapple DESSERT AND coconut ICE cream'

emma LOVES pineapple DESSERT AND coconut ICE cream


### RE’s subn() method
- The re.subn() method is the new method, although it performs the same task as the re.sub() method, the result it returns is a bit different.

- The re.subn() method returns a tuple of two elements.

- The first element of the result is the new version of the target string after all the replacements have been made.
- The second element is the number of replacements it has made
- Let’s test this using the same example as before and only replacing the method.

In [98]:
import re

target_string = "Emma loves PINEAPPLE, COCONUT, BANANA ice cream"
result = re.subn(r"[A-Z]{2,}", "MANGO", target_string)
print(result)
# Output ('Emma loves MANGO, MANGO, MANGO ice cream', 3)

('Emma loves MANGO, MANGO, MANGO ice cream', 3)


- Note: Note: I haven’t changed anything in the regular expression pattern, and the resulting string is the same, only that this time it is included in a tuple as the first element of that tuple. Then after the comma, we have the number of replacements being made, and that is three.

- We can also use the count argument of the subn() method. So the value of the second element of the result tuple should change accordingly.

- So let’s test this.

In [99]:
import re

target_string = "Emma loves PINEAPPLE, COCONUT, BANANA ice cream"
result = re.subn(r"[A-Z]{2,}", "MANGO", target_string, count=2)
print(result)
# Output ('Emma loves MANGO, MANGO, BANANA ice cream', 2)

('Emma loves MANGO, MANGO, BANANA ice cream', 2)


### Python Regex Replace Pattern in a string using re.sub()


- In this article, will learn how to use regular expressions to perform search and replace operations on strings in Python.

- Python regex offers sub() the subn() methods to search and replace patterns in a string. Using these methods we can replace one or more occurrences of a regex pattern in the target string with a substitute string.

- After reading this article you will able to perform the following regex replacement operations in Python.


- re.sub(pattern, replacement, string)	Find and replaces all occurrences of pattern with replacement
- re.sub(pattern, replacement, string, count=1)	Find and replaces only the first occurrences of pattern with replacement
- re.sub(pattern, replacement, string, count=n)	Find and replaces first n occurrences of pattern with the replacement

## How to use re.sub() method
- To understand how to use the re.sub() for regex replacement, we first need to understand its syntax.

- Syntax of re.sub()

- re.sub(pattern, replacement, string[, count, flags])
- The regular expression pattern, replacement, and target string are the mandatory arguments. The count and flags are optional.



- pattern: The regular expression pattern to find inside the target string.
- replacement: The replacement that we are going to insert for each occurrence of a pattern. The replacement can be a string or function.
- string: The variable pointing to the target string (In which we want to perform the replacement).
- count: Maximum number of pattern occurrences to be replaced. The count must always be a positive integer if specified. .By default, the count is set to zero, which means the re.sub() method will replace all pattern occurrences in the target string.
- flags: Finally, the last argument is optional and refers to regex flags. By default, no flags are applied.
- There are many flag values we can use. For example, the re.I is used for performing case-insensitive searching and replacing.
- Return value

- It returns the string obtained by replacing the pattern occurrences in the string with the replacement string. If the pattern isn’t found, the string is returned unchanged.

- Now, let’s test this.

- Regex example to replace all whitespace with an underscore
- Now, let’s see how to use re.sub() with the help of a simple example. Here, we will perform two replacement operations


- Replace all the whitespace with a hyphen
- Remove all whitespaces
- Let’s see the first scenario first.

- Pattern to replace: \s

- In this example, we will use the \s regex special sequence that matches any whitespace character, short for [ \t\n\x0b\r\f]

- Let’s assume you have the following string and you wanted to replace all the whitespace with an underscore.

- target_string = "Jessa knows testing and machine learning"

In [100]:
import re

target_str = "Jessa knows testing and machine learning"
res_str = re.sub(r"\s", "_", target_str)
# String after replacement
print(res_str)
# Output 'Jessa_knows_testing_and_machine_learning'

Jessa_knows_testing_and_machine_learning


## Regex to remove whitespaces from a string

- Now, let’s move to the second scenario, where you can remove all whitespace from a string using regex. This regex remove operation includes the following four cases.

- Remove all spaces, including single or multiple spaces ( pattern to remove \s+ )
- Remove leading spaces ( pattern to remove ^\s+ )
- Remove trailing spaces ( pattern to remove \s+$ )
- Remove both leading and trailing spaces. (pattern to remove  ^\s+|\s+$ )
- Example 1: Remove all spaces

In [101]:
import re

target_str = "   Jessa Knows Testing And Machine Learning \t  ."

# \s+ to remove all spaces
# + indicate 1 or more occurrence of a space
res_str = re.sub(r"\s+", "", target_str)
# String after replacement
print(res_str)
# Output 'JessaKnowsTestingAndMachineLearning.'

JessaKnowsTestingAndMachineLearning.


In [102]:
#Example 2: Remove leading spaces

import re

target_str = "   Jessa Knows Testing And Machine Learning \t  ."

# ^\s+ remove only leading spaces
# caret (^) matches only at the start of the string
res_str = re.sub(r"^\s+", "", target_str)

# String after replacement
print(res_str)
# Output 'Jessa Knows Testing And Machine Learning 	  .'

Jessa Knows Testing And Machine Learning 	  .


In [103]:
#Example 3: Remove trailing spaces

import re

target_str = "   Jessa Knows Testing And Machine Learning   \t\n"
# ^\s+$ remove only trailing spaces
# dollar ($) matches spaces only at the end of the string
res_str = re.sub(r"\s+$", "", target_str)

# String after replacement
print(res_str)
# Output '   Jessa Knows Testing And Machine Learning'

   Jessa Knows Testing And Machine Learning


In [104]:
#Example 4: Remove both leading and trailing spaces

import re

target_str = "   Jessa Knows Testing And Machine Learning   \t\n"
# ^\s+ remove leading spaces
# ^\s+$ removes trailing spaces
# | operator to combine both patterns
res_str = re.sub(r"^\s+|\s+$", "", target_str)

# String after replacement
print(res_str)
# Output 'Jessa Knows Testing And Machine Learning'

Jessa Knows Testing And Machine Learning


In [105]:
#Substitute multiple whitespaces with single whitespace using regex
import re

target_str = "Jessa Knows Testing    And Machine     Learning \t \n"

# \s+ to match all whitespaces
# replace them using single space " "
res_str = re.sub(r"\s+", " ", target_str)

# string after replacement
print(res_str)
# Output 'Jessa Knows Testing And Machine Learning'

Jessa Knows Testing And Machine Learning 


- Limit the maximum number of pattern occurrences to be replaced
- As I told you, the count argument of the re.sub() method is optional. The count argument will set the maximum number of replacements that we want to make inside the string. By default, the count is set to zero, which means the re.sub() method will replace all pattern occurrences in the target string.

- Replaces only the first occurrences of a pattern

- By setting the count=1 inside a re.sub() we can replace only the first occurrence of a pattern in the target string with another string.

- Replaces the n occurrences of a pattern

- Set the count value to the number of replacements you want to perform.

- Now let’s see the example.

In [106]:
import re

# original string
target_str = "Jessa knows testing and machine learning"
# replace only first occurrence
res_str = re.sub(r"\s", "-", target_str, count=1)
# String after replacement
print(res_str)
# Output 'Jessa-knows testing and machine learning'

# replace three occurrence
res_str = re.sub(r"\s", "-", target_str, count=3)
print(res_str)
# Output 'Jessa-knows-testing-and machine learning'

Jessa-knows testing and machine learning
Jessa-knows-testing-and machine learning


### Regex replacement function
- We saw how to find and replace the regex pattern with a fixed string in the earlier example. In this example, we see how to replace a pattern with an output of a function.

- For example, you want to replace all uppercase letters with a lowercase letter. To achieve this we need the following two things


- A regular expression pattern that matches all uppercase letters
- and the replacement function will convert matched uppercase letters to lowercase.
### Pattern to replace: [A-Z]

- This pattern will match any uppercase letters inside a target string.

### replacement function

- You can pass a function to re.sub.  When you execute re.sub()  your function will receive a match object as the argument. If can perform replacement operation by extracting matched value from a match object.

- If a replacement is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument and returns the replacement string

- So in our case, we will do the followings


- First, we need to create a function to replace uppercase letters with a lowercase letter
- Next, we need to pass this function as the replacement argument to the re.sub()
- Whenever re.sub() matches the pattern, It will send the corresponding match object to the replacement function
- Inside a replacement function, we will use the group() method to extract an uppercase letter and convert it into a lowercase letter

In [107]:
import re

# replacement function to convert uppercase letter to lowercase
def convert_to_lower(match_obj):
    if match_obj.group() is not None:
        return match_obj.group().lower()

# Original String
str = "Emma LOves PINEAPPLE DEssert and COCONUT Ice Cream"

# pass replacement function to re.sub()
res_str = re.sub(r"[A-Z]", convert_to_lower, str)
# String after replacement
print(res_str)
# Output 'Emma loves pineapple dessert and coconut Ice Cream'

emma loves pineapple dessert and coconut ice cream


## Regex replace group/multiple regex patterns
- We saw how to find and replace the single regex pattern in the earlier examples. In this section, we will learn how to search and replace multiple patterns in the target string.

- To understand this take the example of the following string

- student_names = "Emma-Kelly Jessa Joy Scott-Joe Jerry"

- Here, we want to find and replace two distinct patterns at the same time.

- We want to replace each whitespace and hyphen(-) with a comma (,) inside the target string.  To achieve this, we must first write two regular expression patterns.

- Pattern 1: \s matches all whitespaces
- Pattern 2: -  matches hyphen(-)

In [109]:
import re

# Original string
student_names = "Emma-Kelly Jessa Joy Scott-Joe Jerry"

# replace two pattern at the same time
# use OR (|) to separate two pattern
res = re.sub(r"(\s)|(-)", ",", student_names)
print(res)
# Output 'Emma,Kelly,Jessa,Joy,Scott,Joe,Jerry'


Emma,Kelly,Jessa,Joy,Scott,Joe,Jerry


### Replace multiple regex patterns with different replacement
- To understand this take the example of the following string

- target_string = "EMMA loves PINEAPPLE dessert and COCONUT ice CREAM"

- The above string contains a combination of uppercase and lowercase words.

- Here, we want to match and replace two distinct patterns with two different replacements.

- Replace each uppercase word with a lowercase
- And replace each lowercase word with uppercase
- So we will first capture two groups and then replace each group with a replacement function. If you don’t know the replacement function please read it here.

## Group 1: ([A-Z]+)

- To capture and replace all uppercase word with a lowercase.
- [A-Z] character class means, any character from the capital A to capital Z in uppercase exclusively.
## Group 2: ([a-z]+)

- To capture and replace all lowercase word with an uppercase
- [a-z] character class means, match any character from the small case a to z in lowercase exclusively.

- Note: Whenever you wanted to capture groups always write them in parenthesis (, ).

In [110]:
import re

# replacement function to convert uppercase word to lowercase
# and lowercase word to uppercase
def convert_case(match_obj):
    if match_obj.group(1) is not None:
        return match_obj.group(1).lower()
    if match_obj.group(2) is not None:
        return match_obj.group(2).upper()

# Original String
str = "EMMA loves PINEAPPLE dessert and COCONUT ice CREAM"

# group 1 [A-Z]+ matches uppercase words
# group 2 [a-z]+ matches lowercase words
# pass replacement function 'convert_case' to re.sub()
res_str = re.sub(r"([A-Z]+)|([a-z]+)", convert_case, str)

# String after replacement
print(res_str)
# Output 'emma LOVES pineapple DESSERT AND coconut ICE cream'

emma LOVES pineapple DESSERT AND coconut ICE cream


### RE’s subn() method
- The re.subn() method is the new method, although it performs the same task as the re.sub() method, the result it returns is a bit different.


- The re.subn() method returns a tuple of two elements.

- The first element of the result is the new version of the target string after all the replacements have been made.
- The second element is the number of replacements it has made
- Let’s test this using the same example as before and only replacing the method.

In [111]:
import re

target_string = "Emma loves PINEAPPLE, COCONUT, BANANA ice cream"
result = re.subn(r"[A-Z]{2,}", "MANGO", target_string)
print(result)
# Output ('Emma loves MANGO, MANGO, MANGO ice cream', 3)

('Emma loves MANGO, MANGO, MANGO ice cream', 3)


- Note: Note: I haven’t changed anything in the regular expression pattern, and the resulting string is the same, only that this time it is included in a tuple as the first element of that tuple. Then after the comma, we have the number of replacements being made, and that is three.

- We can also use the count argument of the subn() method. So the value of the second element of the result tuple should change accordingly.

- So let’s test this.

In [112]:
import re

target_string = "Emma loves PINEAPPLE, COCONUT, BANANA ice cream"
result = re.subn(r"[A-Z]{2,}", "MANGO", target_string, count=2)
print(result)
# Output ('Emma loves MANGO, MANGO, BANANA ice cream', 2)

('Emma loves MANGO, MANGO, BANANA ice cream', 2)


### Python Regex Capturing Groups
- In this article, will learn how to capture regex groups in Python. By capturing groups we can match several distinct patterns inside the same target string.

### What is Group in Regex?
- A group is a part of a regex pattern enclosed in parentheses () metacharacter. We create a group by placing the regex pattern inside the set of parentheses ( and ) . For example, the regular expression (cat) creates a single group containing the letters ‘c’, ‘a’, and ‘t’.
- For example, in a real-world case, you want to capture emails and phone numbers, So you should write two groups, the first will search email, and the second will search phone numbers.

- Also, capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses (, ).

- For example, In the expression, ((\w)(\s\d)), there are three such groups

- ((\w)(\s\d))
- (\w)
- (\s\d)
- We can specify as many groups as we wish. Each sub-pattern inside a pair of parentheses will be captured as a group. Capturing groups are numbered by counting their opening parentheses from left to right.

- Capturing groups are a handy feature of regular expression matching that allows us to query the Match object to find out the part of the string that matched against a particular part of the regular expression.

- Anything you have in parentheses () will be a capture group. using the group(group_number) method of the regex Match object we can extract the matching value of each group.

- We will see how to capture single as well as multiple groups.

- Example to Capture Multiple Groups


- Let’s assume you have the following string:

- target_string = "The price of PINEAPPLE ice cream is 20"
- And, you wanted to match the following two regex groups inside a string

- To match an UPPERCASE word
- To match a number
- To extract the uppercase word and number from the target string we must first write two regular expression patterns.

- Pattern to match the uppercase word (PINEAPPLE)
- Pattern to match the number (20).
- The first group pattern to search for an uppercase word: [A-Z]+

- [A-Z] is the character class. It means match any letter from the capital A to capital Z in uppercase exclusively.
- Then the + metacharacter indicates 1 or more occurrence of an uppercase letter
- Second group pattern to search for the price: \d+

- The \d means match any digit from 0 to 9 in a target string
- Then the + metacharacter indicates number can contain a minimum of 1 or maximum any number of digits.
- Extract matched group values

- In the end, we can use the groups() and group() method of match object to get the matched values.

In [113]:
import re

target_string = "The price of PINEAPPLE ice cream is 20"

# two groups enclosed in separate ( and ) bracket
result = re.search(r"(\b[A-Z]+\b).+(\b\d+)", target_string)

# Extract matching values of all groups
print(result.groups())
# Output ('PINEAPPLE', '20')

# Extract match value of group 1
print(result.group(1))
# Output 'PINEAPPLE'

# Extract match value of group 2
print(result.group(2))
# Output 20

('PINEAPPLE', '20')
PINEAPPLE
20


### Let’s understand the above example

- First of all, I used a raw string to specify the regular expression pattern. As you may already know, the backslash has a special meaning in some cases because it may indicate an escape character or escape sequence to avoid that we must use raw string.

- Now let’s take a closer look at the regular expression syntax to define and isolate the two patterns we are looking to match. We need two things.


- First, we need to enclose each of the two patterns inside a pair of parentheses. So (\b[A-Z]+\b) is the first group, and (\b\d+) is the second group in between parentheses. Therefore each pair of parentheses is a group.

- Note:

- The parentheses are not part of the pattern. It indicates a group.
- The \b indicates a word boundary.
- Secondly, we need to consider the larger context in which these groups reside. This means that we also care about the location of each of these groups inside the entire target string and that’s why we need to provide context or borders for each group.

- Next, I have added .+ at the start of each group. the dot represents any character except a new line and the plus sign means that the preceding pattern is repeating one or more times. This syntax means that before the group, we have a bunch of characters that we can ignore, only take uppercase words followed by the word boundary (whitespace). it will match to PINEAPPLE.

- I have also added .+ at the start of the second pattern, it means before the second group, we have a bunch of characters that we can ignore, only take numbers followed by a boundary. it will match to 20.

- Next, we passed both the patterns to the re.search() method to find the match.

- The groups() method

- At last, using the groups() method of a Match object, we can extract all the group matches at once. It provides all matches in the tuple format.


- Access Each Group Result Separately
- We can use the group() method to extract each group result separately by specifying a group index in between parentheses. Capturing groups are numbered by counting their opening parentheses from left to right. In our case, we used two groups.

- Please note that unlike string indexing, which always starts at 0, group numbering always starts at 1.

- The group with the number 0 is always the target string. If you call The group() method with no arguments at all or with 0 as an argument you will get the entire target string.

- To get access to the text matched by each regex group, pass the group’s number to the group(group_number) method.

- So the first group will be a group of 1. The second group will be a group of 2 and so on.

In [114]:
# Extract first group
print(result.group(1))

# Extract second group
print(result.group(2))

# Target string
print(result.group(0))

PINEAPPLE
20
PINEAPPLE ice cream is 20


- So this is the simple way to access each of the groups as long as the patterns were matched.


- Regex Capture Group Multiple Times
- In earlier examples, we used the search method. It will return only the first match for each group. But what if a string contains the multiple occurrences of a regex group and you want to extract all matches.

- In this section, we will learn how to capture all matches to a regex group. To capture all matches to a regex group we need to use the finditer() method.

- The finditer() method finds all matches and returns an iterator yielding match objects matching the regex pattern. Next, we can iterate each Match object and extract its value.

- Note: Don’t use the findall() method because it returns a list, the group() method cannot be applied. If you try to apply it to the findall method, you will get AttributeError: ‘list’ object has no attribute ‘groups.’

- So always use finditer if you wanted to capture all matches to the group.

In [115]:
import re

target_string = "The price of ice-creams PINEAPPLE 20 MANGO 30 CHOCOLATE 40"

# two groups enclosed in separate ( and ) bracket
# group 1: find all uppercase letter
# group 2: find all numbers
# you can compile a pattern or directly pass to the finditer() method
pattern = re.compile(r"(\b[A-Z]+\b).(\b\d+\b)")

# find all matches to groups
for match in pattern.finditer(target_string):
    # extract words
    print(match.group(1))
    # extract numbers
    print(match.group(2))

PINEAPPLE
20
MANGO
30
CHOCOLATE
40


## Extract Range of Groups Matches
- One more thing that you can do with the group() method is to have the matches returned as a tuple by specifying the associated group numbers in between the group() method’s parentheses. This is useful when we want to extract the range of groups.

- For example, get the first 5 group matches only by executing the group(1, 5).

- Let’s try this as well.

In [116]:
import re

target_string = "The price of PINEAPPLE ice cream is 20"
# two pattern enclosed in separate ( and ) bracket
result = re.search(r".+(\b[A-Z]+\b).+(\b\d+)", target_string)

print(result.group(1, 2))
# Output ('PINEAPPLE', '20')

('PINEAPPLE', '20')


### Python Regex Metacharacters and Operators


- This article will let you know how to use metacharacters or operators in your Python regular expression. We will walk you through each metacharacter (sign) by providing short and clear examples of using them in your code.

- We can use both the special and ordinary characters inside a regular expression. Most ordinary characters, like ‘A', 'p', are the simplest regular expressions; they match themselves. For example, you can concatenate ordinary characters, so the pattern "PYnative" matches the string ‘PYnative’.

- Apart from this we also have special characters called metacharacters. Each metacharacter is equally important and may turn out to be very helpful for achieving your goals when solving your programming tasks using a regular expression.


## What is Metacharacter in a Regular Expression?
- In Python, Metacharacters are special characters that affect how the regular expressions around them are interpreted. Metacharacters don’t match themselves. Instead, they indicate that some rules. Characters or sign like |, +, or *, are special characters. For example, ^ (Caret) metacharacter used to match the regex pattern only at the start of the string.

- Metacharacters also called as operators, sign, or symbols.


- First, let’s see the list of regex metacharacters we can use in Python and their meaning.

- . (DOT)	Matches any character except a newline.
- ^ (Caret)	Matches pattern only at the start of the string.
- $ (Dollar)	Matches pattern at the end of the string
- * (asterisk)	Matches 0 or more repetitions of the regex.
- + (Plus)	Match 1 or more repetitions of the regex.
- ? (Question mark)	Match 0 or 1 repetition of the regex.
- [] (Square brackets)	Used to indicate a set of characters. Matches any single character in brackets. For example, [abc] will - match either a, or, b, or c character
- | (Pipe)	used to specify multiple patterns. For example, P1|P2, where P1 and P2 are two different regexes.
- \ (backslash)	Use to escape special characters or signals a special sequence. For example, If you are searching for one of - - the special characters you can use a \ to escape them
- [^...]	Matches any single character not in brackets.
- (...)	Matches whatever regular expression is inside the parentheses. For example, (abc) will match to substring 'abc'




## Regex . dot metacharacter
- Inside the regular expression, a dot operators represents any character except the newline character, which is \n. Any - character means letters uppercase or lowercase, digits 0 through 9, and symbols such as the dollar ($) sign or the pound (#) - - symbol, punctuation mark (!) such as the question mark (?) commas (,) or colons (:) as well as whitespaces.

- Let’s write a basic pattern to verify that the DOT matches any character except the new line.

In [117]:
import re

target_string = "Emma loves \n Python"
# dot(.) metacharacter to match any character
result = re.search(r'.', target_string)
print(result.group())
# Output 'E'

# .+ to match any string except newline
result = re.search(r'.+', target_string)
print(result.group())
# Output 'Emma loves '

E
Emma loves 


## Explanation

- So here, I used the search() method to search for the pattern specified in the first argument. Notice that I used the dot (.) and then the plus (+) sign over here. The plus sign is the repetition operator in regular expressions, and it means that the preceding character or pattern should repeat one or more times.


- This means that we are looking to match a sequence of at least one character except for the new line.

- Next, we used the group() method to see the result. As you can notice, the substring till the newline (\n) is returned because the DOT character matches any character except the new line.

## DOT to match a newline character
- If you want the DOT to match the newline character as well, use the re.DOTALL or re.S flag as an argument inside the search() method. Let’s try this also.

In [118]:
import re

str1 = "Emma is a Python developer \n She also knows ML and AI"

# dot(.) characters to match newline
result = re.search(r".+", str1, re.S)
print(result.group())

Emma is a Python developer 
 She also knows ML and AI


# Regex ^ caret metacharacter


- target_string = "Emma is a Python developer and her salary is 5000$ \n Emma also knows ML and AI"

- In Python, the caret operator or sign is used to match a pattern only at the beginning of the line. For example, considering our target string, we found two things.

- We have a new line inside the string.
- Secondly, the string starts with the word Emma which is a four-letter word.
- So assuming we wanted to match any four-letter word at the beginning of the string, we would use the caret (^) metacharacter. - Let’s test this.


In [119]:
import re

target_string = "Emma is a Python developer \n Emma also knows ML and AI"

# caret (^) matches at the beginning of a string
result = re.search(r"^\w{4}", target_string)
print(result.group())
# Output 'Emma'

Emma


## Explanation

- So in this line of code, we are using the search() method, and inside the regular expression pattern, we are using the carrot first.

- To match a four-letter word at the beginning of the string, I used the \w special sequence, which matches any alphanumeric characters such as letters both lowercase and uppercase, numbers, and the underscore character.

- The 4 inside curly braces say that the alphanumeric character must occur precisely four times in a row. i.e. Emma

- caret ( ^ ) to match a pattern at the beginning of each new line
- Normally the carat sign is used to match the pattern only at the beginning of the string as long as it is not a multiline string meaning the string does not contain any newlines.

- However, if you want to match the pattern at the beginning of each new line, then use the re.M flag. The re.M flag is used for multiline matching.

- As you know, our string contains a newline in the middle. Let’s test this.

In [120]:
import re

str1 = "Emma is a Python developer and her salary is 5000$ \nEmma also knows ML and AI"

# caret (^) matches at the beginning of each new line
# Using re.M flag
result = re.findall(r"^\w{4}", str1, re.M)
print(result)
# Output ['Emma', 'Emma']

['Emma', 'Emma']


## Regex $ dollar metacharacter

- This time we are going to have a look at the dollar sign metacharacter, which does the exact opposite of the caret (^) .

- In Python, The dollar ($) operator or sign matches the regular expression pattern at the end of the string. Let’s test this by matching word AI which is present at the end of the string, using a dollar ($) metacharacter.

In [121]:
import re

str1 = "Emma is a Python developer \nEmma also knows ML and AI"
# dollar sign($) to match at the end of the string
result = re.search(r"\w{2}$", str1)
print(result.group())
# Output 'AI'

AI


### Regex * asterisk/star metacharacter
- Another very useful and widely used metacharacter in regular expression patterns is the asterisk (*). In Python, The asterisk operator or sign inside a pattern means that the preceding expression or character should repeat 0 or more times with as many repetitions as possible, meaning it is a greedy repetition.

- When we say * asterisk is greedy, it means zero or more repetitions of the preceding expression.


- Let’s see the example to match all the numbers from the following string using an asterisk (*) metacharacter.

- target_string = "Numbers are 8,23, 886, 4567, 78453"
- Patter to match: \d\d*

- Let’s understand this pattern first.

- As you can see, the pattern is made of two consecutive \d. The \d special sequences represent any digit.

- The most important thing to keep in mind here is that the asterisk (*) at the end of the pattern means zero or more repetitions of the preceding expression. And in this case, the preceding expression is the last \d, not all two of them.

- This means that we are basically searching for numbers with a minimum of 1 digit and possibly any integer.

- We may get the following possible matches

- A single digit, meaning 0 repetitions according to the asterisk Or
- The two-digit number, meaning 1 repetition according to the asterisk Or
- we may have the three-digit number meaning two repetitions of the last \d, or
- The four-digit number as well.
- There is no upper limit of repetitions enforced by the * (asterisk) metacharacter. However, the lower limit is zero.


- So \d\d* means that the re.findall() method should return all the numbers from the target string.

In [122]:
import re

str1 = "Numbers are 8,23, 886, 4567, 78453"
# asterisk sign(*) to match 0 or more repetitions

result = re.findall(r"\d\d*", str1)
print(result)
# Output ['8', '23', '886', '4567', '78453']

['8', '23', '886', '4567', '78453']


### Regex + Plus metacharacter
- Another very useful and widely used metacharacter in regular expression patterns is the plus (+). In Python, The plus operator (+) inside a pattern means that the preceding expression or character should repeat one or more times with as many repetitions as possible, meaning it is a greedy repetition.

- When we say plus is greedy, it means 1 or more repetitions of the preceding expression.

- Let’s see the same example to match two or more digit numbers from a string using a plus (+) metacharacter.

- Patter to match: \d\d+

- This means that we are basically searching for numbers with a minimum of 2 digits and possibly any integer.

- We can get the following possible matches

- We may get the two-digit number, meaning 1 repetition according to the plus (+) Or
- we may have the three-digit number meaning two repetitions of the last \d, or
- we may have the four-digit number as well.
- There is no upper limit of repetitions enforced by the * (asterisk) metacharacter. However, the lower limit is 1.

- So \d\d+ means that the re.findall() method should Return all the numbers with a minimum of two digits from the target string.

In [123]:
import re

str1 = "Numbers are 8,23, 886, 4567, 78453"
# Plus sign(+) to match 1 or more repetitions
result = re.findall(r"\d\d+", str1)
print(result)
# Output ['23', '886', '4567', '78453']

['23', '886', '4567', '78453']


## The ? question mark metacharacter
- In Python, the question mark operator or sign (?) inside a regex pattern means the preceding character or expression to repeat either zero or one time only. This means that the number of possible repetitions is strictly limited on both ends.

- Let’s see the example to compare the ? with * and + metacharacters to handle repetitions.

- Pattern to match: \d\d\d\d\d?

- As you know, the question mark enables the repetition of the preceding character, either zero or one time.

- we have five\d, which means that we want to match numbers having at least four digits while the fifth \d may repeat 0 or 1 times, meaning it doesn’t exist at all or one time.


In [124]:
import re

target_string = "Numbers are 8,23, 886, 4567, 78453"
# Question mark sign(?) to match 0 or 1 repetitions
result = re.findall(r"\d\d\d\d\d?", target_string)
print(result)
# Output ['4567', '78453']

['4567', '78453']


- We have set a limit of four for the total number of digits in the match. And indeed, the result contains only collections of four-digit and five-digit numbers.

- The \ backslash metacharacter

- In Python, the backslash metacharacter has two primary purposes inside regex patterns.

- It can signal a special sequence being used, for example, \d for matching any digits from 0 to 9.
- If your expression needs to search for one of the special characters, you can use a backslash ( \ ) to escape them
- For example, you want to search for the question mark (?) inside the string. You can use a backslash to escaping such special characters because the question mark has a special meaning inside a regular expression pattern.
- Let’s understand each of these two scenarios, one by one.

- To indicate a special sequence
- \d for any digits
- \w for any alphanumeric character
- \s for space
- Escape special character using a backslash (\)
- Let’ take the DOT metacharacter as you’ve seen thus far. The DOT has a special meaning when used inside a regular expression. It matches any character except the new line.

- However, In the string, the DOT is used to end the sentence. So the question is how to precisely match an actual dot inside a string using regex patterns. But the DOT already has a special meaning when used inside a pattern.


- Well, the solution is to use the backslash, and it is called Escaping. You can use the backslash to escape the dot inside the regular expression pattern. And this way, you can match the actual dot inside the target string and remove its special meaning.

- Let’s take the example of the same

In [125]:
import re

str1 = "Emma is a Python developer. Emma salary is 5000$. Emma also knows ML and AI."
# escape dot
res = re.findall(r"\.", str1)
print(res)
# Output ['.', '.', '.']

['.', '.', '.']


- The [] square brackets metacharacter
- The square brackets are beneficial when used in the regex pattern because they represent sets of characters and character classes.

- Let’s say we wanted to look for any occurrences of letters E, d, k letters inside our target string. Or, in simple terms, match any of these letters inside the string. We can use the square brackets to represent sets of characters like [Edk].

In [126]:
import re

str1 = "Emma is a Python developer. Emma also knows ML and AI."
res = re.findall(r"[edk]", str1)
print(res)
# Output 'd', 'e', 'e', 'e', 'k', 'd']

['d', 'e', 'e', 'e', 'k', 'd']


- Note: Please note that the operation here is or meaning this is equivalent to saying I am looking for any occurrences of E or d or k. The result is a list containing all the matches that were found inside the target string.

- This operation can be beneficial when you want to search for several characters at the same time inside a string without knowing that any or all of them are part of the string.

- We can also use the square brackets to specify an interval or a range of characters and use a dash in-between the two ends of the range.

- For instance, let’s say that we want to match any letter from m to p inside our target string, to do this we can write regex like [m-p] Mean all the occurrences of the letters m, n, o, p.

## Python Regex Special Sequences and Character classes
- we will see how to use regex special sequences and character classes in Python. Python regex special sequence represents some special characters to enhance the capability of a regulars expression.

- Special sequence
- The special sequence represents the basic predefined character classes, which have a unique meaning. Each special sequence makes specific common patterns more comfortable to use.


- For example, you can use \d sequence as a simplified definition for character class [0-9], which means match any digit from 0 to 9.

- Let’s see the list of regex special sequences and their meaning. The special sequences consist of  '\'  (backlash) and a character from the table below.

- Special Sequence	Meaning
- \A	Matches pattern only at the start of the string
- \Z	Matches pattern only at the end of the string
- \d	Matches to any digit.
- Short for character classes [0-9]
- \D	Matches to any non-digit.
- short for [^0-9]
- \s	Matches any whitespace character.
- short for character class [ \t\n\x0b\r\f]
- \S	Matches any non-whitespace character.
- short for [^ \t\n\x0b\r\f]
- \w	Matches any alphanumeric character.
- short for character class [a-zA-Z_0-9]
- \W	Matches any non-alphanumeric character.
- short for [^a-zA-Z_0-9]
- \b	Matches the empty string, but only at the beginning or end of a word. Matches a word boundary where a word character is [a-zA-Z0-9_].
- For example, ‘\bJessa\b' matches ‘Jessa’, ‘Jessa.’, ‘(Jessa)’, ‘Jessa Emma Kelly’ but not ‘JessaKelly’ or ‘Jessa5’.
- \B	Opposite of a \b. Matches the empty string, but only when it is not at the beginning or end of a word



## Character classes
- In Python, regex character classes are sets of characters or ranges of characters enclosed by square brackets [].

- For example, [a-z] it means match any lowercase letter from a to z. 

- Let’s see some of the most common character classes used inside regular expression patterns.

### Character Class	Description
- [abc]	Match the letter a or b or c
- [abc][pq]	Match letter a or b or c followed by either p or q.
- [^abc]	Match any letter except a, b, or c (negation)
- [0-9]	Match any digit from 0 to 9. inclusive (range)
- [a-z]	Match any lowercase letters from a to z. inclusive (range)
- [A-Z]	Match any UPPERCASE letters from A to Z. inclusive (range)
- [a-zA-z]	Match any lowercase or UPPERCASE letter. inclusive (range)
- [m-p2-8]	Ranges: matches a letter between m and p and digits from 2 to 8, but not p2
- [a-zA-Z0-9_]	Match any alphanumeric character

- Now Let’s see how to use each special sequence and character classes in Python regular expression.

- Special Sequence \A and \Z
- Backslash A ( \A )

- The \A sequences only match the beginning of the string. It works the same as the caret (^) metacharacter.


- On the other hand, if we do have a multi-line string, then \A will still match only at the beginning of the string, while the caret will match at the beginning of each new line of the string.

- Backslash Z ( \Z ) sequences only match the end of the string. It works the same as the dollar ($) metacharacter.

In [127]:
import re

target_str = "Jessa is a Python developer, and her salary is 8000"

# \A to match at the start of a string
# match word starts with capital letter
result = re.findall(r"\A([A-Z].*?)\s", target_str)
print("Matching value", result)
# Output ['Jessa']

# \Z to match at the end of a string
# match number at the end of the string
result = re.findall(r"\d.*?\Z", target_str)
print("Matching value", result)
# Output ['8000']

Matching value ['Jessa']
Matching value ['8000']


### Special sequence \d and \D
#### Backslash d ( \d )

- The \d matches any digits from 0 to 9 inside the target string.
- This special sequence is equivalent to character class [0-9] .
- Use either \d or [0-9].
#### Backslash capital D ( \D )

- This sequence is the exact opposite of \d, and it matches any non-digit character.
- Any character in the target string that is not a digit would be the equivalent of the \D.
- Also, you can write \D using character class [^0-9]  (caret ^ at the beginning of the character class denotes negation).



- Now let’s do the followings

- Use a special sequence \d inside a regex pattern to find a 4-digit number in our target string.
- Use a special sequence \D inside a regex pattern to find all the non-digit characters.

In [128]:
import re

target_str = "8000 dollar"

# \d to match all digits
result = re.findall(r"\d", target_str)
print(result)
# Output ['8', '0', '0', '0']

# \d to match all numbers
result = re.findall(r"\d+", target_str)
print(result)
# Output ['8000']

# \D to match non-digits
result = re.findall(r"\D", target_str)
print(result)
# Output [' ', 'd', 'o', 'l', 'l', 'a', 'r']

['8', '0', '0', '0']
['8000']
[' ', 'd', 'o', 'l', 'l', 'a', 'r']


### Special Sequence \w and \W
#### Backslash w ( \w )

- The \w matches any alphanumeric character, also called a word character.
- This includes lowercase and uppercase letters, the digits 0 to 9, and the underscore character.
- Equivalent to character class [a-zA-z0-9_].
- You can use either \w or [a-zA-z0-9_].
#### Backslash capital W ( \W )

- This sequence is the exact opposite of \w, i.e., It matches any NON-alphanumeric character.
- Any character in the target string that is not alphanumeric would be the equivalent of the \W.
- You can write \W using character class [^a-zA-z0-9_] .
- Example

- Now let’s do the followings

- Use a special sequence \w inside a regex pattern to find all alphanumeric character in the string
- Use a special sequence \W inside a regex pattern to find all the non-alphanumeric characters.

In [129]:
import re

target_str = "Jessa and Kelly!!"

# \w to match all alphanumeric characters
result = re.findall(r"\w", target_str)
print(result)
# Output ['J', 'e', 's', 's', 'a', 'a', 'n', 'd', 'K', 'e', 'l', 'l', 'y']

# \w{5} to 5-letter word
result = re.findall(r"\w{5}", target_str)
print(result)
# Output ['Jessa', 'Kelly']

# \W to match NON-alphanumeric
result = re.findall(r"\W", target_str)
print(result)
# Output [' ', ' ', '!', '!']

['J', 'e', 's', 's', 'a', 'a', 'n', 'd', 'K', 'e', 'l', 'l', 'y']
['Jessa', 'Kelly']
[' ', ' ', '!', '!']


## Special Sequence \s and \S
### Backslash lowercase s ( \s )

- The \s matches any whitespace character inside the target string. Whitespace characters covered by this sequence are as follows

- common space generated by the space key from the keyboard. ("  ")
- Tab character (\t)
- Newline character (\n)
- Carriage return (\r)
- form feed (\f)
- Vertical tab (\v)

- Also, this special sequence is equivalent to character class [ \t\n\x0b\r\f] . So you can use either \s or [ \t\n\x0b\r\f].

- Backslash capital S ( \S )

- This sequence is the exact opposite of \s, and it matches any NON-whitespace characters. Any character in the target string that is not whitespace would be the equivalent of the \S.

- Also, you can write \S using character class [^ \t\n\x0b\r\f] .

- Example

- Now let’s do the followings

- Use a special sequence \s inside a regex pattern to find all whitespace character in our target string
- Use a special sequence \S inside a regex pattern to find all the NON-whitespace character

In [130]:
import re

target_str = "Jessa and Kelly!!"

# \w to match all alphanumeric characters
result = re.findall(r"\w", target_str)
print(result)
# Output ['J', 'e', 's', 's', 'a', 'a', 'n', 'd', 'K', 'e', 'l', 'l', 'y']

# \w{5} to 5-letter word
result = re.findall(r"\w{5}", target_str)
print(result)
# Output ['Jessa', 'Kelly']

# \W to match NON-alphanumeric
result = re.findall(r"\W", target_str)
print(result)
# Output [' ', ' ', '!', '!']

['J', 'e', 's', 's', 'a', 'a', 'n', 'd', 'K', 'e', 'l', 'l', 'y']
['Jessa', 'Kelly']
[' ', ' ', '!', '!']


## Special Sequence \s and \S
### Backslash lowercase s ( \s )

- The \s matches any whitespace character inside the target string. Whitespace characters covered by this sequence are as follows

- common space generated by the space key from the keyboard. ("  ")
- Tab character (\t)
- Newline character (\n)
- Carriage return (\r)
- form feed (\f)
- Vertical tab (\v)

- Also, this special sequence is equivalent to character class [ \t\n\x0b\r\f] . So you can use either \s or [ \t\n\x0b\r\f].

- Backslash capital S ( \S )

- This sequence is the exact opposite of \s, and it matches any NON-whitespace characters. Any character in the target string that is not whitespace would be the equivalent of the \S.

- Also, you can write \S using character class [^ \t\n\x0b\r\f] .

- Example

- Now let’s do the followings

- Use a special sequence \s inside a regex pattern to find all whitespace character in our target string
- Use a special sequence \S inside a regex pattern to find all the NON-whitespace character


In [131]:
import re

target_str = "Jessa \t \n  "

# \s to match any whitespace
result = re.findall(r"\s", target_str)
print(result)
# Output [' ', ' ', '\t', ' ', '\n', ' ', ' ']

# \S to match non-whitespace
result = re.findall(r"\S", target_str)
print(result)
# Output ['J', 'e', 's', 's', 'a']

# split on white-spaces
result = re.split(r"\s+", "Jessa and Kelly")
print(result)
# Output ['Jessa', 'and', 'Kelly']

# remove all multiple white-spaces with single space
result = re.sub(r"\s+", " ", "Jessa   and   \t \t Kelly  ")
print(result)
# Output 'Jessa and Kelly '

[' ', '\t', ' ', '\n', ' ', ' ']
['J', 'e', 's', 's', 'a']
['Jessa', 'and', 'Kelly']
Jessa and Kelly 


### Special Sequence \b and \B
#### Backslash lowercase b ( \b )

- The \b special sequence matches the empty strings bordering the word.  The backslash \b is used in regular expression patterns to signal word boundaries, or in other words, the borders or edges of a word.

- Note: A word is a set of alphanumeric characters surrounded by non-alphanumeric characters (such as space).

- Example

- Let’s try to match all 6-letter word using a special sequence \w and  \b

In [132]:
import re

target_str = "  Jessa salary is 8000$ She is Python developer"

# \b to word boundary
# \w{6} to match six-letter word
result = re.findall(r"\b\w{6}\b", target_str)
print(result)
# Output ['salary', 'Python']

# \b need separate word not part of a word
result = re.findall(r"\bthon\b", target_str)
print(result)
# Output []

['salary', 'Python']
[]


### Note: 

- One essential thing to keep in mind here is that the match will be made only for the complete and separate word itself. No match will be returned if the word is contained inside another word.

- For instance, considering the same target string, we can search for the word “ssa” using a \b special sequence like this "\bssa\b". But we will not get a match because non-alphanumeric characters do not border it on both sides.

- Moreover, the \b sequence always matches the empty string or boundary between an alphanumeric character and a non-alphanumeric character.

- Therefore keep in mind that the word you’re trying to match with the help of the \b special sequence should be separate, not part of a word.

#### Backslash capital B ( \B )

- This sequence is the exact opposite of \b.

- On the other hand, the special sequence \B matches the empty string or the border between two alphanumeric characters or two non-alphanumeric characters only when it is not at the beginning or at the end of a word.

- So this sequence can be useful for matching and locating some strings in a specific word.

- For example, let’s use \B to check whether the string ‘thon‘ is inside the target string but not at the beginning of a word. So ‘thon‘ should be part of a larger word in our string, but not at the beginning of the word.

In [133]:
import re

target_str = "Jessa salary is 8000$ She is Python developer"

# \B
result = re.findall(r"\Bthon", target_str)
print(result)
# Output ['thon']

['thon']


- And indeed, we have a match of "thon" inside the word “Python” not being at the beginning of the word. What if we want to check that "thon" is part of a word in the target string but not at the end of that word.

- Well, we have to move the \B sequence at the end of the pattern. Let’s try this also.

In [134]:
result = re.findall(r"thon\B", target_str)

### Create custom character classes
- We can construct the character classes using the following ways

- Simple Classes
- Negation
- ranges
- Simple character classes
- The most basic form of a character class is to place a set of characters side-by-side within square brackets.


- For example, the regular expression [phl]ot will match the words “pot”, “hot”, or “lot” because it defines a character class accepting either ‘p’, ‘h’, or ‘l’ as its first character followed by ‘ot’.

- Let’s see the Python example of how to use simple character classes in the regular expression pattern.

In [135]:
import re

target_string = "Jessa loves Python. and her salary is 8000$"

# simple character Class [jds]
# Match the letter J or d or e
result = re.findall(r"[Jde]", target_string)
print(result)
# Output ['J', 'e', 'e', 'd', 'e']

# simple character Class [0-9]
# Match any digit
result = re.findall(r"[0-9]", target_string)
print(result)
# Output ['8', '0', '0', '0']

# character Class [abc][pq]
# Match Match p or y or t followed by either h or s.
result = re.findall(r"[Pyt][hs]", target_string)
print(result)
# Output ['th']

['J', 'e', 'e', 'd', 'e']
['8', '0', '0', '0']
['th']


### Use negation to construct character classes

- To match all characters except those listed inside a square bracket, insert the "^" metacharacter at the character class’s beginning. This technique is known as negation.

- [^abc] matches any character except a, b, or c
- [^0-9] matches any character except digits

In [136]:
import re

target_string = "abcde25"

result = re.findall(r"[^abc]", target_string)
print(result)
# Output ['d', 'e', '2', '5']

# match any character except digits
result = re.findall(r"[^0-9]", target_string)
print(result)
# Output ['a', 'b', 'c', 'd', 'e']

['d', 'e', '2', '5']
['a', 'b', 'c', 'd', 'e']


## Use ranges to construct character classes
- Sometimes you’ll want to define a character class that includes a range of values, such as the letters “m through p” or the numbers “2 through 6“. To specify a range, simply insert the "-" metacharacter between the first and last character to be matched, such as [m-p] or [2-6].

- Let’s see how to use ranges to construct regex character classes.

- [a-z] matches any lowercase letters from a to z
- [A-Z] matches any UPPERCASE letters from A to Z
- [2-6] matches any digit from 2 to 6


- You can also place different ranges beside each other within the class to further increase the match possibilities.

- For example, [a-zA-Z] will match any letter of the alphabet: a to z (lowercase) or A to Z (uppercase).

In [137]:
import re

target_string = "ABCDefg29"

print(re.findall(r"[a-z]", target_string))
# Output ['e', 'f', 'g']

print(re.findall(r"[A-Z]", target_string))
# Output ['A', 'B', 'C', 'D']

print(re.findall(r"[a-zA-Z]", target_string))
# Output ['A', 'B', 'C', 'D', 'e', 'f', 'g']

print(re.findall(r"[2-6]", target_string))
# Output ['2']

print(re.findall(r"[A-C2-8]", target_string))
# Output ['A', 'B', 'C', '2']

['e', 'f', 'g']
['A', 'B', 'C', 'D']
['A', 'B', 'C', 'D', 'e', 'f', 'g']
['2']
['A', 'B', 'C', '2']


### Python Regex Flags

- Python regex allows optional flags to specify when using regular expression patterns with match(), search(), and split(), among others.

- All RE module methods accept an optional flags argument that enables various unique features and syntax variations.

- For example, you want to search a word inside a string using regex. You can enhance this regex’s capability by adding the RE.I flag as an argument to the search method to enable case-insensitive searching.

- You will learn how to use all regex flags available in Python with short and clear examples.




- re.A	re.ASCII	Perform ASCII-only matching instead of full Unicode matching
- re.I	re.IGNORECASE	Perform case-insensitive matching
- re.M	re.MULTILINE	This flag is used with metacharacter ^ (caret) and $ (dollar).
- When this flag is specified, the metacharacter ^ matches the pattern at beginning of the string and each newline’s beginning (\n).
- And the metacharacter $ matches pattern at the end of the string and the end of each new line (\n)
- re.S	re.DOTALL	Make the DOT (.) special character match any character at all, including a newline. Without this flag, DOT(.) will match anything except a newline
- re.X	re.VERBOSE	Allow comment in the regex. This flag is useful to make regex more readable by allowing comments in the regex.
- re.L	re.LOCALE	Perform case-insensitive matching dependent on the current locale. Use only with bytes patterns


### IGNORECASE flag
- First of all, let’s see the re.I flag’s role, which stands for ignoring a case. specified this flag in the regex method as an argument to perform case insensitive matching. You can specify this flag using two ways

- re.I
- re.IGNORECASE

In [138]:
import re

target_str = "KELLy is a Python developer at a PYnative. kelly loves ML and AI"

# Without using re.I
result = re.findall(r"kelly", target_str)
print(result)
# Output ['kelly']

# with re.I
result = re.findall(r"kelly", target_str, re.I)
print(result)
# Output ['KELLy', 'kelly']

# with re.IGNORECASE
result = re.findall(r"kelly", target_str, re.IGNORECASE)
print(result)
# Output ['KELLy', 'kelly']

['kelly']
['KELLy', 'kelly']
['KELLy', 'kelly']


- Notice the word “kelly” the occurs two times inside this string., First, capitalized at the beginning of the sentences and second in all lowercase.

- In the first re.findall() method, we got only one occurrence because, by default, the matching is case sensitive.

- And in the second re.findall() method, we got 2 occurrences because we changed the case sensitive behavior of regex using re.I so that it can find all the occurrences of a word regardless of any of its letters being uppercase or lowercase.




### DOTALL flag

- Now, let’s see the re.S flag’s role. You can specify this flag using two ways

- re.S
- re.DOTALL
- As you know, By default, the dot(.) metacharacter inside the regular expression pattern represents any character, be it a letter, digit, symbol, or a punctuation mark, except the new line character, which is \n.

- The re.S flag makes this exception disappear by enabling the DOT(.) metacharacter to match any possible character, including the new line character hence its name DOTALL.

- This can prove to be pretty useful in some scenarios, especially when the target string is a multi-line.

- Now let’s use the re.search() method with and without the RE.S flag.

In [139]:
import re

# string with newline character
target_str = "ML\nand AI"

# Match any character
result = re.search(r".+", target_str)
print("Without using re.S flag:", result.group())
# Output 'ML'

# With re.S flag
result = re.search(r".+", target_str, re.S)
print("With re.S flag:", result.group())
# Output 'ML\nand AI'

# With re.DOTALL flag
result = re.search(r".+", target_str, re.DOTALL)
print("With re.DOTALL flag:", result.group())
# Output 'ML\nand AI'

Without using re.S flag: ML
With re.S flag: ML
and AI
With re.DOTALL flag: ML
and AI


- In the first call of a re.search() method, DOT didn’t recognize the \n and stopped matching. After adding the re.S option flag in the next call, The dot character matched the entire string.

### VERBOSE flag
- That re.X flag stands for verbose. This flag allows more flexibility and better formatting when writing more complex regex patterns between the parentheses of the match(), search(), or other regex methods.

- You can specify this flag using two ways

- re.X
- re.VERBOSE
- The verbose flag allows us to the following inside the regex pattern


- Better spacing, indentation, and a clean format for more extended and intricate patterns.
- Allows us to add comments right inside the pattern for later reference using the hash sign (#).
#### When to use

- For some reason, you feel that the pattern looks complicated. Although it can get way more complicated than this, you can make it prettier and more readable by adding indentation and comments using re.X or re.VERBOSE.

In [140]:
import re

target_str = "Jessa is a Python developer, and her salary is 8000"

# re.X to add indentation  and comment in regex
result = re.search(r"""(^\w{2,}) # match 5-letter word at the start
                        .+(\d{4}$) # match 4-digit number at the end """, target_str, re.X)
# Fiver-letter word
print(result.group(1))
# Output 'Jessa'

# 4-digit number
print(result.group(2))
# Output 8000

Jessa
8000


### MULTILINE flag
- You can specify this flag using two ways

- re.M
- re.MULTILINE
- The re.M flag is used as an argument inside the regex method to perform a match inside a multiline block of text.

- Note: This flag is used with metacharacter ^ and $.

- The caret (^)matches a pattern only at the beginning of the string
- The dollar ($) matches the regular expression pattern at the end of the string
- When this flag is specified, the pattern character ^ matches at the beginning of the string and each newline’s start (\n). And the metacharacter character $ match at the end of the string and the end of each newline (\n).

In [141]:
import re

target_str = "Joy lucky number is 75\nTom lucky number is 25"

# find 3-letter word at the start of each newline
# Without re.M or re.MULTILINE flag
result = re.findall(r"^\w{3}", target_str)
print(result)  
# Output ['Joy']

# find 2-digit at the end of each newline
# Without re.M or re.MULTILINE flag
result = re.findall(r"\d{2}$", target_str)
print(result)
# Output ['25']

# With re.M or re.MULTILINE
# find 3-letter word at the start of each newline
result = re.findall(r"^\w{3}", target_str, re.MULTILINE)
print(result)
# Output ['Joy', 'Tom']

# With re.M
# find 2-digit number at the end of each newline
result = re.findall(r"\d{2}$", target_str, re.M)
print(result)
# Output ['75', '25']

['Joy']
['25']
['Joy', 'Tom']
['75', '25']


### ASCII flag
- You can specify this flag using two ways

- re.A
- re.ASCII
- Make regex  \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns and is ignored for byte patterns.

In [142]:
import re

# string with ASCII and Unicode characters
target_str = "虎太郎 and Jessa are friends"

# Without re.A or re.ASCII
# To match all 3-letter word
result = re.findall(r"\b\w{3}\b", target_str)
print(result)
# Output ['虎太郎', 'and', 'are']

# With re.A or re.ASCII
# regex to match only 3-letter ASCII word
result = re.findall(r"\b\w{3}\b", target_str, re.A)
print(result)
# Output ['and', 'are']

['虎太郎', 'and', 'are']
['and', 'are']


### Python find the position of a regex match using span(), start(), and end()


- In this article, we will see how to locate the position of a regex match in a string using the start(), end(), and span() methods of the Python re.Match object.

- We will solve the following three scenarios

- Get the start and end position of a regex match in a string
- Find the indexes of all regex matches
- Get the positions and values of each match
- Note: Python re module offers us the search(), match(), and finditer() methods to match the regex pattern, which returns us the Match object instance if a match found. Use this Match object to extract the information about the matching string using the start(), end(), and span() method.


- These Match object methods are used to access the index positions of the matching string.

- start() returns the starting position of the match
- end() return the ending position of the match
- span() return a tuple containing the (start, end) positions of the match


### Example to get the position of a regex match
- In this example, we will search any 4 digit number inside the string. To achieve this, we must first write the regular expression pattern.

- Pattern to match any 4 digit number: \d{4}

- Steps:

- Search the pattern using the search() method.
- Next, we can extract the match value using group()
- Now, we can use the start() and end() methods to get the starting and ending index of the match.
- Also, we can use the span() method() to get both start and end indexes in a single tuple.


In [143]:
import re

target_string = "Abraham Lincoln was born on February 12, 1809,"
# \d to match digits
res = re.search(r'\d{4}', target_string)
# match value
print(res.group()) 
# Output 1809

# start and end position
print(res.span())
# Output (41, 45)

# start position
print(res.start())
# Output 41

# end position
print(res.end())
# Output 45

1809
(41, 45)
41
45


- Access matching string using start(), and end()
- Now, you can save these positions and use them whenever you want to retrieve a matching string from the target string. We can use string slicing to access the matching string directly using the index positions obtained from the start(), end() method.

In [144]:
import re

target_string = "Abraham Lincoln was born on February 12, 1809,"
res = re.search(r'\d{4}', target_string)
print(res.group())
# Output 1809

# save start and end positions
start = res.start()
end = res.end()
print(target_string[start:end])
# Output 1809

1809
1809


## Find the indexes of all regex matches
- Assume you are finding all matches to the regular expression in Python, apart from all match values you also want the indexes of all regex matches. In such cases, we need to use the finditer() method of Python re module instead of findall().

- Because the findall() method returns all matches in the form of a Python list, on the other hand, finditer() returns an iterator yielding match objects matching the regex pattern. Later, we iterate each Match object to extract all matches along with their positions.

- In this example, we will find all 5-letter words inside the following string and also print their start and end positions.

In [145]:
import re

target_string = "Jessa scored 56 and Kelly scored 65 marks"
count = 0
# \w matches any alphanumeric character
# \b indicate word boundary
# {5} indicate five-letter word
for match in re.finditer(r'\b\w{5}\b', target_string):
    count += 1
    print("match", count, match.group(), "start index", match.start(), "End index", match.end())

match 1 Jessa start index 0 End index 5
match 2 Kelly start index 20 End index 25
match 3 marks start index 36 End index 41


In [146]:
#find all the indexes of all the occurrences of a word in a string

import re

target_string = "Emma knows Python. Emma knows ML and AI"
# find all occurrences of  word emma
# index of each occurrences
cnt = 0
for match in re.finditer(r'emma', target_string, re.IGNORECASE):
    cnt += 1
    print(cnt, "st match start index", match.start(), "End index", match.end())

1 st match start index 0 End index 4
2 st match start index 19 End index 23


### Points to be remembered while using the start() method
- Since the re.match() method only checks if the regular expression matches at the start of a string, start() will always be zero.

- However, the re.search() method scans through the entire target string and looks for occurrences of the pattern that we want to find, so the match may not start at zero in that case.

- Now let’s match any ten consecutive alphanumeric characters in the target string using both match() and search() method.

In [147]:
import re

target_string = "Emma is a basketball player who was born on June 17, 1993"
# match method with pattern and target string using match()
result = re.match(r"\w{10}", target_string)
# printing  match
print("Match: ", result) # None

# using search()
result = re.search(r"\w{10}", target_string)
# printing match
print("Match value: ", result.group()) # basketball
print("Match starts at", result.start()) # index 10

Match:  None
Match value:  basketball
Match starts at 10
