In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# <center>Python RegEx</center>
---

Regular expressions are a way of specifying patterns of characters that you want to match in a string. They are a powerful tool for text **processing** and **manipulation**. Python has a built-in module called **`re`** that provides various functions and methods to work with regular expressions.

A regular expression is composed of two types of characters: **normal characters** and **metacharacters**. Normal characters match themselves, while metacharacters have special meanings and functions. For example, the metacharacter **`.`** matches any character except a newline, while the metacharacter **`+`** means one or more repetitions of the preceding character or group.

To use regular expressions in Python, you need to import the **`re`** module and then use one of its functions or methods, such as `re.search, re.findall, re.sub,` etc. These functions take a regular expression pattern as the first argument and a string as the second argument. They return different types of objects depending on the function and the match result.

In [2]:
# Import the re module
import re

# Define a string
text = "The rain in Spain falls mainly on the plain"

# Define a regular expression pattern
pattern = "ain"

# Use re.search to find the first occurrence of the pattern in the string
match = re.search(pattern, text)

print(text)

# Print the match object
print(match) # Output: <re.Match object; span=(5, 8), match='ain'>

The rain in Spain falls mainly on the plain
<re.Match object; span=(5, 8), match='ain'>


In [3]:
# Use re.findall to find all occurrences of the pattern in the string
matches = re.findall(pattern, text)

print(text)

# Print the list of matches
print(matches) # Output: ['ain', 'ain', 'ain']

The rain in Spain falls mainly on the plain
['ain', 'ain', 'ain', 'ain']


In [4]:
# Use re.sub to replace all occurrences of the pattern in the string with another string
new_text = re.sub(pattern, "ane", text)

print(text)

# Print the modified string
print(new_text) # Output: The rane in Spane falls manely on the plane

The rain in Spain falls mainly on the plain
The rane in Spane falls manely on the plane


## Raw Strings
---
Raw strings in Python are strings that treat backslashes (**`\`**) as literal characters, rather than as escape sequences. They are useful when you need to specify strings that contain backslashes, such as regular expressions or file paths on Windows.

To create a raw string in Python, you need to prefix the string literal with **`r or R, such as r'...' or R'...'`**. For example:

In [5]:
# A regular string with escape sequences
s = 'Hello\nWorld'
print(s)

# A raw string with literal backslashes
s = r'Hello\nWorld'
s = R'Hello\nWorld'
print(s)

Hello
World
Hello\nWorld


If you want to convert a regular string into a raw string, you can use the built-in **`repr()`** function, which returns a printable representation of an object. For example:

In [6]:
# A regular string
s = 'Hello\nWorld'

# Convert it to a raw string using repr()
print(repr(s))

'Hello\nWorld'


---

In [7]:
text_to_search = '''
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890

Ha Haha

Metacharacters (need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )

subrata.com

6294505807
123 321 53341

Subrata
Rahul
Bapai
Mrinal
Kailash

"Emma's luck numbers are 251 761 231 451"
"Blue Berries are better than black berries"

'''

sentence = "Start a sentence and then bring it to an end."

## re.compile
**`re.compile`** is a function that compiles a regular expression pattern into a regular expression object, which can be used for matching using its **match(), search()** and other methods. The advantage of compiling a regular expression is that you can reuse it multiple times without rewriting it. You can also specify flags to modify the behavior of the regular expression, such as case-insensitivity or multiline matching.

```python
re.compile(pattern, flags=0)
```

## re.finditer
**`re.finditer`** is a method that returns an iterator yielding match objects over all non-overlapping matches for the regular expression pattern in a string. The match objects contain information about the start and end positions, groups and named groups of the match. You can use **re.finditer** to loop over all the matches in a string and perform some action on each match.

```python
re.finditer(pattern, string, flags=0)
```

**Example 1:** Compile a pattern to find three consecutive digits in a string

In [8]:
# Example 1: Compile a pattern to find three consecutive digits in a string
import re

pattern = re.compile(r"\d{3}") # \d means any digit, {3} means exactly three times
matches = pattern.finditer(text_to_search)
print(type(matches))

for match in matches:
    print(match.group(),end=", ") # match.group() returns the matched substring

<class 'callable_iterator'>
123, 456, 789, 629, 450, 580, 123, 321, 533, 251, 761, 231, 451, 

**Example 2:** Compile a pattern with flags to find words that start with b or B in a string

In [9]:
# Example 2: Compile a pattern with flags to find words that start with b or B in a string
import re

pattern = re.compile(r"\b[bB]\w+", flags=re.IGNORECASE) # \b means word boundary, \w+ means one or more word characters, flags=re.IGNORECASE means ignore case
matches = pattern.finditer(text_to_search)
print(type(matches))

for match in matches:
    print(match.group(), end=", ")

<class 'callable_iterator'>
be, Bapai, Blue, Berries, better, black, berries, 

## Most used methods of re module
* **`re.match()`**: This method checks for a match only at the beginning of the string.

In [10]:
pattern = '^a...s$'
test_string = 'abyss'
result = re.match(pattern, test_string)
if result:
  print("Search successful.")
else:
  print("Search unsuccessful.")

Search successful.


* **`re.search()`**: This method checks for a match anywhere in the string.

In [11]:
pattern = 'world'
test_string = 'Hello world'
result = re.search(pattern, test_string)
if result:
  print("Search successful.")
else:
  print("Search unsuccessful.")

Search successful.


* **`re.findall()`**: This method returns a list of all matches in the string.

In [12]:
pattern = '\d+'
test_string = '12 drummers drumming, 11 pipers piping, 1001 lords a-leaping'
result = re.findall(pattern, test_string)
print(result)

['12', '11', '1001']


* **`re.sub()`**: This method replaces one or more matches with a given string

In [13]:
pattern = 'blue|white|red'
test_string = 'Roses are red, violets are blue'
result = re.sub(pattern, 'pink', test_string)
print(result)

Roses are pink, violets are pink


## Metacharacters
---
Metacharacters are characters with a special meaning in regular expressions. They are used to define the search criteria and any text manipulations.

1. **`.`** The metacharacter . matches any character except a newline in Python regular expressions. It is also called the dot or wildcard metacharacter. It can be used to match any single character in a pattern.

For example, if you want to match a word that starts with “`h`” and ends with “`t`”, you can use the pattern "`h.t`". This will match any word that has three letters and the first and last letters are “`h`” and “`t`”, such as `“hat”, “hot”, “hit”`, etc.

In [14]:
import re

# create a pattern object from a metacharacter expression
pattern = re.compile(".")

# find the first match of that pattern in a string
match = re.findall(pattern, "The hat is hot")

# print the match object
print(match) # <re.Match object; span=(4, 7), match='hat'>

['T', 'h', 'e', ' ', 'h', 'a', 't', ' ', 'i', 's', ' ', 'h', 'o', 't']


In [15]:
# find all the matches of that pattern in a string
matches = re.findall("h.t", "The hat is hot")

# print the matches list
print(matches) # ['hat', 'hot']

['hat', 'hot']


2. **`\d`** matches **`any digit`** from (0-9)
3. **`\D`** matches **`non-digit`**

The metacharacter **`\d`** matches **any digit** from `0 to 9`. For example, the pattern **\d\d** matches any two digits, such as `12 or 99`. The metacharacter **`\D`** matches any **non-digit character**. For example, the pattern **\D\D** matches any two non-digits, such as `ab or %&`.

In [16]:
import re

# using \d to match digits
text = "My phone number is 123-456-7890"
pattern = r"\d\d\d-\d\d\d-\d\d\d\d" # match a phone number format
match = re.search(pattern, text)
print(match,"\n")
if match:
    print(match.group()) # prints 123-456-7890

<re.Match object; span=(19, 31), match='123-456-7890'> 

123-456-7890


In [17]:
# using \D to match non-digits
text = "The price is $9.99"
pattern = r"\D+" # match one or more non-digits
match = re.findall(pattern, text)
if match:
    print(match) # prints ['The price is $', '.']

['The price is $', '.']


4. **`\w`** matches any **`alphanumeric characters`** i.e ([a-z], [A-Z], [0-9])
5. **`\W`** matches any **`non-alphanumeric characters`**

The metacharacter **`\w`** matches **`any alphanumeric character`**, which means any letter from `a to z (lowercase or uppercase)`, any digit from `0 to 9`, or the `underscore character _`. For example, the pattern `\w\w\w` matches any three alphanumeric characters, such as `abc` or `1_2`. 

The metacharacter **`\W`** matches any **`non-alphanumeric character`**, which means any character that is `not a letter, a digit, or an underscore`. For example, the pattern `\W` matches any symbol, such as `%` or `&`.

In [18]:
import re

# using \w to match alphanumeric characters
text = "My username is pynative_01"
pattern = r"\w+" # match one or more alphanumeric characters
match = re.findall(pattern, text)
print(match, "\n")

['My', 'username', 'is', 'pynative_01'] 



In [19]:
# using \W to match non-alphanumeric characters
text = "The price is $9.99"
pattern = r"\W" # match one non-alphanumeric character
match = re.findall(pattern, text)
print(match) # prints space, $ , dot(.), comma(,)

[' ', ' ', ' ', '$', '.']


6. **`\s`** matches **`any Whitespace Characters`**
7. **`\S`** matches **`any Non-Whitespace Characters`**

The metacharacter **`\s`** matches any **whitespace character**, which means any **space, tab, newline, or carriage return**. For example, the pattern `\s\s` matches `any two whitespace characters`, such as `" " or “\n\t”`. 

The metacharacter **`\S`** matches any **non-whitespace character**, which means any character that is **not a space, tab, newline, or carriage return**. For example, the pattern `\S` matches any `symbol, letter, digit, or underscore`.

In [20]:
import re

# using \s to match whitespace characters
text = "Hello\tworld\n"
pattern = r"\s" # match one whitespace character
match = re.findall(pattern, text)
if match:
    print(match) # prints ['\t', '\n']

['\t', '\n']


In [21]:
# using \S to match non-whitespace characters
text = "The price is $9.99"
pattern = r"\S+" # match one or more non-whitespace characters
match = re.findall(pattern, text)
if match:
    print(match) # prints ['The', 'price', 'is', '$9.99']

['The', 'price', 'is', '$9.99']


8. **`\b`** matches a **word boundary**
9. **`\B`** matches a **non-word boundary**

The metacharacter **`\b`** matches a **word boundary**, which is a `position between a word character and a non-word character`, or between a `word character and the beginning or end of a string`. A word character is any letter, digit, or underscore. For example, the pattern `\bcat\b` matches the word “cat” but not “catch” or “concatenate”.

The metacharacter **`\B`** matches a **non-word boundary**, which is a `position where both characters are word characters or both are non-word characters`. For example, the pattern `\Bcat\B` matches “catch” and “concatenate” but not “cat” or “scat”.

In [22]:
import re

# using \b to match word boundaries
text = "I like cats and dogs"
pattern = r"\bcats\b" # match cat as a whole word
match = re.search(pattern, text)
print(match, "\n")
if match:
    print(match.group()) # prints cat

<re.Match object; span=(7, 11), match='cats'> 

cats


In [23]:
# using \B to match non-word boundaries
text = "I like catch and concatenate"
pattern = r"\Bcat\B" # match cat inside other words
match = re.search(pattern, text)
print(match, "\n")
if match:
    print(match.group()) # prints cat

<re.Match object; span=(20, 23), match='cat'> 

cat


### Anchors
Anchors are metacharacters that **match the positions of characters in a string**, such as the beginning or the end of a string. They are useful for specifying where a pattern should occur in a string. In Python, there are two types of anchors:

* The **`caret anchor (^)`** matches at the **beginning** of a string. For example, the pattern ^Hello matches the string “Hello world” but not “Say Hello”.
* The **`dollar anchor ($)`** matches at the **end** of a string. For example, the pattern world$ matches the string “Hello world” but not “world peace”.

In [24]:
import re

# using ^ to match at the beginning of a string
text = "Hello world Hello"
pattern = r"^Hello" # match Hello at the beginning
match = re.search(pattern, text)
if match:
    print(match.group()) # prints Hello

Hello


In [25]:
# using $ to match at the end of a string
text = "World Hello world"
pattern = r"world$" # match world at the end
match = re.search(pattern, text)
if match:
    print(match.group()) # prints world

world
