In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# <center>Python RegEx</center>
---

Regular expressions are a way of specifying patterns of characters that you want to match in a string. They are a powerful tool for text **processing** and **manipulation**. Python has a built-in module called **`re`** that provides various functions and methods to work with regular expressions.

A regular expression is composed of two types of characters: **normal characters** and **metacharacters**. Normal characters match themselves, while metacharacters have special meanings and functions. For example, the metacharacter **`.`** matches any character except a newline, while the metacharacter **`+`** means one or more repetitions of the preceding character or group.

To use regular expressions in Python, you need to import the **`re`** module and then use one of its functions or methods, such as `re.search, re.findall, re.sub,` etc. These functions take a regular expression pattern as the first argument and a string as the second argument. They return different types of objects depending on the function and the match result.

In [2]:
# Import the re module
import re

# Define a string
text = "The rain in Spain falls mainly on the plain"

# Define a regular expression pattern
pattern = "ain"

# Use re.search to find the first occurrence of the pattern in the string
match = re.search(pattern, text)

print(text)

# Print the match object
print(match) # Output: <re.Match object; span=(5, 8), match='ain'>

The rain in Spain falls mainly on the plain
<re.Match object; span=(5, 8), match='ain'>


In [3]:
# Use re.findall to find all occurrences of the pattern in the string
matches = re.findall(pattern, text)

print(text)

# Print the list of matches
print(matches) # Output: ['ain', 'ain', 'ain']

The rain in Spain falls mainly on the plain
['ain', 'ain', 'ain', 'ain']


In [4]:
# Use re.sub to replace all occurrences of the pattern in the string with another string
new_text = re.sub(pattern, "ane", text)

print(text)

# Print the modified string
print(new_text) # Output: The rane in Spane falls manely on the plane

The rain in Spain falls mainly on the plain
The rane in Spane falls manely on the plane


## Raw Strings
---
Raw strings in Python are strings that treat backslashes (**`\`**) as literal characters, rather than as escape sequences. They are useful when you need to specify strings that contain backslashes, such as regular expressions or file paths on Windows.

To create a raw string in Python, you need to prefix the string literal with **`r or R, such as r'...' or R'...'`**. For example:

In [5]:
# A regular string with escape sequences
s = 'Hello\nWorld'
print(s)

# A raw string with literal backslashes
s = r'Hello\nWorld'
s = R'Hello\nWorld'
print(s)

Hello
World
Hello\nWorld


If you want to convert a regular string into a raw string, you can use the built-in **`repr()`** function, which returns a printable representation of an object. For example:

In [6]:
# A regular string
s = 'Hello\nWorld'

# Convert it to a raw string using repr()
r = repr(s)
print(r)

'Hello\nWorld'


---

In [7]:
text_to_search = '''
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890

Ha Haha

Metacharacters (need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )

subrata.com

6294505807
123 321 53341

Subrata
Rahul
Bapai
Mrinal
Kailash

"Emma's luck numbers are 251 761 231 451"
"Blue Berries are better than black berries"

'''

sentence = "Start a sentence and then bring it to an end."

## re.compile
**`re.compile`** is a function that compiles a regular expression pattern into a regular expression object, which can be used for matching using its **match(), search()** and other methods. The advantage of compiling a regular expression is that you can reuse it multiple times without rewriting it. You can also specify flags to modify the behavior of the regular expression, such as case-insensitivity or multiline matching.

```python
re.compile(pattern, flags=0)
```

**`re.finditer`** is a method that returns an iterator yielding match objects over all non-overlapping matches for the regular expression pattern in a string. The match objects contain information about the start and end positions, groups and named groups of the match. You can use **re.finditer** to loop over all the matches in a string and perform some action on each match.

```python
re.finditer(pattern, string, flags=0)
```

**Example 1:** Compile a pattern to find three consecutive digits in a string

In [8]:
# Example 1: Compile a pattern to find three consecutive digits in a string
import re

pattern = re.compile(r"\d{3}") # \d means any digit, {3} means exactly three times
matches = pattern.finditer(text_to_search)
print(type(matches))

for match in matches:
    print(match.group()) # match.group() returns the matched substring

<class 'callable_iterator'>
123
456
789
629
450
580
123
321
533
251
761
231
451


**Example 2:** Compile a pattern with flags to find words that start with b or B in a string

In [9]:
# Example 2: Compile a pattern with flags to find words that start with b or B in a string
import re

pattern = re.compile(r"\b[bB]\w+", flags=re.IGNORECASE) # \b means word boundary, \w+ means one or more word characters, flags=re.IGNORECASE means ignore case
matches = pattern.finditer(text_to_search)
print(type(matches))

for match in pattern.finditer(text_to_search):
    print(match.group())

<class 'callable_iterator'>
be
Bapai
Blue
Berries
better
black
berries


## Metacharacters
---