<a href="https://colab.research.google.com/github/subho2026/NLP-Beginners/blob/main/NLP_01_RegEx_en.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

heavily inspired by
[This Overview](https://pynative.com/python/regex/)

Credits to them

# Import stuff)

In [None]:
import re

So, what re can do?

Spoiler - a lot

In [None]:
print(help(re))

Help on module re:

NAME
    re - Support for regular expressions (RE).

MODULE REFERENCE
    https://docs.python.org/3.7/library/re
    
    The following documentation is automatically generated from the Python
    source files.  It may be incomplete, incorrect or include features that
    are considered implementation detail and may vary between Python
    implementations.  When in doubt, consult the module reference at the
    location listed above.

DESCRIPTION
    This module provides regular expression matching operations similar to
    those found in Perl.  It supports both 8-bit and Unicode strings; both
    the pattern and the strings being processed can contain null bytes and
    characters outside the US ASCII range.
    
    Regular expressions can contain both special and ordinary characters.
    Most ordinary characters, like "A", "a", or "0", are the simplest
    regular expressions; they simply match themselves.  You can
    concatenate ordinary characters, so last mat

# Lets do some tasks

Let's find a digit in a string

Which can be usefull as a preprocessing tool

`\d`- some digit between 0 and 9

`\` (backslash) escape special characters

In [None]:
target_str = "My phone number is +1-234-567-89 dude privet"
res = re.findall(r"\d", target_str)

print(res)

['1', '2', '3', '4', '5', '6', '7', '8', '9']


In [None]:
res = re.findall(r"d", target_str)

print(res)

['d', 'd']


What if we wanted to stack numbers together

`\d` and `+` (Plus)

1 or more repetitions of the regex

In [None]:
res = re.findall(r"\d+", target_str)

print(res)

['1', '234', '567', '89']


**Task**

Write a code that would output a list of digits as integers

In [None]:
# write your code here

res_int = []

for x in res:
    res_int.append(int(x))
res_int

[1, 234, 567, 89]

**RegEx Metacharacters**



`.` (DOT)

any character except a newline.

In [None]:
res = re.findall(r".", target_str)

print(res)

['M', 'y', ' ', 'p', 'h', 'o', 'n', 'e', ' ', 'n', 'u', 'm', 'b', 'e', 'r', ' ', 'i', 's', ' ', '+', '1', '-', '2', '3', '4', '-', '5', '6', '7', '-', '8', '9', ' ', 'd', 'u', 'd', 'e', ' ', 'p', 'r', 'i', 'v', 'e', 't']


`^` (Caret)

pattern only at the start of the string

In [None]:
res = re.findall(r"^.", target_str)

print(res)

['M']


`$` (Dollar)

pattern at the end of the string

In [None]:
res = re.findall(r".$", target_str)

print(res)

['t']


`*` (asterisk)

0 or more repetitions of the regex

In [None]:
res = re.findall(r"\d*", target_str)

print(res)

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '1', '', '234', '', '567', '', '89', '', '', '', '', '', '', '', '', '', '', '', '', '']


something to find f.e. words with endings

In [None]:
res = re.findall(r"er*", target_str)

print(res)

['e', 'er', 'e', 'e']


weird way to find all stacked digits?

In [None]:
res = re.findall(r"\d\d*", target_str)

print(res)

['1', '234', '567', '89']


`?` (Question mark)

0 or 1 repetition of the regex

In [None]:
res = re.findall(r"\d?", target_str)

print(res)

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '1', '', '2', '3', '4', '', '5', '6', '7', '', '8', '9', '', '', '', '', '', '', '', '', '', '', '', '', '']


lets fing 2-3 digit numbers

In [None]:
res = re.findall(r"\d\d\d?", target_str)

print(res)

['234', '567', '89']


`[ ]` (Square brackets)

a set of characters. Matches any single character in brackets



In [None]:
res = re.findall(r"[pd]..", target_str)

print(res)

['pho', 'dud', 'pri']


it could be a combination of stuff

In [None]:
res = re.findall(r"[pd][rh]\w+", target_str)

print(res)

['phone', 'privet']


`[^ ]` any single character not in brackets

In [None]:
res = re.findall(r"[^0-9]", target_str)

print(res)

['M', 'y', ' ', 'p', 'h', 'o', 'n', 'e', ' ', 'n', 'u', 'm', 'b', 'e', 'r', ' ', 'i', 's', ' ', '+', '-', '-', '-', ' ', 'd', 'u', 'd', 'e', ' ', 'p', 'r', 'i', 'v', 'e', 't']


`( )` everything inside is a pattern as a whole

In [None]:
res = re.findall(r"(234)", target_str)

print(res)

['234']


`{}` - nice brackets

to specify particular number of mathes

exact number of repetitions

In [None]:
res = re.findall(r"\d{3}", target_str)

print(res)

['234', '567']


from low boundary to upper boundary

In [None]:
res = re.findall(r"\d{1,3}", target_str)

print(res)

['1', '234', '567', '89']


at least times

In [None]:
res = re.findall(r"\d{2,}", target_str)

print(res)

['234', '567', '89']


# Some special Character Classes

`\d` is equivelent to `[0-9]` any digit

`\D` is equivelent to `[^0-9]` any non-digit

`\s` is equivelent to `[ \t\n\x0b\r\f]` any whitespace character

`\S` is equivelent to `[^ \t\n\x0b\r\f]` any non-whitespace character

`\w` is equivelent to `[a-zA-Z_0-9]` any alphanumeric character

`\W` is equivelent to `[^a-zA-Z_0-9]` any non-alphanumeric character

`\b` is to find empty string at the beginning or end of a word

`\B` is somehow the opposite

In [None]:
res = re.findall(r"\B\w+\B", target_str)

print(res)

['hon', 'umbe', '3', '6', 'ud', 'rive']


In [None]:
res = re.findall(r"\b\w+\b", target_str)

print(res)

['My', 'phone', 'number', 'is', '1', '234', '567', '89', 'dude', 'privet']


**Task**

Write code to extract phone number as a whole sting (all the pluss, numbers and dashes)

In [None]:
# write your code here

['+1-234-567-89']


**Task**

Find all 4-5 letter words

In [None]:
# write your code here


['phone', 'dude']


# Something about methods

In [None]:
target_string = "Toss a 300$ to a Witcher"

`re.compile('pattern')`

to save a pattern as a variable to be used in the future

just not to boringly "copy-paste" and to avoid mistakes


In [None]:
str_pattern = r"\w+"
pattern = re.compile(str_pattern)
pattern

re.compile(r'\w+', re.UNICODE)

then we can apply different methods to the pattern

`re.match(pattern, str)`

return a match at the start of the string


In [None]:
res = pattern.match(target_string)

print(res.group())

Toss


In [None]:
print(res.start())

0


In [None]:
print(res.end())

4


In [None]:
print(res.span())

(0, 4)


`re.search(pattern, str)`

search regex pattern anywhere inside string


In [None]:
res = re.search(r"\d+", target_string)
print(res.group())

300


`re.findall(pattern, str)`

Scans the regex pattern through the entire string and returns all matches

In [None]:
res = re.findall(r"\w+", target_string)
print(res)

['Toss', 'a', '300', 'to', 'a', 'Witcher']


`re.split(pattern, str)`

It breaks a string into a list of matches as per the given regular expression pattern.

In [None]:
res = re.split(r"\s", target_string)
print("All tokens:", res)

All tokens: ['Toss', 'a', '300$', 'to', 'a', 'Witcher']


In [None]:
target_str = "My phone number is (wait for it) \n +1-234-567-89"

res = re.split("\n", target_str)

print(res)

['My phone number is (wait for it) ', ' +1-234-567-89']


`re.sub(pattern, replacement, str)`

Replace one or more occurrences of a pattern in the string with a `replacement`


In [None]:
res = re.sub(r"\s", "_SPACE_", target_string)

print(res)

Toss_SPACE_a_SPACE_300$_SPACE_to_SPACE_a_SPACE_Witcher


In [None]:
res = re.sub(r"\d+", "threehundred", target_string)

print(res)

Toss a threehundred$ to a Witcher


# THE TASKS

* Write a code
* Prove that it works

Heavily inspired by this [Kaggle Post](https://www.kaggle.com/code/albeffe/regex-exercises-solutions)

Some are directly taken, some are modified. But credit is worth giving

*1*) Write a Python program to check that a string contains only a certain set of characters (in this case a-z, A-Z and 0-9).

In [None]:
# write your code here

import re

# The regular expression pattern for lowercase letters
pattern = r'^[a-z]+$'

# Test strings
test_string1 = "hello"
test_string2 = "Hello123"

# Check if test_string1 contains only lowercase letters
if re.match(pattern, test_string1):
    print(f"'{test_string1}' contains only lowercase letters.")
else:
    print(f"'{test_string1}' contains characters other than lowercase letters.")

# Check if test_string2 contains only lowercase letters
if re.match(pattern, test_string2):
    print(f"'{test_string2}' contains only lowercase letters.")
else:
    print(f"'{test_string2}' contains characters other than lowercase letters.")


**Results :** <br>
'hello' contains only lowercase letters.<br>
'Hello123' contains characters other than lowercase letters.


2) Write a Python program that matches a string that has an `1` followed by zero or more `0`'s





In [None]:
# write your code here

import re

pattern = r'10*'

test_strings = ["100001", "0000", "101010", "11", "0"]

for test_string in test_strings:
    if re.match(pattern, test_string):
        print(f"'{test_string}' is a valid match.")
    else:
        print(f"'{test_string}' is not a valid match.")


**Results :** <br>
'100001' is a valid match.<br>
'0000' is a valid match.<br>
'101010' is a valid match.<br>
'11' is not a valid match.<br>
'0' is not a valid match.


3) Write a Python program that matches a string that has an `:` followed by one or more `)`'s

In [None]:
# write your code here

4) Write a Python program that matches a string that has an `digit` (any number) followed by zero or one  `non-alphabetic character`

In [None]:
# write your code here

5) Write a Python program that matches a string that has an `I` followed by three `E`

In [None]:
# write your code here

6) Write a Python program that matches a string that has an `l` followed by two to three `3`.

In [None]:
# write your code here

7) Write a Python program to find sequences of `lowercase letters` joined with a `underscore`

In [None]:
# write your code here

8) Write a Python program to find the sequences of `one upper case letter` followed by `lower case letters`

In [None]:
# write your code here

9) Write a Python program that matches a string that has an `l` followed by anything, ending in `e`

In [None]:
# write your code here

10) Write a Python program that matches a `word` at the `beginning of a string`

(if string starts with ` ` (space) it should not match)

In [None]:
# write your code here

11) Write a Python program that matches a `word` at the end of string, with `optional punctuation`

(if string ends with ` ` (space) it should not match)

In [None]:
# write your code here

12) Write a Python program that matches a word containing `q`

In [None]:
# write your code here

13) Write a Python program that matches a word containing `g` , not at the start or end of the word

In [None]:
# write your code here

14) Write a Python program to match a string that contains only `upper` and `lowercase letters`, `numbers`, and `underscores`

(no space allowed)

In [None]:
# write your code here

15) Write a Python program where a string will start with `+` a specific number (let it be `7`)

In [None]:
# write your code here

16) Write a Python program to remove leading zeros from an IP address

f.e. `255.01.092.132` -> `255.1.92.132`

In [None]:
# write your code here

17) Write a Python program to check for a `number` at the end of a string

In [None]:
# write your code here

18) Write a Python program to search the numbers `0-9` of length between `1` to `3` in a given string.

In [None]:
# write your code here

19) Write a Python program to search some literals strings in a string.
Sample text : `'The quick brown fox jumps over the lazy dog'`

Searched words : `'fox', 'dog', 'horse'`

In [None]:
# write your code here

20) Write a Python program to search a literals string in a string and also find the location within the original string where the pattern occurs

Sample text : `'The quick brown fox jumps over the lazy dog.' `

Searched words : `'lazy'`

In [None]:
# write your code here

21) Write a Python program to find the substrings within a string.

Sample text :

`'Machine learning, Deep learning, life-long learning'`

Pattern :

`'learning'`

Note: There are two instances of exercises in the input string.

In [None]:
# write your code here

22) Write a Python program to find the occurrence and position of the substrings within a string

(previous example)

In [None]:
# write your code here

23) Write a Python program to replace `whitespaces` with an `underscore` and vice versa

In [None]:
# write your code here

24) Write a Python program to extract year, month and date from a an url

Assume `yyyy/mm/dd` format

In [None]:
# write your code here

25) Write a Python program to convert a date of yyyy-mm-dd format to dd-mm-yyyy format

In [None]:
# write your code here

26) Write a Python program to remove quotation marks `"` from a string

In [None]:
# write your code here

27) Write a Python program to separate and print the `numbers` of a given string

In [None]:
# write your code here

28) Write a Python program to find all words starting with `'f'` or `'t'` in a given string.

In [None]:
# write your code here

29) Write a Python program to separate and print the numbers and their position of a given string.

In [None]:
# write your code here

30) Write a Python program to abbreviate `'Doctor'` as `'Dr.'` in a given string.

In [None]:
# write your code here

# The final Task

example taken from [here](https://towardsdatascience.com/mastering-regular-expressions-for-your-day-to-day-tasks-b01385aeea56) so you could always cheat)

**The text**

Chinese small-leaf-type tea was introduced into India in 1836 by the British in an attempt to break the Chinese monopoly on tea.[57] In 1841, Archibald Campbell brought seeds of Chinese tea from the Kumaun region and experimented with planting tea in Darjeeling. The Alubari tea garden was opened in 1856 and Darjeeling tea began to be produced.[58] In 1848, Robert Fortune was sent by the East India Company on a mission to China to bring the tea plant back to Great Britain. He began his journey in high secrecy as his mission occurred in the lull between the Anglo-Chinese First Opium War (1839–1842) and Second Opium War (1856–1860).[59] .....[57] Tea was originally consumed only by anglicized Indians; however, it became widely popular in India in the 1950s because of a successful advertising campaign by the India Tea Board.[57]

In [None]:
text = 'Chinese small-leaf-type tea was introduced into India in 1836 by the British in an attempt to break the Chinese monopoly on tea.[57] In 1841, Archibald Campbell brought seeds of Chinese tea from the Kumaun region and experimented with planting tea in Darjeeling. The Alubari tea garden was opened in 1856 and Darjeeling tea began to be produced.[58] In 1848, Robert Fortune was sent by the East India Company on a mission to China to bring the tea plant back to Great Britain. He began his journey in high secrecy as his mission occurred in the lull between the Anglo-Chinese First Opium War (1839–1842) and Second Opium War (1856–1860).[59] .....[57] Tea was originally consumed only by anglicized Indians; however, it became widely popular in India in the 1950s because of a successful advertising campaign by the India Tea Board.[57]'

1) Write a Python program to extract all annotations  (numers in square brackets) - and only them

In [None]:
# write your code here

2) Write a Python program to extract all years

For a case of a span - calculate the span (bigger number - smaller)

In [None]:
# write your code here

3) Write a Python program to replace `First` and `Second` (only in names of the wars) to `1st` and `2nd` respectively

In [None]:
# write your code here

4) Write a Python program to count how many times tea was used in context of "Chinese Tea"
Don't forget to consider upper and lower case + possible in between words

In [None]:
# write your code here

**Extra**

Use regex to find the winner team given the following table

And the margin by which it won

In [None]:
text = '''
Juventus F.C.     -    Napoli                 4:1 (2:0)
A.C. Milan        -    Internazionale F.C.    2:2 (1:0)
A.C. Fiorentina   -    Torino                 1:0 (0:0)
Lazio             -    Atalanta               0:3 (0:2)
Lazio             -    Juventus F.C.          0:4 (0:2)
'''