# Regular Expressions
Regular expressions (RE, regexp) are used to find substrings in strings not by exact matches, but by those described by pattern rules.

Special characters that we will use to define string search rules:

- `.` any single character
- `?` 0 or 1 occurrence of the previous character
- `*` the previous character is repeated ≥ 0 times (0, 1, 2, 3, etc.)
- `+` the previous character is repeated ≥ 1 time (1, 2, 3, etc.)
- `^` start of string
- `$` end of string
- `[abc]` "or": any of the characters a, b, c
- `[а-я]` any letter of the Russian alphabet from "а" to "я". Most special characters have no effect inside square brackets: . denotes a period, ? is a question mark. Outside of square brackets, to get a period or, for example, a plus sign, special characters must be escaped with \ (\. denotes a period, \+ denotes a plus sign).
- `[^abc]` — negation: any character except a, b, or c.
- `\d` — any digit, similar to [0-9]
- `\D` — any character except digits (negation of \d or [^0-9])
- `\w` — letters, digits, _ (same as [a-zA-Z0-9_]), \W — anything except letters, digits, _.
- `\s` — any space-like character ([ \t\n\r\f\v]), \S — any non-space-like character.

All about the re module: https://docs.python.org/3/howto/regex.html

Regex training: https://regex101.com/


In [1]:
import re

**`re.search(pattern, string)`** - returns the first occurrence of a substring that matches a regular expression. re.search(what_we_are_looking for; where_we_are_looking for)

In [2]:
re.search('Bon', '''Je dis "Bonjour" quand c'est le jour. Je dis "Bonsoir" quand c'est le soir.''')

<re.Match object; span=(8, 11), match='Bon'>

In [3]:
re.search('Bon....', '''Je dis "Bonjour" quand c'est le jour. Je dis "Bonsoir" quand c'est le soir.''')

<re.Match object; span=(8, 15), match='Bonjour'>

In [89]:
re.search('Bon....', '''Je dis "Bonjour" quand c'est le jour. Je dis "Bonsoir" quand c'est le soir.''').group()

'Bonjour'

But if nothing was found for the pattern, `.group()` will throw an error.

In [6]:
re.search('matin.', '''Je dis "Bonjour" quand c'est le jour. Je dis "Bonsoir" quand c'est le soir.''').group()

AttributeError: 'NoneType' object has no attribute 'group'

Therefore, to be on the safe side, it's worth checking if anything was found:

In [9]:
first_match = re.search('Bon.', '''Je dis "Bonjour" quand c'est le jour. Je dis "Bonsoir" quand c'est le soir.''')
if first_match:
    print(first_match.group())
else:
    print('Nothing found.')

Bonj


**`re.findall(pattern, string)`** - finds all occurrences of matching strings

In [10]:
all_results = re.findall('Bon....', '''Je dis "Bonjour" quand c'est le jour. Je dis "Bonsoir" quand c'est le soir.''')
all_results

['Bonjour', 'Bonsoir']

### Escaping (using slashes) and raw strings

In [11]:
digits = re.findall('\d', 'Today is November 14th. The year is 2025.')
digits

  digits = re.findall('\d', 'Today is November 14th. The year is 2025.')


['1', '4', '2', '0', '2', '5']

In [12]:
digits = re.findall(r'\d', 'Today is November 14th. The year is 2025.')
digits

['1', '4', '2', '0', '2', '5']

Also:

In [13]:
print('part 1\npart 2')

part 1
part 2


In [16]:
print(r'part 1\npart 2')

part 1\npart 2


What does `r` mean?

An `r` before a string turns it into a raw string; `r` indicates that the string doesn't contain any special characters. `\` is just a slash, and n is just n.

Another way to indicate that a character isn't a special character is to escape it by putting a slash before it.

In [18]:
digits = re.findall('\\d', 'Today is November 14th. The year is 2025.')
digits

['1', '4', '2', '0', '2', '5']

In [13]:
print('part 1\\npart 2')

часть 1\nчасть 2


### Training

**Find out what year it is  'Today is November 14th. The year is 2025.'** (in 3 ways)

In [20]:
years = re.findall('\d\d\d\d', 'Today is November 14th. The year is 2025.')
years

  digits = re.findall('\d\d\d\d', 'Today is November 14th. The year is 2025.')


['2025']

In [21]:
years = re.findall('\d{4}', 'Today is November 14th. The year is 2025.')
years

  years = re.findall('\d{4}', 'Today is November 14th. The year is 2025.')


['2025']

In [25]:
years = re.findall('[0-9]{4}', 'Today is November 14th. The year is 2025.')
years

['2025']

The curly brackets indicate how many times the previous element can be repeated.

In [26]:
re.findall(r'.{4}', 'Today is November 14th. The year is 2025')

['Toda',
 'y is',
 ' Nov',
 'embe',
 'r 14',
 'th. ',
 'The ',
 'year',
 ' is ',
 '2025']

Square brackets indicate which elements can appear in that position. You can list all the elements in a row or specify ranges. Most special characters don't work within square brackets.

In [28]:
re.findall('[A-Za-z]{4}', 'Today is November 14th. The year is 2025')

['Toda', 'Nove', 'mber', 'year']

In [29]:
re.findall('[А-Яа-я]{4}', 'Сегодня 14 ноября 2025 года')

['Сего', 'нояб', 'года']

In [30]:
re.findall('[А-Яа-я]{4}', 'Сёдня 14 ноября 2025 года')

['нояб', 'года']

In [31]:
re.findall('[А-Яа-яЁё]{4}', 'Сёдня 14 ноября 2025 года')

['Сёдн', 'нояб', 'года']

Using `^` we specify which characters we don't want to find.

In [32]:
re.findall('[^ёя]', 'Сёдня')

['С', 'д', 'н']

Using `+` indicates that the previous character is repeated ≥ 1 time.

In [34]:
re.findall('a+', 'aah. aaaah! ')

['aa', 'aaaa']

Using `*` indicates that the previous character is repeated ≥ 0 time.

In [35]:
re.findall('a*', 'Aaaah, I get it!')

['', 'aaa', '', '', '', '', '', '', '', '', '', '', '', '', '']

**Find phone numbers in the contacts**

In [61]:
fgn = '''Справочная:
Тел.: +7 (495) 771-32-32
Факс: +7 (495) 628-79-31
Для соединения с внутренним номером подразделения/работника:
+7 (495) 531-00-00
Довузовская подготовка:
Сайт: http://fdp.hse.ru
E-mail: fdp@hse.ru
Адреса и телефоны: https://fdp.hse.ru/contacts
Приемная комиссия:
Тел.: 84957713242; +7(495)916-88-44, +7 (495) 9168844'''

In [90]:
# Your code

`?` - the previous character is repeated 0 or 1 time (a way to indicate optionality)

In [28]:
# example
colours = re.findall('colou?r', 'In US English, “color” is the correct spelling. In UK English, “colour” is standard.')
colours

['color', 'colour']

In [13]:
# Your code

In [15]:
# specifying options
re.findall('(g(e|o)t)', 'I get it or I got it')

[('get', 'e'), ('got', 'o')]

In [30]:
re.findall(r'((\+7|8) ?\(?\d{3}\)? ?\d{3}-?\d{2}-?\d{2})', fgn)

[('+7 (495) 771-32-32', '+7'),
 ('+7 (495) 628-79-31', '+7'),
 ('+7 (495) 531-00-00', '+7'),
 ('84957713242', '8'),
 ('+7(495)916-88-44', '+7')]

### Find all terrier breeds inside the text

In [76]:
terriers = 'In the 18th century in Britain, only two types of terriers were recognized, long-legged and short-legged.[8] Today, terriers are often informally categorized by size or by function.Hunting-types are still used to find, track, or trail quarry, especially underground, and sometimes to bolt the quarry. Modern examples include the Jack Russell Terrier, the Jagdterrier, the Rat Terrier, and the Patterdale Terrier. There are also the short-legged terriers such as the Cairn Terrier, the Scottish Terrier, and the West Highland White Terrier, which were also used to kill small vermin.'
re.findall('(([A-Z][^ ]* ?)+(T|t)errier)', terriers)

[('Today, terrier', 'Today, ', 't'),
 ('Jack Russell Terrier', 'Russell ', 'T'),
 ('Jagdterrier', 'Jagd', 't'),
 ('Rat Terrier', 'Rat ', 'T'),
 ('Patterdale Terrier', 'Patterdale ', 'T'),
 ('Cairn Terrier', 'Cairn ', 'T'),
 ('Scottish Terrier', 'Scottish ', 'T'),
 ('West Highland White Terrier', 'White ', 'T')]

### `.group()`

If you need to find several substrings in a string, you can select each of them using parentheses.

In [81]:
s = 'корова молоко '
r = re.search('(.+?оро.+?) (.+?оло.+?) ', s)
print(r)
print(r.group())
print(r.group(0))  # the same as r.group()

<re.Match object; span=(0, 14), match='корова молоко '>
корова молоко 
корова молоко 


In [33]:
print(r.group(1))

корова


In [34]:
r.group(2)

'молоко'

### Greedy search

In [82]:
s = "Hello, I'm glad to see you"
r = re.search('.+ ', s)  # this search is greedy
                         # it finds the longest sequence possible
print(r.group())

Hello, I'm glad to see 


In [19]:
s = "Hello, I'm glad to see you"
r = re.search('.+? ', s)
print(r.group())

Hello, 


If you omit the `?`, the maximum string is found (greedy search). The limitation is a space, but a space is included in any character set.

To make the search non-greedy, use `?`, so the matching string will be found before its first occurrence, not its last.

In [37]:
s = 'корова молоко ворота'
r = re.search('.+оро.+ ', s)
print(r)
print(r.group())

<re.Match object; span=(0, 14), match='корова молоко '>
корова молоко 


In [38]:
s = 'корова молоко ворота'
r = re.search('.+оро.+? ', s)
print(r)
print(r.group())

<re.Match object; span=(0, 7), match='корова '>
корова 


**Find the login and domain in the email address**

In [85]:
pattern = r'(([a-zA-Z0-9_.]+)@(([a-zA-Z0-9_]+)\.([a-zA-Z]+)))'
re.findall(pattern, fgn)

[('fdp@hse.ru', 'fdp', 'hse.ru', 'hse', 'ru')]

To avoid confusion in the numbering of brackets, you can name them.

In [87]:
pattern = r'(?P<login>[a-zA-Z0-9_.]+)@(?P<provider>(?P<name>[a-zA-Z0-9_]+)\.(?P<domain>[a-zA-Z]+))'
re.search(pattern, fgn).group('provider')

'hse.ru'

### You can make replacements using regular expressions

In [88]:
re.sub('.', 'a', 'some text')

'aaaaaaaaa'