# An introduction to regular expressions
Decoding simple regex features to match complex text patterns.

From: https://www.oreilly.com/content/an-introduction-to-regular-expressions/

<i>Offtopic | HTML-codes from: https://www.startertutorials.com/ajwt/character-formatting-essentials.html</i>

In [1]:
import re
import pandas as pd

Match two capital letters like AA, AB, AC, DE etc.:

In [2]:
result = re.fullmatch(pattern="[A-Z]{2}", string = "TX")

if result:
    print("Match")
else:
    print("No match")

Match


In [3]:
abb = pd.Series(["TX", "AX", "SW", "asc"], name = 'abbreviations')
abb

0     TX
1     AX
2     SW
3    asc
Name: abbreviations, dtype: object

Match $10:

In [4]:
result = re.fullmatch(pattern="\$10", string = "$10")

if result:
    print("Match")
else:
    print("No match")

Match


<code>\s</code> is space:

In [5]:
result = re.fullmatch(pattern="Lorem\sIpsum", string = "Lorem Ipsum")

if result:
    print("Match")
else:
    print("No match")

Match


First character should be 0, 1 or 3 and second character should be F, X or B. Any combination is possible:

In [6]:
result = re.fullmatch(pattern="[013][FXB]", string = "1X")

if result:
    print("Match")
else:
    print("No match")

Match


First character should be either 0 or 1 or 3. Same for the second and third character. The fourth character should be F, X or B. Any combination is possible:

In [7]:
result = re.fullmatch(pattern="[013]{3}[FXB]", string = "130X")

if result:
    print("Match")
else:
    print("No match")

Match


First 3 characters are allowed to be composed only out of letters <kbd>S, T, E, F, A</kbd> and <kbd>N</kbd>. All capital. The last fourth letter is allowed to be composed only out of letters v, a, n, l and i. All lowercase.

In [8]:
result = re.fullmatch(pattern="[STEFAN]{3}[vanli]{1}", string = "EFEl")

if result:
    print("Match")
else:
    print("No match")

Match


Можно указать промежуток, который ты бы хотел использовать. В данном случае используются только буквы от B до H для первого символа и цифры от 3 до 7 для второго.

In [9]:
result = re.fullmatch(pattern="[B-H][3-7]", string = "D5")

if result:
    print("Match")
else:
    print("No match")

Match


Допустимые символы для первого символа: промежуток символов от A до Z, от a до z и от 0 до 9. Оно пишется A-Z и сразу дальше без пробела a-z потому что код видит их вместе: (A-Z)(a-z)(0-9). Если поставишь между ними дефис, то код будет видеть дополнительный символ типа (A-Z)(-)(a-z)(-)(0-9). Порядок, насколько я понял, не важен. В данном случае заглавные и строчные символы а также цифры не имеют между собой какого-то приоритета.

In [10]:
result = re.fullmatch(pattern="[A-Za-z0-9][3-7]", string = "a5")

if result:
    print("Match")
else:
    print("No match")

Match


Альтернативой <i>0-9</i> является <i>\d</i>. Пример:

In [39]:
result = re.fullmatch(pattern="[A-Za-z\d]", string = "5")

if result:
    print("Match")
else:
    print("No match")

Match


<code>^</code> (a carrot) before a character/characters means <i>don't include the following (after this symbol) characters</i>:

In [11]:
result = re.fullmatch(pattern="[^1-3]", string = "4")

if result:
    print("Match")
else:
    print("No match")

Match


If you want to include dash <kbd>-</kbd> you can include it the following way: 

In [12]:
result = re.fullmatch(pattern="[-1-3]", string = "-") # or [1-3-]

if result:
    print("Match")
else:
    print("No match")

Match


<code>^</code> at the beginning says: <i>it should begin with this symbol</i>
<br>
<code>$</code> at the end says: <i>it should end with this symbol</i>
<br> <br>
It does not match here, but works fin on regex101.com

In [13]:
result = re.fullmatch(pattern="^[0-9]", string = "84289ido3")

if result:
    print("Match")
else:
    print("No match")

No match


In [14]:
result = re.fullmatch(pattern="[0-9]$", string = "84289ido3")

if result:
    print("Match")
else:
    print("No match")

No match


Qualifying the start <code>^</code> and end <code>$</code> of a string forces everything between them to be the only contents allowed in the input. Like in the example below, where only 2 numbers are allowed:

In [15]:
result = re.fullmatch(pattern="^[0-9][0-9]$", string = "84")

if result:
    print("Match")
else:
    print("No match")

Match


Match a Dutch telefone number:

In [16]:
result = re.fullmatch(pattern="[+][3][1] [6] [0-9]{2} [0-9]{2} [0-9]{2} [0-9]{2}", 
                      string = "+31 6 83 26 10 38")

if result:
    print("Match")
else:
    print("No match")

Match


Use <code>{}</code> to state the number of repetitions. E.g. <code>[0-9]{2}</code> is equal to <code>[0-9][0-9]</code> and means <i>repeat the number two times</i>

In [17]:
result = re.fullmatch(pattern="[0-9]{2}", 
                      string = "38")

if result:
    print("Match")
else:
    print("No match")

Match


Match a 10-digit phone number with dashes:

In [18]:
result = re.fullmatch(pattern="[0-9]{3}-[0-9]{3}-[0-9]{4}", 
                      string = "470-127-7501")

if result:
    print("Match")
else:
    print("No match")

Match


Use <code>{n,n}</code> (n,n ← without spaces!) to state the <i>minimum</i> and <i>maximum</i> number of repetitions. E.g. <code>[0-9]{2,4}</code> means <i>repeat the number minimum 2 times and maximum 3</i>. If you leave the second number empty, e.g. like <code>{n,}</code> then you will state a <i>minimum</i> without a <i>maximum</i>.

In [19]:
result = re.fullmatch(pattern="[0-9]{4,6}", 
                      string = "746311")

if result:
    print("Match")
else:
    print("No match")

Match


If you want a value to be made optional you can use <code>?</code> or <code>{0,1}</code>:

In [20]:
result = re.fullmatch(pattern="[0-9]?[A-Za-z]", 
                      string = "1A")

if result:
    print("Match")
else:
    print("No match")

Match


In [21]:
result = re.fullmatch(pattern="[0-9]{0,1}[A-Za-z]", 
                      string = "1A")

if result:
    print("Match")
else:
    print("No match")

Match


Making the dashes in the phone number optional:

In [22]:
result = re.fullmatch(pattern="[0-9]{3}-?[0-9]{3}-?[0-9]{4}", 
                      string = "4701277501")

if result:
    print("Match")
else:
    print("No match")

Match


Making the country code <i>+31</i> optional as well as spaces:

In [23]:
result = re.fullmatch(pattern="[+]?[3]?[1]? ?[6] ?[0-9]{2} ?[0-9]{2} ?[0-9]{2} ?[0-9]{2}", 
                      string = "6 83261038")

if result:
    print("Match")
else:
    print("No match")

Match


1 or more repetitions can be made using <code>+</code> or <code>{1,}</code>:

In [24]:
result = re.fullmatch(pattern="[5]+[A-Za-z]", 
                      string = "555A")

if result:
    print("Match")
else:
    print("No match")

Match


0 or more repetitions can be made using <code>*</code> or <code>{0,}</code>:

In [25]:
result = re.fullmatch(pattern="[5]*[A-Za-z]", 
                      string = "A")

if result:
    print("Match")
else:
    print("No match")

Match


<code>.</code> means <i>any character</i>. In the example below you have [(1) only one number 5] followed by [(2) 3 random characters] followed by [(3) 1 capital letter]:
<br>
<br>
Note that <code>.{3}</code> = <code>...</code>

In [26]:
result = re.fullmatch(pattern="[5].{3}[A-Z]", 
                      string = "5&@)D")

if result:
    print("Match")
else:
    print("No match")

Match


Alternative to SQL's <code>_*_</code> is <code>.*</code>. It means <i>match any symbol for 0 or more times</i>.

In [27]:
result = re.fullmatch(pattern=".*", 
                      string = "5&@)D")

if result:
    print("Match")
else:
    print("No match")

Match


Use <code>()</code> for grouping. In the example below we make the index letters optional:

In [28]:
result = re.fullmatch(pattern="[0-9]{4}( [A-Za-z]{2})?", 
                      string = "3024 SG")

if result:
    print("Match")
else:
    print("No match")

Match


In [29]:
result = re.fullmatch(pattern="([A-Z]{4}-)?[0-9]{3}-?[0-9]{4}", 
                      string = "DEBS-127-7501")

if result:
    print("Match")
else:
    print("No match")

Match


<code>OR</code> is expressed with <code>|</code>. It alternates two or more valid patterns where at least one of those patterns must match in that position. In the example below we want to get US zip codes that end with 35 or 75. It means that we have 3 numbers and the last two numbers (extra numbers) should be 35 or 75. In total there are 5 numbers.

In [30]:
result = re.fullmatch(pattern="[0-9]{3}(35|75)", 
                      string = "93435")

if result:
    print("Match")
else:
    print("No match")

Match


Look up for indexes that have a number between 3025-3030, space is optional and the index letters should be either AN or NK:

In [31]:
result = re.fullmatch(pattern="[3][0](25|26|27|28|29|30) ?(AN|NK)", 
                      string = "3030NK")

if result:
    print("Match")
else:
    print("No match")

Match


<code>|</code> can also be used to simply to qualify a set of literal values. E.g. if I want to only match ALPHA, BETA, and GAMMA, I can use <code>|</code> to achieve this.

In [32]:
result = re.fullmatch(pattern="ALPHA|BETA|GAMMA", 
                      string = "BETA")

if result:
    print("Match")
else:
    print("No match")

Match


This is a prefix <code>(?<=text)</code> and this is a suffix <code>(?=text)</code>. You use a prefix in a situation when you want e.g. extract numbers that are precedes by uppercase letters. Like you want to get <i>12</i> from the <i>ALPHA12</i>.

In [None]:
# A prefix
result = re.fullmatch(pattern="(?<=[A-Z]+)[0-9]+", 
                      string = "ALPHA12")

if result:
    print("Match")
else:
    print("No match")

In [None]:
# A suffic
result = re.fullmatch(pattern="[0-9]+(?=[A-Z]+)", 
                      string = "12ALPHA")

if result:
    print("Match")
else:
    print("No match")

From: https://towardsai.net/p/l/regular-expression-regex-in-python-the-basics

Sythax of the <code>finall</code> function:

re.findall(pattern = <replace with the regex pattern>,
    string = <replace with the text>)

In [43]:
re.findall(pattern = r"Player\d",
           string = "Player1 and Player2 form a team.")

['Player1', 'Player2']

In [48]:
print('Line1 \nLine2')

Line1 
Line2


In [46]:
print('Line1 \tLine2')

Line1 	Line2


1. Literal match of word <i>player</i>:

In [49]:
text = "My team has player1, player2, and playerN."

re.findall(r"player", text)

['player', 'player', 'player']

2. Match a digit using <code>\d</code>

In [50]:
re.findall(r"player\d", text)

['player1', 'player2']

3. Match a non-digit using <code>\D</code>

In [56]:
re.findall(r"player\D", text)

['playerN']

4. Match a word character using <code>\w</code>

In [60]:
re.findall(r"player\w", text)

['player1', 'player2', 'playerN']

5. Match a non-word character using <code>\W</code>

In [62]:
text2 = "#Learn_Regex_in_5_Minutes"

re.findall(r"\W", text2)

['#']

6. Match whitespace with <code>\s</code>

In [63]:
text3 = "I am learning Python regex."

re.findall(r"Python\sregex", text3)

['Python regex']

7. Match a non-whitespace with <code>\S</code>

In [64]:
text4 = "I like sugar-free coffee."

re.findall(r"sugar\Sfree", text4)

['sugar-free']

Extract phone number from the text

In [68]:
text5 = "My phone numbers is 1234567890."

re.findall(r"\d{10}", text5)

['1234567890']

In [69]:
re.findall(r".*", text5)

['My phone numbers is 1234567890.', '']

1. Match one or more times using <code>+</code>

In [72]:
text5 = "My phone numbers is 1234567890."

re.findall(r"\d+", text5)

['1234567890']

2. Match exactly <i>n</i> occurrences using <code>{n}</code>

In [74]:
text6 = "I studies comp1234. My friend studies data5678"

re.findall(r"\w{4}\d{4}", text6)

['comp1234', 'data5678']

3. Match the preceding character zer or one time. In this example it will match both <i>cat</i> and <i>cats</i>

In [75]:
text7 = "I have a cat. My friend has two cats."

re.findall(r"cats?", text7)

['cat', 'cats']

4. Special characters like <kbd>\\</kbd> <kbd>*</kbd> <kbd>+</kbd> <kbd>?</kbd> cannot be directly matched using code like <code>re.findall(r"+", text)</code>. Use the escape character <code>\\</code> before the special character.

In [76]:
text8 = "1 + 1 = 2"

re.findall(r"\+", text8)

['+']

In [97]:
text9 = "Desktop \ New Foler"

re.findall(r"\\", text9)

['\\']