# Regex introduction

## What is a regex?
[**Regex**](https://en.wikipedia.org/wiki/Regular_expression) stands for _regular expression_, and regular expressions are a way of writing patterns that match strings. Usually these patterns can be used to search strings for specific things, or to search and then replace certain things, etc. Regular expressions are great for string manipulation!

## Why do regular expressions matter?
From the first paragraph in this guide you might have guessed it, but regular expressions can be very useful **whenever you have to deal with strings**. From the basic renaming of a set of similarly named variables in your source code to [data preprocessing](https://github.com/clone95/Virgilio/blob/master/Specializations/HardSkills/DataPreprocessing.md). Regular expressions usually offer a concise way of expressing whatever type of things you want to find. For example, if you wanted to parse a form and look for the year that someone might have been born in, you could use something like `(19)|(20)[0-9][0-9]`. This is an example of a regular expression!

## Prerequisites
This guide does not assume any prior knowledge. Examples will be coded in Python, but mastery of the programming language is neither assumed nor needed. You are welcome to read the guide in your browser or to download it and to run the examples/toying around with them.

# Index
 - [Basic regex](#Basic-regex)
   - [Using Python re](#Using-Python-re)
   - [$\pi$ lookup](#$\pi$-lookup)
 - [Matching options](#Matching-options)
   - [Virgilio or Virgil?](#Virgilio-or-Virgil?)
 - [Matching repetitions](#Matching-repetitions)
   - [Greed](#Greed)
   - [Removing excessive spaces](#Removing-excessive-spaces)
 - [Special characters](#Special-characters)
 - [Toy project about regex](#Toy-project-with-regex)
 - [Further reading](#Further-reading)
 - [Suggested solutions](#Suggested-solutions)
 
Let's dive right in!

**Just a quick word:** I tried to include some small exercises whenever I show you something new, so that you can try and test your knowledge. Examples of solutions are provided in the [end of the notebook](#Suggested-solutions).

## Basic regex

A regex is just a string written in a certain format, that can then be used by specific tools/libraries/programs to perform pattern matching on strings. Throughout this guide we will use `this formatting` to refer to regular expressions!

The simplest regular expressions that one can create are just composed of regular characters. If you wanted to find all the occurrences of the word _"Virgilio"_ in a text, you could write the regex `Virgilio`. In this regular expression, no character is doing anything special or different. In fact, this regular expression is just a normal word. That is ok, regular expressions are strings, after all!

If you were given the text _"Project Virgilio is great"_, you could use your `Virgilio` regex to find the occurrence of the word _"Virgilio"_. However, if the text was _"Project virgilio is great"_, then your regex wouldn't work, because regular expressions are **case-sensitive** by default and thus should match everything exactly. We say that `Virgilio` matches the sequence of characters "Virgilio" literally.

### Using Python re

To check if our regular expressions are working well and to give you the opportunity to directly experiment with them, we will be using Python's `re` module to work with regular expressions. To use the `re` module we first import it, then define a regular expression and then use the `search()` function over a string! Pretty simple:

In [1]:
import re

regex = "Virgilio"
str1 = "Project Virgilio is great"
str2 = "Project virgilio is great"

if re.search(regex, str1):
    print("'{}' is in '{}'".format(regex, str1))
else:
    print("'{}' is not in '{}'".format(regex, str1))
    
if re.search(regex, str2):
    print("'{}' is in '{}'".format(regex, str2))
else:
    print("'{}' is not in '{}'".format(regex, str2))

'Virgilio' is in 'Project Virgilio is great'
'Virgilio' is not in 'Project virgilio is great'


The `re.search(regex, string)` function takes a regex as first argument and then searches for any matches over the string that was given as the second argument. However, the return value of the function is **not** a boolean, but a *match object*:

In [2]:
print(re.search(regex, str1))

<re.Match object; span=(8, 16), match='Virgilio'>


Match objects have relevant information about the match(es) encountered: the start and end positions, the string that was matched, and even some other things for more complex regular expressions.

We can see that in this case the match is exactly the same as the regular expression, so it may look like the `match` information inside the match object is irrelevant... but it becomes relevant as soon as we introduce options or repetitions into our regex.

If no matches are found, then the `.search()` function returns `None`:

In [3]:
print(re.search(regex, str2))

None


Whenever the match is not `None`, we can save the returned match object and use it to extract all the needed information!

In [4]:
m = re.search(regex, str1)
if m is not None:
    print("The match started at pos {} and ended at pos {}".format(m.start(), m.end()))
    print("Or with tuple notation, the match is at {}".format(m.span()))
    print("And btw, the actual string matched was '{}'".format(m.group()))

The match started at pos 8 and ended at pos 16
Or with tuple notation, the match is at (8, 16)
And btw, the actual string matched was 'Virgilio'


Now you should try to get some more matches and some fails with your own literal regular expressions. I provide three examples of my own:

In [5]:
m1 = re.search("regex", "This guide is about regexes")
if m1 is not None:
    print("The match is at {}\n".format(m1.span()))

m2 = re.search("abc", "The alphabet goes 'abdefghij...'")
if m2 is None:
    print("Woops, did I just got the alphabet wrong..?\n")
    
s = "aaaaa aaaaaa a aaa"
m3 = re.search("a", s)
if m3 is not None:
    print("I just matched '{}' inside '{}'".format(m3.group(), s))

The match is at (20, 25)

Woops, did I just got the alphabet wrong..?

I just matched 'a' inside 'aaaaa aaaaaa a aaa'


### $\pi$ lookup

$$\pi = 3.1415\cdots$$

right? Well, what comes after the dots? An infinite sequence of digits, right? Could it be that your date of birth appears in the first million digits of $\pi$? Well, we could use a regex to find that out! Change the `regex` variable below to look for your date of birth or for any number you want, in the first million digits of $\pi$!

In [6]:
pifile = "regex-bin/pi.txt"
regex = ""  # define your regex to look your favourite number up

with open(pifile, "r") as f:
    pistr = f.read()  # pistr is a string that contains 1M digits of pi
    
## search for your number here

To search for numbers in the first 100 million digits of $\pi$ (or 200 million, I didn't really get it) you can check [this](https://www.angio.net/pi/piquery) website.

## Matching options

We just saw a very simple regular expression that was trying to find the word _"Virgilio"_ in text, but we also saw that we had zero flexibility and we couldn't even handle the fact that someone may have forgotten to capitalize the name properly, spelling it like _"virgilio"_ instead.

To prevent problems like this, regular expressions can be written in a way to handle different possibilities. For our case, we want the first letter to be either _"V"_ or _"v"_, and that should be followed by _"irgilio"_.

In order to handle different possibilities, we use the character `|`. For instance, `V|v` matches the letter vee, regardless of its capitalization:

In [7]:
v = "v"
V = "V"
regex = "v|V"
if re.search(regex, v):
    print("small v found")
if re.search(regex, V):
    print("big V found")

small v found
big V found


Now we can concatenate the regex for the first letter and the `irgilio` regex (for the rest of the name) to get a regex that matches the name of Virgilio, regardless of the capitalization of its first letter:

In [8]:
virgilio = "virgilio"
Virgilio = "Virgilio"
regex = "V|v" + "irgilio"
if re.search(regex, virgilio):
    print("virgilio found!")
if re.search(regex, Virgilio):
    print("Virgilio found!")

virgilio found!
Virgilio found!


Notice that the attribution `regex = "V|v" + "irgilio"` is equivalent to `regex = "V|virgilio"`, but now we have a minor problem... there are no visual clues that hint at the fact that the `|` character is there only because of the vee. You can make it more obvious by writing the regex in a slightly different way, with parenthesis: `(V|v)irgilio`

In [9]:
regex = "(V|v)irgilio"
print(re.search(regex, "The name of the project is virgilio, but with a big V!"))

<re.Match object; span=(27, 35), match='virgilio'>


Maybe you didn't even notice, but there is something else going on! Notice that we used the characteres `|`, `(` and `)`, and those are not present in the word _"virgilio"_, but nonetheless our regex `(V|v)irgilio` matched it... that is because these three characters have special meanings in the regex world, and hence are **not** interpreted literally, contrary to what happens to any letter in `irgilio`.

### Virgilio or Virgil?

Here is a couple of paragraphs from Wikipedia's [article on Virgil](https://en.wikipedia.org/wiki/Virgil):

 > Publius Vergilius Maro (Classical Latin: [ˈpuː.blɪ.ʊs wɛrˈɡɪ.lɪ.ʊs ˈma.roː]; traditional dates October 15, 70 BC – September 21, 19 BC[1]), usually called Virgil or Vergil (/ˈvɜːrdʒɪl/) in English, was an ancient Roman poet of the Augustan period. He wrote three of the most famous poems in Latin literature: the Eclogues (or Bucolics), the Georgics, and the epic Aeneid. A number of minor poems, collected in the Appendix Vergiliana, are sometimes attributed to him.[2][3]

 > Virgil is traditionally ranked as one of Rome's greatest poets. His Aeneid has been considered the national epic of ancient Rome since the time of its composition. Modeled after Homer's Iliad and Odyssey, the Aeneid follows the Trojan refugee Aeneas as he struggles to fulfill his destiny and reach Italy, where his descendants Romulus and Remus were to found the city of Rome. Virgil's work has had wide and deep influence on Western literature, most notably Dante's Divine Comedy, in which Virgil appears as Dante's guide through Hell and Purgatory.
 
"Virgilio" is the italian form of "Virgil", and I edited the above paragraphs to have the italian version instead of the english one. I want you to revert this!

You might want to take a look at [`while` cycles in Python](https://realpython.com/python-while-loop/), [string indexing](https://www.digitalocean.com/community/tutorials/how-to-index-and-slice-strings-in-python-3) and [string concatenation](https://realpython.com/python-string-split-concatenate-join/). The point is that you find a match, you break the string into the part _before_ the match and the part _after_ the match, and you glue those two together with _Virgilio_ in between.

Notice that [string replacement](https://www.tutorialspoint.com/python/string_replace.htm) would probably be faster and easier, but that would defeat the purpose of this exercise. After fixing everything, print the final results to be sure that you fixed every occurrence of the name.

In [10]:
paragraphs = \
"""Publius Vergilius Maro (Classical Latin: [ˈpuː.blɪ.ʊs wɛrˈɡɪ.lɪ.ʊs ˈma.roː]; traditional dates October 15, 70 BC – September 21, 19 BC[1]), usually called virgilio or Vergil (/ˈvɜːrdʒɪl/) in English, was an ancient Roman poet of the Augustan period. He wrote three of the most famous poems in Latin literature: the Eclogues (or Bucolics), the Georgics, and the epic Aeneid. A number of minor poems, collected in the Appendix Vergiliana, are sometimes attributed to him.[2][3]

Virgilio is traditionally ranked as one of Rome's greatest poets. His Aeneid has been considered the national epic of ancient Rome since the time of its composition. Modeled after Homer's Iliad and Odyssey, the Aeneid follows the Trojan refugee Aeneas as he struggles to fulfill his destiny and reach Italy, where his descendants Romulus and Remus were to found the city of Rome. virgilio's work has had wide and deep influence on Western literature, most notably Dante's Divine Comedy, in which virgilio appears as Dante's guide through Hell and Purgatory."""

## Matching repetitions

Sometimes we want to find patterns that have bits that will be repeated. For example, people make a _"awww"_ or _"owww"_ sound when they see something cute, like a baby. But the number of _"w"_ I used there was completely arbitrary! If the baby is really really cute, someone might write _"awwwwwwwwwww"_. So how can I write a regex that matches _"aww"_ and _"oww"_, but with an arbitrary number of characters _"w"_?

I will illustrate several ways of capturing repetitions, by testing regular expressions against the following strings:

 - "awww" (3 letters "w")
 - "awwww" (4 letters "w")
 - "awwwwwww" (7 letters "w")
 - "awwwwwwwwwwwwwwww" (16 letters "w")
 - "aw" (1 letter "w")
 - "a" (0 letters "w")

In [48]:
cute_strings = [
    "awww",
    "awwww",
    "awwwwwww",
    "awwwwwwwwwwwwwwww",
    "aw",
    "a"
]

def match_cute_strings(regex):
    """Takes a regex, prints matches and non-matches"""
    for s in cute_strings:
        m = re.search(regex, s)
        if m:
            print("match: {}".format(s))
        else:
            print("non match: {}".format(s))

#### At least once

If I want to match all strings that containt **at least** one "w", we can use the character `+`. A `+` means that we want to find **one or more repetitions** of whatever was to the left of it. For example, the regex `a+` will match any string that has at least one "a".

In [35]:
regex = "aw+"
match_cute_strings(regex)

match: awww
match: awwww
match: awwwwwww
match: awwwwwwwwwwwwwwww
match: aw
non match: a


#### Any number of times

If I want to match all strings that contain an arbitrary number of letters "w", I can use the character `*`. The character `*` means **match any number of repetitions** of whatever comes on the left of it, _even 0 repetitions_! So the regex `a*` would match the empty string "", because the empty string "" has 0 repetitions of the letter "a".

In [36]:
regex = "aw*"
match_cute_strings(regex)

match: awww
match: awwww
match: awwwwwww
match: awwwwwwwwwwwwwwww
match: aw
match: a


#### A specific number of times

If I want to match a string that contains a certain particle a specific number of times, I can use the `{n}` notation, where `n` is replaced by the number of repetitions I want. For example, `a{3}` matches the string "aaa" but not the string "aa".

In [37]:
regex = "aw{3}"
match_cute_strings(regex)

match: awww
match: awwww
match: awwwwwww
match: awwwwwwwwwwwwwwww
non match: aw
non match: a


**Wait a minute**, why did the pattern `aw{3}` match the longer expressions of cuteness, like "awwww" or "awwwwwww"? Because the regular expressions try to find _substrings_ that match the pattern. Our pattern is `awww` (if I write the `w{3}` explicitly) and the string **awww**w has that substring, just like the string **awww**wwww has it, or the longer version with 16 letters "w". If we wanted to exclude the strings "awwww", "awwwwwww" and "awwwwwwwwwwwwwwww" we would have to fix our regex. A better example that demonstrates how `{n}` works is by considering, instead of expressions of cuteness, expressions of amusement like "wow", "woow" and "wooooooooooooow". We define some expressions of amusement:

 - "wow"
 - "woow"
 - "wooow"
 - "woooow"
 - "wooooooooow"
 
and now we test our `{3}` pattern.

In [47]:
wow_strings = [
    "wow",
    "woow",
    "wooow",
    "woooow",
    "wooooooooow"
]

def match_wow_strings(regex):
    """Takes a regex, prints matches and non-matches"""
    for s in wow_strings:
        m = re.search(regex, s)
        if m:
            print("match: {}".format(s))
        else:
            print("non match: {}".format(s))

In [42]:
regex = "wo{3}w"
match_wow_strings(regex)

non match: wow
non match: woow
match: wooow
non match: woooow
non match: wooooooooow


#### Between $n$ and $m$ times

Expressing amusement with only three "o" is ok, but people might also use two or four "o". How can we capture a variable number of letters, but within a range? Say I only want to capture versions of "wow" that have between 2 and 4 letters "o". I can do it with `{2,4}`.

In [43]:
regex = "wo{2,4}w"
match_wow_strings(regex)

non match: wow
match: woow
match: wooow
match: woooow
non match: wooooooooow


#### Up to $n$ times or at least $m$ times

Now we are just playing with the type of repetitions we might want, but of course we might say that we want **no more** than $n$ repetitions, which you would do with `{,n}`, or that we want **at least** $m$ repetitions, which you would do with `{m,}`.

In fact, take a look at these regular expressions:

In [44]:
regex = "wo{,4}w" # should not match strings with more than 4 o's
match_wow_strings(regex)

match: wow
match: woow
match: wooow
match: woooow
non match: wooooooooow


In [45]:
regex = "wo{3,}w" # should not match strings with less than 3 o's
match_wow_strings(regex)

non match: wow
non match: woow
match: wooow
match: woooow
match: wooooooooow


#### To be or not to be

Last but not least, sometimes we care about something that might or might not be present. For example, above we dealed with the English and Italian versions of the name Virgilio. If we wanted to write a regular expression to capture both versions, we could write `((V|v)irgil)|((V|v)irgilio)`, or slightly more compact, `(V|v)((irgil)|(irgilio))`. But this does not look good at all, right? All we need to say is that the final "io" might or might not be present. We do this with the `?` character. So the regex `(V|v)irgil(io)?` matches the upper and lower case versions of "Virgil" and "Virgilio".

In [46]:
regex = "(V|v)irgil(io)?"
names = ["virgil", "Virgil", "virgilio", "Virgilio"]
for name in names:
    m = re.search(regex, name)
    if m:
        print("The name {} was matched!".format(name))

The name virgil was matched!
The name Virgil was matched!
The name virgilio was matched!
The name Virgilio was matched!


### Greed

### Removing excessive spaces

## Special characters

To be added soon!

## Toy project about regex

To be added soon!

## Further reading
For regular expressions in Python, you can take a look at the [documentation](https://docs.python.org/3/library/re.html) of the `re` module, as well as this [regex HOWTO](https://docs.python.org/3/howto/regex.html).

[This](https://regexr.com/) interesting website (and [this one](https://regex101.com/) as well) provides an interface for you to type regular expressions and see what they match in a text. The tool also gives you an explanation of what your regular expression is doing.

---

I found some interesting websites with exercises on regular expressions. [This one](https://regexone.com/lesson/introduction_abcs) has more "basic" exercises, each one of them preceeded by an explanation of whatever you will need to complete the exercise. I suggest you to go through them. [Hackerrank](https://www.hackerrank.com/domains/regex) and [regexplay](http://play.inginf.units.it/#/) also have some interesting exercises, but those require you to login in some way.

---

If you enjoyed this guide and/or it was useful, consider leaving a star in the [Virgilio repository](https://github.com/clone95/Virgilio) and sharing it with your friends!

This was brought to you by the editor of the [Mathspp Blog](https://mathspp.blogspot.com), [RojerGS](https://github.com/RojerGS).

### Suggested solutions

### $\pi$ lookup (solved)

In [11]:
pifile = "regex-bin/pi.txt"
regex = "9876"  # define your regex to look your favourite number up

with open(pifile, "r") as f:
    pistr = f.read()  # pistr is a string that contains 1M digits of pi
    
## search for your number here
m = re.search(regex, pistr)
if m:
    print("Found the number '{}' at positions {}".format(regex, m.span()))
else:
    print("Sorry, the first million digits of pi can't help you with that...")

Found the number '9876' at positions (4087, 4091)


### Virgilio or Virgil? (solved)

In [12]:
paragraphs = \
"""Publius Vergilius Maro (Classical Latin: [ˈpuː.blɪ.ʊs wɛrˈɡɪ.lɪ.ʊs ˈma.roː]; traditional dates October 15, 70 BC – September 21, 19 BC[1]), usually called virgilio or Vergil (/ˈvɜːrdʒɪl/) in English, was an ancient Roman poet of the Augustan period. He wrote three of the most famous poems in Latin literature: the Eclogues (or Bucolics), the Georgics, and the epic Aeneid. A number of minor poems, collected in the Appendix Vergiliana, are sometimes attributed to him.[2][3]

Virgilio is traditionally ranked as one of Rome's greatest poets. His Aeneid has been considered the national epic of ancient Rome since the time of its composition. Modeled after Homer's Iliad and Odyssey, the Aeneid follows the Trojan refugee Aeneas as he struggles to fulfill his destiny and reach Italy, where his descendants Romulus and Remus were to found the city of Rome. virgilio's work has had wide and deep influence on Western literature, most notably Dante's Divine Comedy, in which virgilio appears as Dante's guide through Hell and Purgatory."""

regex = "(V|v)irgilio"
parsed_str = paragraphs
m = re.search(regex, parsed_str)
while m is not None:
    parsed_str = parsed_str[:m.start()] + "Virgil" + parsed_str[m.end():]
    m = re.search(regex, parsed_str)

print(parsed_str)

Publius Vergilius Maro (Classical Latin: [ˈpuː.blɪ.ʊs wɛrˈɡɪ.lɪ.ʊs ˈma.roː]; traditional dates October 15, 70 BC – September 21, 19 BC[1]), usually called Virgil or Vergil (/ˈvɜːrdʒɪl/) in English, was an ancient Roman poet of the Augustan period. He wrote three of the most famous poems in Latin literature: the Eclogues (or Bucolics), the Georgics, and the epic Aeneid. A number of minor poems, collected in the Appendix Vergiliana, are sometimes attributed to him.[2][3]

Virgil is traditionally ranked as one of Rome's greatest poets. His Aeneid has been considered the national epic of ancient Rome since the time of its composition. Modeled after Homer's Iliad and Odyssey, the Aeneid follows the Trojan refugee Aeneas as he struggles to fulfill his destiny and reach Italy, where his descendants Romulus and Remus were to found the city of Rome. Virgil's work has had wide and deep influence on Western literature, most notably Dante's Divine Comedy, in which Virgil appears as Dante's guide t