***

## <div align="center"> The Hogwarts Regex challenge

<div align="center"> Computational Social Sciences, University of Amsterdam, Febuary, 2023

***

![hogwarts](https://cdn.dribbble.com/users/59947/screenshots/12020903/media/b4aaca6fc95d40427b6bf9b3c5cc05be.jpg?compress=1&resize=1200x900&vertical=top)

Artwork by [StudioMuti](https://dribbble.com/studioMUTI)


***

### The challange explained


Welcome challengee - it is time to perform magic with Python. To accomplish this challenge, you can use the instructions in this Jupyter Notebook as well as the following websites: 

* [Docs Python Regular Expressions](https://docs.python.org/3/library/re.html)
* [Rex Egg](https://www.rexegg.com/)
* [Google for Education Python Regular Expressions](https://developers.google.com/edu/python/regular-expressions)


Please do not use any other sources, unless it is specifically stated that you can. The [Rex Egg quickstart](https://www.rexegg.com/regex-quickstart.html) (a Regex cheat sheet) will be particularly useful once you understand the basics.


Regex (short for Regular Expressions) is used for 'pattern matching,' which is probably best explained using an example. To illustrate the power of Regex, we first need to import the Regex library, which comes pre-installed in Anaconda, and we compile the string ```s```. You can run the code cell below.

***

In [1]:
import re
s = 'The Philosopher\'s Stone is published in 1997, but the writing started in 1990.'
s

"The Philosopher's Stone is published in 1997, but the writing started in 1990."

***

Say that you would like to extract all 'years' from the above string. This can be achieved with the Regex pattern ```\d{4}``` which will match on all four digit numbers. Whilst ```\d``` matches on all digits, ```{4}``` quantifies '4 instances of the foregoing character.' In this particular example, you could also have used ```\d+```, meaning 'one or more digits.' Yet be carefull, as such 'generous' patterns may yield more matches than you like.

For now, run the code cell below.

***

In [2]:
year = re.findall(r'\d{4}',s) 
year

['1997', '1990']

***

As a magician practicioner, you may be interested in any text or numbers in between two specific strings. The pattern ```.*?``` in the example below matches on _everything._ It is placed between two brackets '()' to form a delimited group, i.e. a subset of your full match. 

First run the code.

***

In [3]:
s = 'a b c d e f' 
re.search(r'b(.*?)d', s).group(1).strip()

'c'

***

```.group(0)``` always gives the full match. In the example above ```.group(0)``` returns:

> b c d

But we are only interested in ```.group(1)```, the match between the two brackets:

> c

If we would add more closed brackets, as we will see in the next example, then we would create more groups, which are numbered in order of appearance.

```.strip()``` is here merely used for aesthetic purposes. It removes whitespaces at the beginning and end of a string.


Also note that you can - and sometimes want to - match whitespaces. This is done with the ```\s``` Regex pattern, as can be seen in the following example. Here we only match 2 word characters that are followed by a whitespace (so 'aa ' rather than 'bb' is matched)


***

In [4]:
s = 'aa bb' 
re.search(r'\w{2}\s', s)

<re.Match object; span=(0, 3), match='aa '>

***

In the following example, we will take a look at another useful Regex trick. Imaging a situation where you are just interested in the year in which the writing of a certain book started. In such cases you can use a 'positive lookbehind' to start matching _after_ a particular pattern. 

We can divide the pattern below into three parts. Each part is put between brackets to compile separate groups from which we can retrieve information. ```(?<=writing)``` looks behind the word 'writing.' ```(.*?)``` matches on anything that comes after that. And ```(\d{4})``` matches on 4 digit numbers that come after the word writing. In other words, this [string searching algorithm](https://en.wikipedia.org/wiki/String-searching_algorithm) starts becoming interested as soon as it sees 'writing,' then it processes anything, until it bumps into a 4 digit number.

Now see it in action.

***

In [5]:
s = 'The Philosopher\'s Stone is published in 1997, but the writing started in 1990.'
start = re.search(r'(?<=writing)(.*?)(\d{4})', s)
start.group(2).strip()
# Note that there are 4 groups in total (group 0 to 3)

'1990'

***

Finally, Regex is all about finding the right pattern for the right match(es). You don't want to make the pattern too "greedy" (so it gives your more than you want), and you also don't want to make it too "strict" (so it gives you less than you want). 

For instance, imagine that you like to find all three letter words (The, but, the) in the example string above. ```\w{3}``` will not give you what you like. Just try it out. You need to demarcate the pattern with so-called "word boundaries," using ```\b\w{3}\b```. This will make the pattern less greedy.  


***

In [6]:
short_words = re.findall(r'\b\w{3}\b',s) 
short_words

['The', 'but', 'the']

***

Now you should be set for your first Regex challenges. Use the above mentioned [cheat sheet](https://www.rexegg.com/regex-quickstart.html) by Rex Egg for a nice overview of the different Regex characters and patterns.

***

***

### Your Regex challenges

***

***

**Challenge 1.** Is it likely? No. But, imaging a wave of modernization at Hogwarts, in which Professors added email to their stock pile of communication methods. As a data wizard, you will need to extract all email addresses from the existing documentation to make a clean email list. Find all email addresses in the string below. 

To help you in the right direction, the basics of the code are already there. You just need to replace 'your_pattern' by one of your creations.

***

In [7]:
s = 'Please submit your assignments to the following email addresses. \nAstronomy: sinistra@hogwarts.edu \nDefence Against the Dark Arts: lupin@hogwarts.edu \nPotions: snape@hogwarts.edu'
mail = re.findall(r'\w+@hogwarts.edu', s) 
mail

['sinistra@hogwarts.edu', 'lupin@hogwarts.edu', 'snape@hogwarts.edu']

*** 

**Challenge 2.** It's not just email which is the enemy of owl post, professors may turn to telephones, too! Identify all telephone numbers within the following string. Also note that you have to match the whitespaces in between the numbers.


***

In [8]:
s = 'In case of emergency, please do call your professor. Reach out to Professor Sinistra at 010 4529 6017, Professor Lupin at 010 5529 9036, or Professor Snape at 010 8865 9046'
clean = re.findall(r'\d{3}\s\d{4}\s\d{4}', s)
clean

['010 4529 6017', '010 5529 9036', '010 8865 9046']

***

**Challenge 3.** ```re.findall``` returns a tuple, and, if you search for multiple groups, a tuple of tuples. Write a Regex pattern that matches on 3 groups: (1) 'Professor Some_Family_Name', (2) ' at ', and (3) 'a telephone number'. Hence your output should look like this:

> [('Professor Sinistra', ' at ', '010 4529 6017'),
>
> ('Professor Lupin', ' at ', '010 5529 9036'),
>
> ('Professor Snape', ' at ', '010 8865 9046')]

Note that you also have to match the whitespace before and after ' at '

***

In [9]:
tt = re.findall(r'(Professor\s\w+)(\sat\s)(\d{3}\s\d{4}\s\d{4})', s)
tt

[('Professor Sinistra', ' at ', '010 4529 6017'),
 ('Professor Lupin', ' at ', '010 5529 9036'),
 ('Professor Snape', ' at ', '010 8865 9046')]

**Challenge 4.**  As a data wizard, your goal is now to make the 'at' dissapear, so that you can make a neat overview of names and telephone numbers in a Pandas Dataframe, which looks as follows:

|    | name professor       | telephone number   |
|---:|:---------------------|:-------------------|
|  0 | Professor Snape      | 010 4529 6017      |
|  1 | Professor Dumbledore | 010 5529 9036      |
|  2 | Professor Lupin      | 010 8865 9046      |



There are various ways to do this. From crude to elegant. Teach yourself a method of choice. For this part you may use other websites than the ones listed above.

In [10]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.row_stack(tt))
df = df.drop(df.columns[1], axis=1)
df = df.rename(columns={df.columns[0]: 'name professor', df.columns[1]: 'telephone number'})
df

Unnamed: 0,name professor,telephone number
0,Professor Sinistra,010 4529 6017
1,Professor Lupin,010 5529 9036
2,Professor Snape,010 8865 9046


***

**Challenge 6.**  For the next part you need to learn two new tricks. First, you already looked into a positive lookbehind. Similary, Regex also offer a 'positive lookahead,' which can yield matches _before_ a particular character. Use ```(?=some_character)```, where 'some_character' should be replaced by a character of your choice. In the example below, the Regex pattern will match on any single word character which is followed by a space and the letter 'c.'

***

In [11]:
s = 'a b c d'
m = re.search(r'\w{1}(?=\sc)', s) # note that you need to add a space character either before the c or after the \w{1}
m

<re.Match object; span=(2, 3), match='b'>

***

Second, Regex also makes it possible to use optional patterns, you can match those, but they are not necessary. A ```?``` makes the preceding pattern optional. So, in the example below, you will see that exactly the same Regex pattern will match both on 'Harry' and 'Harry Potter.'

***

In [12]:
h  = 'Harry'
hp = 'Harry Potter'
m1 = re.search(r'Harry(\sPotter)?', h)
m2 = re.search(r'Harry(\sPotter)?', hp)
print(m1)
print(m2)

<re.Match object; span=(0, 5), match='Harry'>
<re.Match object; span=(0, 12), match='Harry Potter'>


***

Now it is your turn. For this challenge, you need to turn Harry's grades into a neat Dataframe. You start again by creating a tuple of tuples, in which you capture both subjects and grades. For this you will need a positive lookahead and some optional items. Then you turn ```tt``` into a Dataframe.

***

In [13]:
grades = 'History of Magic: A; Muggle Studies: A; Potions O; Transfiguration: E; Arithmancy: A; Divination: O;'
tt = re.findall(r'(\w+\s?\w+\s?\w+:\s)(\w{1})(?=;)', grades)
tt

[('History of Magic: ', 'A'),
 ('Muggle Studies: ', 'A'),
 ('Transfiguration: ', 'E'),
 ('Arithmancy: ', 'A'),
 ('Divination: ', 'O')]

In [14]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.row_stack(tt))
df = df.rename(columns={df.columns[0]: 'subject', df.columns[1]: 'grade'})
df  

Unnamed: 0,subject,grade
0,History of Magic:,A
1,Muggle Studies:,A
2,Transfiguration:,E
3,Arithmancy:,A
4,Divination:,O


***

**Challenge 7.** ```re.sub()``` is an often used operation to clean and organize (textual) data. It works with a pattern that matches something, which is then replaced by something else. For instance, here we replace all underscores for spaces.

***

In [15]:
s = 'Hogwarts_School_of_Witchcraft_and_Wizardry'
space = re.sub(r'_', ' ', s)
space

'Hogwarts School of Witchcraft and Wizardry'

***

Things can be made more interestingly by adding a definition that can deal with different scenarios. Built a definition called ```grader``` which transforms the single letter grades into grades that are fully written out. If you do not know them by heart, then you can find the meaning of the different grades at Hogwarts [here](https://www.hp-lexicon.org/thing/grades-at-hogwarts/). For this part you may again use other websites than the ones listed above.

***

In [16]:
grades = 'History of Magic: A; Muggle Studies: A; Potions O; Transfiguration: E; Arithmancy: A; Divination: O;'
def grader(match_obj):
    if match_obj.group(1) == 'O;': return 'Outstanding;'
    if match_obj.group(2) == 'E;': return 'Exceeds expectations;'
    if match_obj.group(3) == 'A;': return 'Acceptable;'

full_grades = re.sub(r"(O;)|(E;)|(A;)", grader, grades)
full_grades

'History of Magic: Acceptable; Muggle Studies: Acceptable; Potions Outstanding; Transfiguration: Exceeds expectations; Arithmancy: Acceptable; Divination: Outstanding;'

***

### Further reading

Apart from the three websites listed in the top of this Notebook, you can also read Christopher Toa's [cool blogpost](https://towardsdatascience.com/7-useful-tricks-for-python-regex-you-should-know-ec20381e22f2) on '7 Useful Tricks for Python Regex.' For instance, trick 4 show how you can not only customize ```re.sub``` with a definition, but also with a lambda function.

***