*Part 4: Working with Text Data*
#String methods and regular expressions#

In the last part of the tutorial we learned how to gather data in the internet. The data we retrieved was mainly **text data**. In the last part of the tutorial we will thus take a look at how we can work with text data. This week, we will get to know different **string methods** and learn how to match more complex text patterns using **regular expressions**.

## Getting help

We are selective in this tutorial and only discuss elements that we believe are most important for the purpose of this class. If you want more details, you can consult, for example, the **Python Standard Library Reference** at https://docs.python.org/3/library/ or the **Language Reference** at https://docs.python.org/3/reference/. But be warned: the amount of detail in these sources can be overwhelming. For **quick and easy-to-understand overviews** of different topics see, for example, https://www.w3schools.com/python/. Here are some specific references for today's tutorial:

String methods:
* https://www.w3schools.com/python/python_ref_string.asp
* In pandas: https://pandas.pydata.org/docs/user_guide/text.html

Regular Expressions:
* https://en.wikipedia.org/wiki/Regular_expression
* https://www.w3schools.com/python/python_regex.asp
* https://www.programiz.com/python-programming/regex


If you get stuck or don't remember how to do something, it is usually a good idea to **Google** your problem. Python has a large (and fast-growing) community and you will probably find answers to most of your questions online (e.g. on **Stack Overflow** or in a **Youtube tutorial**).

## Introduction

Suppose you would like to analyze the contents of some Wikipedia articles you retrieved. For example, you would like to know what words are most commonly used in different articles or whether the articles are positvely or negatively framed. How could you do this? You may remember that the text data we retrieved in the last tutorial still looked a bit messy. A paragraph may, for example, look like this:

```
The cat (Felis catus) is a domestic species of small carnivorous mammal.[1][2] It is the only domesticated species in the family Felidae and is often referred to as the domestic cat to distinguish it from the wild members of the family.[4]
```

If we want to do text analysis, we must first clean things up a little bit. For example, we may want to lower-case all the letters, remove the punctuation, get rid of the citations etc. We may also be interested in certain parts of the text and want to retrieve them. In this tutorial, you will get to know string methods and regular expressions, so that you can do these things conveniently.



## String methods

### Built-in string methods

In previous tutorials, we already got to know some string methods:

In [1]:
myString = "this is a String!"

In [2]:
myString.lower() # lower-case all the letters

'this is a string!'

In [3]:
myString.upper() # upper-case all the letters

'THIS IS A STRING!'

In [4]:
myString.capitalize() # Make first letter upper-case (and everything else lower-case)

'This is a string!'

><font color = ffffff> SIDENOTE: The `title()` method is like `capitalize()`, but it capitalizes each word.
>
>```python
myString.title() # Will return 'This Is A String!'
```

In [5]:
myString.replace("i", "ii") # replace parts of a string

'thiis iis a Striing!'

In [6]:
myString.index("String") # Find position of a specified substring

10

In [7]:
myString.split() # Tokenize a string (split the string into a list of words)

['this', 'is', 'a', 'String!']

><font color = ffffff> SIDENOTE: By default, `split()` uses spaces as separators. You can also use a custom separator.
>
>```python
"A,B,C,D".split(",") # Will create list ['A', 'B', 'C', 'D']
```
><font color = 4e1585>You can use the ``join()`` method to do the reverse of `split()`, i.e. to join all elements of a list into a string.
>
>```python
" ".join(['this', 'is', 'a', 'String!']) # Join with a space as separator
```

Python has quite a few additional built-in string methods you can use (see here: https://www.w3schools.com/python/python_ref_string.asp). Let's look at some of them:

In [8]:
myString.startswith("Hello") # Check if string starts with a specified value

False

In [9]:
myString.endswith("!") # Check if string ends with a specified value

True

><font color = ffffff> SIDENOTE: There is no ``contains`` string method. If you want to check if a string contains a certain substring, use the ``in`` operator:
>
>```python
"this" in myString
```

In [10]:
myString2 = "     this is a badly formatted string   "
myString2.lstrip() # remove spaces on the left

'this is a badly formatted string   '

In [11]:
myString2.rstrip() # remove spaces on the right

'     this is a badly formatted string'

In [12]:
myString2.strip() # remove spaces on both sides

'this is a badly formatted string'

If the output of a string method is another string, you can also chain several string methods:

In [13]:
myString2.strip().capitalize().replace("badly", "nicely")

'This is a nicely formatted string'

Remember that string methods do not change your object . They only return an output. If you want to change your original string, you have to assign this output to your string (i.e. overwrite it with the new value):

In [None]:
myString2

In [21]:
myString2 = myString2.strip().capitalize().replace("badly", "nicely")
myString2

'This is a nicely formatted string'

### String methods in pandas

In many cases you will not be working with a single string but with **many strings**. For example, you may have built a **Pandas** dataframe with a string for each observation. For example, your data could look as follows:

In [16]:
import pandas as pd

df = pd.DataFrame(["First cat launched into space",
                   "US President George W. Bush's cat, 19 years old",
                   "the world's heaviest cat, died at 22",
                   "The world's oldest cat (38 years)"],
                  index = ["Felicette",  "India Willie Bush", "Meow", "Creme Puff"],
                  columns=["description"])
df

Unnamed: 0,description
Felicette,First cat launched into space
India Willie Bush,"US President George W. Bush's cat, 19 years old"
Meow,"the world's heaviest cat, died at 22"
Creme Puff,The world's oldest cat (38 years)


How can we apply a string method to every description? You may remember that we could use the **`apply()` method**. Let's use it to lower-case all the letters:

In [22]:
df["description"].apply(lambda x: x.lower())

Felicette                              first cat launched into space
India Willie Bush    us president george w. bush's cat, 19 years old
Meow                            the world's heaviest cat, died at 22
Creme Puff                         the world's oldest cat (38 years)
Name: description, dtype: object

Hoever, there is an easier way to do this. All **string methods** are nicely **built into the Pandas library** so you can apply them more conveniently:

In [23]:
df["description"].str.lower()

Felicette                              first cat launched into space
India Willie Bush    us president george w. bush's cat, 19 years old
Meow                            the world's heaviest cat, died at 22
Creme Puff                         the world's oldest cat (38 years)
Name: description, dtype: object

As you can see, you just write ``.str`` to indicate that you would like to apply a string method and then add the method you want to use! This works for all the other string methods too:

In [24]:
df["description"].str.capitalize()

Felicette                              First cat launched into space
India Willie Bush    Us president george w. bush's cat, 19 years old
Meow                            The world's heaviest cat, died at 22
Creme Puff                         The world's oldest cat (38 years)
Name: description, dtype: object

In [25]:
df["description"].str.replace("'", "")

Felicette                             First cat launched into space
India Willie Bush    US President George W. Bushs cat, 19 years old
Meow                            the worlds heaviest cat, died at 22
Creme Puff                         The worlds oldest cat (38 years)
Name: description, dtype: object

Note that again your object is not modified. Re-assign it to modify the column in your DataFrame:

In [26]:
df["description"] = df["description"].str.replace("'", "").str.capitalize()
df

Unnamed: 0,description
Felicette,First cat launched into space
India Willie Bush,"Us president george w. bushs cat, 19 years old"
Meow,"The worlds heaviest cat, died at 22"
Creme Puff,The worlds oldest cat (38 years)


---

>  <font color='teal'> **In-class exercise**: Clean the following string so that it looks nice.

In [27]:
aString = "   tHis is a %s %s teRrible StrInG %s "

In [41]:
aString.replace(" %s", "").strip().capitalize()

'This is a terrible string'

>  <font color='teal'> Print the description about the cats in ``df`` in upper case!



In [39]:
df["description"].str.upper()

Felicette                             FIRST CAT LAUNCHED INTO SPACE
India Willie Bush    US PRESIDENT GEORGE W. BUSHS CAT, 19 YEARS OLD
Meow                            THE WORLDS HEAVIEST CAT, DIED AT 22
Creme Puff                         THE WORLDS OLDEST CAT (38 YEARS)
Name: description, dtype: object



---



## Regular expressions

String methods can be very useful, but sometimes you may have to match (and extract or replace) more complex patterns in text. For example, you may want to remove all the HTML tags from a text, extract all email addresses or do complicated replacements. To do such things, we can use so-called **regular expressions** (also called regexes). Regular expressions allow you to **match text based on certain patterns**. They are not specific to Python, but are implemented in many programming languages. Regular expressions are very powerful and can get quite complicated. In this tutorial we will only get to know the basics, but feel free to explore the more advanced possibilities they offer:
* https://realpython.com/regex-python/

In Python, we can use the **``re`` module** to work with regular expressions:

In [42]:
import re

One very useful method of ``re`` module is **``findall``**.  Let's see how it works:

In [43]:
myStr = "I have 7 cats. The name of my favorite cat is Sarah! She is 14 years old."

In [46]:
re.findall("cat", myStr)

['cat', 'cat']

The method takes two mandatory arguments: The first argument is the search pattern (i.e. the regular expression) and the second argument is the string where you want to search for it. As you can see, the findall method returns a list with all the matches.

In this example, we searched our string for the word "cat" and got two matches. Searching for exact matches may not be very useful and we do not need the ``re`` moduel to do it. The power of regexes lies in its **special sequences** and **metacharacters** that allow you to define very sophisticated search patterns.

#### Raw strings

Before we can get started with them, there is one more concept you need to know: **raw strings**. Consider the following example:

In [47]:
mystr = "\"\a\b\c\d\e\f\n\o\p\q\r\s\t\v\w"
print(mystr)

"\c\d\e
\s	\w


What happened? The backslash has a special meaning for the Python interpreter. It can be used to **escape** the next character, meaning that a special character (e.g. ``"``) will be treated like a normal character. Moreover, in combination with other characters (e.g. ``n`` and ``t``) it will create special types of blanks such as new lines or tabs.

But what if we do not want this to happen? We can use **raw strings**. This is done by placing an ``r`` in front of the string.

In [48]:
mystr = r"\"\a\b\c\d\e\f\n\o\p\q\r\s\t\v\w"
print(mystr)

\"\a\b\c\d\e\f\n\o\p\q\r\s\t\v\w


In a raw string, backslashes are treated like normal characters. This will be important for the definition of regexes.

### Matching special sequences

Regular expressions are defined using special sequences and meta-characters. Here are some examples for **special sequences**:

|Character| Meaning| Example|
|-|-|-|
|\w|word character (a-z, A-Z, 0-9, _)||
|\W|not a word character||
|\d|digit (0-9)||
|\D|not a digit||
|\s|whitespace||
|\S|not a whitespace||
|\b|word boundary||
|\B|not a word boundary||


How can we use them? Suppose you would like to retrieve all the numbers from ``myStr``:

In [49]:
myStr = "I have 7 cats. The name of my favorite cat is Sarah! She is 14 years old."

In [50]:
re.findall("\d", myStr) # match all digits

['7', '1', '4']

To be sure that the backslashes don't mess up your regex, it is better to use raw strings:

In [51]:
re.findall(r"\d", myStr) # match all digits

['7', '1', '4']

Similarly, we can match all non-word characters:

In [55]:
re.findall(r"\W", myStr) # match all non-word characters

[' ',
 ' ',
 ' ',
 '.',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '!',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '.']

You can also match patterns consisting of several characters.

In [56]:
re.findall(r"\d\d", myStr) # two-digit numbers

['14']

In [57]:
re.findall(r"\b\w\w\b", myStr) # All words consisting of two letters

['of', 'my', 'is', 'is', '14']

In [59]:
re.findall(r"\b\w\we\b", myStr) # All three-letter words ending with e

['The', 'She']

### Using meta-characters

Let's now have a look at the most important metacharacters:

|Character| Meaning| Example|
|-|-|-|
|.|any character|"h."|
|[]|Set of characters|"[a-z]"|
|^|starts with (within sets: NOT)|"^\w"|
|$|ends with|"!\$"|
|+|One or more repetitions|"[a-z]+"|
|*|Zero or more repetitions|"a\w*\b"|
|?|Zero or one repetition|"a\w?\b"|
|{}|Specified number of repetitions|"[aeiou]{2}"|
|()|Group|"\s(\d+)\s"|


####Matching different types of characters

Some of the metacharcaters can be used to match specific characters. For example, you can use the ``.`` to match any character:

In [60]:
print(re.findall(".", myStr)) # any character

['I', ' ', 'h', 'a', 'v', 'e', ' ', '7', ' ', 'c', 'a', 't', 's', '.', ' ', 'T', 'h', 'e', ' ', 'n', 'a', 'm', 'e', ' ', 'o', 'f', ' ', 'm', 'y', ' ', 'f', 'a', 'v', 'o', 'r', 'i', 't', 'e', ' ', 'c', 'a', 't', ' ', 'i', 's', ' ', 'S', 'a', 'r', 'a', 'h', '!', ' ', 'S', 'h', 'e', ' ', 'i', 's', ' ', '1', '4', ' ', 'y', 'e', 'a', 'r', 's', ' ', 'o', 'l', 'd', '.']


In [61]:
print(re.findall("a.", myStr)) # an a followed by any character

['av', 'at', 'am', 'av', 'at', 'ar', 'ah', 'ar']


A very useful thing you can include in regular expressions are **sets**. You can use them to specify an exact set of characters you would like to match. This is done using ``[]`` brackets:

In [65]:
re.findall("[abc]", myStr) # Match all letters in set: a, b or c

['a', 'c', 'a', 'a', 'a', 'c', 'a', 'a', 'a', 'a']

In [63]:
re.findall("[abc][abc]", myStr) # Match when a, b or c is followed by a, b or c

['ca', 'ca']

What if you wanted to match all lower-case letters from a to z? Instead of writing ``"[abcdefghijklmnopqrstuvwxyz]"`` you can use the ``-`` sign to specify a range of letters (or digits):

In [66]:
print(re.findall("[a-z]", myStr)) # Match all lower case letters

['h', 'a', 'v', 'e', 'c', 'a', 't', 's', 'h', 'e', 'n', 'a', 'm', 'e', 'o', 'f', 'm', 'y', 'f', 'a', 'v', 'o', 'r', 'i', 't', 'e', 'c', 'a', 't', 'i', 's', 'a', 'r', 'a', 'h', 'h', 'e', 'i', 's', 'y', 'e', 'a', 'r', 's', 'o', 'l', 'd']


In [67]:
print(re.findall("[A-Z]", myStr)) # Match all upper case letters

['I', 'T', 'S', 'S']


In [68]:
print(re.findall("[a-zA-Z]", myStr)) # Match all lower and uppercase letters

['I', 'h', 'a', 'v', 'e', 'c', 'a', 't', 's', 'T', 'h', 'e', 'n', 'a', 'm', 'e', 'o', 'f', 'm', 'y', 'f', 'a', 'v', 'o', 'r', 'i', 't', 'e', 'c', 'a', 't', 'i', 's', 'S', 'a', 'r', 'a', 'h', 'S', 'h', 'e', 'i', 's', 'y', 'e', 'a', 'r', 's', 'o', 'l', 'd']


In [69]:
print(re.findall("[^a-zA-Z]", myStr)) # Match eveything except lower and uppercase letters

[' ', ' ', '7', ' ', '.', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '!', ' ', ' ', ' ', '1', '4', ' ', ' ', '.']


If you want to anchor the search to the beginning or the and of a sting, you can use the ``^`` and the ``$``characters respectively:

In [70]:
print(re.findall("^[a-zA-Z]", myStr)) # Note that ^ has a different meaning within
                                      # a set (see above!)

['I']


In [73]:
print(re.findall(r"\W$", myStr))

['.']


If you want to match one of the characters that are used as meta-characters (e.g. a literal ``.``), you need to escape them:

In [74]:
print(re.findall(r"\.", myStr)) # Match all .

['.', '.']


#### Matching repetitions

Sometimes you may want to match a certain character (set)  several times. For example, you may want to match all words that start with the letter "i". Let's try to do this:

In [75]:
myStr="test: i in ink idea image ignore ireland"

In [76]:
print(re.findall(r"\bi\b", myStr)) # one-letter words
print(re.findall(r"\bi[a-z]\b", myStr)) # two-letter words
print(re.findall(r"\bi[a-z][a-z]\b", myStr)) # three-letter words
print(re.findall(r"\bi[a-z][a-z][a-z]\b", myStr)) # four-letter words

['i']
['in']
['ink']
['idea']


This works, but it's not very convenient. Fortunately, there are different meta-characters that allow you to specify different types of repetitions.

We can **use a ``+``** to match a character (or set of characters) **1 or more times**:

In [77]:
re.findall(r"\bi[a-z]+\b", myStr)

['in', 'ink', 'idea', 'image', 'ignore', 'ireland']

Similarly, you can **use a ``*``** to match something **0 or more times**:

In [78]:
re.findall(r"\bi[a-z]*\b", myStr)

['i', 'in', 'ink', 'idea', 'image', 'ignore', 'ireland']

Sometimes, you may also want to match a character **0 or 1 times**. This is done **using a ``?``**:

In [79]:
re.findall(r"\bi[a-z]?\b", myStr)

['i', 'in']

You can also **specify an exact number of repetitions using ``{}`` brackets**:

In [80]:
re.findall(r"\bi[a-z]{2}\b", myStr) # 2 repetitions

['ink']

In [81]:
re.findall(r"\bi[a-z]{2,4}\b", myStr) # 2-4 repetitions

['ink', 'idea', 'image']

---

>  <font color='teal'> **In-class exercise**: Consider the following string.

In [92]:
myStr= """Félicette (French pronunciation: ​[felisɛt]) was a stray Parisian cat who is
the only cat to have been successfully launched into space. She was launched on 18. October 1963
as part of the French space program. Félicette was one of 14 female cats trained for spaceflight."""

>  <font color='teal'> Write a regular expression that extracts the year of the space launch.


In [84]:
re.findall(r"\b\d{4}\b", myStr)

['1963']

>  <font color='teal'> Can you extend the regex to match the full date? Make sure your regex would work for other dates too (e.g., 21 January 2021).


In [109]:
re.findall(r"\b\d{1,2}\.?\s[a-zA-Z]+\s\d{4}\b", myStr)

['18. October 1963']

#### Group capturing

If you only want to **capture a specific part** within your matching string, you can use groups. A group is defined **using ``()`` brackets**.

In [111]:
mystr = r"The <b>cat</b> is a domestic <b>species</b>."

In [112]:
re.findall(r"<b>\w+</b>", mystr) # Match tags with bold text (can also be done with BeautifulSoup)

['<b>cat</b>', '<b>species</b>']

In [113]:
re.findall(r"<b>(\w+)</b>", mystr) # Return only the text within the tag

['cat', 'species']

We can also capture several groups. We will get a list of tuples:

In [114]:
re.findall(r"(\w+)\s<b>(\w+)</b>", mystr)

[('The', 'cat'), ('domestic', 'species')]

###Methods of the ``re`` module

Appart from ``findall``, the re module has many other useful methods (see here: https://www.programiz.com/python-programming/regex). We will take a look at two more of them: **sub** and **split**.

The **``sub`` method allows you to replace parts of a string** (that are selected through a regular expression):

In [115]:
myStr = "I have 7 cats. The name of my favorite cat is Sarah! She is 14 years old."

In [116]:
re.sub("\d+", r"[INSERT NUMBER]", myStr)

'I have [INSERT NUMBER] cats. The name of my favorite cat is Sarah! She is [INSERT NUMBER] years old.'

As you can see, the first argument is the regular expression (i.e. the parts you want to replace), the second argument is the replacement and the last argument is your text string.

The ``sub`` method can also be useful if you want to remove parts of a string:

In [117]:
re.sub("</?b>", "", "The <b>cat</b> is a domestic species.") # Note: This can also be done using
                                                             # get_text() from Beautifulsoup!

'The cat is a domestic species.'

Furthermore, when using `()` groups, the matched text can be reused in the replacement using `\1`, `\2`, etc.

In [118]:
re.sub(r"(\d+)\. (January)", r"\2 \1", "My cat was born on 12. January.")

'My cat was born on January 12.'

The **``split`` method allows you to split a string based on a regular expression**:

In [None]:
re.split("[.!?]\s", myStr) # Split at period, question marks or exclamation marks
                           # (Note that special characters such as . and ? are
                           # treated like normal characters within sets!)

['I have 7 cats',
 'The name of my favorite cat is Sarah',
 'She is 14 years old.']

There is much more you can do with the ``re`` module – feel free to investigate and find out! Moreover, regular expressions are very common and you can find many great examples online!

### Regular expressions in pandas

What if we want to apply a regex to an entire column in a pandas dataframe? The **string methods in pandas accept regular expressions as an input**!



In [119]:
df

Unnamed: 0,description
Felicette,First cat launched into space
India Willie Bush,"Us president george w. bushs cat, 19 years old"
Meow,"The worlds heaviest cat, died at 22"
Creme Puff,The worlds oldest cat (38 years)


In [120]:
df["description"].str.replace(r"\d{1,2}", "[INSERT AGE]", regex=True)

Felicette                                First cat launched into space
India Willie Bush    Us president george w. bushs cat, [INSERT AGE]...
Meow                     The worlds heaviest cat, died at [INSERT AGE]
Creme Puff                  The worlds oldest cat ([INSERT AGE] years)
Name: description, dtype: object

In [None]:
df["age"] = df["description"].str.extract(r"(\d{1,2})")
df

Unnamed: 0,description,age
Felicette,First cat launched into space,
India Willie Bush,"Us president george w. bushs cat, 19 years old",19.0
Meow,"The worlds heaviest cat, died at 22",22.0
Creme Puff,The worlds oldest cat (38 years),38.0


---

>  <font color='teal'> **In-class exercise**: Consider the following pandas df:

In [122]:
import pandas as pd
addresses = ['Kirchgasse 59, 4000 St. Gallen', 'Burgweg 7, 8000 Biel/Bienne', 'Dorfstrasse 71, 1000 Bern', 'Bahnhofstrasse 40, 1000 Basel', 'Alpenblickstrasse 95, 3000 Lausanne', 'Bergstrasse 71, 5000 Winterthur', 'Hauptstrasse 14, 8000 Lugano', 'Güterstrasse 61, 6000 Zürich', 'Marktplatz 41, 9000 Lucerne', 'Rue du Lac 53, 3000 Genève']
df = pd.DataFrame({'address': addresses})
df

Unnamed: 0,address
0,"Kirchgasse 59, 4000 St. Gallen"
1,"Burgweg 7, 8000 Biel/Bienne"
2,"Dorfstrasse 71, 1000 Bern"
3,"Bahnhofstrasse 40, 1000 Basel"
4,"Alpenblickstrasse 95, 3000 Lausanne"
5,"Bergstrasse 71, 5000 Winterthur"
6,"Hauptstrasse 14, 8000 Lugano"
7,"Güterstrasse 61, 6000 Zürich"
8,"Marktplatz 41, 9000 Lucerne"
9,"Rue du Lac 53, 3000 Genève"


>  <font color='teal'> Use regular expressions, capturing groups and ``str.extract()`` to create the following new columns:
1. house_number
2. postal_code
3. city_name


In [124]:
df["house_nr"] = df["address"].str.extract(r"\b(\d{1,2})\b")
df

Unnamed: 0,address,house_nr
0,"Kirchgasse 59, 4000 St. Gallen",59
1,"Burgweg 7, 8000 Biel/Bienne",7
2,"Dorfstrasse 71, 1000 Bern",71
3,"Bahnhofstrasse 40, 1000 Basel",40
4,"Alpenblickstrasse 95, 3000 Lausanne",95
5,"Bergstrasse 71, 5000 Winterthur",71
6,"Hauptstrasse 14, 8000 Lugano",14
7,"Güterstrasse 61, 6000 Zürich",61
8,"Marktplatz 41, 9000 Lucerne",41
9,"Rue du Lac 53, 3000 Genève",53


In [125]:
df["postal_code"] = df["address"].str.extract(r"\b(\d{4})\b")
df

Unnamed: 0,address,house_nr,postal_code
0,"Kirchgasse 59, 4000 St. Gallen",59,4000
1,"Burgweg 7, 8000 Biel/Bienne",7,8000
2,"Dorfstrasse 71, 1000 Bern",71,1000
3,"Bahnhofstrasse 40, 1000 Basel",40,1000
4,"Alpenblickstrasse 95, 3000 Lausanne",95,3000
5,"Bergstrasse 71, 5000 Winterthur",71,5000
6,"Hauptstrasse 14, 8000 Lugano",14,8000
7,"Güterstrasse 61, 6000 Zürich",61,6000
8,"Marktplatz 41, 9000 Lucerne",41,9000
9,"Rue du Lac 53, 3000 Genève",53,3000


In [130]:
df["city"] = df["address"].str.extract(r"\b\d{4}\b\s(.+)")
df

Unnamed: 0,address,house_nr,postal_code,city
0,"Kirchgasse 59, 4000 St. Gallen",59,4000,St. Gallen
1,"Burgweg 7, 8000 Biel/Bienne",7,8000,Biel/Bienne
2,"Dorfstrasse 71, 1000 Bern",71,1000,Bern
3,"Bahnhofstrasse 40, 1000 Basel",40,1000,Basel
4,"Alpenblickstrasse 95, 3000 Lausanne",95,3000,Lausanne
5,"Bergstrasse 71, 5000 Winterthur",71,5000,Winterthur
6,"Hauptstrasse 14, 8000 Lugano",14,8000,Lugano
7,"Güterstrasse 61, 6000 Zürich",61,6000,Zürich
8,"Marktplatz 41, 9000 Lucerne",41,9000,Lucerne
9,"Rue du Lac 53, 3000 Genève",53,3000,Genève


**Let's do the course evaluation now!**

https://scanserveruls.unibe.ch/evasys/online.php?pswd=KEQHK