![bse_logo_textminingcourse](https://bse.eu/sites/default/files/bse_logo_small.png)

This is material for the course: Introduction to Text Mining and Natural Language Processing

by Hannes Mueller


# Session 1: Strings and RegEx

This notebook provides a quick practical introduction to strings and regular expressions. The notebook provides material that is discussed in the lecture slides and then gives additional material for practice. This means some of the material will repeat. It is compulsory for students to go through. We will assume you have gone through these.

General advice: There is really no substitute in this topic to trial and error. I strongly encourage you to try different things as you go through. So ask GenAI for advice on commands but not the solution! Always try a few things to learn.

# 1) Lecture Support Material

This part of the notebook contains example with explanations.

In [1]:
#importing RegEx
import re

## 1.1) Strings

The following is lecture material. Play with it.

In [2]:
# Concatenation: Combining strings
concatenated_string = 'Hello' + ' ' + 'World'
print("Concatenated String:", concatenated_string)

# Length: Finding the number of characters
length_of_hello = len('Hello')
print("Length of 'Hello':", length_of_hello)

# Slicing: Extracting parts of the string
sliced_string = 'Hello'[1:4]
print("Sliced String (1:4):", sliced_string)

# More Useful String Operations

# Query for content: Check if a substring exists within a string
string = "Welcome to the World of Python"
query_result = "world" in string
print("'world' in string:", query_result)

# Lowercasing: Convert all characters in a string to lowercase
uppercase_string = "WHAT you want"
lowercase_string = uppercase_string.lower()
print("Lowercase String:", lowercase_string)

# Find position: Find the first position of an occurrence
position = string.find("world")
print("Position of 'world':", position)


Concatenated String: Hello World
Length of 'Hello': 5
Sliced String (1:4): ell
'world' in string: False
Lowercase String: what you want
Position of 'world': -1


In [3]:
####################
#Key method, split!#
####################

text = "Hello, welcome to the world of Python"
words = text.split()
print(words)

text = "apple,banana,cherry,dates"
fruits = text.split(',')
print(fruits)

['Hello,', 'welcome', 'to', 'the', 'world', 'of', 'Python']
['apple', 'banana', 'cherry', 'dates']


## Test exercise for split and lower

Make a list from the following string where every word is a list item and all words are lowercased. Do it in one line.

Don't worry about punctuation for now.

In [4]:
string="It is important that the China, Russia, Europe and the United States work together in the United Nations. Like really important."
#write your code here

In [5]:
#desired result
['it',
 'is',
 'important',
 'that',
 'the',
 'china,',
 'russia,',
 'europe',
 'and',
 'the',
 'united',
 'states',
 'work',
 'together',
 'in',
 'the',
 'united',
 'nations.',
 'like',
 'really',
 'important.']

['it',
 'is',
 'important',
 'that',
 'the',
 'china,',
 'russia,',
 'europe',
 'and',
 'the',
 'united',
 'states',
 'work',
 'together',
 'in',
 'the',
 'united',
 'nations.',
 'like',
 'really',
 'important.']

## 1.2) RegEx
## Slide: Basic Syntax

In [6]:
# Basic Syntax of RegEx

# Literals: Match exact characters
pattern1 = re.compile(r'abc')
matches1 = pattern1.findall("abc 123 abc ABC")
print("Literals Example:")
print("Matches for 'abc':", matches1)

# Metacharacters: Symbols with special meanings
# '.' matches any character, '^' matches the start of a string
pattern2 = re.compile(r'.') #any character
matches2 = pattern2.findall("abc 123 ABC")
print("\nMetacharacters Example:")
print("Matches any character in 'abc 123 ABC':", matches2)

pattern3 = re.compile(r'^a') #strings that start with a
matches3 = pattern3.findall("abc 123 abc ABC")
print("Matches 'a' at the start of 'abc':", matches3)

# Quantifiers: Specify the number of occurrences
# '*' for 0 or more, '+' for 1 or more
# The following pattern is looking for 'a' followed by zero or more 'b's.
pattern4 = re.compile(r'ab*')
matches4 = pattern4.findall("aaaab bab")
print("\nQuantifiers Example (*):")
print("Matches for 'ab*' in 'aaaab bab':", matches4)

# The following pattern is looking for 'a' followed by one or more 'b's.
pattern5 = re.compile(r'ab+')
matches5 = pattern5.findall("aaaab bab")
print("Matches for 'ab+' in 'aaaab bab':", matches5)


Literals Example:
Matches for 'abc': ['abc', 'abc']

Metacharacters Example:
Matches any character in 'abc 123 ABC': ['a', 'b', 'c', ' ', '1', '2', '3', ' ', 'A', 'B', 'C']
Matches 'a' at the start of 'abc': ['a']

Quantifiers Example (*):
Matches for 'ab*' in 'aaaab bab': ['a', 'a', 'a', 'ab', 'ab']
Matches for 'ab+' in 'aaaab bab': ['ab', 'ab']



## Slide: RegEx Functions

In [7]:
# re.findall() searches for all matches of the given pattern in the string
# and returns a list of matches.
pattern = r'\d+'  # Matches one or more digits
string = "A1 B22 C333"
matches = re.findall(pattern, string)
print("Matches found using findall:", matches)

# re.sub() replaces all matches of the given pattern in the string
# with the specified replacement text.
pattern = r'\d+'  # Matches one or more digits
string = "ID: 123"
replacement = "X"
result = re.sub(pattern, replacement, string)
print("Result after substitution:", result)

Matches found using findall: ['1', '22', '333']
Result after substitution: ID: X



## Slide: Groups and Capturing in RegEx

In [8]:
# Groups and Capturing in RegEx

# Example 1: Capturing groups in a Social Security Number
pattern1 = re.compile(r'(\d{3})-(\d{2})-(\d{4})')
match1a = pattern1.search("123-45-6789")
match1b = pattern1.search("16767-123-42-6789-34")
print("Example 1a:")
print("Groups captured:", match1a.groups())
print("Example 1b:")
print("Groups captured:", match1b.groups())

# Example 2: Capturing date elements
pattern2 = re.compile(r'(\d{2})/(\d{2})/(\d{4})')
match2 = pattern2.search("12/02/2023")
print("\nExample 2:")
print("Date captured:", match2.groups())


Example 1a:
Groups captured: ('123', '45', '6789')
Example 1b:
Groups captured: ('123', '42', '6789')

Example 2:
Date captured: ('12', '02', '2023')


## Slide: Lookahead and Lookbehind in RegEx

In [9]:
# Lookahead and Lookbehind in RegEx

# Positive Lookahead Example
pattern3 = re.compile(r'q(?=u)')
matches3 = pattern3.findall("quick quiet q-anon quack")
print("Positive Lookahead Example:")
print("Matches for q followed by u:", matches3)

# Negative Lookbehind Example
pattern4 = re.compile(r'(?<!q)u')
matches4 = pattern4.findall("guru quick q-anon quack")
print("\nNegative Lookbehind Example:")
print("Matches for u not preceded by q:", matches4)


Positive Lookahead Example:
Matches for q followed by u: ['q', 'q', 'q']

Negative Lookbehind Example:
Matches for u not preceded by q: ['u', 'u']


## Slide: (Non-)Greedy Matching in RegEx

In [10]:
# (Non-)Greedy Matching in RegEx

# Greedy Pattern Example
pattern5 = re.compile(r'<.*>')
match5 = pattern5.search("<div>Hello <span>World")
print("Greedy Match Example:")
print("Greedy match:", match5.group())

# Non-Greedy Pattern Example
pattern6 = re.compile(r'<.*?>')
matches6 = pattern6.findall("<div>Hello <span>World")
print("\nNon-Greedy Match Example:")
print("Non-Greedy matches:", matches6)


Greedy Match Example:
Greedy match: <div>Hello <span>

Non-Greedy Match Example:
Non-Greedy matches: ['<div>', '<span>']


In [11]:
# Example 1: Extracting area code and phone number
text1 = "Call me at 415-555-1011 tomorrow."
pattern1 = re.compile(r'(\d{3})-(\d{3}-\d{4})')
match1 = pattern1.search(text1)
print(match1.groups())  # ('415', '555-1011')

# Example 2: Capturing different parts of a date
text2 = "The event is on 12/25/2023."
pattern2 = re.compile(r'(\d{2})/(\d{2})/(\d{4})')
match2 = pattern2.search(text2)
print(match2.groups())  # ('12', '25', '2023')

# Example 3: Capturing domain and top-level domain from email
text3 = "Please contact us at support@example.com."
pattern3 = re.compile(r'@(\w+).(\w+)')
match3 = pattern3.search(text3)
print(match3.groups())  # ('example', 'com')

# Example 4: Capturing first and last name
text4 = "Full name: Jane Doe"
pattern4 = re.compile(r'Full name: (\w+) (\w+)')
match4 = pattern4.search(text4)
print(match4.groups())  # ('Jane', 'Doe')

# Example 5: Capturing words around 'and'
text5 = "Bread and butter"
pattern5 = re.compile(r'(\w+) and (\w+)')
match5 = pattern5.search(text5)
print(match5.groups())  # ('Bread', 'butter')


('415', '555-1011')
('12', '25', '2023')
('example', 'com')
('Jane', 'Doe')
('Bread', 'butter')


## Test Exercise
Use re., sub, lower and \W+ to get rid of punctuation when you store the following string in a lowercased list. 

In [12]:
string="It is important that the China, Russia, Europe and the United States work together in the United Nations. Like really important."
#write your code here

In [13]:
#desired result
['it',
 'is',
 'important',
 'that',
 'the',
 'china',
 'russia',
 'europe',
 'and',
 'the',
 'united',
 'states',
 'work',
 'together',
 'in',
 'the',
 'united',
 'nations',
 'like',
 'really',
 'important']

['it',
 'is',
 'important',
 'that',
 'the',
 'china',
 'russia',
 'europe',
 'and',
 'the',
 'united',
 'states',
 'work',
 'together',
 'in',
 'the',
 'united',
 'nations',
 'like',
 'really',
 'important']

# 2) Exercises

### Instructions:
- This has some overlap with the lecture slides but also introduces additional characters.
- I have tried to hide all results through putting "." which prevents it from compiling. 
- Use the notebook this way: think what will happen in a cell before implementing it and go cell by cell. Experiment as you go along.

## 2.1) Strings

Basically, strings use list comprehension and this means we can do a lot of cool things with them.

Remember, first guess, then reveal!

In [14]:
string1="What is the answer to the ultimate question of life, the universe, and everything?"
string2="42"

### list comprehension with strings

In [15]:
#What will this do? Erase ; to find out.
string1+string2
.


SyntaxError: invalid syntax (4089159482.py, line 3)

In [None]:
#note the absence of white space



#What will this do?
string2+4
.

In [None]:
#to get around this problem write
x=int(string2)+4
print(x)
.

In [None]:
#letters in strings are callable
string1[0]
.



In [None]:
string1[:4]
.




In [None]:
string1[:-12]
.


In [None]:
x=12
string1[12:-x]
.

In [None]:
len(string1)
.

In [None]:
for x in string1[0:4]:
    print(x)
.

In [None]:
"w" in string1
.

In [None]:
"w" in string1[0:4]
.

### lower()
lower() is a function we will use a lot in pre-processing. It puts all letters lowercase.

In [None]:
string1[0:4].lower()
.

In [None]:
"w" in string1[0:4].lower()
.

### find()
find() method finds the first occurrence of the specified value.

find() method returns -1 if the value is not found.

find(string, x, y) looks for string between position x and y

In [None]:
#find gives you the postion in the string
string1.find("What")
.

In [None]:
string1.find("to")
.

In [None]:
string1.find("what")
.

In [None]:
string1.lower().find("what")
.

### Cutting text out between specific substrings
The following cuts out text between two specific substrings. (very useful in many applications)

In [18]:
data = 'From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008'
atpos = data.find('@')
print(atpos)
sppos = data.find(' ',atpos)
print(sppos)
host = data[atpos+1 : sppos]
print(host)

.


SyntaxError: invalid syntax (3920448093.py, line 9)

In [21]:
data = 'From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008'
atpos = data.find('@')
print(atpos)
sppos = data.find(' ',atpos+11)
print(sppos)
host = data[atpos+1 : sppos]
print(host)
.

SyntaxError: invalid syntax (2806002878.py, line 8)

## 2.2) Regular Expressions

For a full list of functionalities go to https://docs.python.org/3/library/re.html.
For an explanation of some functionalities go to https://docs.python.org/3/howto/regex.html#regex-howto

We will do some examples now so you get a feel.

In [None]:
#remember
string1="What is the answer to the ultimate question of life, the universe, and everything?"

In [22]:
import re

### re.search()
re.search() is a function in Python's re module that searches for a specified pattern in a string. It returns a Match object if the pattern is found, and None if the pattern is not found.

In [23]:
#the simplest expression is []
m = re.search('[oqr]', string1)

In [25]:
m
.

SyntaxError: invalid syntax (3428163214.py, line 2)

In [None]:
m[0]
.

SyntaxError: invalid syntax (<ipython-input-61-b8636fed89d0>, line 2)

### findall()
re.findall() is a function in Python's re module that returns a list of all non-overlapping matches of a pattern in a string. It searches the string for all matches of the regular expression and returns them as a list. If the pattern is not found, it returns an empty list.

In [None]:
m = re.findall('[wWo]', string1)

In [None]:
m
.

In [None]:
m = re.findall('[aeiou]', string1)
.

In [None]:
m
.

In [None]:
#ranges - the equivalent to the above is
m = re.findall('[a-g]', string1)

In [None]:
m
.

Typical ranges are:

[a-z]: all letters

[A-Z]: all capital letters 

[0-9]: all digits


Not is ^



In [None]:
m = re.findall('[^a-zA-Z ]', string1)

In [26]:
#note the final addition above is spaces
#what will be in here?
m
.

SyntaxError: invalid syntax (3421970333.py, line 4)

The short way of writing [a-zA-Z] is "\w" and a short way of writing [^a-zA-Z] is "\W".

\s is whitespace and \S stands for non-whitespace.

\d are digits and \D are non-digits.

\b is end of word/number (anything other than [a-zA-Z0-9_]). It's important to note that \b is a zero-width assertion, so it doesn't consume any characters in the string; it just asserts whether a position is a word boundary or not.

In [None]:
m = re.findall('[^ \w]', string1)

In [27]:
m
.

SyntaxError: invalid syntax (3428163214.py, line 2)

In [None]:
m = re.findall('is|or|and|the', string1)

In [None]:
m
.


Now it gets crazy. Key operators are ?, *, + and .

The ? allows you to ignore the previous character.
The * matches 0 or more of the same characters before.
The + matches 1 or more of the prevoius character.
The . matches any character (beg.n matches begin, begun...)

(...)
Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below.

(?<=...)
Matches if the current position in the string is preceded by a match for ... that ends at the current position. This is called a positive lookbehind assertion. (?<=abc)def will find a match in 'abcdef', since the lookbehind will back up 3 characters and check if the contained pattern matches.

In regular expressions, the \b character is used to match a word boundary. A word boundary is a position in a string where the character before it and the character after it are of different types. For example, the beginning and end of a string are word boundaries, as are spaces and punctuation marks.

In [None]:
string="The cat sat on the mat."
print(re.findall(r"\bcat\b", string))
.

In [None]:
string="The caterpillar crawled across the catwalk."
print(re.findall(r"\bcat\b", string))
.


SyntaxError: invalid syntax (<ipython-input-4-3b043eb7e99d>, line 3)

In [None]:
emailstring='From: "Mr. Benny Sings" <bigbrowneyes@spinfinder.com>'

In [None]:
print(re.findall('\".*\"', emailstring))
.

In [None]:
print(re.findall("\w\S*@.*\w", emailstring))
.

In [None]:
print(re.findall(r'\w+', 'spam-egg, decision-making, vice-vice-president'))
.

In [None]:
print(re.findall(r'(?<=-)\w+', 'spam-egg, decision-making, vice-vice-president'))
.

The followins is a really cool application. Note how two brackets are used to make two groups instead of one like up until now.

In [None]:
print(re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10'))
.

In [None]:
s = r'abc123d, hello 3.1415926, this is my book'
pattern = r'\d+'
L = re.findall(pattern, s)
print(L)
.

In [None]:
s = r'abc123d, hello 3.1415926, this is my book'
pattern = r'-?\d*\.?\d+'
L = re.findall(pattern, s)
print(L)
.

# 3) Exercises

These exercises are ordered in difficulty. The first exercise is a relatively small variation of the code provided above. The last two exercises require substantial coding effort.

### Exercise 1: Match all the words that begin with a vowel and end with a consonant in a given string.
Regex function to use is re.findall()

In [39]:
string = "The quick brown fox jumps schnoerkel over the lazy dog."

In [40]:
"solution"

'solution'

### Exercise 2: Match all the words that have three consonants in a row.
Regex function to use is re.findall()

In [41]:
string = "The quick brown fox jumps schnoerkel over the lazy dog."

In [42]:
"solution"

'solution'

### Exercise 3: Extract all the email addresses from a given text.
Regex function to use: findall()

In [43]:
string="""
Hello,

My email is john.doe@gmail.com and my friend's email is jane.doe@yahoo.com. We both like to communicate through email.

Best regards,
John
"""

In [44]:
"solution"

'solution'

### Exercise 4: Replace the emails by the string "[email]" in the same text.
Regex function to use sub().

In [45]:
"solution"

'solution'

### Exercise 5: Extract Authors and Years from Text

Write a regular expression with two groups () that extracts authors and years from the following bibitems.

In [46]:
bibstring="""Croicu, Mihai and Ralph Sundberg, 2016, “UCDP GED Codebook version 5.0”, Department of Peace and Conflict Research, Uppsala University.
Depetris-Chauvin, Emilio, Ruben Durante, and Filipe Campante. 2020. "Building Nations through Shared Experiences: Evidence from African Football." American Economic Review, 110 (5): 1572-1602.
Esteban, Joan Maria, Laura Mayoral and Debraj Ray (2012) Ethnicity and Conflict: An Empirical Study. American Economic Review. 102(4): 1310–1342.
Fearon, J. D. (1995). Rationalist explanations for war. International organization, 49(3), 379-414.
Gates, S., Graham, B. A., Lupu, Y., Strand, H., & Strøm, K. W. (2016). Power sharing, protection, and peace. The Journal of Politics, 78(2), 512-526.
Mueller, Hannes (2016). Growth and Violence: Argument for a Per Capita Measure of Civil War. Economica, 83 (331), 473-497."""

print(bibstring)

Croicu, Mihai and Ralph Sundberg, 2016, “UCDP GED Codebook version 5.0”, Department of Peace and Conflict Research, Uppsala University.
Depetris-Chauvin, Emilio, Ruben Durante, and Filipe Campante. 2020. "Building Nations through Shared Experiences: Evidence from African Football." American Economic Review, 110 (5): 1572-1602.
Esteban, Joan Maria, Laura Mayoral and Debraj Ray (2012) Ethnicity and Conflict: An Empirical Study. American Economic Review. 102(4): 1310–1342.
Fearon, J. D. (1995). Rationalist explanations for war. International organization, 49(3), 379-414.
Gates, S., Graham, B. A., Lupu, Y., Strand, H., & Strøm, K. W. (2016). Power sharing, protection, and peace. The Journal of Politics, 78(2), 512-526.
Mueller, Hannes (2016). Growth and Violence: Argument for a Per Capita Measure of Civil War. Economica, 83 (331), 473-497.


In [47]:
"solution"

'solution'

### Exercise 6: Text extraction

Text downloaded from LexisNexis often comes in a string format with a header indicating the source of the article and the download date at the end of the article. Take the example below and get rid of the header. Then save the headline, the text body and the date as separate variables and print them out.



In [48]:
article = """
New York Times

Civil Rights Movement: A Look Back

Fifty years ago, the civil rights movement was in full swing. Led by iconic figures such as Martin Luther King Jr., Rosa Parks, and Malcolm X, the movement worked to secure equal rights for African Americans in the United States.

Despite significant progress, the fight for civil rights is far from over. Black Americans continue to face discrimination and inequality in numerous aspects of life, including education, employment, and the criminal justice system.

As we reflect on the achievements of the civil rights movement, it is important to remember the struggles that preceded them and the work that still needs to be done. Let us honor the legacy of those who fought for justice by continuing to fight for a more equal society.

Download date: February 22, 2010
"""

In [49]:
"solution"

'solution'