# Seminar Notebook 1.1: Strings in Python

**LSE MY459: Computational Text Analysis and Large Language Models** (WT 2026)

**Ryan Hübert**

This notebook covers working with strings in Python.

## String objects in Python

In Python text is stored in `str` (string) objects. For example, we can create an object called `some_text` that contains a string with the full name of this course.

In [56]:
some_text = "Computational Text Analysis and Large Language Models"
some_text

'Computational Text Analysis and Large Language Models'

You can check that this object is now in Python's global scope.

In [57]:
print("some_text" in globals()) # check its in global scope
globals()["some_text"]          # look at the object in scope

True


'Computational Text Analysis and Large Language Models'

In Python `str` objects are stored as sequences of Unicode code points. You can read more about this here: <https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str>. In Unicode, every character has a unique code point, so this means that the length of a Python `str` object should match the number of characters in the string. Keep in mind: every letter, number, space, line break, symbol and piece of punctuation you write is considered a "character" in a string.

In [58]:
len("LSE MY459.")

10

You can extract substrings from a string using Python's standard indexing. For example, if to get the first 5 characters from the `some_text` string, you can use:

In [59]:
some_text[0:5]

'Compu'

Finally, the `in` operator gives a simple and fast way to search for a string inside another string, returning a boolean. This is similar to how `in` works for other Python object types, such as checking if an element is in a list, etc. Consider the following two examples, which demonstrate how _precise_ you need to be when working with strings (or any object!) in Python:

In [60]:
'and ' in some_text

True

In [61]:
'and  ' in some_text

False

## String manipulation with Python

Python `str` objects have many available methods. You can see a list of them at <https://docs.python.org/3/library/stdtypes.html#string-methods>. We will refer to these methods as "built in" `str` methods to distinguish them from other methods and functions available in specialised modules for working with text, such as `re` and `nltk`, which we'll cover later.

The next code chunks show you a variety of ways to join strings together.

In [62]:
# Strings can be joined together ("pasted") using the `+` operator:
my_string = "LSE MY459: " + " " + "Computational Text Analysis "
my_string

'LSE MY459:  Computational Text Analysis '

In [63]:
# You can join a string "in place" using the += in place operator
# Read more here: https://docs.python.org/3/library/operator.html#in-place-operators
my_string += "and Large Language Models "
my_string

'LSE MY459:  Computational Text Analysis and Large Language Models '

In [64]:
my_string = "\n".join([my_string, "Winter Term 2025  "])
my_string

'LSE MY459:  Computational Text Analysis and Large Language Models \nWinter Term 2025  '

In the chunks above, the strings were displayed after each assignment step by including `my_string` on its own line. We will refer to this as "echoing" the value of the object `my_string`. When you echo the value of any object, the value is displayed in the code output. In the examples above, we know that the objects are `str` objects because they are enclosed in single quotes. We can also verify this more formally by using the `type()` function:

In [65]:
type(my_string)

str

However, there are many situations where you may want to _print_ a Python object by using the `print()` function. For `str` objects, the `print()` function will generate output of _formatted_ text. For example, if we print the `my_string` object, it no longer displays the line break character `\n`, and instead, it actually creates a line break. It also does not show the enclosing quotes. 

In [66]:
print(my_string)

LSE MY459:  Computational Text Analysis and Large Language Models 
Winter Term 2025  


Python also has a module called `pprint`, which stands for "pretty print" and allows you more options for formatting printed objects. You can read the documentation here: <https://docs.python.org/3/library/pprint.html>. You will sometimes find this useful when working with awkwardly formatted text, or very long text.

In [67]:
from pprint import pprint
pprint(my_string, width = 30) 

('LSE MY459:  Computational '
 'Text Analysis and Large '
 'Language Models \n'
 'Winter Term 2025  ')


Notice that there is a typo in the string above. We are currently in 2026, not 2025! We can use the `str.replace()` method to replace the incorrect text with corrected text.

In [68]:
my_string.replace("2025", "2026")

'LSE MY459:  Computational Text Analysis and Large Language Models \nWinter Term 2026  '

Unfortunately, when I applied the `replace()` method to `my_string`, I forgot to "record" the change in the object in the global scope. You can see this by printing the object:

In [69]:
print(my_string) # typo is still there!

LSE MY459:  Computational Text Analysis and Large Language Models 
Winter Term 2025  


Keep in mind that if you want to make a change to an object in Python's scope, you need to make sure to assign with `=`. For example:

In [70]:
my_string = my_string.replace("2025", "2026")
my_string # The change is now made in the object in scope!

'LSE MY459:  Computational Text Analysis and Large Language Models \nWinter Term 2026  '

Also notice that there is excess spacing at the end of the string. We can remove any white space at the beginning or end of the string by using the `str.strip()` method.

In [71]:
my_string = my_string.strip()
my_string

'LSE MY459:  Computational Text Analysis and Large Language Models \nWinter Term 2026'

## The `re` module

While Python has a lot of build in methods for strings, they are somewhat limited. Of particular note is that Python's built in string methods do not allow for regular expressions (or "regex"). You can learn more about regular expressions [here](http://www.zytrax.com/tech/web/regex.htm). Regular expressions let us develop bespoke rules for both matching strings and extracting elements from them. We'll use Python's built-in `re` module for regex operations.

Recall that the `in` operator allows you to search if a string is contained within another string. However, this is not very flexible---only exact matches are recognized. One important thing you can do with regex is that you can search a string for a more flexible pattern. For example consider the sentence: "Dr Hübert teaches MY472 and MY459."

In [72]:
hubert_string = "Dr Hübert teaches MY472 and  MY459."
hubert_string

'Dr Hübert teaches MY472 and  MY459.'

What if I want to see if there is any mention of a MY course codes in this string? I could iterate over every possible code and perform an `in` search (as above). First, I need to prepare the list of course codes by manually copy and pasting them one by one from <https://www.lse.ac.uk/resources/calendar2025-2026/courseGuides/graduate.htm>

In [73]:
my_courses = ["MY400", "MY401", "MY405", "MY421A", "MY421W", "MY423", 
              "MY425", "MY426", "MY428", "MY451A", "MY451W", "MY452A", 
              "MY452W", "MY455", "MY456", "MY457", "MY459", "MY461", 
              "MY464", "MY465", "MY470", "MY472", "MY474", "MY475", 
              "MY476", "MY498", "MY499", "MY4IR"]

Now, we can iterate over is list and test if any one of them appears in the text. The `any()` function evaluates to `True` if any match is found.

In [74]:
if any(x in hubert_string for x in my_courses):
    print("Match found!")
else:
    print("Match not found!")

Match found!


This is quite cumbersome! I had to _create_ a list of MY course codes in order to see if any of them appear in the string. However, I know all course codes have a similar format: "MY", then three numbers, and then (possibly) some letters. We can construct a regex that looks for this kind of pattern. In Python regex need to be in quotes with an r before them: `r''` or `r""`. This is known as [raw string notation](https://docs.python.org/3/library/re.html#raw-string-notation). (Technically you do not _have_ to use this, but you should --- read the docs for why it can create problems if you don't.)

In [75]:
my_regex = r'(MY)(\d+)([A-Z]*)'

The `re` module has a function `re.search()`, that allows us to search for this pattern.

In [76]:
import re

if re.search(my_regex, hubert_string):
    print("Match found!")
else:
    print("Match not found!")

Match found!


The nice thing about `re.search()` is that you can also _extract_ the matches using the `group()` method. You can extract parts of the matched string based on [capturing groups](https://docs.python.org/3/howto/regex.html#grouping) you defined in the regex.

In [77]:
print(re.search(my_regex, hubert_string).group(0)) # Get whole match
print(re.search(my_regex, hubert_string).group(1)) # Get match in first capturing group
print(re.search(my_regex, hubert_string).group(2)) # Get match in second capturing group

MY472
MY
472


There are many other functions in the `re` module. The most commonly used ones are as follows:

In [78]:
# Create a list of all regex matches in a string (note how capturing groups affect this)
print(re.findall(my_regex, hubert_string))

[('MY', '472', ''), ('MY', '459', '')]


In [79]:
# Split a string into a list of strings based on a regex pattern
print(re.split(r" +", hubert_string))
# Notice difference with built-in string method:
print(hubert_string.split(" "))

['Dr', 'Hübert', 'teaches', 'MY472', 'and', 'MY459.']
['Dr', 'Hübert', 'teaches', 'MY472', 'and', '', 'MY459.']


In [80]:
# Replace matched text in a string with other text
print(re.sub(r" +", " ", hubert_string)) # replace multiple spaces with one
print(re.sub(r"ü", "y", hubert_string))  # replace accented characters

Dr Hübert teaches MY472 and MY459.
Dr Hybert teaches MY472 and  MY459.


(Side note: why replace "ü" with "y" in the code above? Ryan's family is from Norway, and his last name is a German "loan word." In Norwegian, "ü" can be written---and pronounced---as "y", see <https://en.wikipedia.org/wiki/%C3%9C#Letter_%C3%9C>.)

Let's look at another real world example from Ryan's research. Below is a text string from a U.S. federal district court case record.

In [81]:
court_text = "ORDER OF REASSIGNMENT to District Judge Aileen M Cannon for all further proceedings, Judge Cecilia M. Altonaga no longer assigned to case. Signed by Judge Cecilia M. Altonaga on 11/23/2020. See attached document for full details. (yar) (Entered: 11/23/2020)"

In [82]:
# Find all the mentions of "judge"
re.findall("judge", court_text)

[]

In [83]:
# Note that regex searches are case sensitive
re.findall("Judge", court_text)

['Judge', 'Judge', 'Judge']

In [84]:
# Find all parentheses
# Parentheses are special characters in regex, need to escape them or put in brackets or escape
re.findall(r"[()]", court_text)

['(', ')', '(', ')']

In [85]:
# Extract the "Entered" date
# \d = a digit
# \d{1,2} = 1-2 digits, e.g. "1", "11", "23"
# [0-9/]+ = one or more digits or slashes
re.findall(r"\(Entered: *([\d/]+)\)", court_text)

['11/23/2020']

In [86]:
# Extract all judge names
# Pattern: "Judge" followed by first name, optional middle initial, last name
re.findall(r"Judge [A-Za-z]+ [A-Z][.]? [A-Za-z]+", court_text)

['Judge Aileen M Cannon',
 'Judge Cecilia M. Altonaga',
 'Judge Cecilia M. Altonaga']

### Keep in mind

We will expect you to be broadly familiar with how to use regular expressions in Python. At a minimum, you should read [Python's official HOWTO](https://docs.python.org/3/howto/regex.html), and look over the [`re` documentation](https://docs.python.org/3/library/re.html#module-contents). You should also practice using regex to do various string manipulations. You can also test out regex patterns at <https://regex101.com/> (switch to Python!), or get an LLM to help you. (You will need to know some regex for the exam.)