## Learning goals for this week:
- Explain how strings are related to lists
- Do common operations (e.g., substring, indexing) on strings
- Recognize potential application opportunities for string methods and functions (e.g., upper/lower, isnumeric)
- Appropriately apply string methods and functions
- Construct strings from variable values using string formatting




## What are strings and why should we care about them?




### Strings are everywhere

We need to learn to work with strings because a lot of data we want to do things with live in the world as mixed data
- Email addresses
- Webpage URLs
- Names
- Documents, words
- Sales records
- Etc.

Strings are the ultimate "lingua franca" between systems
- Data is often passed as "serialized" forms (e.g., JSON: Javascript **String** Object Notation)
- We assume strings coming in, and we parse it appropriately. This can include data (numbers/records), as we see in one of the Projects for this module!
- This also includes the "human system" (i.e., the user)!

### Strings are lists of characters

But what *is* a string? It's fundamentally a sequence of letters/characters.

And that's exactly what a string is in Python too: it's a sequence of characters, much like (though not *exactly*) like a `list`.

In [1]:
s = "banana"
for index, char in enumerate(s):
    print(index, char)

0 b
1 a
2 n
3 a
4 n
5 a


In [2]:
for char in s:
    print(char)

b
a
n
a
n
a


In [None]:
a = "banana"
b = "snd4t7"
for index, char in enumerate(a):
    print(char)
    # and corresponding character position in b
    print(b[index])

b
s
a
n
n
d
a
4
n
t
a
7


In [None]:
sentence = "she sells seashells by the seashore, except when she doesn't want to sell seashells"
for index, char in enumerate(sentence):
    print(index, char)

0 s
1 h
2 e
3  
4 s
5 e
6 l
7 l
8 s
9  
10 s
11 e
12 a
13 s
14 h
15 e
16 l
17 l
18 s
19  
20 b
21 y
22  
23 t
24 h
25 e
26  
27 s
28 e
29 a
30 s
31 h
32 o
33 r
34 e
35 ,
36  
37 e
38 x
39 c
40 e
41 p
42 t
43  
44 w
45 h
46 e
47 n
48  
49 s
50 h
51 e
52  
53 d
54 o
55 e
56 s
57 n
58 '
59 t
60  
61 w
62 a
63 n
64 t
65  
66 t
67 o
68  
69 s
70 e
71 l
72 l
73  
74 s
75 e
76 a
77 s
78 h
79 e
80 l
81 l
82 s


In [5]:
a = ""
b = " \t\n"
print("a is " +a)
print("b is "+b)
a == b
print(len(a))
print(len(b))

a is 
b is  	

0
3


### Characters don't have to be visible/letters!





Notice that even the "blank space" is a character! A string that includes an empty space character is **NOT** the same as an empty string (i.e., a list of characters of length zero), even though they print out the same. This distinction is very important to remember as you work with real world data.

In [None]:
a = "" # a blank/empty string
b = " " # a string with one blank space *character*
print("Printing out the value of a")
print(a)
print("Printing out the value of b")
print(b)
print(len(a), len(b))
print(a == b)

Printing out the value of a

Printing out the value of b
 
0 1
False


In [None]:
a = "James"
b = " James"
c = "James "
print(a == b)
print(b == c)
print(a == c)

False
False
False


Other kinds of characters that don't look like "letters": tabs and newlines

In [None]:
# tab is \t
s = "a\ttab\nhello😂"
print(s)
for idx, char in enumerate(s):
  print(idx, char)

a	tab
hello😂
0 a
1 	
2 t
3 a
4 b
5 

6 h
7 e
8 l
9 l
10 o
11 😂


In [None]:
# new line is \n
s = "a\ntab"
print(s)
for idx, char in enumerate(s):
  print(idx, char)

a
tab
0 a
1 

2 t
3 a
4 b


Because strings are a special case of a list, most of the properties and functions that apply to lists also apply to strings (e.g., sortable, has length, can check if something is "in" it), with one important exception: **strings are immutable**: you can never modify a string directly, only create a *new* string that you must then assign to a variable (or reassign to the same variable) if you want to preserve that change. More on this when we talk about working with strings.

In [7]:
a_string = "hello world Hi"
# has length
print(len(a_string))
# can be indexed
for idx in range(len(a_string)):
  char = a_string[idx]
  print(idx, char)
# get the first character
print(a_string[0])
# get the last cahracter
print(a_string[-1])
# sort it!
print(sorted(a_string))

14
0 h
1 e
2 l
3 l
4 o
5  
6 w
7 o
8 r
9 l
10 d
11  
12 H
13 i
h
i
[' ', ' ', 'H', 'd', 'e', 'h', 'i', 'l', 'l', 'l', 'o', 'o', 'r', 'w']


In [9]:
#can be indexted
a_string[:3] #first three


'hel'

In [None]:
vowels = "aeiou"
s = "hello world"
for char in s:
    if char not in vowels:
        print(char)

h
l
l
 
w
r
l
d


In [11]:
course_code = "INST201"
"INST" in course_code

True

In [15]:
a_list = [ "h", "e", "l","l", "o"]
a_string = "hello"
a_list.sort()
print(a_list)
sorted(a_string)
print(a_string)

['e', 'h', 'l', 'l', 'o']
hello


In [17]:
THIS MEANS ANYTIME you modify a stirng you must have some kind of variable statement to preserve the change

SyntaxError: invalid syntax (3282925709.py, line 1)

### Aside: string encoding


The previous observation about blank spaces illustrates a larger point: we deliberately say strings are sequences of *characters*, not letters. This is because strings can include numbers, as we've seen (think usernames like joelchan86, or your uids), but also all sorts of other characters, including various kinds of blank spaces --- like tabs, spaces, and newlines --- and even emoji! 

Check this resource for an overview and initial guide: https://realpython.com/python-encodings-guide/

This is something I want to show you to give you a better intuition for what strings *are*, but there is also an important practical implication: you need to be very careful to transform or normalize your strings when you want to sort or compare them. What's the same or different to your human eye will often *not* be the same or different to the computer's eye.

For example, "A" and "a" have different encodings. Thus, Python does *not* see them as the same "letter". Sometimes you'll even be reading in strings that are in a different encoding 

In [None]:
s1 = "James "
s2 = "James"
s1 == s2

False

## Working with strings: basics


Similar to lists, many basic operations with strings revolve around indexing and iteration.

### Getting parts of a string

Works similarly to lists.

In [None]:
s = "my name is inigo montoya, you killed my father, prepare to die!"
for i in range(len(s)):
    char = s[i]
    print(i, char)

0 m
1 y
2  
3 n
4 a
5 m
6 e
7  
8 i
9 s
10  
11 i
12 n
13 i
14 g
15 o
16  
17 m
18 o
19 n
20 t
21 o
22 y
23 a
24 ,
25  
26 y
27 o
28 u
29  
30 k
31 i
32 l
33 l
34 e
35 d
36  
37 m
38 y
39  
40 f
41 a
42 t
43 h
44 e
45 r
46 ,
47  
48 p
49 r
50 e
51 p
52 a
53 r
54 e
55  
56 t
57 o
58  
59 d
60 i
61 e
62 !


In [None]:
s = "my name is inigo montoya, you killed my father, prepare to die!"
for char in s:
    print(char)

m
y
 
n
a
m
e
 
i
s
 
i
n
i
g
o
 
m
o
n
t
o
y
a
,
 
y
o
u
 
k
i
l
l
e
d
 
m
y
 
f
a
t
h
e
r
,
 
p
r
e
p
a
r
e
 
t
o
 
d
i
e
!


In [None]:
s = "my name is inigo montoya, you killed my father, prepare to die!"
print(s[-1])
l = [1,4,5,6,7,]
print(l[-1])

!
7


Remember slicing? Here we can think about substrings. Super useful for truncation, or getting particular parts of strings when you know the pattern (e.g., first four characters of a course code is always the department).

Remember: the index before the `:` indicates where you want to *start*, and the index after the `:` indicates where you want to stop *before*. So `[0:4]` will go from index `0` to index `3` (before index 4). Leaving out an index implicitly says "to the max" (e.g., from `0` or `until the end`).

In [None]:
code = "INST201"
area = code[0:4]
print(area)

INST


Practice! How would you get the first number of the level (after the four-letter code)?

In [20]:
code = "INST201"
area = code[4:5]
print(area)

2


In [None]:
names = ["Joel", "Sarah", "John", "Michael", "Patrick", "Kacie"]
for name in names:
    # get the first initial
    initial = name[0]
    print(initial)

J
S
J
M
P
K


Practice! How would you get the first three letters of each name?

In [22]:
code[:-3]

'INST'

### Join strings

We've also already shown you concatenation.

In [None]:
l1 = [1, 2, 3]
l2 = [4, 5, 6]
print(l1 + l2)

[1, 2, 3, 4, 5, 6]


In [None]:
s1 = "Hello"
s2 = " World!"
print(s1 + s2)

Hello World!


Practice! How would you join `INST` and `201`?

### Check if character(s) is in string

And checking if some character (or sequence of characters) is in a string

In [None]:
message = "Hello, my name is Inigo Montoya"
# let's check if the message mentions my name!
print("inigo" in message)
# need to transform the message to lowercase to match
print("inigo" in message.lower())

False
True


In [None]:
l = ["INST201", "INST126", "INFM322", "CMSC126"]
for item in l:
    if "CMSC" in item:
        print(item)

CMSC126


In [None]:
l = ["INST201", "INST126", "INFM322", "CMSC126", "joelchan@umd.edu", "joelchan", ".edu", "sarah@umd.edu"]
for item in l:
    if ".edu" in item:
        print(item)

joelchan@umd.edu
.edu
sarah@umd.edu


## Working with strings: advanced

Similar to lists, there is a collection of in-built **string methods**: functions in Python that operate on strings: https://docs.python.org/3/library/stdtypes.html#string-methods

In [None]:
s = "hello"
dir(s)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

I'm not going to show you all of them, but I will talk through them and discuss some fairly common ones

No need to memorize them – just know:
- There are many methods that allow you to do things with strings – if you want to do something, first search for that method! It’s often way more efficient/bug-free than what you’ll write (even after you get good)
- Where to find the exact code for it, how to figure out how they work

More importantly, I want you to pracice reading documentation, get a sense of how to use functions (code that other people have written that you can reuse): what are the parameters? return values? what can you learn from examples? how do you learn how to use it appropriately in your own code?


In [23]:
def is_url(s):
    return "http" in s

In [24]:
s = "https://docs.google.com/document/d/1mzvN-qI0gvKF-oCSbRm_Ecgb_5pxqz66mTRENNTMwGI/edit"
is_url(s)

True

### Checking a string

In [25]:
# we want to do math
a = " 123"
b = "567"
# but first we want to make sure the strings are all numbers before we convert them
if a.isdigit() and b.isdigit():
    a = int(a)
    b = int(b)
    print(a*b)
else:
    print("One of the input strings contains non-digits!")

One of the input strings contains non-digits!


In [None]:
l = ["INST201", "INST126", "INFM322", "CMSC126", "joelchan@umd.edu", "joelchan", ".edu", "sarah@umd.edu"]
# get all the strings that start with INST
for item in l:
    if item.startswith("INST"):
        print(item)

INST201
INST126


In [None]:
l = ["INST201", "INST126", "INFM322", "CMSC126", "joelchan@umd.edu", "joelchan", "CHDG101", "sarah@umd.edu"]
for item in l:
    if item.startswith("1", 4): # get the 100 level courses by checking if the part of the string that starts at position 4 starts with the character 1
        print(item)

INST126
CMSC126
CHDG101


### Changing a string

#### "Cleaning" / normalizing a string

Often we get data in string form, and we need to make sure it conforms to our expectations.

In [None]:
# need to turn into a number so I can do math with it
sales_record = "$1,000,000"

# with iteration
cleaned = "" # initialize clean string as a blank/empty string
# for each character int he sales record string
for char in sales_record:
    if char.isnumeric(): # if the character is numeric
        cleaned += char # grab it
print(cleaned)

1000000


In [None]:
# can use .replace() if you know in advance which characters you want to strip out
sales_record = "$1,000,000"
cleaned = sales_record.replace("$", "").replace(",", "")
print(cleaned)

1000000


In [1]:
def normalize_string(s):
      return s.upper().strip() # convert the string to upper case and remove leading and trailing blank spaces

# need to make sure it's normalized and we remove all weird stuff
n = " Josh Lyman"
m = "JOSH LYMAN"
print(n)
print(m)
print(n == m)
n_normal = normalize_string(n)
m_normal = normalize_string(m)
print(n_normal)
print(m_normal)
print(n_normal == m_normal)


 Josh Lyman
JOSH LYMAN
False
JOSH LYMAN
JOSH LYMAN
True


#### "Parsing" a string (getting specific bits we want)

You can do this if you know there is some *separator* that you can rely on to divide the string into the "bits" you want.

Examples:
- Parse an email
- Parse a URL
- Parse a sentence into words!
- Parse a time stamp

These all use the `.split()` method.

In [None]:
email = "joelchan@umd.edu"
# we want only the domain and server
elements = email.split("@")
print(elements)
domainserver = elements[1] # domain server is the 2nd element in the split
print(domainserver)

['joelchan', 'umd.edu']
umd.edu


In [None]:
email = "joelchan@umd.edu"
# if we only want the domain (.edu), we can do a multiple split
split1 = email.split("@") # split the email by the @ separator
domainserver = split1[1] # grab the second item
split2 = domainserver.split(".") # split that second item by the . separator
domain = split2[1] # get the second item from that one
print(domain)

edu


In [None]:
email = "joelchan@umd.edu"
elements = email.split("joelchan")
print(elements)

['', '@umd.edu']


In [None]:
url = "www.ischool.umd.edu"
elements = url.split(".")
print(elements)
domain = elements[-1]
print(domain)

['www', 'ischool', 'umd', 'edu']
edu


In [None]:
timestamp = "13:30:31"
elements = timestamp.split(":")
print(elements)
# get the hour
hour = elements[0]
minute = elements[1]
seconds = elements[2]
print(hour)
print(minute)
print(seconds)

['13', '30', '31']
13
30
31


In [None]:
message = "She sells seashells by the sea shore, with sea in the wind, and sea in my shoes"
words = message.split(" ")
print(words)

['She', 'sells', 'seashells', 'by', 'the', 'sea', 'shore,', 'with', 'sea', 'in', 'the', 'wind,', 'and', 'sea', 'in', 'my', 'shoes']


In [None]:
# simpler way to count occurrences of a substring
message.count("sea")

4

In [None]:
email = "joelchan@umd.edu"
# we want all the elements
# first split into username and domain and server
elements = email.split("@")
print(elements)
username = elements[0]
print(username)
other_elements = elements[1].split(".")
print(other_elements)
servername = other_elements[0]
domainname = other_elements[1]
print(username, servername, domainname)

['joelchan', 'umd.edu']
joelchan
['umd', 'edu']
joelchan umd edu


## REMEMBER: STRINGS ARE IMMUTABLE

Note! Unlike lists, string methods return a new object (and do not modify the original string), since strings are immutable

This means if you don't assign the return value of the string method to a new variable, the change will be **lost**. Remember this!

In [None]:
a = "hello"
b = "Hello"
a.lower()
b.lower()
a == b
print(a, b)

hello Hello


In [None]:
a = "hello"
b = "Hello"
a = a.lower()
b = b.lower()
a == b
print(a, b)

hello hello


In [None]:
message = "Hello, my name is Inigo Montoya"
print(message)
# let's check if the message mentions my name!
message = message.lower() # change to lower case
message = message.replace("inigo", "MYSTERY")
print(message)

Hello, my name is Inigo Montoya
hello, my name is MYSTERY montoya


## String formatting

So far we've taken strings as given, and we often specify a string directly. But frequently it is useful to compose a string programmatically, from variables.

Often this is done for debugging (to read the state of your program at various steps), but often this is used as outputs of your program, intermediate or final.

Here's an example

In [2]:
msg = "hello"
name = "sarah"
friend = "joel"
output = f"{msg.title()} {friend.title()}, my name is {name.title()}!"
print(output)

Hello Joel, my name is Sarah!


In [4]:
#print out a message that says
#sorry that wasn't right, a person's name, you ahve some number of tries left
name = "sarah"
attempts = 5
output = f"sorry that wasn't right, {name}, you have {attempts} attempts left"
print(output)

sorry that wasn't right, sarah, you have 5 attempts left


In [None]:
sales = ["$100", "$250", "$500"]

for idx, sale in enumerate(sales):
    print(f"Processing the item at index {idx}: {sale}") # example of debugging/tracing statement
    print(sale)

Processing the item at index 0: $100
$100
Processing the item at index 1: $250
$250
Processing the item at index 2: $500
$500


In [None]:
tip = 0.18
check = 25.00
total_value = check + check*tip
print(f'Please charge my card for ${total_value:.2f}')

Please charge my card for $29.50


This is an f string and round the value to 2 decimal places .2 float

### The basics
Let's look in more detail. The intuition here is that you're defining a series of "slots" for variables. Each slot is indicated with the `{}` curly braces. And you put data / variables in them.

You also indicate that you're doing this slot thing by prefixing the string with the letter `f`

Here's how it looks:

In [None]:
names = ["Joel", "Sarah", "Michael", "Kacie"]
for name in names:
    message = f"Welcome, {name}!"
    print(message)

Welcome, Joel!
Welcome, Sarah!
Welcome, Michael!
Welcome, Kacie!


In [None]:
birth_year = 1986
this_year = 2021
name = "Joel"
message = f"Happy birthday, {name}! You are {this_year - birth_year} this year!"
print(message)

Happy birthday, Joel! You are 35 this year!


### Controlling the way it looks
You can also control how the string looks! Various things like controlling how many decimal places are printed out (very useful when doing math), or how wide or indented the string is.

In [None]:
# most common
x = 2
y = 3
result = x/y
message = f"{x} divided by {y} is {result:.2f}" # only show two decimal places for the float value of result
print(message)

2 divided by 3 is 0.6667


The general design pattern here is to put a colon after and then specify some kind of formatting option. More details here: http://zetcode.com/python/fstring/

For the curious: there was a time when string formatting was done differently (but Python's creators basically tell everyone *not* to use it anymore): just pointing it out as a historical novelty in case you see it in the wild in other people's code (*cough* Joel's code *cough*).

https://realpython.com/python-string-formatting/#1-old-style-string-formatting-operator

## Questions?

