# Notebook 5.1: String operations

A "string" is the name used in Python for 'strings of characters' such as words, sentences, or paragraphs of text. It is one of the most basic data types and one that Python is very good at working with. In fact, the ease with which Python can be used to manipulate text is one of the primary reasons it bas become such a popular language for both scientific programming as well and web development. 

### Create a string object
Below we assign a string object to a variable name using the `=` sign.

In [None]:
mystring = "a string of text"

### Indexing and slicing
Strings are an *indexed* datatype that is *immutable*. This means that we can select portions of the text using indexed numbering, but we cannot modify individual parts of it. When you take an index or slice of a variable a new variable is *returned*. Unless you store that returned variable, however, it is not saved, and the original string variable is unchanged unless you overwrite the variable name. 

In [None]:
# indexing: select the first element of a string (Python index starts at 0)
mystring[0]

In [None]:
# indexing: select the last element (negative indexes start from the end)
mystring[-1]

In [None]:
# indexing: select a range of elements with a slice (e.g., first:last)
mystring[0:8]

In [None]:
# indexing: the above could also be written as [:8] meaning from start to index 8
mystring[:8]

In [None]:
# indexing: this means index -3 until the end.
mystring[-3:]

### Strings can be combined or reassigned


In [None]:
# create new string objects and assign to variables
string1 = "a new string "
string2 = "bigger than the last string"

In [None]:
# combine them and assign the result to another variable
newstring = string1 + string2
newstring

### You cannot mutate string objects
Attempts to assign a value to an index or slice of a string will raise an error because strings are *immutable*. We can take an index or slice of a string to return part of it, but cannot change it *in place*. Instead, we can replace the variable storing a string object with a new string. 

In [None]:
# error: you cannot assign to an index inside a string.
mystring[0] = "A"

### How do you modify strings then?
You assign a new string to the same named variable. This can be done using indexing or concatenation, or several other ways as well as we'll see. This may seem a little nuanced, but you'll see later how this varies among different objects, some of which are mutable and others which are not. The example below creates a new string by adding "A" to the previous string indexed from 1 until the end.

In [None]:
## you could do this instead
newstring = "A" + mystring[1:]
newstring

## Built-in string functions
String variables are a type of data object in Python, and just like all objects in Python this means that they have more information than just the value that we assigned to it. For example, string objects store the length of the string that is stored, and they have many `functions` that can be called to manipulate that string. These `'attributes'` and `'functions'` can be accessed by using tab-completion on an object in jupyter. After the period at the end of `mystring` below hit tab to see the options displayed. 

In [None]:
# put your cursor after the period below and press <tab> to see available options
mystring.

### Example builtins
The examples below call `functions` associated with string objects.

In [None]:
# this capitalizes the first character of the string
mystring.capitalize()

In [None]:
# this splits the string into a 'list' where ever there is whitespace
mystring.split()

In [None]:
# this centers the text across a width of 40
mystring.center(40)

In [None]:
# this prints the index of the searched word starting from the left
mystring.find("of")

### Modifying strings
In the examples above we applied a function to `mystring` which returned a new string object that was printed into the output location below the cell. The new string object was not stored, and so it is lost.

Because the returned value wasn't saved it was "garbage-collected" by Python, meaning that it's space in memory was erased. If we want to store the result we need to assign it to a variable. This is done below, where you see `mystring` is changed after we run the command.

In [None]:
# here mystring is replaced by a new string where the first character is capitalized.
mystring = mystring.capitalize()
mystring

In [None]:
# let's create a new variable like the original that is not capitalized
lower_string = mystring.lower()
lower_string

### The `print()` function

The fact that a variable name appears when you simply execute a cell that has the object on the last line is a convenience of jupyter. It is showing you what the object is. In the case of a string it shows the content with quotes around it to indicate it is a string.

A more formal way to show the contents of a string is by using the `print()` function. You can see that the first example below simply returns an object, showing quotes around the text to indicate it is a string object. The second way, using print, prints just the text. 

In [None]:
# you return a variable's value by entering just the variable
mystring

In [None]:
# or you can use the print() function, which prints it to stdout
print(mystring)

In [None]:
# In Python3 but not Py2 print of multiple strings is concatenated
print(mystring, lower_string)

### Single, double, and triple quotes
What is the difference between single, double, or triple quotes when creating strings. 
There is little difference, but by providing redundancy 
it is easier to find ways to write strings that include quotes inside of them. For example, to write a string that contains single quotes you can wrap it in double quotes. 

In [None]:
## examples of printing strings
print("hello world")
print('printing in single quotes is the same as in double quotes')
print("'you can make strings that include single quotes by putting them inside doubles'")
print('"you can make strings that include double quotes by putting them inside singles"')
print("""
multi-line string with mixed single and double
quotes can all be captured inside triple-quotes.
This also interprets the starting line as a newline.
""")

### Formatted printing
There are several ways to insert string variables into another string to automate the process of writing long strings in Python. We will focus on two methods: fstrings and formatting.  

These allow you to substitute variables into a string when printing by using curly braces to indicate the position where elements should be inserted. Let's see some examples.

In [None]:
# substite variables into the curly braces
dog = "dog"
cat = "cat"
print("the {} is meaner than the {}".format(dog, cat))

In [None]:
# use indexing to indicate the order of filling
dog = "dog"
cat = "cat"
print("the {1} is meaner than the {0}".format(dog, cat))

In [None]:
# the variables do not need to be assigned before hand
print("""
This is a long string over
multiple lines. But I can still 
insert "{}" here or "{}" here, 
it's easy as {}.
""".format('dog', 'cat', 'pie'))

In [None]:
# You can substitute strings, ints, floats, etc.
print("the final answer is: {}".format(0.00123))

### F-strings
In the fstring format you need to add an 'f' before the string, and then it will allow you to substitute variables into the string simply by inserting them directory into the string inside of curly brackets. 

In [None]:
# f-strings works great for a one-line string
dog = "dog"
cat = "cat"
print(f"the {dog} is meaner than the {cat}")

In [None]:
# f-strings 
print(f"""
This is a long string over
multiple lines. But I still want to 
insert "{dog}" here or "{cat}" here.
""")

### Parsing a string document
In many cases an entire data set, or document page, will be stored as a string, and so it is really useful to know some common workflows for parsing strings of text into other usable forms. 



In [None]:
# a string that is like a full page document
page = """
This is a multi-line document.
This is the second line.
The last line is here.
"""

In [None]:
# the string variable looks like this. Newlines are represented as '\n'
page

In [None]:
# strip() removes the newline characters at the beginning and end
page.strip()

In [None]:
# split can take an argument to split on a specific character .
# Here we enter '\n' to split on new lines. 
# This parses the string into a 'list' object of lines. 
# We'll discuss lists more later. 
page.strip().split('\n')

### Details on strings versus bytes
Although strings seem like a simple data type, being just characters of text, when you dive in deeper you find they are actually quite complex. This is particularly true in Python3 which, as opposed to Python2, has two distinct types called `strings` and `bytes`, which I have been referring to so far simply as strings. Technically, bytes are the representations of text that can be written to your hard disk, whereas strings are simply a mapping of those bytes to a certain type of decoding. In this way, the same set of bytes can represent different strings depending on the encoding type. This is particularly useful for representing the full range of characters from non-latin based characters, and emojis, and other things that exist in the wider world of text. Simply, a string is for showing to humans while a bytestring is for storing on disk. Let's see some examples. 

In [None]:
# define a unicode string.
"a string"

In [None]:
# explicitly define a unicode string
u"a unicode string"

In [None]:
# explicitly define a byte string
b"a byte string"

You're probably thinking, huh, these all look the same so far. That's true. It's because the default encoding for strings is called `"utf-8"`. Strings are simply the utf-8 representation of a byte string. What if we decode the bytestring with a different encoding?

In [None]:
bb = b"a unicode string"

print(f"""
this is bb decoded with utf-8:  '{bb.decode("utf-8")}'
this is bb decoded with utf-16: '{bb.decode("utf-16")}'
""")

In [None]:
# encode this string back into a bytestring
print('中文'.encode('utf-8'))

In [None]:
# decde this bytestring back into a utf-16 string
b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-16')

The bytes versus strings concept is a bit complex, and it is OK if you don't fully grasp all of it right now. For the most part, if you are only working with simple unicode characters, the only thing you need to know about bytes for now that you can use the `.decode()` function call to convert them back to strings.

### Comments (#)
Written code often contains comment strings or lines. These are not  meant to be interpreted by the program, but instead to provide hints to the person reading the code. You've seen this already in the lecture notes and example code we have examined, including in most cells of this notebook. You can see that the jupyter notebook interpreter colors comments as a lighter blue-grey color. In the examples below you can see that the comments are not interpreted when written in-line either. 

In [None]:
newstring = "a new string"            ## creating a new string variable
newstring = newstring.capitalize()    ## return capiltalized version
newstring.startswith("A")             ## ask whether it starts with an "A"

### Using print() for debugging

A common use for the `print()` function when writing code is to use it for *debugging*. Essentially, this is a way of asking "what is happening in the code right now?". It is a good way to ensure that the code is running how you want it to, or to find the bugs in your code if they exist. We'll use `print` in this way below while discussing `for-loops`. 

### Strings are iterable
Strings, like many other Python data objects, are *iterable*, which means that we can sample sequential elements from them. The elements in a string are bytes (i.e., characters). We can write a for-loop like below to iterate over the elements in the string object. 

A `for-loop` is a very common procedure in programming. It is used to repeat some procedure a specified number of times. The syntax for writing a for-loop in python is `for x in y:` where `x` becomes a variable name that will be updated every loop. The `y` variable must be an iterable object, such as a string. This line must end with a colon, and the next line must be indented in. By convention, indentation is usually 4 spaces in Python.

In [None]:
## define a string variable
stringvar = "apples orange grapes"

## a for-loop iterating over elements in stringvar
for x in stringvar:
    print(x)

### More on for-loops
We've learned about operators for things like addition, subtraction, and for performing comparisons, such as `=`, `>`, and `<`. Below we combine these in a for-loop to perform a more complex operation. Following the format we used above to write a for-loop, it's important to recognize that the variable in the loop is being reassigned on every iteration. In this loop we name the variable `char`, and if `char` then test if `char` is a vowel by asking if it `is in` a string that contains only vowel characters. If the `==` comparison returns True then the print function is called.

In [None]:
## find vowels in stringvar
for char in stringvar:
    for vowel in "aeiou":
        if char == vowel:
            print(char)

### A simpler way to do the same thing

In [None]:
## find vowels in stringvar
for char in stringvar:
    if char in "aeiou":
        print(char)

# Challenges

<div class="alert alert-success">
    Try to complete all of the challenges below.
</div>

A. Create a variable named 'varstring' and assign it the value "apples orange grapes"


In [None]:
varstring = "apples oranges grapes"

B. Use indexing to create a new variable of only the first two fruits


In [None]:
firstTwo = varstring[0:14]

C. Use indexing to create a new variable of only the last two fruits


In [None]:
lastTwo = varstring[7:]

D. Split the string on whitespace to create a list


In [None]:
newList = varstring.split()

E. Iterate over varstring and print every element that is not a vowel


In [None]:
for char in varstring:
    if char not in "aeiou":
        print(char)

F. Create a variable that is assigned the following string and print it: `They asked, "what's your name?"`

In [None]:
string = "They asked, \"what's your name?\""
print(string)

G. Count the number of differences between these two DNA strings


In [None]:
dna1 = "ACAGAGTTGCCAGGAGATGACAGAAAGGTGTGGGTTACAACTCTCTCTAATTTAAGGGCCAATTAACATT"
dna2 = "ACAGAGTCGCCAGGAGATGACAGAAAGGTCTGGGTTACAACTCTCTCTAAAATAAGGGCCAATTAACGTT"

In [None]:
residue2 = 0
count = 0
for residue1 in dna1:
    if residue1 != dna2[residue2]:
        count+=1
    residue2+=1

print(count)