# Strings

# Objectives

By the end of this notebook, you should know:

* Strings are sequences of characters
* Each character is encoded with Unicode
* How to define a string with single, double, and triple quotes
* How to recognize the backslash escape character
* What a method is and how to call it
* How to call methods with parameters
* How to discover methods with tab completion
* How to get help using methods
* How to call methods in succession with method chaining
* How to get the length of a string
* What the plus and multiplication operators do with strings
* The basics of string interpolation with f-strings
* How to select a single character from a string
* How to select multiple characters with slice notation
* That strings are immutable

In the previous notebook we briefly looked at the basic numeric types, **`int`** and **`float`**. These types are quite simple and straightforward. They behave how you would expect them. We now turn to **strings** which are a more complex type. Python uses the keyword **`str`** to represent strings.

# Built-in Types
Before diving into strings, let's take a moment to discuss **built-in types**. The word **built-in** refers to features of the language that come standard with each installation of Python. They form the core part of the language from which all other features and libraries are built from. The most common and prevalent built-in types are:

* `bool`
* `int`
* `float`
* `str`
* `tuple`
* `list`
* `set`
* `dict`

There are several more built-in types such as **`complex`**, **`frozenset`**, **`bytes`**, and others but these are used far less frequently than the list above. To see all the built-in types see this (very long) [page from the official Python documentation][1]

## The term data type
You will often see the term **data type** used when discussing **types**. They refer to the same concept.

## Back to strings - what are they?
Technically, a Python string is a sequence of characters.

## So what are characters?
A character is the smallest possible component of text, usually a letter of an alphabet or punctuation that you can output with one key from your keyboard. There is no separate types for characters. A single character is just a string with length of one.

Internally, each character in Python is represented as an integer. For instance, the integer 65 corresponds to **`A`** and 947 corresponds to **`γ`**, the Greek letter Gamma. This system of encoding is called **Unicode**. Unicode has become a industry standard for character representation. See the [documentation][2] for more on characters.

### More on Unicode
Before Unicode, one of the first computing standards for character representation was ASCII, a system that represented 128 unique characters using 7 bits. A bit is the smallest unit of information for a computer and is binary (0/1). So 7 bits can encode 2^7 (128) different pieces of information and for ASCII, this information corresponded to characters, numbers, and punctuation common to English language speakers.

Since there happen to be many thousands of languages and tens of thousands of different character representations, Unicode was brought into existence to universally cover any possible writing system. Unicode represents each character with 4 **bytes**, and with 8-bits per byte that is 32 bits which can represent 2^32 or over 4 billion unique characters. Each number in Unicode is called a **code point** and represents a single unique character.

##  Creating our first string
Strings are created by writing a sequence of characters surrounded by either single or double quotes.

## Reminder to execute all code cells
Remember to execute (**shift/ctrl + enter**) each code cell below while reading through the notebook.

[1]: https://docs.python.org/3/library/stdtypes.html
[2]: https://docs.python.org/3/howto/unicode.html#definitions

In [None]:
# Define a string
my_string = 'my own personal string'

## Output the string to the notebook
There was no output after executing the above cell as the string was assigned to the variable **`my_string`**. Use the cell below to output the value of the screen to the notebook.

In [None]:
my_string

## Using double quotes
Double quotes work equally as well for creating strings. See the example below:

In [None]:
my_string2 = "yet another string"
my_string2

## Quotes inside of quotes
Because Python allows for single and double quotes, it's easy to make a string with quotes inside of quotes. Below are examples of either a single or double quote inside of a string.

In [None]:
singleq = "There's a single quote in here"
singleq

In [None]:
doubleq = 'Yogi Berra once said, "We made too many wrong mistakes"'
doubleq

## Multiline Strings with triple quotes
When you have a string that spans multiple lines, or a string with both double and single quotes in it, you can use triple quotes. Enclose the string with either three single or three double quotes.

In [None]:
tripleq = """Use triple
quotes for some 
very long
multiline strings
"""

tripleq

## Single and double quotes in the same string
You can also use triple quotes as a way to put both single and double quotes in the same string.

In [None]:
my_string_w_2_quote_types = '''My friend said, "I'm only a mediocre pythonista". I got mad! '''

my_string_w_2_quote_types

## Strings with escape characters
You probably noticed that the output of the last two strings have backslashes inside them. The Jupyter notebook outputs the strings in a standard manner by using only single quotes. The two strings are identical. It is still possible to have single quotes inside a string that uses single quotes by using the backslash **escape character**.

A backslash is a signal (technically an escape character) to Python that we want something different to happen than use the literal backslash character. For instance, **`\n`** is interpretted as a new line and **`\'`** means a single quote but not one that terminates the string.

## An aside on functions
We will formally discuss functions in their own notebooks. For those unfamiliar with functions, they are references to reusable stored code that performs a particular task. Function names are always followed by a set of parentheses (when called) and can contain **arguments** which are comma separated values that the function will use.

## Using the print function
You can use the **`print`** function to output your strings. The output will have the escape characters replaced by the values they represent.

In [None]:
print(tripleq)

In [None]:
print(my_string_w_2_quote_types)

# Many more "abilities" with strings
Strings have far more "abilities" than their numeric counterparts. Some of these abilities include the following:
* Capitalizing the first letter of each word
* Counting the frequency of a particular letter
* Splitting the string into multiple other strings
* Finding the location a particular substring
* And much more

The abilities I am referring to are called **methods**. The way methods are invoked in Python is by placing a dot at the end of the variable name followed by the method name and a set of parentheses.

## Capitalizing each character in a string
For instance, we can use the **`upper`** method to capitalize each character in a string.

In [None]:
capitalize_me = "this string will soon be all caps"

In [None]:
capitalize_me.upper()

# Counting character occurrences
The **`count`** method capitalizes the unique occurrences of a substring. Place the substring you would like to count inside of the parentheses.

In [None]:
capitalize_me.count('i')

In [None]:
capitalize_me.count('will')

# Calling methods with parameters
The **`count`** method from above is an example of a method with **parameters**. A parameter is a variable whose value changes the functionality of the method. The strings **`i`** and **`will`** are both parameter values. 

Some methods, such as **`upper`** do not have any parameters and therefore can be called with nothing inside their parentheses. Methods and their parameters will be discussed in greater detail in future notebooks.

# Split a string into multiple substrings
The **`split`** method splits a single string into multiple strings. By default, it will split the string at every space. The **`split`** method returns a **`list`** of strings. Lists contain multiple objects and will be discussed in depth in a separate notebook.

In [None]:
my_string.split()

# Splitting a string by a given value
The first parameter of the **`split`** allows us to choose how the string is split. Below, the string will be split at every occurrence of the substring **`so`**.

In [None]:
my_string.split('so')

# How do I know what methods are available?
The above examples only covered a few of the available string methods. The official Python documentation lists all the [string methods][1]. While thorough, its not that convenient. 

The **`help`** function provides a fairly easy way to display all the methods from a certain type. Execute the following cell to get a list of all the string methods.

[1]: https://docs.python.org/3/library/stdtypes.html#string-methods

In [None]:
help(str)

# Use tab completion to get help while coding
An even better method for finding the available list of methods is with the tab completion functionality of IPython. To make use of tab completion, write the name of your variable followed by a dot and then press **tab**. A drop down menu will appear with a list of the methods. It will look like this: ![][1]

[1]: images/string_methods.png

Reproduce this drop down menu in the below cell:

In [None]:
# Place your cursor at the end of the next line and press tab
my_string.

# Use the `dir` function as yet another way
One further method for finding all the functionality of a particular variable is to pass it to the **`dir`** function.

In [None]:
dir(my_string)

# What are all those methods beginning and ending with underscores?
Methods that both begin and end with double underscores are "special" or "magic" methods that allow for Python objects to have a standard way of using the operators and built-in functions. These methods are not intended for public use and are not important at this point of your Python career. 

However, they are very important for developing new types on your own. They will be covered in the supplemental material on object-oriented programming.

# Getting help with these methods
Knowing the method name is sometimes enough to understand what its intended purpose is, but often times you will need to read more on how the method actually works, especially if there are parameters. This is referred to as **reading the documentation** or **reading the docs** and is an extremely important skill to acquire.

# Using the `help` function (again)
You can use the **`help`** function to output the documentation. Place the variable followed by dot and method name inside the help function. Take note to not actually call this method and write it without parentheses. 

In the below cell, we get help on the **`replace`** string method.

In [None]:
replace_string = 'replace each letter a in this string with A'


help(replace_string.replace)

# Understanding the output of the `help` function
The above output is referred to as the **docstring**. Docstring comes from the fact that the documentation in Python can be written inside of a method as a string. More on this later.

The second line from the docstring tells us that **`replace`** is indeed a method from the built-in string type. The word **`instance`** is a technical term denoting that the variable **`replace_string`** is a single member of the string type. This distinction will become important during the object-oriented programming section.

The following line:

 ```
 S.replace(old, new[, count]) -> str
 ```

is called the **method signature**. It informs us that there are three parameters, **`old`**, **`new`**, and **`count`**. Both **`old`** and **`new`** are **required** parameters while **`count`** is an **optional** parameter. For builtin types, optional parameters are always denoted in brackets inside the docstrings. The word that follows the arrow, **`->`**, tells us what type is returned from the method (another string in this instance).

The remaining text gives us a short description on what the method actually does.

### Using the `replace` method
The **`replace`** method requires that we give it values for the two parameters **`old`** and **`new`**. If we don't supply it with the correct number of parameters we get the following error:

```
TypeError: replace() takes at least 2 arguments (0 given)
```

Run the cell below as it is to replicate the error

In [None]:
# run this cell as it is
replace_string.replace()

### Using `replace` correctly
Let's replace each lowercase **a** with an uppercase **A**. From the docstring above, we know that the first parameter is **`old`** and the second is **`new`**. We pass the parameter values into the method by separating them with commas.

In [None]:
replace_string.replace('a', 'A')

### Using the optional `count` parameter
From the documentation, we know that we can limit the number of replacements by passing a value for the optional third parameter **`count`**. Let's limit the replacement to the first 2.

In [None]:
replace_string.replace('a', 'A', 2)

# Getting help while coding
One of my favorite tricks is to have the docstrings appear in a pop-up window as I am coding. It will look like this:

![][1]

To make the docstrings pop-up in-place, type out your method and **hold shift + tab + tab**

[1]: images/string_docstring.png

In [None]:
# place the cursor at the end of the line then hold shift and press tab twice
replace_string.replace

### Problem 1
<span style="color:green">Replace each occurrence of 'in' with 'out' in the following string. </span>

In [None]:
replace_string = 'it is starting to rain on the inside'
# your code here

### Problem 2
<span style="color:green">Find and use a method that will strip away all the exclamation points from either end of the following string. </span>

In [None]:
s = '!!!!a string with a dull message!!!!'
# your code here

### Problem 3
<span style="color:green">Find and use a method that will find the position of the first occurrence of the letter `t` in the following string. </span>

In [None]:
s = 'a data scientist'
# your code here

### Chaining methods

Most of the string methods **return** another string, with the exceptions being the **`count`** method which returns an integer and the **`split`** method which returns a list. In fact, all methods in Python return an object.

As programmers, we can continue calling new methods on this returned object without first saving it to a new variable. This is called **method chaining**.  Let's do an example where we slowly build a chain of methods. We will start with a string and call string methods that return other strings.

We will start by calling a single method to strip away the punctuation from just the right side with **`rstrip`**.

In [None]:
s = '?!?!?!A HIDDEN TEST STRING??!!?!'
s.rstrip('?!')

### Our first method chain

We can chain an additional method after **`rstrip`** by simply adding a dot and method name as usual. Let's append the **`lower`** method to lowercase the letters.

In [None]:
s.rstrip('?!').lower()

### An infinite number of chains are possible
There is no limit to how many methods you can chain together. Let's add a couple more:

In [None]:
s.rstrip('?!').lower().replace('t', 'a').count('e')

### Chaining on multiple lines
In other languages (like JavaScript), chaining methods is very common and is usually written more clearly and cleanly with one method per line. **Whitespace** matters in Python and so we can't just write each method on a separate line. Python gives you a couple different ways to work around this.
* Wrapping the entire expression in parentheses
* Using a backslash character at the end of each line

Most Python programmers prefer the former. It's also custom to line up the methods directly by the dots.

In [None]:
# using parethenses to put methods on different lines
(s.rstrip('?!')
  .lower()
  .replace('t', 'a')
  .count('e'))

In [None]:
#  This also works but most python programmers dislike it
s.rstrip('?!') \
 .lower() \
 .replace('t', 'a') \
 .count('e')

### Problem 4
<span style="color:green">Strip each letter 'a' from the left side, switch the case of each letter (from lower to upper and from upper to lower), and find the position(aka index) of the first letter 'o' </span>

In [None]:
test_string = 'aaaa TOO many aaaaaaaaa'
# your code here

### Getting the length of a string: Is there a method for this?
Obtaining the length of a string is one of the most straightforward pieces of information that you could get from it. By working through the above examples, you would think that there might be a method like...   

`>>> test_string.len()`  

But this doesn't exist.

### So how do you get the length of a string?
The builtin **function** **`len`** that returns the string length.

In [None]:
# Getting the length of a string using the len function
test_string = 'yet another test string'
len(test_string)

### More on the `len` function
The **`len`** function is used to get the number of elements from a wide variety of objects, not just strings, such as with tuples, lists, sets, dictionaries, and many others.

# Concatenating strings
Again, there is no direct method to concatenate strings together. There isn't even a function to do this either. Instead, we use the plus operator. See some examples below:

In [None]:
'abcde' + 'fghijk' + 'lmnop'

In [None]:
string1 = 'mac'
string2 = 'hine'

string1 + string2

### What happens if you subtract strings?
The subtraction operator is unsupported for strings and will produce an error.

In [None]:
'asdfa' - 'a'

### Repeat a string with the multiplication operator
Interestingly, the multiplication operator is available to use with strings. Multiply any string by an integer and you will produce a new string concatenated to itself that many times. 

In [None]:
# The string is repeatedly concatenated to itself via multiplication
'some test words | ' * 5

# String Interpolation
**String interpolation** refers to the substitution of variable values inside of strings. Python recently upgraded its string interpolation in 3.6 to make it easier and more intuitive with something called **f-strings** which is short for formatted literal strings.

## f-string basics
Substituting variable values into strings is quite easy with f-strings. Here are the two steps that you must follow:
* Prefix the string with the letter **`f`**
* Surround the variable with curly braces within the string

In the example below, we create three variables and replace them in the sentence with an f-string. Notice the **`f`** prefix.

In [None]:
name = 'Ted'
occupation = 'data scientist'
salary = 3

# f-string - substitute name, occupation, and salary
f'Employee {name} is a {occupation} and earns {salary} dollars per year'

## Using the older `format` method
Prior to f-strings you had to use the format string method. Not everyone has upgraded yet to Python 3.6, so you will still see this syntax used frequently.

There are a few ways to use the **`format`** method. As with f-strings, you use curly braces within the string in the place where you want to make the substitution. The braces may be empty and by default they will take on the values in the order used as arguments in the format method.

In [None]:
name = 'Ted'
occupation = 'data scientist'
salary = 3

'Employee {} is a {} and earns {} dollars per year'.format(name, occupation, salary)

### `format` method without first declaring variables

You can use the literal values as arguments in the format method without first declaring them as variables.

In [None]:
'Employee {} is a {} and earns {} dollars per year'.format('Sally', 'Doctor', 4)

### `format` method with parameter names

Similarly to f-string, you can name your parameters in the `format` method and use those names to explicitly identify the variable during interpolation.

In [None]:
name = 'George'
occupation = 'Marketer'
salary = 2

'Employee {name} is a {occ} and earns {sal} dollars per year'.format(sal=salary, name=name, occ=occupation)

## Use f-strings for your projects

I encourage you to use f-strings as they are very intuitive and very popular amongst Python programmers. But, keep in mind that many projects that you work on with other developers will require the older format method.

## Much more to string formatting
This intro just scratches the surface to string interpolation, also known as string formatting. Read this [blog post][1] a more detailed discussion.

[1]: https://realpython.com/python-f-strings/#option-1-formatting

# Selecting substrings
We first introduce the index operator (square brackets), **`[ ]`** which is fundamental to scientific computing in Python. The **`[ ]`** operator has the ability to select item(s) from a sequence in a wide variety of manners. Since strings are sequences of characters, the **`[ ]`** operator provides lots of functionality for strings. 

## The index of each character
The **index** of each character is defined as the integer location of each character. The integer location starts at 0 from the beginning of the string. 

## Selecting a single character
To select a single character from a string, append the index operator to the string and place the integer location of the desired character. Let's select the first character from the following string.

In [None]:
test_string = 'make sure you complete the entire precourse'
test_string[0]

## Selecting other characters
All characters of the string may be selected individually in this manner. Let's select index 4.

In [None]:
# An empty space
test_string[4]

## Using negative indices to select from the end
It is possible to use negative integers to select from the end of a string. Let's select the last element.

In [None]:
test_string[-1]

In [None]:
test_string[-9]

## Index error
Not all integers are valid indexes. You will be met with an **`IndexError`** if the integer is not within the correct range.

In [None]:
test_string[1000]

## Slice notation to select a substring
In addition to selecting single characters, we can easily select substrings (multiple characters) from our string with **slice notation**. Slice notation is composed of three integers, **start**, **stop**, and **step**. These three integers are separated by a colon. The **step** integer is optional and is always defaulted to 1. The generic form for slice notation looks like this:

> `start:stop:step`

Slice notation is placed within the square brackets following the string. Let's select a substring from index 5 to index 13.

In [None]:
test_string[5:13]

## The stop index is not included
The above slice notation, **`5:13`**, begins at index 5 and selects all characters up to but NOT including index 13. Let's verify this by selecting index 12 and index 13.

In [None]:
test_string[12]

In [None]:
test_string[13]

## The step integer is optional
We could have included the step integer 1 like so below, but it is not necessary.

In [None]:
test_string[5:13:1]

## Stepping by something other than 1
We can step by any integer we want. Let's select every other letter from that same substring.

In [None]:
test_string[5:13:2]

## Both start and stop integers are optional as well
Let's say we want to select the first 5 characters of the string. It is not necessary to use the integer 0 as the start position and instead it is much more common to see the following:

In [None]:
test_string[:5]

## Omit the stop value to slice to the end
If the stop value is not provided, then the selection will continue until the end of the string.

### Problem 5
<span style="color:green">Select the last three letters of the following string.</span>

In [None]:
test_string = 'make sure you complete the entire precourse'
# your code here

## Slice notation is only understood within the brackets of sequence objects
The syntax for the slice notation is not correct Python outside of this specific location in square brackets following some kind of sequence object. See the **`SyntaxError`** below:

In [None]:
s = [5:13:2]

## Using the `slice` function
The **`slice`** builtin function exists if you would like to create a slice outside of its usual place. You can then place this slice object inside of the square brackets as you did with the above.

In [None]:
a_slice = slice(5, 13)

test_string[a_slice]

## Don't use the `slice` function for now
There is no need to use the **`slice`** function as a beginning Python student. It does have value, but usually only necessary in advanced use cases. Use **slice notation** instead.

## Reversing a string with a step of -1

It is possible to have a negative integer for the step value of a slice. This will have the effect of selecting the slice in reverse. For this to work properly, the start index must be further to the right than the stop index. Let's see this by starting at the index 13 and selecting down to but not including index 5.

In [None]:
test_string[13:5:-1]

## Reversing a string
From the above example, it would seem that to reverse a string entirely, the start index would need to be very last index. You might come up with the following solution, which uses as the start index one less than the length of the string, 0 as the stop and -1 as the step. This solution works perfectly well.

Take note that you can use variables in slice notation.

In [None]:
start = len(test_string) - 1
stop = 0
step = -1

test_string[start:stop:step]

## A trick to reverse a string
An easier, albeit unintuitive method exists to reverse a string. Use the slice notation **`::-1`**.

In [None]:
test_string[::-1]

## No index error with slicing
Surprisingly, no error happens whenever you use an index in slice notation that doesn't exist in the string. For instance, let's slice from index 100 to 140. An **empty** string is returned. There are no characters in an empty string and its length is 0.

In [None]:
test_string[100:140]

In [None]:
len(test_string[100:140])

### Problem 6
<span style="color:green">Slice this string from index 5 to the end by every 4th element</span>

In [None]:
s = 'the astros will win the world series again in 2018'
# your code here

### Problem 7
<span style="color:green">Select every third element starting from the last character and ending with the first. </span>

In [None]:
s = 'the astros will win the world series again in 2018'
# your code here

### Problem 8
<span style="color:green">Use four chained methods on a string of your choice.</span>

In [None]:
# Enter in a string inside the quotes
your_string = ''
# your code here

# Changing the characters of a string
You might be surprised to find that once a string is created, nothing about it can change. Technically, strings are **immutable** and cannot be changed once created. Many objects in Python are **mutable** such as lists, which we will cover soon, but not strings.

For instance, if we try and change index 7 to **`z`** we will get the following error.

> `TypeError: 'str' object does not support item assignment`

In [None]:
test_string[7] = 'z'

In [None]:
test_string[7:20:-1] = 'z'

### Mutable and Immutable Objects
Python objects are either mutable or immutable. Mutable objects can have their value's changed after creation. Immutable objects are those whose values cannot be changed after creation.

Strings, ints, floats, booleans are types of objects that are immutable (unable to be changed after creation).

We will soon learn mutable types like lists, dictionaries and sets.

### Didn't we have some strings from above that were mutated?
In some of the above examples strings were concatenated together to form a new string but the original strings were never changed. Take a look at the following which concatenates two strings and prints out that value. The original strings are left unchanged.

In [None]:
# Concatenation does not mutate the underlying string
a = 'string 1 '
b = 'string 2'
print(a + b)
print(a)
print(b)

### Slicing does not mutate strings either
When you use slice notation on a string, you are technically creating an entire new string. In the above examples, we never saved that new string to a variable. Below we will, and show how the original string remained unchanged.

In [None]:
substring = test_string[:5]
print(substring)
print(test_string)

### Multiline comments with strings
Instead of using the hash (**`#`**) for commenting out multiple lines of code, Python allows you to use triple quotes that are unassigned to a variable to comment out large blocks of texts. This is actually standard practice for writing documentation in methods that you write and are intuitively named 'docstrings'.

The example below uses triple quotes to write a long multi-line comment. This string's sole purpose is as a comment for the developers. Other Python commands follow the comment.

In [None]:
"""
This area can be
used as a multiline comment since
the normal comment character # does not allow for this.

This multiline comment is especially important when writing docstrings for functions
"""
foo = "executing a string assignment"
foo[5:10]

### Simple test whether a string contains a substring
There is a simple test to determine whether or not one string is a substring of another. Place the **`in`** operator between two strings as follows:

In [None]:
s = "executing a string assignment"

In [None]:
'cut' in s

In [None]:
'z' in s

### Reversing the condition with `not`
You can also test whether a string is not a substring with the **`not in `** operator.

In [None]:
'z' not in s

### Alternatively, use the `find` method
The find method returns the value -1 if the passed string is not a substring, otherwise it gives the index of the first character where the substring was found.

In [None]:
# you can use the index method, which gives you the position of the substring if found
s.find('cut')

In [None]:
s.find('z')

## Congrats on finishing notebook 2!
Move on to notebook 3! The pre-course is mandatory so make sure you finish it all!