**Learning Python -- The Programming Language for Artificial Intelligence and Data Science**

**Lecture 3: Strings and Text Input/Output**

**By Allen Y. Yang, PhD**

**(c) Copyright Intelligent Racing Inc., 2021-2024. All rights reserved. Materials may NOT be distributed or used for any commercial purposes.**


# Keywords

* **str**: The keyword for the string data type in Python.
* **Indexing**: Indexing refers to finding the address and subsequently the stored value within a sequential data type such as a string of characters. Python starts the first index in a sequential data type as zero.
* **Slicing**: Extract a subset of sequential data by their indices.
* **Type casting**: Converting the representation of stored data value from one type to another. Casting may change the data accuracy and utility as determined by the implementation of the casting algorithm.

# Defining Strings

In the previous lecture, we have seen Python can output information about computation results using print(). Before the invention of graphical user interface (GUI), the computer interface was predominently using text. In Windows, Mac OSX, or Linux systems, if you search keyword "terminal", these systems include a terminal application that allows users to manage the computer using the legacy mode, called the console or terminal interface.

As the No.1 programming language in Data Science, handling data encoded in text format is very important for Python. In this lecture, we will discuss basic operations how to use text as a Python program's input or output, collectively known as I/O functions.

In [1]:
a_string = "Hello World!"
print(type(a_string))

b_string = 'Hello World!'
print(a_string)
print(a_string == b_string)

c_string = '''Hello
World!'''
print(c_string)
d_string = """Hello
World!"""
print(c_string==d_string)

<class 'str'>
Hello World!
True
Hello
World!
True


First, we see how a text can be defined in Python. The type of data that are responsible for storing text is called **string**. In the first example above, we define a string variable *a_string* to be the text "Hello World!". Then we check its type, and it is a variable class of the name 'str', short for string type.

We can also print a string simply by using the same *print()* function.

For the a_string variable, its text data are quoted using a pair of double quotation marks. What is contained in between the double quotation marks defines the string. However, For the next b_string variable, its text data are quoted using a pair of single quotation marks. In Python, creating string variables using a pair of either single or double quotation marks is equivalent.

Next, we see the "==" operator previously was used to compare the equality of int or float values. In the string case, it can also be used to compare if two strings are identical. In the above example, a_string and b_string reference the same text, so the comparison result is True.

Finally, Python has an additional text format, which is quite unique to itself. If we use a pair of triple quotation marks, again regardless if they are single or double quotations, it defines a string that contains multiple lines. Please note that using the previous single or double quotation marks, a string text must not have more than one line.

By now,you may have wondered that the special characters Python takes over to denote strings, namely, the single and double quotation marks, themselve can be a part of a valid text. How can we include these special characters as data but not as Python symbols? Let us see the following examples first:

In [None]:
e_string = "World's best coffee"
f_string = 'World"s best coffee'
print(e_string == f_string)

g_string = 'World\'s best coffee'
print(g_string)
print(e_string == g_string)

The above code block demonstrates that, if a string is created using a pair of double quotation marks, then Python will treat single quotation marks as regular text data in the string. Vice versa, if a string is created using single quotation marks, then double quotation marks in the string will be treated as regular text.

The second half of the code block demonstrates yet another way to declare that Python should treat special characters as regular text data, that is to use another special character specifically created for this purpose, namely, the backslash mark "\". In the definition of g_string, the sequence \\' is treated as just a single quotation mark in the text.

The special character \\ not only can declare regular quotation marks, it can also be used to define other special characters. Let's see a few more examples below:

In [3]:
h_string = "This character \\ is special"
print(h_string)

i_string = "Hello World!\b" 
print(i_string)

j_string = "Line 1 \nLine 2"
print(j_string)

This character \ is special
Hello World
Line 1 
Line 2


In the above three examples, a double backslash sequence *\\* is treated in text as a single backslash character. The sequence *\b* removes one character that immediately precedes it. The sequence *\n* represents a return character to display what is after it in a new line.

# Addressing String Elements: Indexing and Slicing

String data are stored in computer memory as a character array, which means an ordered sequence of characters. Python can address and retrieve individual characters in a string using the notation of square brackets. This operation is also known as **indexing**. Some examples are shown below:

In [None]:
a = "Hello World!"
print(a[0])
print(a[1:2])
print(a[0:6:2])
print(a[-1])
print(a[-5:-2])

Let us dissect the above examples. First, we can assume a string variable such as *a* in the example not only represents the entire string, but when in addressing string elements, it also indicates the memory address of the first element. Then any additional offsets from the beginning of the string are described within a pair of square brackets.
1. *a[0]* retrieves from the first string character plus zero offset, which represents the first character "H".
2. *a[1:2]* uses a colon to indicate retrieving a segment of the string characters. The left argument is called **begin**, and the right argument is called **end**. The following rule is somewhat special to Python, if the reader is familiar with other languages, that the string segment retrieved by [begin: end] will include the character at a[begin] position (if valid) but will exclude the character at a[end] position. For the print out of the second example is starting from a[1] = "e" but will exclude a[2] = "l".
3. If the brackets contain three numbers separated by two colons, then the third number is called the **step size**. Hence, the retrieved characters starting at a[0], stopping before a[6], and with step size 2 are a[0]="H", a[2]="l", and a[4]="o".
4. The string character addresses can also be counted backwards from the last character. In Python, the last character of *a* is denoted as *a[-1]*.
5. Similarly, the same rule to retrieve a segment of a string applies when the offsets are given in negative numbers. In this example, the segment starts from a[-5]="o" and stops before a[-2] = "d".

Finally, the operations that use the format *string[begin:end:step]* are called **slicing**.

# String Functions

Next, we see some examples on Python functions that take strings as input arguments:

In [None]:
first = "John"
last = "Smith"
space = " "
new_string = first + space + last
print(new_string)

print(len(first))
print(ord(space))
print(chr(ord(space)) == space)

In the above code block, three strings are first created. Then the "+" operator applies to string type variables to concatenate the three strings into one new string.

The function *len()* returns the length of the string, namely, the number of characters in the string.

The next two statements convert a string character into an integer number, and then from the integer number back to a string character. Specifically, in computer memory, string characters are also stored as numbers. The correspondence between characters and numbers is called a codebook. The function *ord()* returns the code of characters based on a particular codebook called Unicode. 

Then the *chr()* function goes the opposite way, namely, generating a character based on a Unicode number input. In the example, we see that the Unicode for the space character is 32.

There are more useful functions that are relevant to string type variables. We will introduce them later in other examples in the course.

# Algorithm: Reverse a String

Consider the problem that given an input string, we ask to solve for a output string whose characters are in exact reverse order of the input string. This can be done using the slicing operator.

In [1]:
input_string = "Python"

partial_string = input_string[-1:0:-1]
print(partial_string) # Not exactly correct

sliced_string = input_string[::-1]
print(sliced_string)

nohty
nohtyP


We can see from the above code block that, firstly, slicing operation that retrieves the string characters backwards from the last one to the first one with step size -1 does not exactly achieve the goal. Looking at the slicing code, there are two key points to remember:
1. The last character is referenced using the negative offset value of -1. If we count the offset positively, this value should be *len(input_string)-1*. Because the string character offset starts from zero, so the length of the string actually points to an offset that is one character beyond the range of the string.
2. However, since Python dictates that the slicing will stop before the ending offset, so using the smallest non-negative value zero will not be able to retrieve the first character "P" in reverse order. Note we also cannot use values smaller than zero, because as we pointed out above, -1 actually refers to the last character of the string.

So the proper way to reverse a string using slicing operation is to keep the -1 step size, but ignore the exact values for *begin* and *end*. In the second algorithm, neither *begin* nor *end* indexes is provided. In such a case, Python will automatically retrieve the longest possible string result. Hence, the algorithm uses step size -1 to traverse the input string in reverse order, and the result shows the intended output.

# Using String in Text I/O

We have used the function *print()* to output text strings. It is worth noting that *print()* is a rather special type of functions in Python in that it is capable of receiving no argument, one argument, or many arguments. Let us first see an example below:

In [1]:
print('------')
print()
print('------')
first = "John"
last = "Smith"
print(first, last)
print('------')
print(first, last, 2021, True)

------

------
John Smith
------
John Smith 2021 True


We see in this code block, three scenarios are tested and printed out separated by dash lines. In the first case, *print()* without any argument will output an empty line in the text console. In the second case, *print()* can print out multiple input arguments sequentially, each separated by a space character automatically. In the third case, we see that the multiple input arguments can even by of different types. Our example includes the string type, int type, and boolean type.

Next, let us talk about text input. From the console mode, Python can receive a user's text input, conveniently using the statement:

*input_string = input()*

The return result is of the string type. In coding user and computer interaction, it is strongly recommended that the program provides sufficient text cues before the *input()* function to explain what the program expects the user to input.

# Example: Test Your Math 

Below is our last example today. The program was presented in the last lecture as a math challenge. In this lecture, we will discuss more coding details in Python.

In [4]:
# import two Python modules
import math     # includes additional math functions
import random  # includes functions for generating random numbers

# Define constants
OPERATOR_FLOOR = 1
OPERATOR_CEIL = 2

random_operator = random.randint(1,2)   # Select an operator, equiv to random.randint
random_A = random.randint(-10,10)       # Select first value
random_B = random.randint(1,10)         # Select second value. Note denominator cannot be zero
if random_operator == OPERATOR_FLOOR:   # If selected operator is floor()
    result = math.floor(random_A/random_B)
    operator_string = "floor"
else:                                   # If selected operator is ceil()
    result = math.ceil(random_A/random_B)
    operator_string = "ceil"

# Prepare question string
# question_string = ( "Question: " + operator_string + "(" + str(random_A)
#                    + "/" + str(random_B) + ") = ? ")
question_string = "Question: {0}({1}/{2}) = ?".format(operator_string, str(random_A), str(random_B))

user_result = input(question_string)    # Wait for user input
user_result = int(user_result)          # Convert string to int
if  user_result == result:              # The answer is correct, add one score
    print("Correct!")
else:           # The answer is wrong, add one score
    print("Incorrect!")

Question: floor(9/7) = ? 1


Correct!


Let us go over this complete Python program from the beginning. First, we see for the first time in this course, the use of comments in Python code. A comment space is denoted by two ways: 1. Using a pair of triple quotation marks directly in the source code (instead of within the print statement). A pair of triple quotation marks will claim everything in between them as comments, which may cover multiple lines. 2. In comparison, using the hash symbol (or being called the pound sign in the US) "#" designates immediately after it till the remainder of the same line as comments. When encountering comments, Python will simply ignore them.

The purpose of including comments is to increase the readability for the benefits of both the author and equally importantly other readers. If the author of a program did not leave sufficient comments about the logic of the code and the purpose of individual variables, it would be difficult for other readers to second-guess the author's logic and it compromises the reusability of the code.

Therefore, in modern software engineering, practitioners should pay equal attention to both the logic of their code and explaining it using sufficient comments. It is quite normal in commercial-quality source code that professional developers could reserve 1/3 to 1/2 space of their code for commenting and documentation purposes. In this course, we strongly recommend our students to start cultivate this practice.

In lines 2 and 3, the code imports two modules: math and random. Importing math is to be able to use the *floor()* and *ceil()* functions. Importing random serves another purpose, that we will use a randomized integer generator called *randint()* to generate random arithmetic challenges so that every time the program is executed the challenge can be different.

In lines 9 to 11, three random integer numbers are generated. *random_operator* assumes the value of 1 or 2 as the output from the function *random.randint(1, 2)*; *random_A* is the random numerator between the values -10 and 10; *random_B* is the random denominator between the values 1 and 10.

In lines 12 to 17, we see for the first time the use of flow control statements: *if -- else --*. We will discuss flow control statements later in this course. In here, it can be simply understood as, if random_operator is equal to the constant representing floor operator, the correct result will be using *math.floor()* to convert the fraction random_A/random_B, otherwise, the correct result will be using *math.ceil()*

Then in line 20 and 22, two different ways are demonstrated to format a string *question_string*. The first way uses "+" to concatenate the operator string and the division expression. The second way in line 22 uses curly brackets and *.format()* string method to substitute in a string location with one of format() function's input arguments.

In line 24, this question is printed out and cue the user to answer in text using function *input()*. Since the *input()* return is always of string type, the code then uses *int()* to type cast the string variable into an int variable. Note, the reader can run this code and test what if the input text is not a valid integer. In such cases, Python will return an error.

# Summary

* Single quotation marks and double quotation marks can be used to create a text string in one line, so long as the quotation marks must be in a pair of the same type.
* A pair of triple quotation marks can create a text string that contains multiple lines.
* Use of backslash \ with the quotation marks inside a text string denotes the marks as regular characters rather than special characters.
* Some other backslash-defined characters include: \\, \b, \n
* A substring can be defined by **slicing** operation, defined by a pair of square brackets: string[begin: stop: step]. The substring will not take the *stop* position character.
* Negative values in *begin*, *stop*, or *step* indicate counting the positions from the end of the string backward.
* Functions that act on strings: len(), ord(), chr().
* input() function returns a user string input from the terminal.

# Exercises

1. Please create a string variable called *phrase*, and assign the value "Hello World". Then please use slicing method to separate the first word "Hello" to a variable called *subphrase_1*, and the second word "World" to another variable called *subphrase_2*. Note that please do not include the space in between the two words in either subphrases.

2. Continue with the above program. Please write the code to remove the space character " " from the *phrase* variable, and print out the resulting value in the variable.

3. Create three string variables that describe yourself: first_name, last_name, date_of_birth. Then concatenate the three strings into a new string called ID_string, using the "+" operator.

4. Continue with the above program. Please extract from the string ID_string the following characters: the first letter of first_name, the first letter of last_name, and the first letter of date_of_birth. Assign the result into a new string variable called short_ID_string.

5. Debug:

In [21]:
phrase = 'Hello World'
subphrase_1 = phrase[0:5]
subphrase_2 = phrase[-1:-6]
no_space = phrase.replace(" ",'')
print(no_space)

HelloWorld


In [22]:
first_name = 'Zach'
last_name = 'Kravitz'
dob = '5/18/09'
ID_string = first_name + last_name + dob
short_ID_string = first_name[0] + last_name[0] + dob[0]
print(short_ID_string)

ZK5


In [2]:
a = int(input("Please input the first adden: "))
b = int(input("Please input the second adden: "))
print("The sum is ", a + b)

The sum is  8


6. Debug:

In [None]:
wrong_string = "Mike's story"

# Challenges

1. Based on the float value in math.pi, use the string type cast function str() to convert the float value into a string variable, called pi_string. Then keep the string type, move the decimal point character "." to the right by one position, and assign the result again back to pi_string. Finally, print out the resulting string. Hint: The result should look like "31.41592653589793"

In [31]:
import math
pi = str(math.pi)
pi = list(pi)
pi.pop(1)
pi.insert(2, ".")
pi = "".join(pi)
print(pi)

31.41592653589793
