In this notebook, I will introduce common string manipulation practices

## Basics

You surround the words within a pair of quotes. 

You can use both single quotes '' or double quotes ""

In [1]:
"hello, world"

'hello, world'

In [2]:
'hello, world'

'hello, world'

you can even use triple quotes ''' ''' which allows you enter a string with multiples lines and space.

In [4]:
paragraph = '''This is a long string
with multiple lines

and even some blank lines. It's all ok here.
'''
print(paragraph)

This is a long string
with multiple lines

and even some blank lines. It's all ok here.



In [3]:
a = "hello, world" # save string to a variable

In [4]:
len(a) # get the length of the string, including whitespaces

12

In [5]:
a[0] # you can access characters using "index", which starts from 0 instead of 1

'h'

Note, since we have 12 characters, the largest index is 11, not 12, because index starts from 0

In [6]:
a[12] # Error

IndexError: string index out of range

In [None]:
a[11] # Correct

Sometimes, you do not know the length of a string yourself, but you want to get the last character. Here is how you do it:

In [7]:
a[len(a) - 1] # last index = length - 1

'd'

In [8]:
a[-1] # or simply using slicing for lists (for more, check out "lists" notebook)

'd'

## Common Usage

You can concatenate strings using +

In [9]:
'hello ' + 'world' + '!'

'hello world!'

Note, here is a **common mistake** people make. When you want to concatenate two words, make sure you have a whitespace somewhere so it displays right. Otherwise, they look like one word

In [10]:
'hello' + 'world' + '!' # Don't forget the white space, otherwise it looks like this

'helloworld!'

You can slice the string to get a **substring** of it the same way you do for a list.

If you don't know how slicing works, check "lists" notebook

In [11]:
'hello, world!'[7:12]

'world'

In [12]:
'hello, world!'[:5]

'hello'

You can also **reverse** a string like so

In [13]:
'hello, world!'[::-1]

'!dlrow ,olleh'

You can duplicates words by using multiplication sign *

In [14]:
'hello, world!' * 3

'hello, world!hello, world!hello, world!'

You see, you may forget the whitespace at the end. Common mistake

In [15]:
'hello, world! ' * 3

'hello, world! hello, world! hello, world! '

You see. Now it looks good

### Useful Tricks

#### Capitalization

In [16]:
a = "Hello, World!"

In [17]:
a.lower() # get all lower case

'hello, world!'

In [18]:
a.upper() # get all upper case

'HELLO, WORLD!'

In [19]:
'heLLo, WoRld!'.capitalize() # capitalize first character only, change others to lower case

'Hello, world!'

In [20]:
"hi".islower() # checks if all characters are lower case

True

In [21]:
"hi".isupper() # checks if all characters are upper case

False

In [22]:
"hI".isupper() # Note, ALL characters have to be upper case to return true

False

In [23]:
a.swapcase() # swaps lower and upper case

'hELLO, wORLD!'

#### Stripping

In [24]:
a = "    hello, world!!     "
a.strip() # gets rid of the white/empty spaces before and after a string, including new line \n

'hello, world!!'

#### Splitting

In [25]:
a = "hello, 13 + 4 = 5, is right"
b = ["he", "l", "lo", "!"]

You can split a string to a list of strings using 'separators'.

This is useful when a string has a common pattern. 

For example, if you read a CSV file, values between different columns are often separated using ",". It will be something like:

`
1, 2
3, 4
`

More usage of this trick will be in a separate notebook that discusses file reading

In [26]:
a.split(",") # this splits a around ',' character as separator

['hello', ' 13 + 4 = 5', ' is right']

In [27]:
a.split() # if you put nothing, it splits character by character

['hello,', '13', '+', '4', '=', '5,', 'is', 'right']

You can also join all strings in a list to one string using `join` syntax as follows:

`<separator string>.join(<list>)`

In [28]:
''.join(b) # this combines every string in b with nothing in between them

'hello!'

In [29]:
', '.join(b) # this combines every string in b with ', ' in between them

'he, l, lo, !'

In [30]:
'-'.join(b) # this combines every string in b with '-' in between them

'he-l-lo-!'

You get the idea.

#### Casting
Also, you cannot do calculations on strings, even if the content is digits. For computer, it doens't know that. So, + only does string concatenation rather than calculation

In [31]:
a = "hello"
b = "20"
c = "40.5"

In [32]:
b + b

'2020'

If you need to do it, you need to cast them into numbers

In [33]:
int(b) + int(b) # cast them into integers

40

In [34]:
int(a) # casting to int doesn't work, if it is non-int. variable a is a string with non-digits

ValueError: invalid literal for int() with base 10: 'hello'

In [35]:
int(c) # casting to int doesn't work, if it is non-int. variable a is floating points

ValueError: invalid literal for int() with base 10: '40.5'

In [36]:
float(c) # use float to cast it to float

40.5

#### Finding characters

In [37]:
a.find('l') # return index of first occurence of character 'l'

2

In [38]:
a.find('l', 4, 13)

-1

return index of first occurence of character 'l' between index 4 and 12 (note 13 is excluded)

In [39]:
a.find('l', 4, 10)

-1

Note, becasue 10 is excluded, it tries to find 'l' between index 4 and 9, so it cannot find it and returns '-1' by default

In [40]:
a.find('World') # Note, it can also find substrings

-1

In [41]:
a.find('world') # But, finding substrings are case sensitive

-1

So, if you are unsure about the capitalization situation within a string (for example, it is a string that user inputs), and you want to find if it contains a substring regardless of the capitalization, you can do something like 

In [42]:
a = "HilajalkfjohIOHgoisjgla"
b = "ioH"
a.lower().find(b.lower())

12

What the above code does is to convert both string `a` and `b` to lower case, and then use 'find', so we don't have to worry about the case

In [43]:
print("Here are other useful string methods you can use")
", ".join(dir(str)[33:])

Here are other useful string methods you can use


'capitalize, casefold, center, count, encode, endswith, expandtabs, find, format, format_map, index, isalnum, isalpha, isdecimal, isdigit, isidentifier, islower, isnumeric, isprintable, isspace, istitle, isupper, join, ljust, lower, lstrip, maketrans, partition, replace, rfind, rindex, rjust, rpartition, rsplit, rstrip, split, splitlines, startswith, strip, swapcase, title, translate, upper, zfill'

## Intermediate
### Escape Characters

Now, look at this one. Why is this a syntax error? 

In [44]:
"I said "hello""

SyntaxError: invalid syntax (<ipython-input-44-65c92240459c>, line 1)

The reason is that you cannot put "" double quotes within a pair of double quotes, or single quotes within a pair of single quotes. Python interpreter doens't know how to interpret it. The above example maybe small, but when the sentence gets super long, it can be confusing, so Python closes the string as soon as it sees another one

**However**, there is a way around it. If you only need one more pair of quotes within quotes, use the other kind of quotes.

For example, you can use single quotes in double quotes like so

In [45]:
"I said 'Hello, World'!"

"I said 'Hello, World'!"

Or, you can use double quotes in single quotes like so

In [46]:
'I said "Hello, World"!'

'I said "Hello, World"!'

But, sometimes you want quotes within quotes within quotes, like:

" He said, "Michael said"Hi"" "

You do so by adding \ in front of every quote you added in the quote. "\" here is to "escape" the character after it.

In [47]:
"I said \"Hello World! I can put more \"\" quotes now\""

'I said "Hello World! I can put more "" quotes now"'

Note, I added backslash \ in front of every " within the quotes.

There are a lot more things like "" that cannot be used directly in a string. All of them are called ["escape characters"](https://docs.python.org/2.0/ref/strings.html). You usually don't need to know all of them, except single quotes '', double quotes "", and backslash itself \

### Pointers

Essentially, a string is a list of characters. So it has **most** qualities as lists (but not all), and they are referenced through what we call 'pointers'. 

Thus, to understand the pitfalls and usages you usually encouter with strings, you need to understand the concepts of pointers first.

It is fine if you don't know what pointers are. Here is the high level idea:

![variables_pointers](./imgs/variables_pointers.jpg)

Note, when you assign a variable (say `a`) to a string (say `hi there`) , the string itself is saved elsewhere in computer's memory, and the variable (`a`) you declared is essentially a pointer to that string. So, when you say `b = a`, you only created another pointer to the **same** string, instead of variable a. When you assign `a` to a new string, `b` won't be affected

That is fine if it is a lot of concept to grasp now. Here is a walkthrough

In [48]:
a = "hi there" # Created a string that variable a points to
print(a)

hi there


In [49]:
b = a # variable b points to same string as a now
print(b)

hi there


We can check if two variabels have same content using "=="

In [50]:
a == b  # check content

True

We can also check if two variables *point* to the same thing using "is"

In [51]:
a is b # check pointer

True

If you manipulate a string, only that variable pointer gets changed and points to a new string

In [52]:
a = 'not here' # we can change variable a to point to a new string
print(a)

not here


In [53]:
print(b) # but b, which points to the original string, won't get affected

hi there


In [54]:
a is b # not same pointer anymore

False

Note: after assigning a to a new string, the pointers no longer point to the same thing, so "is" pointer comparison returns false

**Why is this important?** You may wonder why do I write all these to explain this pointer difference. 

Check out this example and you will know why

In [55]:
name = "Hello, World"
def add_punctuation(string):
    string = string + "!!!!"
    print("string with punctuation: ", string)

In [56]:
print("before function:", name) # before applying function

before function: Hello, World


In [57]:
add_punctuation(name)

string with punctuation:  Hello, World!!!!


In [58]:
print("after function:", name)

after function: Hello, World


Note: even if the function successfully added "!!!!" to the string, the `name` variable didn't get changed. This is because when we pass add_punctuation, `string` is a *new* pointer, and changing that pointer within the function does not affect the string `name` is pointing to. If you don't quite get it, that is fine. For more detailed explanation, check out "functions" notebook

**Moreover**, always remember that strings are **immutable** in Python, although lists can. (Again, if you do not know lists yet, check 'lists' notebook)

In [59]:
a = ['h', 'i']
print(a)
a[1] = 'e' # Lists are mutable
print(a)

['h', 'i']
['h', 'e']


In [60]:
a = 'hi'
print(a)
a[1] = 'e' # Error: Strings are immutable

hi


TypeError: 'str' object does not support item assignment

So, what can we do if we want to change some parts of the string? There are many alternatives

Recall, since strings are immutable, operations described at the top of this notebook (like reversing a string, getting substring, or concatenating a string) only returns a new string and the orignal string will not be affected

In [61]:
a = "hello, world"

b = a[::-1]
c = a[2:10]

print("String Manipulation returns new strings:", b, c)
print("But original string is not affected:",a)

String Manipulation returns new strings: dlrow ,olleh llo, wor
But original string is not affected: hello, world


So, if you want to do a string manipulation and maintain the effects, save the new string back to the orignal pointer

In [62]:
a = "hello world"
print(a)
a = a[::-1] # Manipulate and save it back to the pointer
print(a)

hello world
dlrow olleh


### More

There are a lot lot more about strings, but these are the essentials you will find useful. If you want to know more about what Python can do with strings, Google them!