# Lecture 6
  - Strings
  - Operations
  - Formatting
  - String Methods

## Unicode
- In early days all characters were represented by ASCII
- Characters occupied 8 bits
- 0 - 127 values were enough to represent english languge:
  - Upper and lower case alphabets
  - Digits
  - Punctuations
  - Non printable characters (control characters)
- A letter maps to some bits which you can store on disk or in memory:
  -  A -> 0100 0001
- **Other languages started using codes from 128-255**
- Asian Languages has thousands of characters
  - They needed more memory

In [2]:
bin(ord('A'))  # hex 0x41, 0100 = 0x4  0001 = 0x1 -->  0x41 ... the 'b' represents the 'bin' or binary

'0b1000001'

-  **Unicode consortium** came up with a concept called **code point**, where every character is represented as:
   - For e.g: Hello --> U+0048 U+0065 U+006C U+006C U+006F
   - English text rarely used code points above U+00FF
   - Then the issue comes with how they will stored in memory?
     - Endian issues:
       00 48 vs 48 00
-  UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes 
-  In **UTF-8**, every **code point from 0-127** is stored in a **single byte**
-  Only **code points 128 and above** are stored using **2, 3, in fact, up to 6 bytes**
-  **English text** looks exactly the same in **UTF-8** as it did in **ASCII**




In [3]:
for i in "Hello":    # iterating through a sequence (string)
    codePoint = ord(i)
    print("{} --> {:3d} {} {}".format(i,codePoint,hex(codePoint), bin(codePoint)))

H -->  72 0x48 0b1001000
e --> 101 0x65 0b1100101
l --> 108 0x6c 0b1101100
l --> 108 0x6c 0b1101100
o --> 111 0x6f 0b1101111


#### Python Strings are Unicode by Default

In [4]:
print(u"hi 猫")    # u prefixed before a string denotes unicode (cat)

hi 猫


In [5]:
print("hi 猫")    # unicode by default, u prefix is not needed

hi 猫


In [6]:
print("∑ Ω µ")   # greek alphabets

∑ Ω µ


## Strings
- Immutable 
  - **Read only**, values cant be changed
  - They cannot be modified
  - **New strings** will formed on certain string operations
- String variables can be **reassigned** with new values


### Single, Double or Triple Quotes
-  Triple quotes are used when strings have line breaks
   -  For e.g.: you have a paragraph of text you want to assign to variable

In [7]:
# you can use single or double or triple quotes to represent strings
string1 = 'Hello World'  # single quote
string2 = "Hello World"  # double quote
string3 = '''Hello World''' # triple single quote

print ("String 1, 2, 3: ", string1, string2, string3)
print ("Type of String 1, 2, 3: ", type(string1), type(string2), type(string3))

String 1, 2, 3:  Hello World Hello World Hello World
Type of String 1, 2, 3:  <class 'str'> <class 'str'> <class 'str'>


In [8]:
# multi line string (or a comment)
para = '''This is line 1.
this is line 2.
This is line3.'''
print (para)

This is line 1.
this is line 2.
This is line3.


### Escape Characters
-  Have special meaning to certain characters
-  Character preceded by a back slash \
   -  '\a'  alarm
   -  '\t'  tab
   -  '\n'  newline
   -  '\r'  carriage return

In [9]:
print ('\a')  # you are supposed to hear a beep




In [10]:
print ("foo", "\t", "bar")  # prints a tab between foo and bar

foo 	 bar


In [11]:
print ("foo\n.")  # prints a new line and then dot
print ("bar\n.")

foo
.
bar
.


In [12]:
print ("foo\r.")  # prints foo and then carriage return and overwrite f with doe
print ("bar\r.")  # carriage return and overwrites b with .

foo.
bar.


In [14]:
stringu = u'Hello World'  # default is unicode in python3
stringr = r'Hello\tWorld\n'  # raw string where escape \ does not mean anything
print ("String u, r: ", stringu, stringr)
print ("Type of String u, r: ", type(stringu), type(stringr))

String u, r:  Hello World Hello\tWorld\n
Type of String u, r:  <class 'str'> <class 'str'>


In [15]:
print ("foo\tbar")   # tab character

foo	bar


In [16]:
print (r"foo\tbar")  # raw string

foo\tbar


In [17]:
print ("foo\\bar")  # escape the backslash

foo\bar


In [18]:
print ("foo\\\\bar")  # escape the backslash

foo\\bar


#### Raw Strings - Strings prefixed with r or R

In [19]:
r = r"\nh\ni\n"  # supresses the meaning of slash
print(r)

\nh\ni\n


### String Variable Reinitialized
-  **New string object is being formed**
-  **String ids are different**
![Reinitialized](images/Lecture-6.002.png)

In [20]:
foo='Foo'
print("foo:", foo, "id:", id(foo))
foo ='Bar'
print("foo:", foo, "id:", id(foo))

foo: Foo id: 84872472
foo: Bar id: 85061904


## Operations

### Concatenation +
-  str1 + str2
-  str1 + str2 + str3
![Concatenation](images/Lecture-6.003.png)

In [21]:
fooStr = 'Foo'
barStr = 'Bar'
cat = fooStr + barStr
print(cat)

FooBar


In [22]:
# Ids are all different, because different object strings
print('id fooStr {}, id barStr {}, id cat {}'.format(id(fooStr), id(barStr), id(cat)))

id fooStr 85062128, id barStr 85061904, id cat 85063360


In [24]:
cat = "Foo" "Bar"   # no plus, strings by themselves, still concatenates
cat

'FooBar'

In [25]:
cat = "Foo" "Bar" 'Baz'
cat

'FooBarBaz'

### Repetition *
-  str1*3
-  3*str1

![Repetition](images/Lecture-6.004.png)

In [26]:
print("fooStr:", fooStr)
rep = fooStr*3
rep2 = 3*fooStr
print (rep, rep2)

fooStr: Foo
FooFooFoo FooFooFoo


In [27]:
id(rep) == id(rep2)  # different string objects

False

In [32]:
rep3 = fooStr * -5  # what happens when multiplied by negative number?
rep3 # creates an empty string since it is an invalid operation

''

### Index [ ]
![Index](images/Lecture-6.005.png)

In [33]:
helloWorld = 'Hello Wo'
print ("Length:", len(helloWorld))
print ("Index 0: ", helloWorld[0], "Index 3:", helloWorld[3])

Length: 8
Index 0:  H Index 3: l


In [34]:
for index, value in enumerate(helloWorld): # enumerate gives index and value of sequence
    print(" Index: {} Value: {}".format(index,value))

 Index: 0 Value: H
 Index: 1 Value: e
 Index: 2 Value: l
 Index: 3 Value: l
 Index: 4 Value: o
 Index: 5 Value:  
 Index: 6 Value: W
 Index: 7 Value: o


In [35]:
helloWorld[8]   # Expect IndexError, there no element at 8th position

IndexError: string index out of range

![Index](images/Lecture-6.006.png)

In [36]:
print ("Index -1: ", helloWorld[-1], "Index -8:", helloWorld[-8])

Index -1:  o Index -8: H


### Slice [start:Upto:Skip] THIS IS A SLICING OPERATOR
![Slice](images/Lecture-6.007.png)

Above picture is wrong. helloWord[1:4] should point to index [1] and index [4]

In [37]:
print ("helloWorld[1:4]", helloWorld[1:4])

helloWorld[1:4] ell


![Slice](images/Lecture-6.008.png)

In [38]:
print ("helloWorld[0:7:2]", helloWorld[0:7:2]) # stride or skip 2

helloWorld[0:7:2] HloW


![Slice](images/Lecture-6.009.png)

In [41]:
print ("helloWorld[::-1]", helloWorld[::-1]) # reverse slicing - starts from the right :: means takes default values, can use the string name or "hello Wo"

helloWorld[::-1] oW olleH


### Immutable, Cant Change String

In [42]:
helloWorld[0]  # read character at 0th index

'H'

In [43]:
helloWorld[0] = 'h'  # Expect TypeError 

TypeError: 'str' object does not support item assignment

#### NOTE: Objects have names, value, type, functions

In [45]:
# " ".format() ... Python's string object has a format function

### String Methods
- dir(str)
- help(str.casefold)

In [47]:
dir(str)                      # returns methods in string as a list

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

In [49]:
for _ in dir(str):
    if  _.startswith("__"):   # filtering off names which startswith __ (if true, continue)
        continue
    print(_)

capitalize
casefold
center
count
encode
endswith
expandtabs
find
format
format_map
index
isalnum
isalpha
isdecimal
isdigit
isidentifier
islower
isnumeric
isprintable
isspace
istitle
isupper
join
ljust
lower
lstrip
maketrans
partition
replace
rfind
rindex
rjust
rpartition
rsplit
rstrip
split
splitlines
startswith
strip
swapcase
title
translate
upper
zfill


In [50]:
help(str.casefold)

Help on method_descriptor:

casefold(...)
    S.casefold() -> str
    
    Return a version of S suitable for caseless comparisons.



In [52]:
s_upper = "HELLO"
s_lower = "hello"

s_upper == s_lower  # strings are case sensitive (upper and lower case are different)

False

#### Casefold

In [55]:
?s_upper.casefold()

In [58]:
s_upper.casefold() == s_lower.casefold() # make all one case

True

#### Capitalize

In [56]:
"hello world".capitalize()  # capitalizes the first letter of first word (calling capitalize method)

'Hello world'

In [57]:
s="hello world"
s.capitalize()  # capitalizes the first letter of first word (assigning string to s variable)

'Hello world'

In [59]:
str.capitalize("hello world") # str is the python defined function and is performing capitalize method on a given string

'Hello world'

In [60]:
str.capitalize(s) # str function is performing capitalize method on s

'Hello world'

#### Title

In [61]:
"hello world, greetings".title()    # capitalizes first letter of every word

'Hello World, Greetings'

In [62]:
g="hello world, greetings"
g.title()    # capitalizes first letter of every word

'Hello World, Greetings'

In [65]:
str.title(g)

'Hello World, Greetings'

In [66]:
str.title("Hello World Greetings")

'Hello World Greetings'

#### Upper

In [63]:
"hello world".upper()  # converts to upper case

'HELLO WORLD'

In [64]:
x = "hello world"
x.upper()  # converts to upper case

'HELLO WORLD'

#### Lower

In [67]:
'HELLO WORLD'.lower()  # converts to lower case

'hello world'

#### Count

In [68]:
"hello world".count('l')  # there are 3 l in hello world

3

In [69]:
"hello world".count('Z')  # there is no Z in hello world

0

#### Strip White Spaces

In [70]:
s = "    hello world   "  # whitespaces at start and end are stripped

In [71]:
print("s: |{}| s.strip: |{}|".format(s, s.strip()))

s: |    hello world   | s.strip: |hello world|


In [72]:
s.strip()

'hello world'

#### Strip Leading and Trailing Characters

In [73]:
s = " ;,hello world!?   "  # whitespaces and punctuations
s.strip(';,!? ')           # remove characters at start and end of string

'hello world'

#### Split White Spaces

In [74]:
s="The brown fox jumped quickly at the lazy dogs"

In [77]:
s.split()  # breaks apart sentence into list of words

['The', 'brown', 'fox', 'jumped', 'quickly', 'at', 'the', 'lazy', 'dogs']

In [79]:
type(s.split())

list

#### Strip and Split lines

In [82]:
s = '''
line 1
line 2
line 3
'''
s    # note \n are new lines

'\nline 1\nline 2\nline 3\n'

In [83]:
print(s)


line 1
line 2
line 3



In [84]:
s.splitlines() # splits into a list of single lines

['', 'line 1', 'line 2', 'line 3']

In [86]:
s.strip().splitlines() # removes the blank

['line 1', 'line 2', 'line 3']

#### Replace

In [87]:
s = "Python Programmers is cool!"
s.replace("is","are")

'Python Programmers are cool!'

In [88]:
"Python Programmers is cool!".replace("is","are")

'Python Programmers are cool!'

In [89]:
s.replace("cool","COOL")

'Python Programmers is COOL!'

#### Join

In [91]:
a2zLetters = "The brown fox jumped quickly at the lazy dogs"
a2zWords = a2zLetters.split() # (Split is a method of string)
print(a2zWords) # print list of words

" ".join(a2zWords) # creates a string from list of words (Join is a method of string)

['The', 'brown', 'fox', 'jumped', 'quickly', 'at', 'the', 'lazy', 'dogs']


'The brown fox jumped quickly at the lazy dogs'

In [92]:
a2zLetters = "The brown fox jumped quickly at the lazy dogs"
a2zWords = a2zLetters.split() # (Split is a method of string)
print(a2zWords) # print list of words

"+".join(a2zWords) # creates a string from list of words (Join is a method of string)

['The', 'brown', 'fox', 'jumped', 'quickly', 'at', 'the', 'lazy', 'dogs']


'The+brown+fox+jumped+quickly+at+the+lazy+dogs'

In [95]:
a2zLetters = "The brown fox jumped quickly at the lazy dogs"
a2zWords = a2zLetters.split() # (Split is a method of string)
print(a2zWords) # print list of words

join(a2zWords) # creates a string from list of words (Join is a method of string)

['The', 'brown', 'fox', 'jumped', 'quickly', 'at', 'the', 'lazy', 'dogs']


NameError: name 'join' is not defined

In [96]:
s = ''
s.join(a2zWords) # creates a string from list of words

'Thebrownfoxjumpedquicklyatthelazydogs'

In [97]:
''.join(a2zWords) # creates a string from list of words

'Thebrownfoxjumpedquicklyatthelazydogs'

In [98]:
s = '-'
s.join(a2zWords) # creates a string from list of words

'The-brown-fox-jumped-quickly-at-the-lazy-dogs'

In [99]:
'_'.join(a2zWords) # creates a string from list of words

'The_brown_fox_jumped_quickly_at_the_lazy_dogs'

#### Index vs Find
- index, rindex
  - Raises value error if string not found
- find, rfind
  - returns -1 if string not found

In [101]:
s = "One Two Three"
s.find("e")  # find e from left (result is s(2))

2

In [104]:
s.rfind('e')  # find e from right

12

In [107]:
s.find("Foo")  # returns -1 when substring not found in string

-1

In [108]:
"Foo" in s

False

In [109]:
s = "One Two Three"
s.index("e")

2

In [110]:
s.rindex('e')

12

In [111]:
s.index("Foo")   # expect valueError

ValueError: substring not found

## String Format Specifier
- A string can be formated using %
  - "format specifier"  % (arguments) 
     - Note here % is not the modulus operator

| format | remarks |
| --- | --- |
| %d, %i | Decimal |
| %s | string |
| %f | floating point |
| %e, %E | scientific notation |
| %x, %X | hex |

In [112]:
print ("percent d: %d" % (10))  # print a decimal number

percent d: 10


In [113]:
print ("percent i: %i" % (-10.23)) # %i and %d are same

percent i: -10


In [114]:
print ("percent f: %f" % (3.1415)) # print a floating point

percent f: 3.141500


In [115]:
print ("percent e/E: %e" % (10000)) # print in scientific notation

percent e/E: 1.000000e+04


In [116]:
print ("percent s: %s" % (10000))   # print a string

percent s: 10000


In [117]:
print ("percent x: %x" % (65534)) # print integer in hexadecimal

percent x: fffe


In [118]:
print ("percent X: %X" % (65534)) # %x or %X prints in hexadecimal

percent X: FFFE


In [119]:
print ("percent X: %X %X" % (47710, 47633))

percent X: BA5E BA11


### String Format method
- ''.format()

In [120]:
'{} {}'.format('one', 'two')

'one two'

In [121]:
'{} {}'.format(1, 2)

'1 2'

In [122]:
'{1} {0} {1}'.format("one","two")

'two one two'

In [123]:
'{1} {0} {1}'.format(1,2)

'2 1 2'

In [124]:
'| {0:<10} | {0:^10} | {0:>10} |'.format('Hello') # sufficient width for hello

'| Hello      |   Hello    |      Hello |'

In [125]:
'| {0:<2} | {0:^2} | {0:>2} |'.format('Hello') # string longer than width provided

'| Hello | Hello | Hello |'

In [126]:
'{:10.5}'.format('Hello World')  # only 5 characters printedcv

'Hello     '

In [127]:
'{:06.2f}'.format(3.141592653589793)

'003.14'

## Recap
- Unicode
- Strings
- Concatenation and Repetition operators
- String indexing, Slicing
- String Format Sepcifier and Format Methods
- String Methods

## Assignments
- String Operations Assignment
- String Operations Writing Assignment

## Quiz
- Quiz 6

## Reference

[Joel On Unicode](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/)

[Strings and Character Data in Python](https://realpython.com/python-strings)
  