## Strings

The [Python 3.6 string documentation](https://docs.python.org/3.6/library/stdtypes.html#text-sequence-type-str).  Strings are also a good way to learn [the common sequence operations](https://docs.python.org/3.6/library/stdtypes.html#typesseq-common).

Apparently, "everyone" should read [this](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/).


[Strings are different in Python 2 and 3](https://timothybramlett.com/Strings_Bytes_and_Unicode_in_Python_2_and_3.html).

## Several different ways to specify a string literal

**Extra credit** How would you declare a literal in a right-to-left language?  Quote a word from a right-to-left language inside a passage of left-to-right?

In [1]:
s = ' a test string '

print(s)

s = ' a test string\'s for testing '

print(s)

s = "a test string's not an imaginative example"

print(s)

s = """\na test string's 
\talso not a good
\t\thaiku\n"""

print(s)

 a test string 
 a test string's for testing 
a test string's not an imaginative example

a test string's 
	also not a good
		haiku



## Strings are immutable sequences

I.e., you can't change their characters one-by-one.

In [2]:
s = ' a test string '

s[3] = 'A'

TypeError: 'str' object does not support item assignment

## But an immutable sequence is still a sequence

Even though you can't write to the characters of a string, you can read them . . . 

In [3]:
print(s)
print()

print(len(s))
print(s[1])
print(s[4:7])
print(s[-1])
print(s[3:])

 a test string 

15
a
est
 
test string 


## So, to change the value of a string . . . 

. . . we assign the some value to the same variable (see the last bit in this cell, which I comment).

In [4]:
new_s_1 = s.strip()[:2] + 'SUPER-DUPER' + s.strip()[6:]

print(new_s_1)

new_s_2 = s.strip()[:s.find('test') - 1] + 'SUPER-DUPER' + s.strip()[s.find('test') + 3:]

print(new_s_2)

s_parts = s.strip().split()

print(s_parts)

s_parts[1] = 'SUPER-DUPER'

#  NOTE THAT ' ' IS A STRING.  STRING HAS A .join METHOD, WHICH TAKES AS AN ARGUMENT A LIST OF STRINGS.

print(' '.join(s_parts))

# HERE, WE DO SOMETHING WHICH HAS THE AFFECT (NOT EFFECT?) OF CHANGING THE VALUE OF s.

s = s.strip()[:2] + 'SUPER-DUPER' + s.strip()[6:]

print(s)

a SUPER-DUPER string
a SUPER-DUPER string
['a', 'test', 'string']
a SUPER-DUPER string
a SUPER-DUPER string


## How do you know what a Python object "does"?

In [5]:
print(dir(s))
print()
print(type(s))
print()
print(help(str))

['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']

<class 'str'>

Help on class str in module builtins:

class str(object)
 |  s

## Strings come with a bunch of useful methods

In [6]:
print(s.lower())
print(s.upper())
print(s.strip())
print(s.startswith(' '))
print(s.endswith(' '))
print(s.find('test'))
print(s.strip().split())
print(s.strip().split('s'))
print(s.strip().title())

a super-duper string
A SUPER-DUPER STRING
a SUPER-DUPER string
False
False
-1
['a', 'SUPER-DUPER', 'string']
['a SUPER-DUPER ', 'tring']
A Super-Duper String


## Strictly speaking . . . 

Up to this point, we haven't been talking about string.  We've been talking about **str**.

In python, **string** is actually a module which contains a bunch of constant variations.  <span style="font-weight: bold; color: red;">\[eyeroll\]</span>

In [7]:
import string

print(dir(string))
print()
print(string.punctuation)

['Formatter', 'Template', '_ChainMap', '_TemplateMetaclass', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '_re', '_string', 'ascii_letters', 'ascii_lowercase', 'ascii_uppercase', 'capwords', 'digits', 'hexdigits', 'octdigits', 'printable', 'punctuation', 'whitespace']

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


## Bonus: shell commands in the notebook

What is on the file system?

In [8]:
!ls -1 data/
!echo ""
!file -ib data/en_Jane_Eyre.txt
!file -ib data/de_Jane_Eyre.txt
!file -ib data/special_encoding_de_Jane_Eyre.txt

de_Jane_Eyre.txt
en_Jane_Eyre.txt
pg97_Abbott_Edwin_Flatland.txt
special_encoding_de_Jane_Eyre.txt

text/plain; charset=us-ascii
text/plain; charset=utf-8
text/plain; charset=iso-8859-1


## Opening and reading files with various character encodings

In [9]:
en_text = open('data/en_Jane_Eyre.txt').read()

print(en_text)
print(type(en_text))

There was no possibility of taking a walk that day.  We had been wandering, indeed, in the leafless shrubbery an hour in the morning; but since dinner (Mrs. Reed, when there was no company, dined early) the cold winter wind had brought with it clouds so sombre, and a rain so penetrating, that further out-door exercise was now out of the question.

<class 'str'>


In [10]:
de_text = open('data/de_Jane_Eyre.txt').read()

print(de_text)
print(type(de_text))

Es war ganz unmöglich, an diesem Tage einen Spaziergang zu machen.
Am Morgen waren wir allerdings während einer ganzen Stunde in den blätterlosen, jungen Anpflanzungen umhergewandert; aber seit dem Mittagessen – Mrs. Reed speiste stets zu früher Stunde, wenn keine Gäste zugegen waren – hatte der kalte Winterwind so düstere, schwere Wolken und einen so durchdringenden Regen heraufgeweht, daß von weiterer Bewegung in frischer Luft nicht mehr die Rede sein konnte.

<class 'str'>


## Note that this one doesn't work.

File is not in an ascii or utf-8 encoding.  I suspect that it doesn't work because I have some sort of an environmental setting on my workstation which defines the default encoding as being utf-8.

In [11]:
special_de_text = open('data/special_encoding_de_Jane_Eyre.txt').read()

print(special_de_text)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 15: invalid start byte

## But it's okay . . .

 . . . if I declare the encoding.
 
 As a general rule, you should probably always declare the encoding (i.e., don't do what I did in the previous three cells).  Default encodings will differ from operating system to operating system, and/or from device to device.  So, generally speaking, it's best to be explicit about what you're expected and what you're providing.

In [12]:
special_de_text = open('data/special_encoding_de_Jane_Eyre.txt', 'r', encoding='iso-8859-1').read()

print(special_de_text)

Es war ganz unmöglich, an diesem Tage einen Spaziergang zu machen.
Am Morgen waren wir allerdings während einer ganzen Stunde in den blätterlosen, jungen Anpflanzungen umhergewandert; aber seit dem Mittagessen -- Mrs. Reed speiste stets zu früher Stunde, wenn keine Gäste zugegen waren -- hatte der kalte Winterwind so düstere, schwere Wolken und einen so durchdringenden Regen heraufgeweht, daß von weiterer Bewegung in frischer Luft nicht mehr die Rede sein konnte.



## But I can change the encoding . . . 

 . . . by writing the data out to a file with a different encoding (note that I could have written a file called "data/special_encoding_de_Jane_Eyre.txt", instead of creating a new file name).

In [13]:
f = open('data/fixed_special_encoding_de_Jane_Eyre.txt', 'w', encoding='utf-8')
f.write(special_de_text)
f.close()

fixed_de_text = open('data/fixed_special_encoding_de_Jane_Eyre.txt').read()

print(fixed_de_text)

Es war ganz unmöglich, an diesem Tage einen Spaziergang zu machen.
Am Morgen waren wir allerdings während einer ganzen Stunde in den blätterlosen, jungen Anpflanzungen umhergewandert; aber seit dem Mittagessen -- Mrs. Reed speiste stets zu früher Stunde, wenn keine Gäste zugegen waren -- hatte der kalte Winterwind so düstere, schwere Wolken und einen so durchdringenden Regen heraufgeweht, daß von weiterer Bewegung in frischer Luft nicht mehr die Rede sein konnte.



## You can combine strings . . . 

 . . . by "adding" them together.  And there's a "multiply" operator which seems curiously useless.  But you can't subtract.
 
**Extra credit**:  There's a thing called [operator overloading](https://en.wikibooks.org/wiki/C%2B%2B_Programming/Operators/Operator_Overloading), which, even if you can't find a use for it as a programmer, still presents an interesting thought experiment about things which change meaning depending on context.

In [14]:
add_string = 'this' + 'this' + 'this'
print(add_string)

multiply_string = 'this' * 8
print(multiply_string)
print(multiply_string - 'this')

thisthisthis
thisthisthisthisthisthisthisthis


TypeError: unsupported operand type(s) for -: 'str' and 'str'

## Subtle differences when concatonating and/or printing strings and numbers

This bit

    str(pi)
    
is a useful bit to know, since it's the way we change an object from one type to another.

In [None]:
pi = 3.14159265359

print(type(pi))
print()

str_p = str(pi)

print(type(str_p))
print()

print(type(str(pi)))
print()

print('the value of pi is', pi)          # print takes any number of arguments, and handles concatonating them.
print('the value of pi is ' + str(pi))   # or, we cast the number to str, then we can concatonating.
print('the value of pi is ' + pi)

## You should be able to cast almost anything to str

Because almost everything will have a \_\_str\_\_ method.

In [None]:
l = [1, 2, 3, 4, 5, 6]
print(l)

print()
print(type(l))
print()
print(dir(l))
print()
print(l.__str__())  # IT IS CONSIDERED BAD FORM TO CALL A METHOD WHOSE NAME STARTS WITH AN UNDERSCORE.
print()

print('this is really a string ' + str(l))
print('this is NOT a string ' + l)

## How do these things combine?

This thing:
    
    open('data/en_Jane_Eyre.txt').read().lower().strip().split(' ')
    
is assigned to en_tokens.  But that does that thing do?

1.  open('data/en_Jane_Eyre.txt') result in something called a "file" or a "file handle" (a "TextIOWrapper", to be exact).
2.  The file handle has a method "read", which returns a string.
3.  String has a lower method, which returns a string; and string also has a lower method, which returns a string.
4.  String has a split method, whicn returns a list

You'll see this sort of chaining one method after another a lot.  You'll also see things like this:

    print(len(str(en_tokens)))
    
Here, 

1.  print takes zero or more arguments of arbitrary types, and writes them to standard out.
2.  In this case, print is taking one argument, which is the result (an integer) from a len function.
3.  len takes one argument, which can be any sequence-like python object (a string, a list, a dictionary, a set, etc . . .); in this case, len is taking as its output the result of a str function, which is a string.
4.  str takes one argument, which can any object which implements a \_\_str\_\_ method.  In this case, that argument is a list.
    

In [None]:
en_tokens = open('data/en_Jane_Eyre.txt').read().lower().strip().split(' ')

print(type(en_tokens))
print()

print(len(en_tokens))
print()

print(len(str(en_tokens)))
print()

print(en_tokens)
print()

## Oh, before I forget:

In [15]:
for cn, c in enumerate('start thinking in sequences'.title()):
    print((' '* cn), c)

 S
  t
   a
    r
     t
       
       T
        h
         i
          n
           k
            i
             n
              g
                
                I
                 n
                   
                   S
                    e
                     q
                      u
                       e
                        n
                         c
                          e
                           s
