# Strings

**Prepared by Tommy Guy and Anthony Scopatz**

Lesson goals:

1.  Examine the string class in greater detail.
2.  Use open() to open, read, and write to files.


To start understanding the string type, let's use the built in helpsystem.

In [1]:
help(str)

Help on class str in module builtins:

class str(object)
 |  str(object='') -> str
 |  str(bytes_or_buffer[, encoding[, errors]]) -> str
 |  
 |  Create a new string object from the given object. If encoding or
 |  errors is specified, then the object must expose a data buffer
 |  that will be decoded using the given encoding and error handler.
 |  Otherwise, returns the result of object.__str__() (if defined)
 |  or repr(object).
 |  encoding defaults to sys.getdefaultencoding().
 |  errors defaults to 'strict'.
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __format__(...)
 |      S.__format__(format_spec) -> str
 |      
 |      Return a formatted version of S as described by format_spec.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getatt

The help page for string is very long, and it may be easier to keep it open
in a browser window by going to the [online Python
documentation](http://docs.python.org/library/stdtypes.html#sequence-types-str-unicode-list-tuple-bytearray-buffer-xrange)
while we talk about its properties.

At its heart, a string is just a sequence of characters. Basic strings are
defined using single or double quotes.

In [None]:
s = "This is a string."
s2 = 'This is another string that uses single quotes'

The reason for having two types of quotes to define a string is
emphasized in these examples:

In [None]:
s = "Bob's mom called to say hello."
s = 'Bob's mom called to say hello.'

The second one should be an error: Python interprets it as `s = 'Bob'` then the
rest of the line breaks the language standard.

Characters in literal strings must come from the ASCII character set,
which is a set of 127 character codes that is used by all modern
programming languages and computers. Unfortunately, ASCII does not have
room for non-Roman characters like accents or Eastern scripts. Unicode
strings in Python are specified with a leading u:

In [None]:
u = u'abcdé'

For the rest of this lecture, we will deal with ASCII strings, because
most scientific data that is stored as text is stored with ASCII.

## Working with Strings

Strings are iterables, which means many of the ideas from lists can also
be applied directly to string manipulation. For instance, characters can
be accessed individually or in sequences:

In [2]:
s = 'abcdefghijklmnopqrstuvwxyz'
s[0]

'a'

In [3]:
s[-1]

'z'

In [4]:
s[1:4]

'bcd'

They can also be compared using sort and equals.

In [None]:
'str1' == 'str2'

In [None]:
'str1' == 'str1'

In [None]:
'str1' < 'str2'

In the help screen, which we looked at above, there are lots of
functions that look like this:

    |  __add__(...)
    |      x.__add__(y) <==> x+y

    |  __le__(...)
    |      x.__le__(y) <==> x<y

These are special Python functions that interpret operations like < and \+.
We'll talk more about these in the next lecture on Classes.

Some special functions introduce handy text functions.

**Hands on example**

Try each of the following functions on a few strings. What does the
function do?

In [7]:
s = "This is a string"

In [8]:
s.startswith("This")

True

In [9]:
s.split(" ")

['This', 'is', 'a', 'string']

In [10]:
s.strip() # This won't change every string!

'This is a string'

In [11]:
s.capitalize()

'This is a string'

In [12]:
s.lower()

'this is a string'

In [13]:
s.upper()

'THIS IS A STRING'

## File I/O

Python has a built-in function called "open()" that can be used to
manipulate files. The help information for open is below:

In [14]:
help(open)

Help on built-in function open in module io:

open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
    Open file and return a stream.  Raise IOError upon failure.
    
    file is either a text or byte string giving the name (and the path
    if the file isn't in the current working directory) of the file to
    be opened or an integer file descriptor of the file to be
    wrapped. (If a file descriptor is given, it is closed when the
    returned I/O object is closed, unless closefd is set to False.)
    
    mode is an optional string that specifies the mode in which the file
    is opened. It defaults to 'r' which means open for reading in text
    mode.  Other common values are 'w' for writing (truncating the file if
    it already exists), 'x' for creating and writing to a new file, and
    'a' for appending (which on some Unix systems, means that all writes
    append to the end of the file regardless of the current seek position

The main two parameters we'll need to worry about are the name of the
file and the mode, which determines whether we can read from or write to
the file. open returns a file object, acts like a pointer into the file.
An example will make this clear. In the code below, I've opened a file
that contains one line:

    $ cat testfile.txt
    abcde
    fghij

Now let's open this file in Python:

In [15]:
f = open('testfile.txt','r')

The second input, 'r' means I want to open the file for reading only. I
can not write to this handle. The read() command will read a specified
number of bytes:

In [17]:
s = f.read(3)
print(s)

abc


We read the first three characters, where each character is a byte long.
We can see that the file handle points to the 4th byte (index number 3)
in the file:

In [18]:
f.tell()

3

In [19]:
f.read(1)

'd'

In [20]:
f.close() # close the old handle

In [21]:
f.read()  # can't read anymore because the file is closed.

ValueError: I/O operation on closed file.

The file we are using is a long series of characters, but two of the
characters are new line characters. If we looked at the file in
sequence, it would look like "abcdenfghijn". Separating a file into
lines is popular enough that there are two ways to read whole lines in a
file. The first is to use the readlines() method:

In [22]:
f = open('testfile.txt','r')
lines = f.readlines()
print(lines)
f.close()  # Always close the file when you are done with it

['abcde\n', 'fghij\n']


A very important point about the readline method is that it *keeps* the
newline character at the end of each line. You can use the strip()
method to get rid of the string.

File handles are also iterable, which means we can use them in for loops
or list extensions:

In [23]:
f = open('testfile.txt','r')
lines = [line.strip() for line in f]
f.close()
print(lines)

['abcde', 'fghij']


In [24]:
lines = []
f = open('testfile.txt','r')
for line in f:
    lines.append(s.strip())
f.close()
print(lines)

['abc', 'abc']


These are equivalent operations. It's often best to handle a file one
line at a time, particularly when the file is so large it might not fit
in memory.

The other half of the story is writing output to files. We'll talk about
two techniques: writing to the shell and writing to files directly.

If your program only creates one stream of output, it's often a good
idea to write to the shell using the print function. There are several
advantages to this strategy, including the fact that it allows the user
to select where they want to store the output without worrying about any
command line flags. You can use "\>" to direct the output of your
program to a file or use "|" to pipe it to another program.

Sometimes, you need to direct your output directly to a file handle. For
instance, if your program produces two output streams, you may want to
assign two open file handles. Opening a file for reading simply requires
changing the second option from 'r' to 'w' or 'a'.

*Caution!* Opening a file with the 'w' option means start writing *at
the beginning*, which may overwrite old material. If you want to append
to the file without losing what is already there, open it with 'a'.

Writing to a file uses the write() command, which accepts a string.

In [28]:
outfile = open('outfile.txt','w')
outfile.write('This is the first line!')
outfile.close()

Another way to write to a file is to use writelines(), which accepts a
list of strings and writes them in order. *Caution!* writelines does not
append newlines. If you really want to write a newline at the end of
each string in the list, add it yourself.

### Aside About File Editing

How is it possible that you can to edit a file in place. You can use f.seek()
and f.tell() to verify that even if your file handle is pointing to the
middle of a file, write commands go to the end of the file in append
mode. The best way to change a file is to open a temporary file in
/tmp/, fill it, and then move it to overwrite the original. On large
clusters, /tmp/ is often local to each node, which means it reduces I/O
bottlenecks associated with writing large amounts of data.