# Getting Texts

This notebook is focused on an essential component of digital text analysis: preparing a corpus of texts. It's part of the [Art of Literary Text Analysis](ArtOfLiteraryTextAnalysis) and assumes that you've already worked through [Getting Setup](GettingSetupENGL345.ipynb) and [Getting Started](GettingStarted.ipynb). In this notebook we'll look at

* accessing plain texts online
* finding string length, counting occurrences of a term, extracting part of a string
* data types
* saving and accessing plain texts in a local directory
* listing text files in a local directory

Note that we're especially interested here in working with plain texts, in later notebooks we'll deal with other formats.

## Running your iPython notebook (recap)

In the previous [Getting Started](GettingStarted.ipynb) notebook we launched iPython Notebook from the user's home directory (the current working directory when you launch Terminal. We then navigate to our .ipynb files from within the browser window runing iPython Notebook.

To get started again today, you'll need to find the 'wk2' files [from Github](https://github.com/ucdh/alta/tree/wk2), download them ('Download Zip') and save them in the folder you used last week. This week's notebooks are 'GettingTexts' and 'GettingNLTK'.

Once you have the files, open Terminal (Applications > Utilities > Terminal) and run the command:

> ```ipython notebook```

Notice that soon after launching iPython Notebook there is a message that tells us how to shutdown the server, by pressing [Control]-C (press the control button on your keyboard and then hit ```c```), and then hit ```y``` to confirm.

![Stop Server](images/stop-server.png)

Once you have the notebook server running on your Mac, you can run code in the notebook and edit your own copy. Next, you'll look at the task of getting texts.

## Accessing Plain Texts Online

Let's create a new notebook and set its name (by clicking on the _Untitled0_ label above the toolbar) to _AccessTexts_ and set the first cell format to ```Heading 1``` with the title _Access Texts_.

![Access Texts](images/access-texts.png)

Hit Shift-Enter to evaluate/format the heading cell and create a new cell. The format of the new cell should be Markdown, and we can put something like this:

> We are first going to experiment with loading a plain text into memory from the \[Gutenberg Project](http://gutenberg.org), an online library with tens of thousands of free texts in different languages and formats. To do this we will import a Python library called [Requests](http://www.python-requests.org/en/latest/) that will help us request web pages, doing what our web browsers do via a Python script instead.

As we proceed through this notebook and beyond, we'll no longer make much distinction between the tutorial notebook and a separate notebook that you may want to write to experiment. It will be up to you to decide which parts you want to try in your own notebook.

Next, let's demonstrate how we can read the contents of a plain text version of the [Works of Edgar Allan Poe, Volume 1](http://www.gutenberg.org/ebooks/2147), available at http://www.gutenberg.org/cache/epub/2147/pg2147.txt. Run the cell below to see this working.

In [None]:
import requests
poeUrl = "http://www.gutenberg.org/cache/epub/2147/pg2147.txt"
poeString = requests.get(poeUrl).text
print(poeString[0:300])

Most of the principles involved have already been covered in the [Getting Started](GettingStarted.ipynb)

1. import module (in this case ```requests```)
1. assigning a string (the url) to a variable name of our choice ```poeUrl```
1. making a function call and assigning the result to the variable name ```poeString```

In this case [get()](http://www.python-requests.org/en/latest/user/quickstart/#make-a-request) is the function name with an argument that contains our Poe URL and returns a Response object. We then ask for the [text](http://www.python-requests.org/en/latest/api/#requests.Response) property of this HTTP response and store this in the variable PoeString. Finally, we print a couple of hundred characters (using the slice notation) to see what the result is.

Many things can go wrong during networking calls, but if all goes well, we should now have a variable (poeString) containing a string with the same contents as at our [URL](http://www.gutenberg.org/cache/epub/2147/pg2147.txt) (click to check that they match!).

Fetching the contents of a URL is a relatively heavy operation, so we want to isolate that in its own cell so that we don't have to run it more times that necessary. If we want to explore various aspects of the poeString string that we fetched, we should do that in a separate cell so that we're not re-fetching the string each time.

## Some Simple String Functions

Last time, we briefly tried using the [len()](https://docs.python.org/3/library/functions.html?highlight=len#len) function. Now, let's use it again to see the length of the poeString string (how many characters there are).

In [None]:
len(poeString)

This suggests that there are 550,293 characters (because we're in Python 3.x we should be dealing with Unicode, and so this should be a true count of the characters, not just the bytes). We'll learn more about Unicode in a later class.

We can look at the string documentation section to see examples of how we can extract parts of a string. For instance, if we want the first 75 characters, we could do this:

In [None]:
poeString[:75]

As we've seen, when the slice notation is used with a colon, we get a range of characters back - in this case from the first or '0th' to the 75th character. Compare the result if you remove the colon.

With ```poeString[:75]``` some of the text returned is familiar, but some of the text is less so. We don't need to get into the details here, but we see examples of a [Unicode Byte Order Mark](http://en.wikipedia.org/wiki/Byte_order_mark) at the beginning and of MS-Windows-based [newline characters](http://en.wikipedia.org/wiki/Newline) at the end.

Next, let's use another handy built-in function ```count()```, and let's count the occurrences of the term _corpse_:

In [None]:
poeString.count("corpse")

Clearly Poe likes talking about corpses. Note that we are searching for _corpse_ (lowercase). This isn't the same as Corpse (capitalized). How many instances of the latter are there?

Our [Poe text](http://www.gutenberg.org/cache/epub/2147/pg2147.txt) is actually a volume of multiple texts, what if we wanted to isolate only one of the text, such as "The Gold Bug?"

We can start by thinking of the string as a sequence of characters, where each character has an index position. Python, like many languages, starts its indexing at 0, so if we ignore the Unicode characters at the start, we get something like this:

|P|r|o|j|e|c|t| |G|u|t|e|n|b|e|r|g|'|s| |T|h|e|…|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:
|0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|…|

So, for instance, I can ask to find the index position of the letter "G" from "Project Gutenberg's The"

In [None]:
string = "Project Gutenberg's The"
string.find("G")

The [find()](https://docs.python.org/3/library/stdtypes.html?highlight=index#str.find) function returns the index of the start of the string being matched (the string argument "G" above). The found index position will be the same for the full word. 

In [None]:
string.find("Gutenberg")

Note that The ```find()``` is case sensitive, so looking for "p" will give a value of -1, which indicates that no match was found.

In [None]:
string.find("p")

To isolate the "The Gold Bug" in our Poe text, we might do something like the following (sometimes planning a program in natural language, rather than in computer code, can be useful):

1. find the index position of the start of the story, i.e. "THE GOLD-BUG"
1. find the index position of the end of the story, or the start of the next story, i.e. "FOUR BEASTS IN ONE"
1. create a new string from 1) the start of position 1) to just before the start of position 2)

We know how to find the first two steps, and we've already seen a variant of the second step when we asked for the first few characters of the full Poe text. Let's first try in a simplified form to isolate "Gutenberg's" from our string "Project Gutenberg's The":

In [None]:
start = string.find("Gutenberg")
end = string.find(" The")
string[start:end]

Same principle for isolating "The Gold Bug" story:

In [None]:
start = poeString.find("THE GOLD-BUG")
end = poeString.find("FOUR BEASTS IN ONE")
goldBugString = poeString[start:end]
# show start and end of goldBugString
goldBugString[:75] + " [rest of the text…] " + goldBugString[-75:]

Great - we've now loaded a text from a webpage, stored it in a variable, and isolated a particular section we're interested in.

Note: did you notice the comment in the previous code cell? In Python, lines beginning with a '#' are comments. They aren't executed by Python, and are another great way for you to document and explain what code does as you write it.

## Data types

At this point it is useful to know a little more about the concept of data types. We've already described text - words, sentences, paragraphs - as "strings". In Python, a string is a type of data we may use. Other types of data we might encounter are integers, booleans (True/False), lists, and sets. Data types can share properties, and may have certain types of functions that can use them in common. For instance, a string like "The Gold Bug" can be sliced, or it can be looped over to examine each character one by one. Similarly, a list of words, ```['The', 'Gold', 'Bug']``` can be sliced or looped over.

If you're wondering why we write ```string.find("a sub string")``` above, but used ```len(a string)``` before that, it is because ```find()``` is a more specific kind of function - one that can only be used with strings of text. We call such specific functions "methods" - in this case it is a string method. ```string.find("a sub string")``` specifies a string type object, then uses the find method to search for a given sub string. ```len()``` could be used with a range of numbers or other types of objects, so we don't need to attach it after a string object to use it.

You can use the very handy function ```type()``` to check the datatype of any object. Try out the following:

In [None]:
type("The Gold Bug")

In [None]:
type(14)

In [None]:
type(['The', 'Gold', 'Bug'])

If your code is not working, it's often really useful to start debugging by testing the type of your inputs and outputs, to make sure we're using the appropriate methods and functions for the objects we're dealing with.

## Accessing Local Plain Texts

Code that relies on accessing content from the web is convenient, though not nearly as robust. Content can change or disappear from the web, and maybe you want to work on your Notebook in a remote location or in an airplane without internet connectivity. Moreover, accessing content from your local machine is typically much faster than interacting with web-based content.

What we'll do in the next section is the following:

1. create a local directory for data (if necessary)
1. open a new file and write our goldBugString to the file
1. (re)open the file and read from it

Let's begin by creating a new subdirectory (relative to the current notebook directory), using the [os](https://docs.python.org/3/library/os.html) module.

In [None]:
import os
directory = "data"
if not os.path.exists(directory):
    os.makedirs(directory)

This demonstrates a [conditional structure in Python](http://en.wikibooks.org/wiki/Python_Programming/Conditional_Statements) where we test for a boolean value (true or false) of whether or not or directory [exists](https://docs.python.org/3/library/os.path.html?highlight=exists#os.path.exists).

Python uses a colon and indentation (usually four spaces) to indicate the parts of the conditional block. If we want to execute a block when a condition evaluates to true (like ```1 < 5```, one _is_ smaller than five):

<blockquote><pre>if _condition_:
    _block_</pre></blockquote>

Or if a condition is not true (like ```1 > 5```, one _is not_ smaller than five):

<blockquote><pre>if *not* _condition_:
    _block_</pre></blockquote>

If the _data_ directory does't exist, we create it using [mkdirs()](https://docs.python.org/3/library/os.html?highlight=mkdirs#os.makedirs).

In the cell below, try writing an ```if``` expression to test if ```1 < 5``` (ie if 1 is less than 5) and print "My maths can cope with that" if the expression is true. Use four spaces to indent the conditional block.

Now that we have a data directory, we need to open a new file in write ("w") mode and write out the string contents of goldBugString. Since we opened the file for writing, we also need to ```close()``` it.

In [None]:
f = open("data/goldBug.txt", "w")
f.write(goldBugString)
f.close()

The ```open()``` functions returns a file descriptor (that we've named ```f```) and to which we can write contents. Assuming things did work out, we can now turn around and open the file in read mode ("r" instead "w"), read the contents into a new variable that we'll call ```goldBugString2```, and then close the file.

In [None]:
f = open("data/goldBug.txt", "r")
goldBugString2 = f.read()
f.close()

Let's have a peek at the contents in our ```goldBugString2``` variable, the same way we did before.

In [None]:
goldBugString2[:75] + " [rest of the text…] " + goldBugString2[-75:]

## Listing Files in a Local Directory

As with many things in programming languages like Python, there's more than one way of listing files in a directory. We're going to introduce a way here that also introduces a loop, a process that is repeated multiple times for each element in a list or for as long as a condition is true.

But first let's start with the [glob()](https://docs.python.org/3/library/glob.html?highlight=glob#glob.glob) function that allows us to list the files in a directory.

In [None]:
import glob
textFiles = glob.glob("data/*txt")
textFiles

The results are shown as a list (delimited by the square brackets), with each element inside separated by a comma (here we only have one element because we only have one file so far).

We can ask what kind of object our ```textFiles``` variable contains.

In [None]:
type(textFiles)

Lists are a type of variable that lend themselves to loops or to iterating over each element. For instance, to show each filename with the number of characters, we could do something like this:

In [None]:
totalCharacters = 0
for textFile in textFiles:
    f = open(textFile, "r")
    textString = f.read()
    f.close()
    chars = len(textString)
    print(textFile, "has", chars, "characters")
    totalCharacters += chars
print("total characters: ", totalCharacters)

The code above is of the general form

<blockquote><pre> for _item_ in _list_:
    _block_</pre></blockquote>

In other words, for each item in our ```textFiles``` list, we execute the block where ```textFile``` is the local variable holding the item in the list. Just as with the conditionals, the colon and indentation indicate what the loop condition is (as long as more elements exist in the list) and what block to execute for each iteration.

In the code above we're also calculating the total number of characters (tracking them in a variable that we've called ```totalCharacters```. Each time we iterate over the list of files, we add the length of characters for the current file.

> ```python
totalCharacters += chars```

The += operator is a compact way to add a value to an existing variable. It's the equivalent of this:

> ```python
totalCharacters = totalCharacters + chars```

Finally, we're using the ```print()``` function here because its's a simple way of combining a string ("total characters: ") and a number (```totalCharacters```). We can't use the ```+``` operator because that can't combine objects of different data types.

## Next Steps

Here are some tasks to try:

* how would you create a subdirectory called ```Austen``` under the working directory you've already created for these classes?
* for each of the plain text novels in English of [Jane Austen](http://www.gutenberg.org/ebooks/author/68) in Project Gutenberg
  * how would you isolate the text content (without the Project Gutenberg header and footer)?
  * how would you save the text-only content into the ```data/Austen``` directory?
* how would you loop over the files in the ```data/Austen``` directory and for each one print the file name and a count of "his" and "her"?
* what is the total number of characters in the Austen corpus?

In the next notebook ([Getting NLTK](GettingNltk.ipynb)) we're going to introduce the Natural Language Toolkit that provides a huge number of useful functions for text analysis.

---
From [The Art of Literary Text Analysis](https://github.com/sgsinclair/alta) by [Stéfan Sinclair](http://stefansinclair.name) &amp; [Geoffrey Rockwell](http://geoffreyrockwell.com), [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0/)<br />
Created January 12, 2015