# Text Preperation

This notebook is used to prepare a text file for other notebooks to use, most notably the notebook on LSTMs. Simply give it a url for the file, or the local directory and it will do a few things to make the file easier to work with. [Project Gutenburg](http://www.gutenberg.org/) is a good place to look for free text files. This example uses sheakspeare, but other good ideas might be the bible, or any long single work. You could also get more creative and use source code or a latex file for instance, see [this blog](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) for inspiration.

This is the flow of the notebook:

1. Remove very rare characters from the file. We will one hot encode the file on a character basis, so each character is another dimention. If a symbol only appears 5 times in a corpus of several millions of words then it is not worth worrying about.
2. Possibly strim the beginning and the end of the file, as many of the Gutenburg files have a preamble which is not part of the book and this may confuse our algorithms.
3. Create another version of the file which has only lower case ascii characters, space, full-stop (period) and commas. This is a simplified version of the text which may make things easier (but may also be less intereseting to learn).

Note that this assumes that you have a corpus that is in one file and will fit in memory. To attempt something more abitious some modifications will need to be made.

In [4]:
import os, requests, re, string
from collections import Counter

In [5]:
string.ascii_lowercase
string.whitespace

' \t\n\r\x0b\x0c'

In [6]:
def fetch_file(url, name):
    """
    Gets a text file and saves to disk.
    """
    file_name = os.path.join(BASE_DIR, name + '.txt')
    r = requests.get(url)
    with open(file_name, 'wt') as f:
        f.write(r.text)


In [7]:
def load_file(name):
    """
    Fetches a file and returns a string.
    """
    file_name = os.path.join(BASE_DIR, name + '.txt')
    with open(file_name, 'rt') as f:
        text = f.read()
    return text

In [8]:
def strip(text, first_line, last_line=None):
    """
    Strips the beginning and the end of the text based
    on given first and last lines. They must be unique
    and exactly as they are in the text.
    """
    
    start = text.find(first_line)
    if last_line is not None:
        end = text.find(last_line) + len(last_line)
        return text[start:end]
    else:
        return text[start:]

In [9]:
def strip_rare_chars(text, cutoff=64):
    """
    Takes a string and removes charachters that are deemed rare.
    User may specify this through cutoff.
    """
    counts = Counter(text)
    to_remove = "".join([c for c in counts if counts[c] < cutoff])
    # Not very efficient, but readable, and speed doesn't seem to
    # matter too much for the toy examples I'm playing with.
    new_text = text
    for r in to_remove:
        new_text = new_text.replace(r, '')
    return new_text

In [27]:
def simplify_text(text):
    """
    Returns a copy of the text with only lower case ascii, full stop,
    comma and space, and removes multiple white space. I.e. if there
    is a space more than once it changes that to be only one space.
    """

    new_text = text.lower()
    # Change all white space to spaces
    for w in string.whitespace:
        new_text = new_text.replace(w, ' ')
    # And get rid of the other rare chars
    allowed_chars = list(string.ascii_lowercase) + [' ', '.', ',']
    new_text = "".join([c for c in new_text if c in allowed_chars])
    # And remove repeated whitespace
    new_text = " ".join(new_text.split())
    return new_text

In [45]:
def save_text(text, name):
    """
    Saves the text to file.
    """
    file_name = os.path.join(BASE_DIR, name + '.txt')
    print(file_name)
    with open(file_name, 'wt') as f:
        f.write(text)

In [31]:
def summarise(text):
    """
    Prints a few basic facts about the text.
    """
    total_words = len(text.split())
    total_chars = len(text)
    unique_chars = len(set(text))
    print("Words:{} Chars:{} Unique Chars:{}".format(total_words, total_chars, unique_chars))

Settings:
* BASE_DIR is where you are keeping all your data and must exists. The given is recomended.
* name is a base name for refering to this corpus (without extentions etc. such as 'shake' for the
works of shakespeare. Extensions and other appendages will be handled for you.
* url points to the text file on the web that you want to download, or you may have your own file, in which case you can join to work flow after this step.
* Can optionally specify the first and last lines of the expected corpus, which will strip and existing 
file of title matter etc.

In [46]:
BASE_DIR = "../data/text"
url = "http://www.gutenberg.org/files/100/100-0.txt"
name = "shake"

In [47]:
first_line = "From fairest creatures we desire increase"
last_line = "Means to immure herself and not be seen."

In [48]:
fetch_file(url, name)

In [49]:
text = load_file(name)
text = strip(text, first_line, last_line)
text = strip_rare_chars(text)
save_text(text, 'clean' + name)

In [50]:
summarise(text)

And make a simplified version of the text as well...

In [54]:
simple_text = simplify_text(text)
save_text(simple_text, 'simple' + name)

In [55]:
summarise(simple_text)

Words:955232 Chars:5147125 Unique Chars:29


Check that they are there.

In [58]:
! ls -lh ../data/text

total 32616
-rw-r--r--  1 simontudge  staff   5.4M  8 Mar 10:36 cleanshake.txt
-rw-r--r--  1 simontudge  staff   5.6M  8 Mar 10:36 shake.txt
-rw-r--r--  1 simontudge  staff   4.9M  8 Mar 10:36 simpleshake.txt
