## 2. Data Preparation

In [1]:
# import libraries
import re
import string

In [2]:
# define function to load entire text file into memory and return it.
def load_doc(filename):
    file = open(filename, 'r', 
                encoding='utf-8')    # open the file as read only
    text = file.read()               # read all the text
    file.close()                     # close the file
    return text

In [3]:
# load Metamorphosis text, meta_text.txt 
in_filename = '../data/meta_text.txt'
doc = load_doc(in_filename)

#preview first 1000 characters 
print(doc[:1000]) 

﻿I

One morning, when Gregor Samsa woke from troubled dreams, he found
himself transformed in his bed into a horrible vermin. He lay on his
armour-like back, and if he lifted his head a little he could see his
brown belly, slightly domed and divided by arches into stiff sections.
The bedding was hardly able to cover it and seemed ready to slide off
any moment. His many legs, pitifully thin compared with the size of the
rest of him, waved about helplessly as he looked.

“What’s happened to me?” he thought. It wasn’t a dream. His room, a
proper human room although a little too small, lay peacefully between
its four familiar walls. A collection of textile samples lay spread out
on the table—Samsa was a travelling salesman—and above it there hung a
picture that he had recently cut out of an illustrated magazine and
housed in a nice, gilded frame. It showed a lady fitted out with a fur
hat and fur boa who sat upright, raising a heavy fur muff that covered
the whole of her lower arm towards 

### 2.3 Clean Text

To transform the raw text into a series of tokens to train the model, several operations will be performed.
- Replace '-' with a white space so that the words can be split better
- Split the words based on white space
- Remove all punctuation from words
- Remove all words that are not alphabetic
- Normalize all words to lowercase

In [4]:
# define function to convert document into tokens
def clean_doc(doc):
    # replace '-' with ' ' 
    doc = doc.replace('-', ' ')
    # split doc into tokens by space
    tokens = doc.split()           
    # filter punctuated chars
    re_punc = re.compile('[%s]'% re.escape(string.punctuation)) 
    # remove punctuation from each word
    tokens = [re_punc.sub('', w) for w in tokens]  
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()] 
    # make lower case
    tokens = [word.lower() for word in tokens]
    return tokens

In [5]:
# convert document into tokens
tokens = clean_doc(doc)
# print first 200 tokens
print(tokens[:200])

['one', 'morning', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'he', 'lay', 'on', 'his', 'armour', 'like', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'the', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'his', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'happened', 'to', 'he', 'thought', 'it', 'a', 'dream', 'his', 'room', 'a', 'proper', 'human', 'room', 'although', 'a', 'little', 'too', 'small', 'lay', 'peacefully', 'between', 'its', 'four', 'familiar', 'walls', 'a', 'collection', 'of', 'textile', 'samples', 'lay', 'spread', 'out'

In [6]:
# print total number of tokens 
print('Total number of tokens: ', len(tokens))
# convert tokens into a set and print the number of unique tokens
print('Number of unique tokens: ', len(set(tokens)))

Total number of tokens:  21347
Number of unique tokens:  2533


### 2.4 Save Clean Text as Sequences

#### Sequence of 50 + 1 Words
For the reference model, we will train the model using a sequence length of 50 words. The tokens will be organized into sequences of 50 input words + 1 output word, creating a sequence of 51 words. The list of tokens will be iterated from token 51 onwards with the prior 50 tokens taken as a sequence. The sequence of tokens will be saved as space separated strings.

In [7]:
# organize token list into sequences
length = 50+1
# replace list() with []
sequences = list()

for i in range(51, 21347):
    # select sequence of tokens
    seq = tokens[i-length:i]
    #convert into a line
    line = ' '.join(seq)
    # store each line in sequences
    sequences.append(line)
    
# print total number of sequences
print('Total Sequences: %d' % len(sequences))

Total Sequences: 21296


In [8]:
# save tokens to file, one dialog per line
def save_doc(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

In [9]:
# save sequences to file
out_filename = '../data/Text_Sequences_50_meta.txt'
save_doc(sequences, out_filename)

#### Sequence of 100 + 1 Words
We would like to explore if using a longer sequence length will improve the model performance. While we found that the average paragraph length was approximately 200 words, the sequence length of 200 words will require a much longer time to train the model. Hence, we will use a sequence length of 100 words to observe the effect of longer sequence length. 

In [10]:
# organize token list into sequences
length = 100+1
sequences = list()

for i in range(length, len(tokens)):
    # select sequence of tokens
    seq = tokens[i-length:i]
    #convert into a line
    line = ' '.join(seq)
    # store each line in sequences
    sequences.append(line)
    
# print total number of sequences
print('Total Sequences: %d' % len(sequences))

Total Sequences: 21246


In [11]:
# save sequences to file
out_filename = '../data/Text_Sequences_100_meta.txt'
save_doc(sequences, out_filename)