# Project Albion

## Important notice
- this notebook's model already contains trained weights *(consult bottom of readme file for more info)*
    - *upsides*: No equipment / time required to train the model
    - *downsides*: You can't play around with the model / number of epochs to try to make the generations better
- notice that in section 2, code that has already been shown will be shown again. This is due to the fact that the loops are very long and splitting them up into different cells would give an incomplete output
    - a complete file that you'll be able to run e.g. in Atom will be provided as Project_Albion_gpu_complete



Let's get started!

## 1. Develop the network 

### Import packages
- Links to the packages can be found in the readme file
- We're using the Keras Sequential Model and pimp it up with LSTMs, Dropout layers, 
- `ModelCheckpoint` is a Keras method to save the model's weights after each epoch

In [312]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import CuDNNLSTM
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import to_categorical
import numpy as np
import math

### Get dataset ready

Make dataset ready to be processed and give names to important quantities.

In [313]:
# load and lower file
filename = "dataset.txt"
raw_text = open(filename).read()
raw_text = raw_text.lower()

In [314]:
# see what chars come up in raw_text
chars = sorted(list(set(raw_text)))
print("\nCharacters occurring in dataset: ")
print(chars)


Characters occurring in dataset: 
['\n', ' ', "'", '-', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [315]:
# create mapping from characters to integers
char_to_int = dict((c,i) for i, c in enumerate(chars))
print("\nCharacter to integer mapping: ")
print(char_to_int)


Character to integer mapping: 
{'\n': 0, ' ': 1, "'": 2, '-': 3, 'a': 4, 'b': 5, 'c': 6, 'd': 7, 'e': 8, 'f': 9, 'g': 10, 'h': 11, 'i': 12, 'j': 13, 'k': 14, 'l': 15, 'm': 16, 'n': 17, 'o': 18, 'p': 19, 'q': 20, 'r': 21, 's': 22, 't': 23, 'u': 24, 'v': 25, 'w': 26, 'x': 27, 'y': 28, 'z': 29}


In [316]:
# give names to important quantities
n_chars = len(raw_text)
n_vocab = len(chars)
print("\nTotal number of characters in dataset: ")
print(n_chars)
print("Number of different characters in dataset: ")
print(n_vocab)



Total number of characters in dataset: 
147516
Number of different characters in dataset: 
30


In [317]:
# determine size of time window based upon which the next character is predicted
seq_length = 28

- 1 place name will be padded as follows: 111111111111111111111111111place_name11111111111111\n
- 27 ones (=spaces) at beginning, then the name, then more spaces such that including the \n, the place_name has 55 chars
- a window of size 28 chars will be slid across the padded place name (dataX)
- the label of the window is the char following the window (dataY)


In [318]:
# Create training examples (time windows and labels (dataX & dataY))
dataX = []
dataY = []
max_length = 28 # longest place name has 28 chars
seq_length = 28 # each window contains 28 chars
word_length = 55 # every place_name's length after padding, \n included

In [319]:
# pad words and save all windows & labels to a list
with open(filename, "r") as place_names:
    for place_name in place_names:
        place_name = place_name.lower()
        place_name = place_name[:len(place_name)-1] # cut off \n
        place_name = place_name.ljust(27, " ") # pad place name with spaces to reach length max_length -1
        place_name = place_name + "\n" #reattach newline symbol, now place name has 28 chars
        for i in range(0, 27):
            place_name = " " + place_name # put max_length -1 (27) spaces at front
        for i in range(0, word_length - seq_length, 1):
            seq_in = place_name[i:i + seq_length] # seq_in on place_name level: first two letters
            seq_out = place_name[i + seq_length] #
            dataX.append([char_to_int[char] for char in seq_in])
            dataY.append(char_to_int[seq_out])

In [320]:
# number of training examples
n_patterns = len(dataX)
np_dataX = np.zeros(n_patterns) # network wants input as np array, so initialise it
print("\nTotal number of patterns: ")
print(n_patterns)


Total number of patterns: 
396333


In [321]:
# create reverse mapping (from integers to characters)
int_to_char = dict((i, c) for i, c in enumerate(chars))

### Prepare input

The LSTM network requires that we transform the list of input sequences to an array of shape [samples, time steps, features].

In [322]:
# convert list into array
np_dataX = np.asarray(dataX, dtype=object)

In [323]:
# must transform list of input sequences to [samples, time steps, features] expected by an LSTM network
X = np.reshape(np_dataX, (n_patterns, 28, 1))
print(X)

[[[1]
  [1]
  [1]
  ...
  [1]
  [1]
  [4]]

 [[1]
  [1]
  [1]
  ...
  [1]
  [4]
  [5]]

 [[1]
  [1]
  [1]
  ...
  [4]
  [5]
  [5]]

 ...

 [[1]
  [1]
  [1]
  ...
  [1]
  [1]
  [1]]

 [[1]
  [1]
  [29]
  ...
  [1]
  [1]
  [1]]

 [[1]
  [29]
  [8]
  ...
  [1]
  [1]
  [1]]]


In [324]:
#rescale integers in input array to range [0,1] to make the patterns learnable for the sigmoid (default) in LSTM
X = X / float(n_vocab)

In [325]:
# convert output labels (dataY) into one-hots (vectors of zero, except in 
# position i, if i is the integer mapped to the output character)
y = to_categorical(dataY)

### Define the model

In [326]:
model = Sequential()
model.add(CuDNNLSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.01))
model.add(CuDNNLSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.01))
model.add(CuDNNLSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.01))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')


#### Define the checkpoint 

The ModelCheckpoint method automatically saves the best weights to a file, that can be plugged into the model later. 

In other files, this allows you to just generate words if you comment out the `model.fit` command and let everything else run. 

However, this notebook is already designed to be run with pre-trained weights so you don't need to change anything other than the weights file.

In [327]:
filepath = "weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=0, save_best_only=True, mode='min')

### Fit model to data (not in this file though)

Normally, we use the following command to finally train the previously defined model on the data. However, we already have the weights from a pre-trained file, so there is no need to train a model and generate weights. 

You should therefore leave `model.fit(...)` commented out.

In [328]:
###model.fit(X, y, epochs=150, batch_size=128, callbacks=[checkpoint])

## 2. Generating place names

### Load weights into model

In [329]:
# load the network weights from fitting, network already trained
filename = "weights-improvement-148-0.3121.hdf5" ### if you want, replace this file with whatever weights you want to use
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

### Prepare for sampling characters

We need to
- decide how many place names we want to generate in one go
- initialise an empty list that will be filled with the indices of the generated characters later
- give the model a "first letter" based upon which it can generate more letters (that is called *seed*)
    - Since we've padded the place names with many spaces (i.e. 1s), a window that just contains a first letter would be one that has 27 ones at the beginning and then an integer at the end (the first letter). It would look like this: 
    `1111111111111111111111111114`
    - The above sequence would generate an "a" as input.
    - based on this seed sequence, a well-trained model would now generate something like `1111111111111111111111111147`, so now we have "ad". This then gets fed back into the model and for example `1111111111111111111111111477` is generated.
    - note that those sequences are the above-mentioned *windows*, all of the same length (28).

In [330]:
# How many place names would you like to generate? It's 30 here.
for samples in range(30):
    m = samples + 1
    print("SAMPLE " + str(m) + ":")
    # stop generating new characters once the end of line char has been generated
    # this inner while loop generates exactly one word
    index = 5
    int_sample = [] #list that will carry indices of sampled characters in one generated word
    # pick random seed (i.e. first letter) from input array
    random_number = np.random.randint(0, len(dataX))
    pattern = np_dataX[random_number]
    for i in range(0, len(pattern)-1):
        while not ( pattern[i] == 1 and pattern[len(pattern)-1] != 1 ):
            random_number = np.random.randint(0, len(dataX))
            pattern = np_dataX[random_number]

    print("\nSeed:")
    one_line_pattern = ""
    print(pattern)
    print("Seeded character:")
    print(str(pattern[len(pattern)-1]) + " --> " + int_to_char[pattern[len(pattern)-1]])
    int_sample.append(pattern[len(pattern)-1]) # append seed to integer sample list
    print("\n")
    print("\n")

SAMPLE 1:

Seed:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 10]
Seeded character:
10 --> g




SAMPLE 2:

Seed:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 12]
Seeded character:
12 --> i




SAMPLE 3:

Seed:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 16]
Seeded character:
16 --> m




SAMPLE 4:

Seed:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 26]
Seeded character:
26 --> w




SAMPLE 5:

Seed:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 8]
Seeded character:
8 --> e




SAMPLE 6:

Seed:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 16]
Seeded character:
16 --> m




SAMPLE 7:

Seed:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4]
Seeded character:
4 --> a




SAMPLE 8:

Seed:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 21]
Seeded character:
21 --> r




SAMPLE 9:

Seed:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 22 23 1 16]
Seeded character:
16 --> m




SAMPLE 10:

Seed:
[1 1 1 1 1 1 1 1 1 1 

### Sample characters

`while index != 0 and c < 55` ensures that we don't generate words that are infinitely long. The number of letters is capped at 55. Also, the algorithm stops generating new characters once the newline symbol `\n` , which corresponds to index `0`, appears. 


With `np.argmax(prediction)` , we check which index has been deemed most likely to appear given the preceding letters.

In the 2nd part, we generate a new window, attaching the generated letter at the end, via its corresponding integer. E.g. `1111111111111111111111111114` turns into `11111111111111111111111111145` 

In [331]:
# How many place names would you like to generate? It's 30 here.
for samples in range(30):
    m = samples + 1
    print("SAMPLE " + str(m) + ":")
    # stop generating new characters once the end of line char has been generated
    # this inner while loop generates exactly one word
    index = 5
    int_sample = [] #list that will carry indices of sampled characters in one generated word
    # pick random seed (i.e. first letter) from input array
    random_number = np.random.randint(0, len(dataX))
    pattern = np_dataX[random_number]
    for i in range(0, len(pattern)-1):
        while not ( pattern[i] == 1 and pattern[len(pattern)-1] != 1 ):
            random_number = np.random.randint(0, len(dataX))
            pattern = np_dataX[random_number]

    print("\nSeed:")
    one_line_pattern = ""
    print(pattern)
    print("Seeded character:")
    print(str(pattern[len(pattern)-1]) + " --> " + int_to_char[pattern[len(pattern)-1]])
    int_sample.append(pattern[len(pattern)-1]) # append seed to integer sample list    
    
    
    
    
    c = 0 # will count number of generated characters, max 50
    while index != 0 and c < 55:
        c = c+1
        # the normalised seed, in integers ("pattern")
        x = np.reshape(pattern, (1, len(pattern), 1))
        x = x / float(n_vocab) # normalise values to range [0,1]

        # making a prediction
        prediction = model.predict(x, verbose=0) # can set verbose to 1 if you want more details

        index = np.argmax(prediction)

        # update seed to: cut off first space (on the left) and add the prediction on the right
        pattern = np_dataX[random_number + c]
        # replace rightmost number by generated one (but use that array to have the shape)
        pattern[len(pattern)-1] = index
        # attach generated character to integer sample list
        int_sample.append(index)
        
        
    # print generated place name in integers
    print("sample in integers:")
    print(int_sample)
    print("\n")
    print("\n")



SAMPLE 1:

Seed:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5]
Seeded character:
5 --> b
sample in integers:
[5, 4, 15, 8, 26, 18, 17, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]




SAMPLE 2:

Seed:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 6]
Seeded character:
6 --> c
sample in integers:
[6, 18, 21, 21, 8, 17, 10, 11, 15, 8, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]




SAMPLE 3:

Seed:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 6]
Seeded character:
6 --> c
sample in integers:
[6, 18, 21, 21, 18, 21, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]




SAMPLE 4:

Seed:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4]
Seeded character:
4 --> a
sample in integers:
[4, 15, 7, 8, 22, 8, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]




SAMPLE 5:

Seed:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 6]
Seeded character:
6 --> c
sample in integers:
[6, 18, 21, 12, 22, 1

Once we've finished generating characters, we can print out the `int_sample` list, which stores all the generated indices. This corresponds to the word in integer form. 

In [332]:
# How many place names would you like to generate? It's 30 here.
for samples in range(30):
    m = samples + 1
    print("SAMPLE " + str(m) + ":")
    # stop generating new characters once the end of line char has been generated
    # this inner while loop generates exactly one word
    index = 5
    int_sample = [] #list that will carry indices of sampled characters in one generated word
    # pick random seed (i.e. first letter) from input array
    random_number = np.random.randint(0, len(dataX))
    pattern = np_dataX[random_number]
    for i in range(0, len(pattern)-1):
        while not ( pattern[i] == 1 and pattern[len(pattern)-1] != 1 ):
            random_number = np.random.randint(0, len(dataX))
            pattern = np_dataX[random_number]

    print("\nSeed:")
    one_line_pattern = ""
    print(pattern)
    print("Seeded character:")
    print(str(pattern[len(pattern)-1]) + " --> " + int_to_char[pattern[len(pattern)-1]])
    int_sample.append(pattern[len(pattern)-1]) # append seed to integer sample list    
    
    
    
    
    c = 0 # will count number of generated characters, max 50
    while index != 0 and c < 55:
        c = c+1
        # the normalised seed, in integers ("pattern")
        x = np.reshape(pattern, (1, len(pattern), 1))
        x = x / float(n_vocab) # normalise values to range [0,1]

        # making a prediction
        prediction = model.predict(x, verbose=0) # can set verbose to 1 if you want more details

        index = np.argmax(prediction)

        # update seed to: cut off first space (on the left) and add the prediction on the right
        pattern = np_dataX[random_number + c]
        # replace rightmost number by generated one (but use that array to have the shape)
        pattern[len(pattern)-1] = index
        # attach generated character to integer sample list
        int_sample.append(index)
    
    
    
    # print generated place name in integers
    print("sample in integers:")
    print(int_sample)
    print("\n")


SAMPLE 1:

Seed:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4]
Seeded character:
4 --> a
sample in integers:
[4, 15, 7, 8, 21, 23, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]


SAMPLE 2:

Seed:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4]
Seeded character:
4 --> a
sample in integers:
[4, 15, 7, 4, 21, 15, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]


SAMPLE 3:

Seed:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 15]
Seeded character:
15 --> l
sample in integers:
[15, 18, 24, 10, 11, 1, 21, 18, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]


SAMPLE 4:

Seed:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5]
Seeded character:
5 --> b
sample in integers:
[5, 4, 15, 15, 5, 28, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]


SAMPLE 5:

Seed:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 8]
Seeded character:
8 --> e
sample in integers:
[8, 4, 22, 23, 18, 11, 4, 1, 

Now that we've got the sample in integer form, we can convert it to letters using our `int_to_char` dictionary.

In [333]:
# How many place names would you like to generate? It's 30 here.
for samples in range(30):
    m = samples + 1
    print("SAMPLE " + str(m) + ":")
    # stop generating new characters once the end of line char has been generated
    # this inner while loop generates exactly one word
    index = 5
    int_sample = [] #list that will carry indices of sampled characters in one generated word
    # pick random seed (i.e. first letter) from input array
    random_number = np.random.randint(0, len(dataX))
    pattern = np_dataX[random_number]
    for i in range(0, len(pattern)-1):
        while not ( pattern[i] == 1 and pattern[len(pattern)-1] != 1 ):
            random_number = np.random.randint(0, len(dataX))
            pattern = np_dataX[random_number]

    print("\nSeed:")
    one_line_pattern = ""
    print(pattern)
    print("Seeded character:")
    print(str(pattern[len(pattern)-1]) + " --> " + int_to_char[pattern[len(pattern)-1]])
    int_sample.append(pattern[len(pattern)-1]) # append seed to integer sample list    
    
    
    
    
    c = 0 # will count number of generated characters, max 50
    while index != 0 and c < 55:
        c = c+1
        # the normalised seed, in integers ("pattern")
        x = np.reshape(pattern, (1, len(pattern), 1))
        x = x / float(n_vocab) # normalise values to range [0,1]

        # making a prediction
        prediction = model.predict(x, verbose=0) # can set verbose to 1 if you want more details

        index = np.argmax(prediction)

        # update seed to: cut off first space (on the left) and add the prediction on the right
        pattern = np_dataX[random_number + c]
        # replace rightmost number by generated one (but use that array to have the shape)
        pattern[len(pattern)-1] = index
        # attach generated character to integer sample list
        int_sample.append(index)
        
        
    # print generated place name in integers
    print("sample in integers:")
    print(int_sample)
    


    
    # print generated place name in characters
    print("sample: ")
    sample = []
    place_name = "" # initialise place_name string
    for int in int_sample: 
        sample.append(int_to_char[int]) # attach generated characters to a list
    for entry in sample: # make a list out of the string 
        place_name += entry
    print(place_name) # here is our place name!!
    
    print("\n")
    print("\n")
print("Finished")

SAMPLE 1:

Seed:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 22]
Seeded character:
22 --> s
sample in integers:
[22, 23, 4, 17, 11, 4, 17, 10, 15, 8, 28, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
sample: 
stanhangley                





SAMPLE 2:

Seed:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 23]
Seeded character:
23 --> t
sample in integers:
[23, 11, 18, 21, 9, 28, 15, 11, 4, 16, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
sample: 
thorfylham                 





SAMPLE 3:

Seed:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 15]
Seeded character:
15 --> l
sample in integers:
[15, 18, 24, 10, 12, 17, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
sample: 
lougin                     





SAMPLE 4:

Seed:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 23]
Seeded character:
23 --> t
sample in integers:
[23, 11, 18, 16, 4, 21, 23, 11, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
sample: 
thom