Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keras RNN text generation word level model #161

Closed
swraithel opened this issue Oct 26, 2017 · 9 comments
Closed

Keras RNN text generation word level model #161

swraithel opened this issue Oct 26, 2017 · 9 comments

Comments

@swraithel
Copy link

I've been working through the example for character-level text generation https://keras.rstudio.com/articles/examples/lstm_text_generation.html
I'm having trouble extending this example to a word-level model. See reprex below

library(keras)
library(readr)
library(stringr)
library(purrr)
library(tokenizers)

# Parameters

maxlen <- 40

# Data Preparation

# Retrieve text
path <- get_file(
  'nietzsche.txt', 
  origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt'
  )

# Load, collapse, and tokenize text
text <- read_lines(path) %>%
  str_to_lower() %>%
  str_c(collapse = "\n") %>%
  tokenize_words( simplify = TRUE)

print(sprintf("corpus length: %d", length(text)))

words <- text %>%
  unique() %>%
  sort()

print(sprintf("total words: %d", length(words))) 

Which Gives:

[1] "corpus length: 101345"
[1] "total words: 10283"

When I move on to the next step I run into issues:

# Cut the text in semi-redundant sequences of maxlen characters
dataset <- map(
  seq(1, length(text) - maxlen - 1, by = 3), 
  ~list(sentece = text[.x:(.x + maxlen - 1)], next_char = text[.x + maxlen])
  )

dataset <- transpose(dataset)

# Vectorization
X <- array(0, dim = c(length(dataset$sentece), maxlen, length(words)))
y <- array(0, dim = c(length(dataset$sentece), length(words)))

for(i in 1:length(dataset$sentece)){

  X[i,,] <- sapply(words, function(x){
    as.integer(x == dataset$sentece[[i]])
  })

  y[i,] <- as.integer(words == dataset$next_char[[i]])

}

Error: cannot allocate vector of size 103.5 Gb

Now compared to the character example I have a lot more words than I did characters in the vocabulary, which is probably why I'm running into vector size issues, but how would I go about pre-processing the word level text data to fit into a rnn? Is this done somehow through the embedding layer? Do I need to do some removal of stop words/stemming to get the word vocabulary down? I've tried to take this to stackoverflow first, but haven't had much luck. https://stackoverflow.com/questions/46856862/keras-rnn-r-text-generation-word-level-model

Thanks!

@OliverHofkens
Copy link

I'm working on a very similar issue, so I hope I can help.
I think that removing stopwords and stemming aren't great ideas if your end goal is to generate coherent sentences.
You have a couple of options here:

  • Restrict your vocabulary to the top n most frequent words, and replace other words with an <unk> marker. This is what is done in the Penn Tree Bank Dataset for example. As there are only about 10k unique words in the Nietzsche dataset, you already have a pretty restricted vocabulary so this might not help much without killing the model's performance.
  • Using one-hot vectors as input (like in the code you posted) makes your input vectors scale with your vocabulary size (every input is vocab_size x maxlen). Embedding vectors remedy this problem because every word will be represented by a vector of a size of your choosing (embedding_size). You will input a simple list of maxlen integers into you embedding layer, which will transform it into a embedding_size x maxlen shaped vector for the rest of your model. Note that the embedding vectors are a trainable layer as well, so expect your training time to increase significantly compared to non-embedding approaches.
  • Probably the easiest way of solving your problem is by using fit_generator instead of fit. This allows you to generate your input data in a function when the model needs it, instead of doing it all up front. So instead of generating your huge 100GB vector before you start training, you generate the inputs in smaller batches and feed them to your model.

@swraithel
Copy link
Author

Thanks for the reply @OliverHofkens. I think both options 2 and 3 seems like the better options. Although, I'm not quite sure how to execute these ideas. I've given a basic outline below. Let me know if you have any good resources that detail how to execute these ideas. Otherwise, just bouncing ideas around below.

Embedding Options:

Pre-processing steps?

This is the part that I'm having trouble wrapping my head around. But here is a first attempt.

library(keras)
library(readr)
library(stringr)
library(purrr)
library(tokenizers)

# Parameters --------------------------------------------------------------

maxlen <- 40

# Data Preparation --------------------------------------------------------

# Retrieve text
path <- get_file(
  'nietzsche.txt', 
  origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt'
  )

# Load, collapse, and tokenize text
text <- read_lines(path) %>%
  str_to_lower() %>%
  str_c(collapse = "\n")




# prep data for embedding steps####

#set up tokenizer
tok <- keras::text_tokenizer()

#fit tokenizer to text
keras::fit_text_tokenizer(tok, text)

#transfer words to sequences
word.deq <- texts_to_sequences(tok, text)

print(sprintf("total words: %d", tok$num_words))  

word.deq is a list in the enviroment

> List (600893 items)

Assuming this is on the right track, what else needs to be done with word.deq in order to get the appropriately shaped X and Y for the model?

Changes to model

### original model
model <- keras_model_sequential()

model %>%
  layer_lstm(128, input_shape = c(maxlen, length(words))) %>%
  layer_dense(length(words)) %>%
  layer_activation("softmax")

### model with embedding?
model <- keras_model_sequential()

model %>%
  layer_embedding(input_dim = length(words), output_dim = 8, input_length = maxlen) %>%
  layer_lstm(128, input_shape = c(maxlen, length(words))) %>% #input shape needed still?
  layer_dense(length(words)) %>%
  layer_activation("softmax")


Generator Option:

I'm assuming you could take the steps outlined in the original example to get X and y and wrap that in a function that takes a subset of dataset and returns a subset of X and y?

Changes to model

### generator function
gen_function <- function (dataset) {
#some generator function steps:
# take dataset as input
#return subset of X,y
# iterate through all of dataset
}


iter <- py_iterator( gen_function )

### original model
 model %>% fit( X, y, batch_size = 128, epochs = 2 )

### attempt at generator model
 model %>% fit_generator( X, y, batch_size = 128, epochs = 2, generator = iter  )
#do we input X and y or does generator take care of that?

@OliverHofkens
Copy link

Hey @swraithel, sorry for the late reply.

I'm not very familiar with Keras' built-in text preprocessing functions, but from reading the docs your suggestion looks good. Indeed, you don't need the input_shape argument any more in the LSTM layer, the embedding layer will handle that for you.

As for the generator, Keras will want to call your generator function and expects the output to be a list(X, y) containing a batch. So the generator itself should have state to keep track where batches start and stop. You can find some examples in this issue I opened: #166 . You then call fit_generator( gen_function, batches_per_epoch ), where gen_function is your generator function, and batches_per_epoch is how many times the generator should be called before 1 epoch should be considered finished.

However, there is one more thing I just discovered which helps to reduce memory usage a lot: sparse_categorical_crossentropy. In the lstm_text_generation example, the loss function is categorical_crossentropy, which expects one-hot vectors to represent the correct classes. With sparse_categorical_crossentropy you can just pass indexes of the correct classes. Your example had a vocabulary of 10283 classes/words, so replacing your ouputs of size (batch_size x vocab_size) with (batch_size x 1) should take about 10000x less memory!

Hope this helps!

@swraithel
Copy link
Author

@OliverHofkens Thanks for the reply. Referring to your issue really helped me work out the generator issue. I've been able to convert the character example to use a generator.

for (i in 1:10) {
  end.index <-min(as.integer(length(dataset$sentece) * 0.1*i), length(dataset$sentece))
  start.index<-max(as.integer(length(dataset$sentece) * 0.1*(i-1))+1, 1)
  index_range <- start.index:end.index
  dataset2 <- list("sentece" = dataset$sentece[index_range], "next_char"=dataset$next_char[index_range])
  
  #Vectorization
  X <- array(0, dim = c(length(dataset2$sentece), maxlen, length(chars)))
  y <- array(0, dim = c(length(dataset2$sentece), length(chars)))

  for(j in 1:length(dataset2$sentece)){

    X[j,,] <- sapply(chars, function(x){
      as.integer(x == dataset2$sentece[[j]])
    })

    y[j,] <- as.integer(chars == dataset2$next_char[[j]])

  }
  write_rds(X, paste0("X-Train-", i, ".rds"))
  write_rds(y, paste0("y-Train-", i, ".rds"))
}

This creates 10 X training datasets and 10 y next character training sets that can be used by this basic generator:

batch_size <- 1000
epochs <- 2


# Fit model to data with generator
steps_per_epoch <- 10
create_generator <- function() {
  
  i <- 0
  
  function() {
    
    # get the index range
    index_range <- max(i,1)
    
    # adjust i for next iteration
    if (i == (steps_per_epoch-1))
      i <<- 0
    else
      i <<- i + 1
    X <- read_rds(paste0("X-Train-", index_range,".rds"))
    y <- read_rds(paste0("y-Train-", index_range,".rds"))
    list(X, y)
  }
}

However when I tried to create a generator for the word level model I ran into a different problem. If I use the same function as above but split into 100 rds files instead of 10, each one is approximately 1GB. This makes me think that with the current approach a generator is not the best option. Have you had any luck with the word level model?

I'm going to see if I can get a basic generator example working and I will be sure to post back here.

@Nonserial
Copy link

Nonserial commented Dec 20, 2017

I have found a working solution for a word based text generation model. A big disadvantage of using "fit_generator" instead of "fit" is that it's using only one CPU core, which makes model generation significantly slower. Is there any possibility, to run "fit_generator" also in parallel?

Here is my modified code (sources: https://keras.rstudio.com/articles/examples/lstm_text_generation.html, https://keras.rstudio.com/articles/faq.html, https://github.com/vlraik/word-level-rnn-keras/blob/master/wordlevelrnn/__init__.py)

library(keras)
library(readr)
library(stringr)
library(purrr)
library(tokenizers)
library(tfdatasets)

# Parameters --------------------------------------------------------------

maxlen <- 5
steps <- 2

# Data Preparation --------------------------------------------------------

# Retrieve text
path <- get_file(
  'nietzsche.txt', 
  origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt'
)

clean_text <- function(text){
  text <- gsub("--", " - ", text)
  text <- gsub("\\.", " \\. ", text)
  text <- gsub("\\,", " \\, ", text)
  text <- gsub("\\!", " \\! ", text)
  text <- gsub("\\?", " \\? ", text)
  text <- gsub("\\;", " \\; ", text)
  text <- gsub("\\:", " \\: ", text)
  text <- gsub("_", " ", text)
  text <- gsub('-\"', ' ', text)
  text <- gsub('\"', ' ', text)
  text <- gsub('\\)|\\(|\\]|\\[', ' ', text)
  text <- gsub("'", " ", text)
  text <- gsub("=", " ", text)
  text <- gsub("[[:digit:]]", " ", text)
  text <- gsub("\\s+", " ", text)
  text
}

# Load, collapse, and tokenize text
text <- read_lines(path) %>%
  str_to_lower() %>%
  str_c(collapse = "\n") 

text <- clean_text(text) %>% 
  tokenize_regex(simplify = T)

#text <- text %>%
#  tokenize_sentences(simplify = T) %>%
#  tokenize_words(simplify = F)

head(text, n = 30)

print(sprintf("corpus length: %d", length(text)))

vocab <- gsub("\\s", "", unlist(text)) %>%
  unique() %>%
  sort()

print(sprintf("total words: %d", length(vocab)))  

sentence <- list()
next_word <- list()
list_words <- data.frame(word = unlist(text), stringsAsFactors = F)
j <- 1

for (i in seq(1, length(list_words$word) - maxlen - 1, by = steps)){
  sentence[[j]] <- as.character(list_words$word[i:(i+maxlen-1)])
  next_word[[j]] <- as.character(list_words$word[i+maxlen])
  j <- j + 1
}


# Model Definition --------------------------------------------------------

model <- keras_model_sequential()

model %>%
  layer_lstm(128, input_shape = c(maxlen, length(vocab))) %>%
  layer_dense(length(vocab)) %>%
  layer_activation("softmax")

optimizer <- optimizer_rmsprop(lr = 0.01)

model %>% compile(
  loss = "categorical_crossentropy", 
  optimizer = optimizer
)

# Training & Results ----------------------------------------------------

sample_mod <- function(preds, temperature = 1){
  preds <- log(preds)/temperature
  exp_preds <- exp(preds)
  preds <- exp_preds/sum(exp(preds))
  
  rmultinom(1, 1, preds) %>% 
    as.integer() %>%
    which.max()
}


batch_size <- 128
all_samples <- 1:length(sentence)
num_steps <- trunc(length(sentence)/batch_size)

sampling_generator <- function(){
  
  function(){
  
    batch <- sample(all_samples, batch_size)
    all_samples <- all_samples[-batch]
    
    sentences <- sentence[batch]
    next_words <- next_word[batch]
    
    # vectorization
    X <- array(0, dim = c(batch_size, maxlen, length(vocab)))
    y <- array(0, dim = c(batch_size, length(vocab)))
    
    
    for(i in 1:batch_size){
      
      X[i,,] <- sapply(vocab, function(x){
        as.integer(x == sentences[i])
      })
      
      y[i,] <- as.integer(vocab == next_words[i])
      
    }
    
    # return data
    list(X, y)
  }
}


model %>% fit_generator(generator = sampling_generator(),
                        steps_per_epoch = num_steps,
                        epochs = 3)


for(diversity in c(0.2, 0.5, 1, 1.2)){
  
  cat(sprintf("diversity: %f ---------------\n\n", diversity))
  
  start_index <- sample(1:(length(text) - maxlen), size = 1)
  sentence <- text[start_index:(start_index + maxlen - 1)]
  generated <- ""
  
  for(i in 1:200){
    
    x <- sapply(vocab, function(x){
      as.integer(x == sentence)
    })
    x <- array_reshape(x, c(1, dim(x)))
    
    preds <- predict(model, x)
    next_index <- sample_mod(preds, diversity)
    nextword <- vocab[next_index]
    
    generated <- str_c(generated, nextword, sep = " ")
    sentence <- c(sentence[-1], nextword)
    
  }
  
  cat(generated)
  cat("\n\n")
  
}

@jjallaire
Copy link
Member

It unfortunately can't run in parallel b/c we can't run R code on a background thread. We are working on support for TensorFlow datasets (https://github.com/rstudio/tfdatasets) which will provide parallel preprocessing (all on the TF graph) however this isn't ready for use with Keras yet.

@Nonserial
Copy link

Nonserial commented Dec 20, 2017

Thanks for your quick reply.
When I use the "fit"-function (instead of the "fit_generator"-function), it seems as if the model is built on 3 cores. The CPU-load is approximately 70% on all 4 cores.
So is it only the "fit_generator"-function, that can't be run in parallel?
Is there any other way, to surround memory-issues (by training on batches) and build the model in parallel?

@swraithel
Copy link
Author

@Nonserial Thanks for the posting the above. I'll have to play around with it on more epochs, but that should do the trick for the world level model for me!

I also posted this on stackoverflow. If your interested in the rep, feel free to add an answer and I'll accept. Otherwise I may just add the link to this thread in the comments. Thanks again!

@jjallaire
Copy link
Member

The fit function probably ends up in native code that's parallelized. The fit_generator function on the other hand continually calls back into R. In general you are better off if you can start with all of your data in memory. Failing that another approach to streaming is the train_on_batch() function.

I think the entire TF community is going to move to using TensorFlow datasets (where all preprocessing occurs on the graph and in parallel) so that's the vector we are investing in within R (more developments coming here in the next few months).

@skeydan skeydan closed this as completed Sep 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants