New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keras RNN text generation word level model #161
Comments
I'm working on a very similar issue, so I hope I can help.
|
Thanks for the reply @OliverHofkens. I think both options 2 and 3 seems like the better options. Although, I'm not quite sure how to execute these ideas. I've given a basic outline below. Let me know if you have any good resources that detail how to execute these ideas. Otherwise, just bouncing ideas around below. Embedding Options:Pre-processing steps?This is the part that I'm having trouble wrapping my head around. But here is a first attempt. library(keras)
library(readr)
library(stringr)
library(purrr)
library(tokenizers)
# Parameters --------------------------------------------------------------
maxlen <- 40
# Data Preparation --------------------------------------------------------
# Retrieve text
path <- get_file(
'nietzsche.txt',
origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt'
)
# Load, collapse, and tokenize text
text <- read_lines(path) %>%
str_to_lower() %>%
str_c(collapse = "\n")
# prep data for embedding steps####
#set up tokenizer
tok <- keras::text_tokenizer()
#fit tokenizer to text
keras::fit_text_tokenizer(tok, text)
#transfer words to sequences
word.deq <- texts_to_sequences(tok, text)
print(sprintf("total words: %d", tok$num_words))
word.deq is a list in the enviroment
Assuming this is on the right track, what else needs to be done with Changes to model### original model
model <- keras_model_sequential()
model %>%
layer_lstm(128, input_shape = c(maxlen, length(words))) %>%
layer_dense(length(words)) %>%
layer_activation("softmax")
### model with embedding?
model <- keras_model_sequential()
model %>%
layer_embedding(input_dim = length(words), output_dim = 8, input_length = maxlen) %>%
layer_lstm(128, input_shape = c(maxlen, length(words))) %>% #input shape needed still?
layer_dense(length(words)) %>%
layer_activation("softmax")
Generator Option:I'm assuming you could take the steps outlined in the original example to get X and y and wrap that in a function that takes a subset of Changes to model### generator function
gen_function <- function (dataset) {
#some generator function steps:
# take dataset as input
#return subset of X,y
# iterate through all of dataset
}
iter <- py_iterator( gen_function )
### original model
model %>% fit( X, y, batch_size = 128, epochs = 2 )
### attempt at generator model
model %>% fit_generator( X, y, batch_size = 128, epochs = 2, generator = iter )
#do we input X and y or does generator take care of that?
|
Hey @swraithel, sorry for the late reply. I'm not very familiar with Keras' built-in text preprocessing functions, but from reading the docs your suggestion looks good. Indeed, you don't need the As for the generator, Keras will want to call your generator function and expects the output to be a However, there is one more thing I just discovered which helps to reduce memory usage a lot: Hope this helps! |
@OliverHofkens Thanks for the reply. Referring to your issue really helped me work out the generator issue. I've been able to convert the character example to use a generator. for (i in 1:10) {
end.index <-min(as.integer(length(dataset$sentece) * 0.1*i), length(dataset$sentece))
start.index<-max(as.integer(length(dataset$sentece) * 0.1*(i-1))+1, 1)
index_range <- start.index:end.index
dataset2 <- list("sentece" = dataset$sentece[index_range], "next_char"=dataset$next_char[index_range])
#Vectorization
X <- array(0, dim = c(length(dataset2$sentece), maxlen, length(chars)))
y <- array(0, dim = c(length(dataset2$sentece), length(chars)))
for(j in 1:length(dataset2$sentece)){
X[j,,] <- sapply(chars, function(x){
as.integer(x == dataset2$sentece[[j]])
})
y[j,] <- as.integer(chars == dataset2$next_char[[j]])
}
write_rds(X, paste0("X-Train-", i, ".rds"))
write_rds(y, paste0("y-Train-", i, ".rds"))
} This creates 10 X training datasets and 10 y next character training sets that can be used by this basic generator: batch_size <- 1000
epochs <- 2
# Fit model to data with generator
steps_per_epoch <- 10
create_generator <- function() {
i <- 0
function() {
# get the index range
index_range <- max(i,1)
# adjust i for next iteration
if (i == (steps_per_epoch-1))
i <<- 0
else
i <<- i + 1
X <- read_rds(paste0("X-Train-", index_range,".rds"))
y <- read_rds(paste0("y-Train-", index_range,".rds"))
list(X, y)
}
}
However when I tried to create a generator for the word level model I ran into a different problem. If I use the same function as above but split into 100 rds files instead of 10, each one is approximately 1GB. This makes me think that with the current approach a generator is not the best option. Have you had any luck with the word level model? I'm going to see if I can get a basic generator example working and I will be sure to post back here. |
I have found a working solution for a word based text generation model. A big disadvantage of using "fit_generator" instead of "fit" is that it's using only one CPU core, which makes model generation significantly slower. Is there any possibility, to run "fit_generator" also in parallel? Here is my modified code (sources: https://keras.rstudio.com/articles/examples/lstm_text_generation.html, https://keras.rstudio.com/articles/faq.html, https://github.com/vlraik/word-level-rnn-keras/blob/master/wordlevelrnn/__init__.py)
|
It unfortunately can't run in parallel b/c we can't run R code on a background thread. We are working on support for TensorFlow datasets (https://github.com/rstudio/tfdatasets) which will provide parallel preprocessing (all on the TF graph) however this isn't ready for use with Keras yet. |
Thanks for your quick reply. |
@Nonserial Thanks for the posting the above. I'll have to play around with it on more epochs, but that should do the trick for the world level model for me! I also posted this on stackoverflow. If your interested in the rep, feel free to add an answer and I'll accept. Otherwise I may just add the link to this thread in the comments. Thanks again! |
The I think the entire TF community is going to move to using TensorFlow datasets (where all preprocessing occurs on the graph and in parallel) so that's the vector we are investing in within R (more developments coming here in the next few months). |
I've been working through the example for character-level text generation https://keras.rstudio.com/articles/examples/lstm_text_generation.html
I'm having trouble extending this example to a word-level model. See reprex below
Which Gives:
When I move on to the next step I run into issues:
Error: cannot allocate vector of size 103.5 Gb
Now compared to the character example I have a lot more words than I did characters in the vocabulary, which is probably why I'm running into vector size issues, but how would I go about pre-processing the word level text data to fit into a rnn? Is this done somehow through the embedding layer? Do I need to do some removal of stop words/stemming to get the word vocabulary down? I've tried to take this to stackoverflow first, but haven't had much luck. https://stackoverflow.com/questions/46856862/keras-rnn-r-text-generation-word-level-model
Thanks!
The text was updated successfully, but these errors were encountered: