-
Notifications
You must be signed in to change notification settings - Fork 45.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
textsum : How to train against my own data ? #373
Comments
I have the same question. It will be nice a have some information how to create the data/data file given some article texts and summaries. |
See https://github.com/tensorflow/models/pull/379/files for examples of making training data for the model. |
@panyx0718 After checking the 379/files/data_convert_example.py, i don't know what the input looks text looks like. Is there any data input format that sent to the text_to_binary function? |
Hi @chenwangliangguo The output file text_data shows the format input text. |
@neufang thanks. Have you run the model on your own data successfully? How did you generate your vocab ? |
Hi, @aronayne I use a text data, got an error: The text Data is : |
Hi, @aronayne If you give a example text data, can you give a file ? And, the textsum must read a binary data file? Why? |
I ran it again. python data_convert_example.py --command binary_to_text --in_file data/data head data/text_data ... python data_convert_example.py --command text_to_binary --in_file On Mon, Sep 12, 2016 at 4:34 AM, UB1010 notifications@github.com wrote:
Thanks |
@panyx0718 2, I added the tag "s" for all sentences. I used the default parameters to train the model. but that only can get similar words for all articles from decoder, how to test the model, can you give more information? |
I have the same problem as you @UB1010 , all result generated by decode is same. |
@doumoxiao |
@UB1010 |
@hphp
I think RNN is so poor efficiency, you can test it to long time. |
How many steps are trained? We use 10 machine, each with 4 gpus, trained for a week. |
@panyx0718
What is your steps ? Do you think, i need to update some parameters for Chinese weibo data? |
We trained a few million steps. 43k is to small On Wed, Oct 26, 2016 at 11:55 PM, UB1010 notifications@github.com wrote:
Thanks |
@panyx0718 thank you. If given enough data and computing power, let's say 50 million Chinese articles, one week training of 10 machines with 4 gpus, can we expect decent results from the textsum model? |
It depends on the quality of your data. Also, I haven't tried such large On Tue, Nov 1, 2016 at 6:21 AM, kinhunt notifications@github.com wrote:
Thanks |
@SawyerW The vocab file is simply every word in the dataset you are using with the count next to it for the number of times it has shown up in all the data files. I have seen some datasets then only taking the top 200K words or something along those lines to just reduce the number of words in the file. This is completely up to you though. |
@xtr33me Yes, you are right, but when I used the vocab file created by myself, some errors happened which was hard to check. When I used the vocab file in /textsum/data to train my own data, it can work, even it could not get right answer. So I wonder if you have your own code to create the vocab file? Maybe you also used your own code to transform the training data into binary data. |
@SawyerW The code I used to create my vocab file is pretty simple. You can get the gist of it below. Add it to a function or in-line it in some other processor you have. It's of course important that you run this against your decoded data and not the dataset that has been converted to binary. I'm sure there is a better way of doing this, but it worked for me. The other important key is that your input data is as clean as possible. This can usually be performed by in your web scraper.
|
@xtr33me Thanks, found the problem, the data set is the problem, already fixed. |
@hphp @UB1010 @panyx0718 After watching above all, I still have some questions: First, do i have to transfer my raw_data into the text_data form with the |
I am working on a search engine. My inventory is movie name , actor's name etc. |
Hi all... python data_convert_example.py --command text_to_binary --in_file data/text_data --out_file data/binary_data I encounter to this error: Traceback (most recent call last): how can i fix this?what is that mean exactly? |
@xtr33me @panyx0718 I ran data_convert_example.py on the toy dataset and got this result: abstract= Can you please help me by providing some script or some hints to make one to convert my training data to the abstract = ... article = ...., publisher = .... format? |
@vdevmcitylp feel free to check out my github link below. I added some formatting scripts some time back I havent touched this code in a while, but I know it was working. Essentially I scraped articles, then had to format them for parsing in the referenced data_convert_example.py. Hope it helps some. https://github.com/tensorflow/models/tree/ef9c156ca7802a5e60018fb0cc7d950ea54569de/textsum |
@xtr33me |
@Ali-Zareie I have same issue, did you solve it? |
@Ali-Zareie @minaSamizade This is because there is "=" in your text. Try use (k, v) = feature.split('=', 1), which add the parameter 1. |
Looks like the original issue was resolved. |
@panyx0718 what is the configuration of the GPU used in the experiment with 10 machines with 4 gpu each? |
@wengenihaoshuai I saw similar results when I didn't have enough source articles. Could this be the issue by chance? In the end I scraped around 1.3 million articles and after cleaning and filtering, I was left with about 900k articles that I was able to train on. When using this many, it was the first time I was happy with the results. Early on I had tried 40k and 200k articles and just wasn't happy with the results at all. Unsure if that is your problem, but it's something to look at. |
when I was try to train on my own dataset with the offical vocab, it is no problem. however one I was train it with my vocab, “ValueError: Duplicated word: four." this is happened. Anybody knows why?? |
textsum model
Within the data folder : https://github.com/tensorflow/models/tree/master/textsum/data there are two files : data & vocab. Is the following correct : data contains the article text to be summarised, vocab is a word count based on the Gigaword dataset. Therefore to summarise my own data I just need to replace the content of the data file in /data/data ? Or do i need to use the licensed Gigaword dataset it order to train against my own news articles ?
The text was updated successfully, but these errors were encountered: