textsum : How to train against my own data ? #373

aronayne · 2016-08-30T21:47:45Z

textsum model

Within the data folder : https://github.com/tensorflow/models/tree/master/textsum/data there are two files : data & vocab. Is the following correct : data contains the article text to be summarised, vocab is a word count based on the Gigaword dataset. Therefore to summarise my own data I just need to replace the content of the data file in /data/data ? Or do i need to use the licensed Gigaword dataset it order to train against my own news articles ?

neufang · 2016-09-06T08:49:23Z

I have the same question. It will be nice a have some information how to create the data/data file given some article texts and summaries.

panyx0718 · 2016-09-06T21:12:29Z

See https://github.com/tensorflow/models/pull/379/files for examples of making training data for the model.

wangliangguo · 2016-09-08T13:58:04Z

@panyx0718 After checking the 379/files/data_convert_example.py, i don't know what the input looks text looks like. Is there any data input format that sent to the text_to_binary function?

neufang · 2016-09-08T14:06:21Z

Hi @chenwangliangguo
python data_convert_example.py --command binary_to_text --in_file data/data --out_file data/text_data

The output file text_data shows the format input text.

wangliangguo · 2016-09-09T01:48:35Z

@neufang thanks. Have you run the model on your own data successfully？ How did you generate your vocab ?

UB1010 · 2016-09-12T11:25:55Z

Hi, @aronayne
Can you give a text data example, for the "python data_convert_example.py --command text_to_binary --in_file data/text_data --out_file data/binary_data"
Thank you!

I use a text data, got an error:
Traceback (most recent call last):
File "data_convert_example.py", line 67, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "data_convert_example.py", line 63, in main
_text_to_binary()
File "data_convert_example.py", line 49, in _text_to_binary
(k, v) = feature.split('=')
ValueError: too many values to unpack

The text Data is :

reopens (ROME)
3rd Rd
Michael Stich (Germany x2) bt Karim Alami (Morocco) 7-6 (7/5), 6-4
jpb/rw94

reopens (BERLIN)
3rd Rd
Anke Huber (Germany x7) bt Katarina Maleeva (Bulgaria) 5-7, 6-4, 6-4
Elena Makarova (Russia) bt Barbara Rittner (Germany x15) 6-2, 6-1
Ann Grossman (USA) bt Gabriela Sabatini (Argentina x4) 6-3, 6-4
Brenda Schultz (Holland) bt Silke Meier (Germany) 6-2, 6-4
vog/rw94

Tributes pour in for late British Labour Party leader

UNDATED, May 12 (AFP)

........

UB1010 · 2016-09-12T11:34:08Z

Hi, @aronayne
The HTML tag can't show in here.
My data is from LDC, but it is text data.

If you give a example text data, can you give a file ?
Thank you!

And, the textsum must read a binary data file? Why?

panyx0718 · 2016-09-12T16:35:29Z

I ran it again.

python data_convert_example.py --command binary_to_text --in_file data/data
--out_file data/text_data

head data/text_data
publisher=AFP abstract=

sri lanka closes schools as war
escalates .

article=

the sri lankan government on
wednesday announced the closure of government schools with immediate effect
as a military campaign against tamil separatists escalated in the north of
the country . the cabinet wednesday decided to advance the
december holidays by one month because of a threat from the liberation
tigers of tamil eelam -lrb- ltte -rrb- against school children , a
government official said . there are intelligence reports that the tigers may try to kill a lot of children to provoke a backlash against tamils in colombo . </s> <s> if that happens , troops will have to be
withdrawn from the north to maintain law and order here , '' a police
official said . he said education minister richard pathirana
visited several government schools wednesday before the closure decision
was taken . the government will make alternate arrangements to
hold end of term examinations , officials said . earlier wednesday
, president chandrika kumaratunga said the ltte may step up their attacks
in the capital to seek revenge for the ongoing military offensive which she
described as the biggest ever drive to take the tiger town of jaffna . .

...

python data_convert_example.py --command text_to_binary --in_file
data/text_data --out_file data/binary_data
diff data/binary_data data/data

On Mon, Sep 12, 2016 at 4:34 AM, UB1010 notifications@github.com wrote:

Hi, @aronayne https://github.com/aronayne
The HTML tag can't show in here.
My data is from LDC, but it is text data.

If you give a example text data, can you give a file ? or send email to me
( wubin@huawei.com )
Thank you!

And, the textsum must read a binary data file? Why?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#373 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACwQe_iEVjZFk2pKvQwDAso5I0g9--o5ks5qpTjGgaJpZM4JxA4-
.

Thanks
Xin

UB1010 · 2016-09-20T10:04:51Z

@panyx0718
Thank you for your response.
I got the text data format from data/data, and made the binary data by my own data.
I used more than 2.4 million Chinese weibo data for training, but i can't get a good result.
I have 2 question, can you help?
1, The training data, need the tag "d" "p" "s", and use tag "s" for sentences, The training data must set tags "s" for every sentence? If there aren't tags "s" in my own data . Can I made one article in a pair tags "s"? one article only was tagged to one sentence. That is OK?
I made the vocab file by my own training data, got the word frequency in the dict. But , how to set "UNK" frequency? can you explain how to get vocab file ?

2, I added the tag "s" for all sentences. I used the default parameters to train the model. but that only can get similar words for all articles from decoder, how to test the model, can you give more information?
Thank you.

doumoxiao · 2016-10-18T09:34:52Z

I have the same problem as you @UB1010 , all result generated by decode is same.
I used more than one million chinese news data for training, but the result is very bad, like:
decode output=不就的就
Have you saved the problem？

UB1010 · 2016-10-18T10:59:26Z

@doumoxiao
I havn't solved the problem. Some one said that need long time to train. but i trained the model with CPU more than one week, the result of decode also is bad.

hphp · 2016-10-26T05:49:19Z

@UB1010
I tried with vocab of 100K words, 0.3 million articles and corresponding abstract for 5 days using a single tesla K40, and increase the beam-size from 4 to 20, the results become more reasonable than first rounds, but does not match the meaning of test data, for example,
INFO:tensorflow:article: 全球九大警用步枪
INFO:tensorflow:abstract: 世界各国步枪
INFO:tensorflow:decoded: 世界客
INFO:tensorflow:article: 进化计算的理论和方法【图片价格品牌报价】 - 京东商城
INFO:tensorflow:abstract: 自动化书籍图片
INFO:tensorflow:decoded: 吊顶装修亮化

UB1010 · 2016-10-26T06:31:07Z

@hphp
Yes, My result is similar.
My data is more than 2.4 million weibo and abstract, beam_size = 4.
Train model 14 days with CPU server.
That is better , but it also can't be used. My result example:

origin_articles[i]: 本期节目的嘉宾，我们邀请到的是百度的创始人，董事长兼首席执行官李彦宏先生，以及通讯和软件领域世界级的科学家，也是新近加盟百度的总裁博士。作为站在中国互联网前沿的 IT 人士，他们会如何看待迅猛发展大数据科技呢？
origin_abstracts[i]: 杨澜访谈录李彦宏、张亚勤：云端的大数据
decoded_output words: 网络营销要要要什么

origin_articles[i]: 中秋节当天，有网友爆出刘翔现身上海普陀婚姻登记中心的照片。新民晚报记者独家拨通刘翔父亲的手机，翔爸爸告诉记者： “ 具体等明天（ 9 月 9 日）领证了再说。 ” 据悉，两人是在伦敦奥运会后认识的，感情很好。
origin_abstracts[i]: 新民晚报独家消息：刘翔今天领证结婚啦
decoded_output words: 湖南一 “ 幼儿园 ” ！

origin_articles[i]: 此次公务员工资试点改革，实行不升职但可升级，只要工作的时间够长，工资就能逐级上涨。但对于公务员工资改革，没有社会共识才是最根本的问题， “ 政府不缺钱，但老百姓不同意。 ” 下
origin_abstracts[i]: 公务员不升职也能涨薪工资可比局长高
decoded_output words: 玛客玛客的 “ 热门 ” ”

I think RNN is so poor efficiency, you can test it to long time.

panyx0718 · 2016-10-26T16:46:11Z

How many steps are trained? We use 10 machine, each with 4 gpus, trained for a week.
I saw that there are many words in abstract but not in the original article.

UB1010 · 2016-10-27T06:55:22Z

@panyx0718
Hi ,panyx, My trained steps is :

summary write, step: 43300
running_avg_loss: 3.681036
running_avg_loss: 4.397063
running_avg_loss: 5.468384
.........

What is your steps ?

Do you think, i need to update some parameters for Chinese weibo data?

panyx0718 · 2016-10-27T16:39:34Z

We trained a few million steps. 43k is to small

On Wed, Oct 26, 2016 at 11:55 PM, UB1010 notifications@github.com wrote:

@panyx0718 https://github.com/panyx0718
Hi ,panyx, My trained steps is :

summary write, step: 43300
running_avg_loss: 3.681036
running_avg_loss: 4.397063
running_avg_loss: 5.468384
.........

What is your steps ?

Do you think, i need to update some parameters for Chinese weibo data?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#373 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACwQe-3jKnPYtvWS2y_i3jaGbn5s1WH-ks5q4ErmgaJpZM4JxA4-
.

Thanks
Xin

kinhunt · 2016-11-01T13:21:02Z

@panyx0718 thank you. If given enough data and computing power, let's say 50 million Chinese articles, one week training of 10 machines with 4 gpus, can we expect decent results from the textsum model?

panyx0718 · 2016-11-01T16:24:37Z

It depends on the quality of your data. Also, I haven't tried such large
dataset (50m) before.

On Tue, Nov 1, 2016 at 6:21 AM, kinhunt notifications@github.com wrote:

@panyx0718 https://github.com/panyx0718 thank you. If given enough data
and computing power, let's say 50 million Chinese articles, one week
training of 10 machines with 4 gpus, can we expect decent results from the
textsum model?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#373 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACwQe_6X6CgxDl3T-vjWh2-yoErne9enks5q5zzJgaJpZM4JxA4-
.

Thanks
Xin

s7kjd3 · 2017-01-12T11:41:58Z

@hphp @UB1010 Hi, I know you may used the data_convert_example.py file to make the training data, but how do you get the vocab, could you share your could for getting the vocab? Thanks

xtr33me · 2017-01-13T01:08:17Z

@SawyerW The vocab file is simply every word in the dataset you are using with the count next to it for the number of times it has shown up in all the data files. I have seen some datasets then only taking the top 200K words or something along those lines to just reduce the number of words in the file. This is completely up to you though.

s7kjd3 · 2017-01-13T01:20:28Z

@xtr33me Yes, you are right, but when I used the vocab file created by myself, some errors happened which was hard to check. When I used the vocab file in /textsum/data to train my own data, it can work, even it could not get right answer. So I wonder if you have your own code to create the vocab file? Maybe you also used your own code to transform the training data into binary data.

xtr33me · 2017-01-14T22:55:16Z

@SawyerW The code I used to create my vocab file is pretty simple. You can get the gist of it below. Add it to a function or in-line it in some other processor you have. It's of course important that you run this against your decoded data and not the dataset that has been converted to binary. I'm sure there is a better way of doing this, but it worked for me. The other important key is that your input data is as clean as possible. This can usually be performed by in your web scraper.


from collections import Counter

with open("datadecoded") as datainput, open("vocab", 'w') as v:
    wordcount = Counter(datainput.read().split())
    for item in wordcount.items(): 
        print >>v, ("{} {}".format(*item))

s7kjd3 · 2017-01-15T05:30:44Z

@xtr33me Thanks, found the problem, the data set is the problem, already fixed.

zcc973784075 · 2017-04-05T02:53:28Z

@hphp @UB1010 @panyx0718 After watching above all, I still have some questions: First, do i have to transfer my raw_data into the text_data form with the <d><p><s> tag? Second, how about we using the gensim to generate the vocab?Thanks a lot!

yindia · 2017-05-24T20:48:11Z

I am working on a search engine. My inventory is movie name , actor's name etc.
My data set is : search_term and target_word(click)(from inventory)
My question is that we have only one or may be two three word search. i want predication on 2-3 character so is textsum useful for me

Ali-Zareie · 2017-06-03T08:59:25Z

Hi all...
I create my own data and now i wanna to convert it to binary file....when i run this:

python data_convert_example.py --command text_to_binary --in_file data/text_data --out_file data/binary_data

I encounter to this error:

Traceback (most recent call last):
File "data_convert_example.py", line 65, in
tf.app.run()
File "/home/az/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "data_convert_example.py", line 61, in main
_text_to_binary()
File "data_convert_example.py", line 47, in _text_to_binary
(k, v) = feature.split('=')
ValueError: need more than 1 value to unpack

how can i fix this?what is that mean exactly?

vdevmcitylp · 2017-06-08T11:21:28Z

@xtr33me @panyx0718
I want to train the textsum model on the Insight (mlg.ucd.ie/datasets/bbc.html) dataset.

I ran data_convert_example.py on the toy dataset and got this result:

abstract=

~~sri lanka closes schools as war escalates .~~

article=

~~the sri lankan government on wednesday announced the closure of government schools with immediate effect as a military campaign against tamil separatists escalated in the north of the country .~~ the cabinet wednesday decided to advance the december holidays by one month because of a threat from the liberation tigers of tamil eelam -lrb- ltte -rrb- against school children , a government official said . there are intelligence reports that the tigers may try to kill a lot of children to provoke a backlash against tamils in colombo . </s> <s> if that happens , troops will have to be withdrawn from the north to maintain law and order here , '' a police official said . ~~he said education minister richard pathirana visited several government schools wednesday before the closure decision was taken .~~ ~~the government will make alternate arrangements to hold end of term examinations , officials said .~~ earlier wednesday , president chandrika kumaratunga said the ltte may step up their attacks in the capital to seek revenge for the ongoing military offensive which she described as the biggest ever drive to take the tiger town of jaffna . .

publisher=AFP

Can you please help me by providing some script or some hints to make one to convert my training data to the abstract = ... article = ...., publisher = .... format?

xtr33me · 2017-06-09T02:21:23Z

@vdevmcitylp feel free to check out my github link below. I added some formatting scripts some time back I havent touched this code in a while, but I know it was working. Essentially I scraped articles, then had to format them for parsing in the referenced data_convert_example.py. Hope it helps some.

https://github.com/tensorflow/models/tree/ef9c156ca7802a5e60018fb0cc7d950ea54569de/textsum

vdevmcitylp · 2017-06-09T11:27:44Z

@xtr33me
It works!
I can't thank you enough for this.

minaSamizade · 2017-09-19T12:35:28Z

@Ali-Zareie I have same issue, did you solve it?

gycg · 2018-01-25T08:37:57Z

@Ali-Zareie @minaSamizade This is because there is "=" in your text. Try use (k, v) = feature.split('=', 1), which add the parameter 1.

gunan · 2018-02-08T00:04:58Z

Looks like the original issue was resolved.

cassioalmeidas · 2018-06-09T12:34:31Z

@panyx0718 what is the configuration of the GPU used in the experiment with 10 machines with 4 gpu each?

wengenihaoshuai · 2018-10-31T09:28:13Z

@hphp @UB1010

I'm using this code recently。and also get a bad result. The decode dose not match the abstract.
I want to know have you solved this problem? and how to do it ? Thank you.

Abstrace = ‘’78 亿主力资金近三日大量撤出中小创‘’
Decode = ‘’储备面早盘银行‘’

xtr33me · 2018-10-31T12:39:44Z

@wengenihaoshuai I saw similar results when I didn't have enough source articles. Could this be the issue by chance? In the end I scraped around 1.3 million articles and after cleaning and filtering, I was left with about 900k articles that I was able to train on. When using this many, it was the first time I was happy with the results. Early on I had tried 40k and 200k articles and just wasn't happy with the results at all. Unsure if that is your problem, but it's something to look at.

01Root · 2019-09-18T02:43:59Z

when I was try to train on my own dataset with the offical vocab, it is no problem. however one I was train it with my vocab, “ValueError: Duplicated word: four." this is happened. Anybody knows why??

poxvoculi assigned panyx0718 Aug 31, 2016

poxvoculi added the stat:awaiting model gardener Waiting on input from TensorFlow model gardener label Aug 31, 2016

xtr33me mentioned this issue Nov 13, 2016

#Textsum# How to generate the vocab file from the original data And what's the format of test data #622

Closed

gunan closed this as completed Feb 8, 2018

gunan unassigned panyx0718 Feb 8, 2018

textsum : How to train against my own data ? #373

textsum : How to train against my own data ? #373

Comments

aronayne commented Aug 30, 2016 • edited Loading

neufang commented Sep 6, 2016

panyx0718 commented Sep 6, 2016

wangliangguo commented Sep 8, 2016

neufang commented Sep 8, 2016 • edited Loading

wangliangguo commented Sep 9, 2016

UB1010 commented Sep 12, 2016

UB1010 commented Sep 12, 2016 • edited Loading

panyx0718 commented Sep 12, 2016

UB1010 commented Sep 20, 2016 • edited Loading

doumoxiao commented Oct 18, 2016

UB1010 commented Oct 18, 2016

hphp commented Oct 26, 2016 • edited Loading

UB1010 commented Oct 26, 2016

panyx0718 commented Oct 26, 2016

UB1010 commented Oct 27, 2016

panyx0718 commented Oct 27, 2016

kinhunt commented Nov 1, 2016

panyx0718 commented Nov 1, 2016

s7kjd3 commented Jan 12, 2017

xtr33me commented Jan 13, 2017 • edited Loading

s7kjd3 commented Jan 13, 2017

xtr33me commented Jan 14, 2017

s7kjd3 commented Jan 15, 2017

zcc973784075 commented Apr 5, 2017 • edited Loading

yindia commented May 24, 2017

Ali-Zareie commented Jun 3, 2017

vdevmcitylp commented Jun 8, 2017 • edited Loading

xtr33me commented Jun 9, 2017

vdevmcitylp commented Jun 9, 2017

minaSamizade commented Sep 19, 2017

gycg commented Jan 25, 2018

gunan commented Feb 8, 2018

cassioalmeidas commented Jun 9, 2018

wengenihaoshuai commented Oct 31, 2018

xtr33me commented Oct 31, 2018

01Root commented Sep 18, 2019

aronayne commented Aug 30, 2016 •

edited

Loading

neufang commented Sep 8, 2016 •

edited

Loading

UB1010 commented Sep 12, 2016 •

edited

Loading

UB1010 commented Sep 20, 2016 •

edited

Loading

hphp commented Oct 26, 2016 •

edited

Loading

xtr33me commented Jan 13, 2017 •

edited

Loading

zcc973784075 commented Apr 5, 2017 •

edited

Loading

vdevmcitylp commented Jun 8, 2017 •

edited

Loading