Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

textsum : How to train against my own data ? #373

Closed
aronayne opened this issue Aug 30, 2016 · 36 comments
Closed

textsum : How to train against my own data ? #373

aronayne opened this issue Aug 30, 2016 · 36 comments
Labels
stat:awaiting model gardener Waiting on input from TensorFlow model gardener

Comments

@aronayne
Copy link

aronayne commented Aug 30, 2016

textsum model

Within the data folder : https://github.com/tensorflow/models/tree/master/textsum/data there are two files : data & vocab. Is the following correct : data contains the article text to be summarised, vocab is a word count based on the Gigaword dataset. Therefore to summarise my own data I just need to replace the content of the data file in /data/data ? Or do i need to use the licensed Gigaword dataset it order to train against my own news articles ?

@poxvoculi poxvoculi added the stat:awaiting model gardener Waiting on input from TensorFlow model gardener label Aug 31, 2016
@neufang
Copy link

neufang commented Sep 6, 2016

I have the same question. It will be nice a have some information how to create the data/data file given some article texts and summaries.

@panyx0718
Copy link
Contributor

See https://github.com/tensorflow/models/pull/379/files for examples of making training data for the model.

@wangliangguo
Copy link

@panyx0718 After checking the 379/files/data_convert_example.py, i don't know what the input looks text looks like. Is there any data input format that sent to the text_to_binary function?

@neufang
Copy link

neufang commented Sep 8, 2016

Hi @chenwangliangguo
python data_convert_example.py --command binary_to_text --in_file data/data --out_file data/text_data

The output file text_data shows the format input text.

@wangliangguo
Copy link

@neufang thanks. Have you run the model on your own data successfully? How did you generate your vocab ?

@UB1010
Copy link

UB1010 commented Sep 12, 2016

Hi, @aronayne
Can you give a text data example, for the "python data_convert_example.py --command text_to_binary --in_file data/text_data --out_file data/binary_data"
Thank you!

I use a text data, got an error:
Traceback (most recent call last):
File "data_convert_example.py", line 67, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "data_convert_example.py", line 63, in main
_text_to_binary()
File "data_convert_example.py", line 49, in _text_to_binary
(k, v) = feature.split('=')
ValueError: too many values to unpack

The text Data is :


reopens (ROME)
3rd Rd
Michael Stich (Germany x2) bt Karim Alami (Morocco) 7-6 (7/5), 6-4
jpb/rw94




reopens (BERLIN)
3rd Rd
Anke Huber (Germany x7) bt Katarina Maleeva (Bulgaria) 5-7, 6-4, 6-4
Elena Makarova (Russia) bt Barbara Rittner (Germany x15) 6-2, 6-1
Ann Grossman (USA) bt Gabriela Sabatini (Argentina x4) 6-3, 6-4
Brenda Schultz (Holland) bt Silke Meier (Germany) 6-2, 6-4
vog/rw94




Tributes pour in for late British Labour Party leader


UNDATED, May 12 (AFP)

........

@UB1010
Copy link

UB1010 commented Sep 12, 2016

Hi, @aronayne
The HTML tag can't show in here.
My data is from LDC, but it is text data.

If you give a example text data, can you give a file ?
Thank you!

And, the textsum must read a binary data file? Why?

@panyx0718
Copy link
Contributor

I ran it again.

python data_convert_example.py --command binary_to_text --in_file data/data
--out_file data/text_data

head data/text_data
publisher=AFP abstract=

sri lanka closes schools as war
escalates .

article=

the sri lankan government on
wednesday announced the closure of government schools with immediate effect
as a military campaign against tamil separatists escalated in the north of
the country .
the cabinet wednesday decided to advance the
december holidays by one month because of a threat from the liberation
tigers of tamil eelam -lrb- ltte -rrb- against school children , a
government official said .
there are intelligence reports that the tigers may try to kill a lot of children to provoke a backlash against tamils in colombo . </s> <s> if that happens , troops will have to be
withdrawn from the north to maintain law and order here , '' a police
official said .
he said education minister richard pathirana
visited several government schools wednesday before the closure decision
was taken .
the government will make alternate arrangements to
hold end of term examinations , officials said .
earlier wednesday
, president chandrika kumaratunga said the ltte may step up their attacks
in the capital to seek revenge for the ongoing military offensive which she
described as the biggest ever drive to take the tiger town of jaffna . .


...

python data_convert_example.py --command text_to_binary --in_file
data/text_data --out_file data/binary_data
diff data/binary_data data/data

On Mon, Sep 12, 2016 at 4:34 AM, UB1010 notifications@github.com wrote:

Hi, @aronayne https://github.com/aronayne
The HTML tag can't show in here.
My data is from LDC, but it is text data.

If you give a example text data, can you give a file ? or send email to me
( wubin@huawei.com )
Thank you!

And, the textsum must read a binary data file? Why?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#373 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACwQe_iEVjZFk2pKvQwDAso5I0g9--o5ks5qpTjGgaJpZM4JxA4-
.

Thanks
Xin

@UB1010
Copy link

UB1010 commented Sep 20, 2016

@panyx0718
Thank you for your response.
I got the text data format from data/data, and made the binary data by my own data.
I used more than 2.4 million Chinese weibo data for training, but i can't get a good result.
I have 2 question, can you help?
1, The training data, need the tag "d" "p" "s", and use tag "s" for sentences, The training data must set tags "s" for every sentence? If there aren't tags "s" in my own data . Can I made one article in a pair tags "s"? one article only was tagged to one sentence. That is OK?
I made the vocab file by my own training data, got the word frequency in the dict. But , how to set "UNK" frequency? can you explain how to get vocab file ?

2, I added the tag "s" for all sentences. I used the default parameters to train the model. but that only can get similar words for all articles from decoder, how to test the model, can you give more information?
Thank you.

@doumoxiao
Copy link

I have the same problem as you @UB1010 , all result generated by decode is same.
I used more than one million chinese news data for training, but the result is very bad, like:
decode output=不 就 的 就
Have you saved the problem?

@UB1010
Copy link

UB1010 commented Oct 18, 2016

@doumoxiao
I havn't solved the problem. Some one said that need long time to train. but i trained the model with CPU more than one week, the result of decode also is bad.

@hphp
Copy link

hphp commented Oct 26, 2016

@UB1010
I tried with vocab of 100K words, 0.3 million articles and corresponding abstract for 5 days using a single tesla K40, and increase the beam-size from 4 to 20, the results become more reasonable than first rounds, but does not match the meaning of test data, for example,
INFO:tensorflow:article: 全球 九大 警用 步枪
INFO:tensorflow:abstract: 世界 各国 步枪
INFO:tensorflow:decoded: 世界 客
INFO:tensorflow:article: 进化 计算 的 理论 和 方法 【 图片 价格 品牌 报价 】 - 京东 商城
INFO:tensorflow:abstract: 自动 化 书籍 图片
INFO:tensorflow:decoded: 吊顶 装修 亮化

@UB1010
Copy link

UB1010 commented Oct 26, 2016

@hphp
Yes, My result is similar.
My data is more than 2.4 million weibo and abstract, beam_size = 4.
Train model 14 days with CPU server.
That is better , but it also can't be used. My result example:

origin_articles[i]: 本期 节目 的 嘉宾 , 我们 邀请 到 的 是 百度 的 创始人 , 董事长 兼 首席 执行官 李彦宏 先生 , 以及 通讯 和 软件 领域 世界级 的 科学家 , 也 是 新近 加盟 百度 的 总裁 博士 。 作为 站 在 中国 互联网 前沿 的 IT 人士 , 他们 会 如何 看待 迅猛发展 大 数据 科技 呢 ?
origin_abstracts[i]: 杨澜 访谈录 李彦宏 、 张亚勤 : 云端 的 大 数据
decoded_output words: 网络营销 要 要 要 什么

origin_articles[i]: 中秋节 当天 , 有 网友 爆出 刘翔 现身 上海 普陀 婚姻登记 中心 的 照片 。 新民晚报 记者 独家 拨通 刘翔 父亲 的 手机 , 翔 爸爸 告诉 记者 : “ 具体 等 明天 ( 9 月 9 日 ) 领证 了 再说 。 ” 据悉 , 两 人 是 在 伦敦 奥运会 后 认识 的 , 感情 很 好 。
origin_abstracts[i]: 新民晚报 独家 消息 : 刘翔 今天 领证 结婚 啦
decoded_output words: 湖南 一 “ 幼儿园 ” !

origin_articles[i]: 此次 公务员 工资 试点 改革 , 实行 不 升职 但 可 升级 , 只要 工作 的 时间 够 长 , 工资 就 能 逐级 上涨 。 但 对于 公务员 工资改革 , 没有 社会 共识 才 是 最 根本 的 问题 , “ 政府 不 缺钱 , 但 老百姓 不 同意 。 ” 下
origin_abstracts[i]: 公务员 不 升职 也 能涨 薪 工资 可比 局长 高
decoded_output words: 玛客 玛客 的 “ 热门 ” ”

I think RNN is so poor efficiency, you can test it to long time.

@panyx0718
Copy link
Contributor

How many steps are trained? We use 10 machine, each with 4 gpus, trained for a week.
I saw that there are many words in abstract but not in the original article.

@UB1010
Copy link

UB1010 commented Oct 27, 2016

@panyx0718
Hi ,panyx, My trained steps is :

summary write, step: 43300
running_avg_loss: 3.681036
running_avg_loss: 4.397063
running_avg_loss: 5.468384
.........

What is your steps ?

Do you think, i need to update some parameters for Chinese weibo data?

@panyx0718
Copy link
Contributor

We trained a few million steps. 43k is to small

On Wed, Oct 26, 2016 at 11:55 PM, UB1010 notifications@github.com wrote:

@panyx0718 https://github.com/panyx0718
Hi ,panyx, My trained steps is :

summary write, step: 43300
running_avg_loss: 3.681036
running_avg_loss: 4.397063
running_avg_loss: 5.468384
.........

What is your steps ?

Do you think, i need to update some parameters for Chinese weibo data?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#373 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACwQe-3jKnPYtvWS2y_i3jaGbn5s1WH-ks5q4ErmgaJpZM4JxA4-
.

Thanks
Xin

@kinhunt
Copy link

kinhunt commented Nov 1, 2016

@panyx0718 thank you. If given enough data and computing power, let's say 50 million Chinese articles, one week training of 10 machines with 4 gpus, can we expect decent results from the textsum model?

@panyx0718
Copy link
Contributor

It depends on the quality of your data. Also, I haven't tried such large
dataset (50m) before.

On Tue, Nov 1, 2016 at 6:21 AM, kinhunt notifications@github.com wrote:

@panyx0718 https://github.com/panyx0718 thank you. If given enough data
and computing power, let's say 50 million Chinese articles, one week
training of 10 machines with 4 gpus, can we expect decent results from the
textsum model?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#373 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACwQe_6X6CgxDl3T-vjWh2-yoErne9enks5q5zzJgaJpZM4JxA4-
.

Thanks
Xin

@SawyerW
Copy link

SawyerW commented Jan 12, 2017

@hphp @UB1010 Hi, I know you may used the data_convert_example.py file to make the training data, but how do you get the vocab, could you share your could for getting the vocab? Thanks

@xtr33me
Copy link

xtr33me commented Jan 13, 2017

@SawyerW The vocab file is simply every word in the dataset you are using with the count next to it for the number of times it has shown up in all the data files. I have seen some datasets then only taking the top 200K words or something along those lines to just reduce the number of words in the file. This is completely up to you though.

@SawyerW
Copy link

SawyerW commented Jan 13, 2017

@xtr33me Yes, you are right, but when I used the vocab file created by myself, some errors happened which was hard to check. When I used the vocab file in /textsum/data to train my own data, it can work, even it could not get right answer. So I wonder if you have your own code to create the vocab file? Maybe you also used your own code to transform the training data into binary data.

@xtr33me
Copy link

xtr33me commented Jan 14, 2017

@SawyerW The code I used to create my vocab file is pretty simple. You can get the gist of it below. Add it to a function or in-line it in some other processor you have. It's of course important that you run this against your decoded data and not the dataset that has been converted to binary. I'm sure there is a better way of doing this, but it worked for me. The other important key is that your input data is as clean as possible. This can usually be performed by in your web scraper.


from collections import Counter

with open("datadecoded") as datainput, open("vocab", 'w') as v:
    wordcount = Counter(datainput.read().split())
    for item in wordcount.items(): 
        print >>v, ("{} {}".format(*item))

@SawyerW
Copy link

SawyerW commented Jan 15, 2017

@xtr33me Thanks, found the problem, the data set is the problem, already fixed.

@zcc973784075
Copy link

zcc973784075 commented Apr 5, 2017

@hphp @UB1010 @panyx0718 After watching above all, I still have some questions: First, do i have to transfer my raw_data into the text_data form with the <d><p><s> tag? Second, how about we using the gensim to generate the vocab?Thanks a lot!

@yindia
Copy link

yindia commented May 24, 2017

I am working on a search engine. My inventory is movie name , actor's name etc.
My data set is : search_term and target_word(click)(from inventory)
My question is that we have only one or may be two three word search. i want predication on 2-3 character so is textsum useful for me

@Ali-Zareie
Copy link

Hi all...
I create my own data and now i wanna to convert it to binary file....when i run this:

python data_convert_example.py --command text_to_binary --in_file data/text_data --out_file data/binary_data

I encounter to this error:

Traceback (most recent call last):
File "data_convert_example.py", line 65, in
tf.app.run()
File "/home/az/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "data_convert_example.py", line 61, in main
_text_to_binary()
File "data_convert_example.py", line 47, in _text_to_binary
(k, v) = feature.split('=')
ValueError: need more than 1 value to unpack

how can i fix this?what is that mean exactly?

@vdevmcitylp
Copy link

vdevmcitylp commented Jun 8, 2017

@xtr33me @panyx0718
I want to train the textsum model on the Insight (mlg.ucd.ie/datasets/bbc.html) dataset.

I ran data_convert_example.py on the toy dataset and got this result:

abstract=

sri lanka closes schools as war escalates .

article=

the sri lankan government on wednesday announced the closure of government schools with immediate effect as a military campaign against tamil separatists escalated in the north of the country . the cabinet wednesday decided to advance the december holidays by one month because of a threat from the liberation tigers of tamil eelam -lrb- ltte -rrb- against school children , a government official said . there are intelligence reports that the tigers may try to kill a lot of children to provoke a backlash against tamils in colombo . </s> <s> if that happens , troops will have to be withdrawn from the north to maintain law and order here , '' a police official said . he said education minister richard pathirana visited several government schools wednesday before the closure decision was taken . the government will make alternate arrangements to hold end of term examinations , officials said . earlier wednesday , president chandrika kumaratunga said the ltte may step up their attacks in the capital to seek revenge for the ongoing military offensive which she described as the biggest ever drive to take the tiger town of jaffna . .

publisher=AFP

Can you please help me by providing some script or some hints to make one to convert my training data to the abstract = ... article = ...., publisher = .... format?

@xtr33me
Copy link

xtr33me commented Jun 9, 2017

@vdevmcitylp feel free to check out my github link below. I added some formatting scripts some time back I havent touched this code in a while, but I know it was working. Essentially I scraped articles, then had to format them for parsing in the referenced data_convert_example.py. Hope it helps some.

https://github.com/tensorflow/models/tree/ef9c156ca7802a5e60018fb0cc7d950ea54569de/textsum

@vdevmcitylp
Copy link

@xtr33me
It works!
I can't thank you enough for this.

@minaSamizade
Copy link

@Ali-Zareie I have same issue, did you solve it?

@gycg
Copy link

gycg commented Jan 25, 2018

@Ali-Zareie @minaSamizade This is because there is "=" in your text. Try use (k, v) = feature.split('=', 1), which add the parameter 1.

@gunan
Copy link
Contributor

gunan commented Feb 8, 2018

Looks like the original issue was resolved.

@cassioalmeidas
Copy link

@panyx0718 what is the configuration of the GPU used in the experiment with 10 machines with 4 gpu each?

@wengenihaoshuai
Copy link

@hphp @UB1010

I'm using this code recently。and also get a bad result. The decode dose not match the abstract.
I want to know have you solved this problem? and how to do it ? Thank you.

Abstrace = ‘’78 亿 主力 资金 近 三日 大量 撤出 中小 创‘’
Decode = ‘’储备面 早盘 银行‘’

@xtr33me
Copy link

xtr33me commented Oct 31, 2018

@wengenihaoshuai I saw similar results when I didn't have enough source articles. Could this be the issue by chance? In the end I scraped around 1.3 million articles and after cleaning and filtering, I was left with about 900k articles that I was able to train on. When using this many, it was the first time I was happy with the results. Early on I had tried 40k and 200k articles and just wasn't happy with the results at all. Unsure if that is your problem, but it's something to look at.

@01Root
Copy link

01Root commented Sep 18, 2019

when I was try to train on my own dataset with the offical vocab, it is no problem. however one I was train it with my vocab, “ValueError: Duplicated word: four." this is happened. Anybody knows why??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting model gardener Waiting on input from TensorFlow model gardener
Projects
None yet
Development

No branches or pull requests