Skip to content

Latest commit

 

History

History
88 lines (73 loc) · 3.2 KB

README.md

File metadata and controls

88 lines (73 loc) · 3.2 KB

The code (demo) is about the paper "Tag-Weighted Topic Model for Mining Semi-Structured Documents"

The paper is at http://dl.acm.org/citation.cfm?id=2540540
Author: Shuangyin Li, Jiefei Li, Rong Pan
Sun Yat-sen University

Any question about code please contact us by emails.
shuangyinli AT cse.ust.hk
lijiefei AT mail2.sysu.edu.cn.
panr AT sysu.edu.cn.

License

Copyright 2013 Shuangyin Li, Jiefei Li, Rong Pan
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Easy Way:

./example.sh

Install

cd src/ && make

Usage

###Input file format:
DocNumLabels label1 label2 ... @ DocNumWords word1 word2 ...
DocNumLabels label1 label2 ... @ DocNumWords word1 word2 ...
DocNumLabels label1 label2 ... @ DocNumWords word1 word2 ...

Each row represent one document with labels. DocNumLables means the number labels of document. DocNumWords means the number words of document. Each label is integer and represent one label. Each word is integer and represent one word.

demo/twtm.demo.input is a simple demo input file.
demo/label.txt is the label dictionary file. The word in row 1 means the label0.
demo/words.dic is the word dictionary file.


###Training:

./twtm est <input data file> <setting.txt> <num_topics> <model save dir>

Example:

./src/twtm est demo/twtm.demo.input src/setting.txt 10 demo/model

Some model training parameters are set in the file "setting.txt".

###Inference:
There are two methods to inference a new document's topic distribution.
One is still using the labels of new document to inference.

./twtm inf <input data file> <setting.txt> <model dir> <prefix> <output dir>

Example:

./src/twtm inf demo/twtm.demo.input src/setting.txt demo/model/ final demo/output/

We can get the doc-topics-dis.txt file in output dir. The file indicates the topic distribution in input data file. The values in the file should be exp(.) so that we can konw that exact probablility.

The other one is just using the words of new document. So with the TWTM model, we can inference some new document without any label just like LDA model.

./twtm lda-inf <input data file> <setting.txt> <model dir> <prefix> <output dir>

Example:

./src/twtm lda-inf demo/twtm.demo.input src/setting.txt demo/model/ final demo/output/