Skip to content

rwang16/IETM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IETM (Instructed-Expansion-Based Topic Model)

IETM uses an open-source Java package to implement the algorithm proposed in the paper named Transferring Knowledge from Large Language Models for Short Text Topic Modeling.

1. Requirements

  • Java (Version=1.8)

2. Datasets

All of corpus files (Tweet, SearchSnippets, and StackOverflow) and the corresponding label files have been prepared in the path ./datasets.

Taking Tweet as an example, the dataset file path is as follows.

datasets

Tweet

Tweet.txt

Tweet_label.txt

Tweet_GPT.txt

Tweet_DREx.txt

Tweet_LLaMa.txt

Tweet_LLaMa2.txt

where 'Tweet.txt' contains the original short texts. The 'Tweet_label.txt' is the label file corresponding to Tweet dataset. The other four files are pseudo long documents generated through different methods. For example, each document in 'Tweet_GPT.txt' is generated by GPT according to the original short text.

3. Run and Evaluate IETM

bash run.sh

-algorithm: IETM.

-dataname: Specify the name of dataset (Tweet, SearchSnippets, or StackOverflow).

-alpha: Specify the value of the Dirichlet prior. The default value is 1.0.

-beta: Specify the value of the Dirichlet prior. The default value is 0.01.

-ntopics: Specify the number of topics. The default value is 50.

-corpus: Specify the file of the input short text corpus file.

-generateCorpus: Specify the file of the input pseudo-long corpus file.

-output: Specify the path to the output directory.

-name: Specify the name of the output file.

-niters: Specify the number of iterations. The default value is 1000.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors