IETM uses an open-source Java package to implement the algorithm proposed in the paper named Transferring Knowledge from Large Language Models for Short Text Topic Modeling.
- Java (Version=1.8)
All of corpus files (Tweet, SearchSnippets, and StackOverflow) and the corresponding label files have been prepared in the path ./datasets.
Taking Tweet as an example, the dataset file path is as follows.
datasets
Tweet
Tweet.txt
Tweet_label.txt
Tweet_GPT.txt
Tweet_DREx.txt
Tweet_LLaMa.txt
Tweet_LLaMa2.txt
where 'Tweet.txt' contains the original short texts. The 'Tweet_label.txt' is the label file corresponding to Tweet dataset. The other four files are pseudo long documents generated through different methods. For example, each document in 'Tweet_GPT.txt' is generated by GPT according to the original short text.
bash run.sh
-algorithm: IETM.
-dataname: Specify the name of dataset (Tweet, SearchSnippets, or StackOverflow).
-alpha: Specify the value of the Dirichlet prior. The default value is 1.0.
-beta: Specify the value of the Dirichlet prior. The default value is 0.01.
-ntopics: Specify the number of topics. The default value is 50.
-corpus: Specify the file of the input short text corpus file.
-generateCorpus: Specify the file of the input pseudo-long corpus file.
-output: Specify the path to the output directory.
-name: Specify the name of the output file.
-niters: Specify the number of iterations. The default value is 1000.