shreyaskarnik/PubMedLDA
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Repository files navigation
This is code for performing topic modeling on PubMed abstracts.
Depends on package gensim and nltk.
The package contains two parts:
1. Retrieve PubMed abstracts using the script getpmAbstracts.py
Usage: usage:python getpmAbstracts.py -q [query] -o [output] -s [flag for steming]
Options:
-h, --help show this help message and exit
-q QUERY, --query=QUERY
Enter the PubMed query (PubMed style queries
supported)
-o OFILE, --output=OFILE
Enter the output file name to store result
-s, --stem To stem result
2. Once you have the PubMed abstracts in a file run LDA over them:
Usage: python gensim_lda_pubmed.py -i [inputfile] -k [number of topics to extract] -v [verbose output FALSE by default] -t [TRUE/ FALSE for TFIDF weights] -r [return topics per document TRUE/FALSE (default FALSE)]
Options:
-h, --help show this help message and exit
-i IFILE, --inputfile=IFILE
Enter the file containing PubMed abstracts
-k NTOP, --numtopics=NTOP
Number of topics
-t TFIDF, --tfidf=TFIDF
TFIDF weignting (default TRUE)
-v VERBOSE Verbose Output TRUE/FALSE (default FALSE)
-r FIT Return topics per document TRUE/FALSE (default FALSE)
All the code in this project is under Creative Commons License.