GitHub - shreyaskarnik/PubMedLDA: Making it easy to perform LDA on PubMed abstracts.

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
sample_results		sample_results
README		README
gensim_lda_pubmed.py		gensim_lda_pubmed.py
getpmAbstracts.py		getpmAbstracts.py

Repository files navigation

This is code for performing topic modeling on PubMed abstracts. 
Depends on package gensim and nltk.
The package contains two parts:
1. Retrieve PubMed abstracts using the script getpmAbstracts.py
Usage: usage:python getpmAbstracts.py -q [query] -o [output] -s [flag for steming]

Options:
  -h, --help            show this help message and exit
  -q QUERY, --query=QUERY
                        Enter the PubMed query (PubMed style queries
                        supported)
  -o OFILE, --output=OFILE
                        Enter the output file name to store result
  -s, --stem            To stem result
2. Once you have the PubMed abstracts in a file run LDA over them:

Usage: python gensim_lda_pubmed.py -i [inputfile] -k [number of topics to extract] -v [verbose output FALSE by default] -t [TRUE/ FALSE for TFIDF weights] -r [return topics per document TRUE/FALSE (default FALSE)]

Options:
  -h, --help            show this help message and exit
  -i IFILE, --inputfile=IFILE
                        Enter the file containing PubMed abstracts
  -k NTOP, --numtopics=NTOP
                        Number of topics
  -t TFIDF, --tfidf=TFIDF
                        TFIDF weignting (default TRUE)
  -v VERBOSE            Verbose Output TRUE/FALSE (default FALSE)
  -r FIT                Return topics per document TRUE/FALSE (default FALSE)

All the code in this project is under Creative Commons License.