I scrape tweets with #HRTechConf, and build Latent Dirichlet Allocation (LDA) model for auto detecting and interpreting topics in the tweets. Here is my pipeline:
- Data gathering – twitter scrape
- Data pre-processing
- Generating word cloud
- Train LDA model
- Visualizing topics
This project requires Python 3.6+ and the following Python libraries installed:
- TwitterScraper, a Python script to scrape for tweets
- NLTK(Natural Language Toolkit), a NLP package for text processing, e.g. stop words, punctuation, tokenization, lemmatization, etc.
- Gensim, “generate similar”, a popular NLP package for topic modeling
- Latent Dirichlet Allocation (LDA), a generative, probabilistic model for topic clustering/modeling
- pyLDAvis, an interactive LDA visualization package, designed to help interpret topics in a topic model that is trained on a corpus of text data
- NumPy
- Pandas
- matplotlib
Code is provided in HRTech2019_LDA.py
.