NLP on Hacker News data (work in progress)
This project aims to do large-scale data analysis of textual and graph data consisting of posts and comments on Hackernews.
Current status:
- All posts and comments since the beginning of HN until July 2021 are included in a PostgreSQL database
- Some basic data exploration code is included in a Jupyter notebook
To do:
- Spectral clustering of users based on their post/comment content
- HN-style natural language generation
- Automatic post tagging based on Latent Dirichlet Allocation
- Entity extraction