Skip to content
An incremental clustering system which is capable of maintaining the growing number of topic clusters of news articles online from a crawler
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
clustering_system
data/genuine
results
tests/clustering_system
.gitignore
.travis.yml
LICENCE
Pipfile
Pipfile.lock
README.md
Vana_Martin_2018.pdf
run.sh
runner.sh
test.sh
thesis.pdf

README.md

Incremental News Clustering

Build Status License

The goal was to research model-based clustering methods, notably the Distance Dependent Chinese Restaurant Process (ddCRP), and propose an incremental clustering system which would be capable of maintaining the growing number of topic clusters of news articles coming online from a crawler. LDA, LSA, and doc2vec methods were used to represent a document as a fixed-length numeric vector. Cluster assignments given by a proof-of-concept implementation of such a system were evaluated using various metrics, notably purity, F-measure and V-measure. A modification of V-measure -- NV-measure -- was introduced in order to penalize an excessive or insufficient number of clusters. The best results were achieved with doc2vec and ddCRP.

Due to copyright, news articles used for experiments are only available at the university library.

Full thesis text: thesis.pdf
Poster: Vana_Martin_2018.pdf

BibTeX citation:

@MASTERSTHESIS {martinvana2018,
    author  = "Martin Váňa",
    title   = "Incremental News Clustering",
    school  = "University of West Bohemia",
    year    = "2018",
    address = "Pilsen",
    month   = "may"
}

Installation

Requirements

  • Python 3.5
  • Pip
  • Pipenv

Ubuntu

$ sudo apt-get install python3 python3-tk python3-pip
$ pip3 install pipenv

Project dependencies

$ pipenv install --dev

If it fails for some reason try pipenv install --dev --skip-lock

~/.bashrc

export PYTHONPATH='.'

Development

Configure PyCharm

Activate project's virtualenv

$ pipenv shell

Run script

$ pipenv run python <script_name>.py

Run tests

$ pipenv run pytest tests
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.