Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Incremental News Clustering

Build Status License

The goal was to research model-based clustering methods, notably the Distance Dependent Chinese Restaurant Process (ddCRP), and propose an incremental clustering system which would be capable of maintaining the growing number of topic clusters of news articles coming online from a crawler. LDA, LSA, and doc2vec methods were used to represent a document as a fixed-length numeric vector. Cluster assignments given by a proof-of-concept implementation of such a system were evaluated using various metrics, notably purity, F-measure and V-measure. A modification of V-measure -- NV-measure -- was introduced in order to penalize an excessive or insufficient number of clusters. The best results were achieved with doc2vec and ddCRP.

Due to copyright, news articles used for experiments are only available at the university library.

Full thesis text: thesis.pdf
Poster: Vana_Martin_2018.pdf

BibTeX citation:

@MASTERSTHESIS {martinvana2018,
    author  = "Martin Váňa",
    title   = "Incremental News Clustering",
    school  = "University of West Bohemia",
    year    = "2018",
    address = "Pilsen",
    month   = "may"
}

Installation

Requirements

  • Python 3.5
  • Pip
  • Pipenv

Ubuntu

$ sudo apt-get install python3 python3-tk python3-pip
$ pip3 install pipenv

Project dependencies

$ pipenv install --dev

If it fails for some reason try pipenv install --dev --skip-lock

~/.bashrc

export PYTHONPATH='.'

Development

Configure PyCharm

Activate project's virtualenv

$ pipenv shell

Run script

$ pipenv run python <script_name>.py

Run tests

$ pipenv run pytest tests

About

An incremental clustering system which is capable of maintaining the growing number of topic clusters of news articles online from a crawler

Resources

License

Releases

No releases published

Packages

No packages published