Parse wikipedia dump files to wiki-talk networks while preserving original wikipedia UIDs.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
KONECT
desc
doc
src
test/wiki_talk_parser
.gitignore
CHANGELOG.md
LICENSE
README.md
main.stu
project.clj

README.md

wiki-talk-parser

This little program can:

  • Parse wikipedia dump files (xml) to wiki-talk networks. Original wikipedia UIDs are remained.
  • "Shrink" the resulted network, so it is an unweighted directed network w/o loops, like in the SNAP wiki-Talk dataset.
  • Group users according to their roles.

Usage with stu

Use stu for easy lives. The only file you need is main.stu. Simply type in stu or:

$ nohup stu -k -j 3 &

Stu will automatically start downloading this program and the datasets, then parsing.

Usage without stu

Installation

Manually download the latest jar files.

Parse

$ java -jar parser.jar *input-file* *lang* > *output-file*

Shrink

$ java -jar shrinker.jar *input-file* > *output-file*

Group users

$ java -jar grouper.jar *input-file* > *output-file*

Compilation

$ lein with-profile parser:shrinker:grouper uberjar

License

Copyright © 2015 Yfiua

Distributed under the Eclipse Public License either version 1.0 or any later version.