Wikipedia Reverts Research Project
- Data dependencies from wikipedia-map-reduce
- Requires Wikipedia diffs generated by wmr.wmf.WmfDiffCreator (see shilad)
- Wikipedia reverts generated by wmr.reverts.RevertGraphGenerator
- s3cmd get s3://macalester/wikipedia/reverts/part-r*
- Cat these together into dat/reverts.txt
- Run data_prep/filter_reverts.py to create statistics and filtered reverts file.
- Requires macademia joined data files (located in s3://macademia/nbrvz/joined, created by tfidf package in wikipedia map reduce)
- Download data files (see command at bottom of http://code.google.com/p/macademia/wiki/WikipediaMinner_Hadoop)
- Translate data files into svmlight format:
- seq 0 400 | xargs printf "%05d\n" | parallel -P 10 --progress --eta zcat ~/usr/macademia/grails2/dat/sims/part-r-{}.gz '|' python2.6 ./data_prep/make_kmeans_input.py '|' gzip '>'articles/translated/{}.txt.gz
- checkout and compile sofia-kmeans