Skip to content

shilad/wp-reverts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wp-reverts

Wikipedia Reverts Research Project

Preliminary data analysis:

  • Data dependencies from wikipedia-map-reduce
  • Requires Wikipedia diffs generated by wmr.wmf.WmfDiffCreator (see shilad)

Data transformation

  • Wikipedia reverts generated by wmr.reverts.RevertGraphGenerator
  • s3cmd get s3://macalester/wikipedia/reverts/part-r*
  • Cat these together into dat/reverts.txt
  • Run data_prep/filter_reverts.py to create statistics and filtered reverts file.

Article clustering

  • Requires macademia joined data files (located in s3://macademia/nbrvz/joined, created by tfidf package in wikipedia map reduce)
  • Download data files (see command at bottom of http://code.google.com/p/macademia/wiki/WikipediaMinner_Hadoop)
  • Translate data files into svmlight format:
  • seq 0 400 | xargs printf "%05d\n" | parallel -P 10 --progress --eta zcat ~/usr/macademia/grails2/dat/sims/part-r-{}.gz '|' python2.6 ./data_prep/make_kmeans_input.py '|' gzip '>'articles/translated/{}.txt.gz
  • checkout and compile sofia-kmeans

About

Wikipedia Reverts Research Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages