WikiSpa

This is a simple wrapper around the dbpedia-extraction framework mainly to make sure each execution is independent. The project is focused in executing wikipedia queries locally or in a Spark cluster.

An example to print out all the wikipedia pageids and their categories separated by TAB is shown below.

object CategoryPerPage extends ElectricJob[WikiFileAndSerialization] with  WikiAccess  with FileAccess {


  override def execute(argument:WikiFileAndSerialization)(implicit ec: ElectricContext)={
   val categoriesCount=   wikiPages(argument.wikiFile, argument.serializationType)
        .map(f => Categories.extractByPage(f).getOrElse((0L, List(): List[String])))

        .filter(f => f._1 != 0 && f._2.nonEmpty)
        .map(f=> f._1 + "\t"+f._2.mkString("\u0001"))

    writeFile(categoriesCount,argument.output)

  }
}

An example output is given below.

290     ISO basic Latin letters,Vowel letters
334     Time scales

The code runs on a (16GB, OSX) laptop for the latest wikipedia data(enwiki-20151002-pages-articles-multistream.xml) in less than three hours. For the rich and the impatient, the code below can be deployed and executed in a Hadoop cluster.

Repository available at OSS releases

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
files		files
project		project
wikispa-core/src		wikispa-core/src
wikispa-spark/src		wikispa-spark/src
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
build.properties		build.properties
local.pubring.gpg.enc		local.pubring.gpg.enc
local.secring.gpg.enc		local.secring.gpg.enc
sbt		sbt
sbt-launch.jar		sbt-launch.jar
version.sbt		version.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiSpa

About

Releases

Packages

Languages

recipegrace/WikiSpa

Folders and files

Latest commit

History

Repository files navigation

WikiSpa

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages