Skip to content
A project to migrate Project Gutenberg to a version control system
Python
Find file
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
docs
index
CONTRIBUTING.md
GITenberg.py
LICENSE
README.rst
README_template.rst
TODO
catalog.pickle
catalog.rdf.bz2
filetypes.py
gen_reamde.rst
rdfparse.py
requirements.txt

README.rst

Project Gutenberg Stats

Estimated 1.6 million files Reported 650 GB total ~40,000 + books

link to issues

How are we getting the files?

rsync -rvhz --progress --partial ftp...

Each repo should...

  • metadata.yml + author + title + publishing info + provinence
  • book_name.{rst|tei|txt} + book text in a master source format
  • license.txt + PG license information + transcriber, converter credits
  • README.rst + generic GITenburg info + generic PG info + book specific info + desc and links to toolchains + desc and links to generated versions for ebook readers

Smart comments:

Convert all files to UTF-8 https://groups.google.com/forum/?fromgroups#!topic/prj-alexandria/VhKbMyH9kcA

File formats:

A list of file formats and their freqency is in the docs folder, generated via:

find -type f|rev|cut -d. -f1|grep -v "/" |rev|sort -f|uniq -c|sort -nr

.tei

a master format http://www.tei-c.org/Tools/Stylesheets/ http://code.google.com/p/hrit/source/browse/rst2xml-tei.py?repo=tei-rest

.rst

a master format Research toolchain for rst >> whatever

dp rst manual http://pgrst.pglaf.org/publish/181/181-h.html

Future

Something went wrong with that request. Please try again.