Permalink
Browse files

Merged the downloading into the python script, documented the --force…

…flag
  • Loading branch information...
1 parent 69e32dc commit 649f17a7073a1ac353b231cb7ff62ada0a0e1dee @konklone konklone committed Apr 8, 2013
Showing with 30 additions and 19 deletions.
  1. +4 −9 README.md
  2. +0 −7 download/xhtml.sh
  3. +26 −3 tasks/structure.py
View
@@ -14,26 +14,21 @@ Create a virtual environment:
### Getting the structure of the Code
-Get the hierarchy of the US Code (not its content), in JSON.
-
-First, download the XHTML to disk:
-
-```bash
-./download/xhtml.sh uscprelim
-```
-
-Then, run the script to output a JSON version of the US Code's structure to STDOUT:
+To output the hierarchy of the US Code to STDOUT, in JSON:
```bash
./run structure
```
+The script will first download the
+
Options:
* `--year`: "uscprelim" (the default), or a specific year version of the Code (e.g. "2011")
* `--title`: Do only a specific title (e.g. "5", "5a", "25")
* `--sections`: Return a flat hierarchy of only titles and sections (no intervening layers)
* `--debug`: Output debug messages only, and no JSON output (dry run)
+* `--force`: Force a re-download of the US Code. Use this flag if you're automatically running the script at an interval.
Example:
View
@@ -1,7 +0,0 @@
-# Download all XHTML uscode files for a year.
-# Run it from the uscode folder. Specify a year on
-# the command line eg: download/xhtml.sh 2011
-DIR=`pwd`"/data"
-DEST="$DIR/uscode.house.gov/xhtml/$1/"
-mkdir -p $DIR
-wget -m -l1 -P $DIR http://uscode.house.gov/xhtml/$1
View
@@ -1,4 +1,4 @@
-# Uses the XHTML files to extract a table of contents for the US Code.
+# Downloads and uses the XHTML version of the US Code to extract a table of contents.
# Defaults to USCprelim.
#
# Outputs JSON to STDOUT. Run and save with:
@@ -9,8 +9,11 @@
# title: Do only a specific title (e.g. "5", "5a", "25")
# sections: Return a flat hierarchy of only titles and sections (no intervening layers)
# debug: Output debug messages only, and no JSON output (dry run)
+# force: Force a re-download of the US Code for the given year (script defaults to caching if the directory for a year is present)
-import glob, re, lxml.html, json, sys
+import glob, re, lxml.html, json, sys, os
+
+import utils
import HTMLParser
pars = HTMLParser.HTMLParser()
@@ -36,6 +39,10 @@ def run(options):
else:
title = "*usc*"
+ # sync XHTML to disk as needed (cache by default)
+ download_usc(year, options)
+
+
filenames = glob.glob("data/uscode.house.gov/xhtml/" + year + "/%s.htm" % title)
filenames.sort()
@@ -210,4 +217,20 @@ def citation_for(title, number):
t = title[1:]
else:
t = title
- return "usc/%s/%s" % (t, number)
+ return "usc/%s/%s" % (t, number)
+
+
+
+def download_usc(year, options):
+ debug = options.get("debug", False)
+
+ dest_dir = "data/uscode.house.gov/xhtml/%s" % year
+
+ if os.path.isdir(dest_dir) and not options.get("force", False):
+ if debug: print "Cached, not downloading again"
+ return # assume it's downloaded
+
+ if debug: print "Downloading US Code XHTML for %s" % year
+ utils.mkdir_p(dest_dir)
+ os.system("rm %s/*" % dest_dir)
+ os.system("wget -q -m -l1 -P %s http://uscode.house.gov/xhtml/%s" % ("data", year))

0 comments on commit 649f17a

Please sign in to comment.