Scraping, parsing and indexing the daily Congressional Record to support phrase search over time, and by legislator and date
HTML Python JavaScript CoffeeScript CSS Java Other
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
api
cwod_site
grammars
parser
scraper
solr
tests
.gitignore
LICENSE
__init__.py
capitolwords.py
daily_then_weekly_update.sh
daily_update.sh
monthly_update.sh
parse_and_ingest.py
prod-contrab
readme
settings.example.py

readme

useful info goes here

Requirements
* json or simplejson
* beautifulsoup verion 3.0 series (it MUST be 3.0 series, not 3.1)
  http://www.crummy.com/software/BeautifulSoup/download/3.x/BeautifulSoup-3.0.8.1.tar.gz
* solr
* sunlightlabs API key

Setup:
* cp settings.example.py settings.py
* create symlinks to settings.py from each of solr/, scraper/ and parser/

* tell solr where to find the schema file. eg, if using running the dev
* environment in apache-solr-1.4.1/example/, it will uses schema.xml in the
* directory /apache-solr-1.4.1/example/solr/conf. same is true for the
* stopwords file. so set up symlinks to he real things, optionally backing up
* the originals as .example. 

cd apache-solr-1.4.1/example/solr/conf
mv schema.{,example.}xml
mv stopwords.{,example.}txt
ln -s /home/cwod/capitolwords/src/solr/schema.xml schema.xml
ln -s /home/cwod/capitolwords/src/solr/stopwords.txt stopwords.txt

Startup
* start up solr. in a dev environment this looks like:
  cd $SOLR_DIR/example
  java -jar start.jar (uses jetty)