rkiddy edited this page Sep 27, 2011 · 2 revisions

The California legislative information can be found at and includes data back to the 1993-1994 session. The information is available both from the web site, as html files and pdf files, and from a database dump that has just recently been made available.

The current scraper uses the information from the database, so before you can run the scraper, you must retrieve the database (see It is not trivial to import the database and the script for it uses hard-coded path elements and assumes that one is running on Windows. Your mileage may vary. One must also have installed the "sqlalchemy" python module.

(RRK) I wonder whether using the database is the right approach. There is a lot of information in the database, but it can be fairly complicated to access and I am not sure that it is better than getting the information out of the web pages, which may be easier to parse. The database contains information in an XML format, but that may not be so helpful. For example, I can look at a bill on the web page and I see "AUTHOR(S) : Alejo (Coauthor: Allen)." It would be wonderful to see XML that says: "<authors><author>Alejo</author><author type="Coauthor">Allen</author></authors>". What you actually see is: "<authors>Alejo (Coauthor: Allen).</author>". Isn't XML supposed to help one be clear about the structure of the data? Does this help? Also, it is not as easy as it should be to import the database. Everyone can get to the web pages. If the scraper uses the database, that imposes an extra cost on anyone who wants to run the scraper themselves. If there was a clear benefit to doing this, it might make sense. But is there a benefit? Can someone import the database and make it availble (read-only) to the net? The state of California cannot and Sunlight Labs cannot and I do not have a colo I can use right now. Anyone?

To see the kind of information that is available from the CA website, here is a list of the files for one bill (see

  • ab_501_bill_20110215_introduced.html
  • ab_501_bill_20110215_introduced.pdf
  • ab_501_bill_20110406_amended_asm_v98.html
  • ab_501_bill_20110406_amended_asm_v98.pdf
  • ab_501_bill_20110628_history.html
  • ab_501_bill_20110706_status.html
  • ab_501_cfa_20110329_131447_asm_comm.html
  • ab_501_cfa_20110412_152341_asm_comm.html
  • ab_501_cfa_20110528_173314_asm_floor.html
  • ab_501_cfa_20110624_162916_sen_comm.html
  • ab_501_vote_20110330_000001_asm_comm.html
  • ab_501_vote_20110527_000003_asm_comm.html
  • ab_501_vote_20110531_0349PM_asm_floor.html
  • ab_501_vote_20110627_000001_sen_comm.html
The "cfa" files are for committee analyses, which also includes lists of groups who have registered support or opposition to a bill. There are also "vt" files for veto messages and "chaptered" and "enrolled" files for the versions of the bills that have passed and become law. There are quirks in the naming scheme for files but there is quite a bit of information here.