Browse files

performance improvements for importing

  • Loading branch information...
1 parent e11d7b1 commit 3539928c9d8dce2f16deeb422890b8dd99b67967 @yourcelf committed Sep 30, 2010
Showing with 108 additions and 23 deletions.
  1. +82 −13 README.rst
  2. +11 −5 afg/management/commands/
  3. +2 −1 afg/
  4. +2 −0
  5. +1 −0 media/js/script.js
  6. +5 −1 requirements.txt
  7. +5 −3
95 README.rst
@@ -11,36 +11,105 @@ Installation
1. Dependencies
+The latest version of afgexplorer uses `Solr <>`_
+as its search backend. Previous versions of afgexplorer only used the
+database. The latest version should work with any Django-compatible database
+(the previous version depended on postgreql); however, the management command
+to import data assumes postgresql for efficiency's sake.
+python and Django
It is recommended that you install using pip and virtualenv. To install
pip install -r requirements.txt -E /path/to/your/virtualenv
-Style sheets are compiled using `Compass <>`_.
+If you use postgresql (recommended), you will need to install
+``egenix-mx-base``, which `cannot be installed using pip
+To install it, first activate your virtualenv, and then:
+ easy_install -i egenix-mx-base
+`Install Solr <>`_. For the purposes
+of testing and development, the `example server
+<>`_ should be
+adequate, though you will need to add add the schema.xml file as described
+Style sheets are compiled using `Compass <>`_. If you
+wish to modify the style sheets, you will need to install that as well. After
+compass is installed, stylesheets can be compiled as you modify the ``.sass``
+files as follows:
+ cd media/css/sass/
+ compass watch
-2. Database settings
+2. Settings
Copy the file `` to ``, and add your database
-settings. PostgreSQL is required due to raw SQL queries for full-text search.
-Other databases might work with modification, but would likely not be as fast.
3. Data
-Next, you will need data. This project contains only the code to run the site,
-and not the leaked documents. The documents themselves must be separately
-obtained at:
+For convenience, the script file ```` runs all of the
+commands described below. Modify the script file to reflect the username and
+password for your database and the location for the data file. The import
+management commands assume that you are using postgresql, and will need
+modification for other databasees.
+The whole import process takes around 20 minutes on a reasonably modern
+Importing data
+This project contains only the code to run the site, and not the documents
+themselves. The documents themselves must be separately obtained at:,_2004-2010
-To import the documents, download the CSV format file, and run the following
-management command:
+To import the documents, download the CSV format file. Then, start the process
+as follows.:
+ python import_wikileaks path/to/file.csv "2010 July 25"
+The first argument is the path to the data file, and the second argument is the
+release label for that file (used as an additional facet to allow viewers to
+search within particular document releases). If there are multiple document
+releases to import at once, add additional filename and label pairs as
+subsequent arguments.
+The script will first collate the entries and extract phrases that are in
+common between the documents. Then, it will construct a new csv file which
+contains the cleaned database fields for for efficient bulk importing with
+postgres. Following this colation, you will need to enter the database
+password to execute the bulk import.
+Indexing with Solr
+To generate the Solr schema, run the following management command:
+ python build_solr_schema > schema.xml
- python import_wikileaks path/to/file.csv
+Copy or link this file to the Solr conf directory (if you're using the example
+Solr server, this will be ``apache-solor-1.4.1/example/solr/conf``), replacing
+any ``schema.xml`` file that is already there, and then restart Solr. After
+restarting Solr, the following management command will rebuild the index:
-The import process will take some time.
+ python rebuild_index
-Granted to the public domain.
+Granted to the public domain. If you need other licensing, please file an
16 afg/management/commands/
@@ -2,7 +2,6 @@
import os
import csv
import json
-import datetime
import tempfile
import itertools
import subprocess
@@ -22,9 +21,10 @@ def clean_summary(text):
class Command(BaseCommand):
args = '<csv_file> <release_name>'
- help = """Import the wikileaks Afghan War Diary CSV file."""
+ help = """Import the wikileaks Afghan War Diary CSV file(s)."""
def handle(self, *args, **kwargs):
+ print args
if len(args) < 2:
print """Requires two arguments: the path to the wikileaks Afghan War Diary CSV file, and a string identifying this release (e.g. "2010 July 25"). The CSV file can be downloaded here:
@@ -76,19 +76,25 @@ def handle(self, *args, **kwargs):
key_list = list(report_keys)
for report_key in report_keys:
phrase_links[report_key][phrase] = key_list
+ phrases = None
print "Writing CSV"
# Write to CSV and bulk import.
fields = rows[0].keys()
temp = tempfile.NamedTemporaryFile(delete=False)
- name =
writer = csv.writer(temp)
+ name =
n = len(rows)
- for c, row in enumerate(rows):
- print "CSV", c, n
+ c = 0
+ # Pop rows to preserve memory (adding the json in phrase_links grows
+ # too fast).
+ while len(rows) > 0:
+ row = rows.pop(0)
+ print "CSV", c, len(rows), n
row['phrase_links'] = json.dumps(phrase_links[row['report_key']])
writer.writerow([row[f] for f in fields])
+ c += 1
print "Loading into postgres"
3 afg/
@@ -166,7 +166,8 @@ def search(request, about=False, api=False):
# at the first two parts.
field_name, lookup = (key + "__exact").rsplit(r'__')[0:2]
# "type" is a reserved name for Solr, so munge it to "type_"
- field_name = "type_" if field_name == "type" else field_name
+ if field_name == "type":
+ field_name = "type_"
field = DiaryEntryIndex.fields.get(field_name, None)
if field and field.faceted:
# Dates are handled specially below
@@ -12,3 +12,5 @@
'PORT': '', # Set to empty string for default. Not used with sqlite3.
+SECRET_KEY = "Change this to an arbitrary random long character string"
1 media/js/script.js
@@ -518,6 +518,7 @@ var ACRONYMS = [
[/\b(UNAMA)\b/g, "United Nations Assistance Mission in Afghanistan"],
[/\b(UNK)\b/g, "Unknown"],
[/\b(usfor-a|USFOR-A)\b/g, "United States Forces Afghanistan"],
+ [/\b(USSF)\b/g, "United States Special Forces [Green Berets]"],
[/\b(UXO)\b/g, "unexploded ordnance [or unfired]"],
[/\b(VBIED)\b/g, "Vehicle-Borne Improvised Explosive Device"],
[/\b(VCP)\b/g, "Vehicle Check Point"],
6 requirements.txt
@@ -1,6 +1,10 @@
-e git://
+## pip can't install egenix-mx-base, but psycopg requires it. With virtualenv
+## activated, install the following:
+# easy_install -i egenix-mx-base
@@ -1,13 +1,15 @@
+# List pairs of (file, release label) here, eg:.
+# FILES='file1 "label one" file2 "label two"'
+FILES='data/afg.csv "2010 July 25"'
sudo su postgres -c "dropdb $DBNAME"
sudo su postgres -c "createdb -O $DBUSER $DBNAME"
python syncdb --noinput
-python import_wikileaks data/afg.csv "2010 July 25"
+echo "$FILES" | xargs python import_wikileaks
python build_solr_schema > schema.xml
-echo "Please reset Solr now to reflect the new schema, then press [Enter]"
+echo "Please copy or symlink 'schema.xml' into your Solr conf directory, then reset Solr to reflect the new schema, then press [Enter]"
read foo
python rebuild_index

0 comments on commit 3539928

Please sign in to comment.