Skip to content

Commit

Permalink
Organize docs
Browse files Browse the repository at this point in the history
  • Loading branch information
thatandromeda committed Feb 5, 2018
1 parent 21b9cac commit 96de50b
Show file tree
Hide file tree
Showing 3 changed files with 110 additions and 86 deletions.
60 changes: 60 additions & 0 deletions docs/developer.md
@@ -0,0 +1,60 @@
# Documentation
This document is for people who are trying to stand up an instance of Hamlet on localhost in order to write code. It assumes you are generally familiar with setting up development environments (for instance, that you can install Python dependencies and stand up local Postgres).

## Tests
Run tests with `python manage.py test --settings=hamlet.settings.test`.

This ensures that they use the test neural net. The primary keys of objects in
the test file are written around the assumption that they will be present in
both the test net and the fixtures.

You can generate additional fixtures with statements like `python manage.py dumpdata theses.Person --pks=63970,29903 > hamlet/theses/fixtures/authors.json`, but make sure to include the pks of all objects already in the fixtures (or to write it to a separate file and then unite it with the existing - you can't just append because the json syntax will be wrong).

## System configuration

### Development dependencies: pipenv
For the most part, dependencies are installed via pipenv. There's a `.env` file (kept out of version control) for use by `pipenv shell`. It specifies:
* `DJANGO_SETTINGS_MODULE`
* `DJANGO_SETTINGS_MODULE='hamlet.settings.local'` (for using heroku local)
* `DJANGO_SETTINGS_MODULE='hamlet.settings.base'` (for python manage.py runserver)
* `DJANGO_DB_PASSWORD='(your password)'`
* `DJANGO_DEBUG_IS_TRUE='True' (if you want)`
* `DSPACE_OAI_IDENTIFIER`
* `DSPACE_OAI_URI`

The latter two are only relevant if you plan to be downloading files or metadata from DSpace. They can be omitted or given dummy values otherwise.

### Additional non-pipenv dependencies
Some dependencies require extra help:
* tika requires Java
* nltk may require installing corpora through the python shell
* gensim wants a C compiler (it can run without one but will be 70x slower; a single neural net training run can take literally days in this case)
* python-magic needs libmagic (`brew install libmagic` on OSX).
* captcha says it needs `apt-get -y install libz-dev libjpeg-dev libfreetype6-dev python-dev` or similar. You can't yum install them on AWS, but the captcha works anyway, so maybe it's lying.

You only need the first three of these if you plan to be doing neural net training. If you're developing the Django parts you can skip them; just get a prebuilt neural net file (see below, "Neural net files").

### Other config
Postgres needs a database and user (default values are `hamlet` for database name and username, no password; override this if desired in `.env` with `DJANGO_DB`, `DJANGO_DB_USER`, `DJANGO_DB_PASSWORD`)

## Static assets
If you need to edit styles, edit files in `hamlet/static/sass/apps/`. Don't edit css directly - these changes will be blown away during asset precompilation.

### for `python manage.py runserver`
* run `python manage.py collectstatic`
* use `hamlet.settings.base`

### for `heroku local`
* run `python manage.py compress`
* then run `python manage.py collectstatic`
* use `hamlet.settings.local`

### for AWS
The static asset pipeline runs automatically; see `.ebextensions/02_python.config`.

## Neural net files
hamlet.model is a copy of all_theses_no_split_w4_s52.model. This is a model trained with a window size of 4 and a step of 52. It is kept out of version control because it is too big.

`hamlet/testmodels/` contains some smaller models not suitable for production, but usable for testing (and small enough to be pushed to GitHub, although it will complain, and hence used on Travis). You can configure your local settings to point at these files and that will suffice for development.

These models don't represent the entire MIT thesis collection (that's what lets them be smaller), so don't be surprised if documents of interest are not present.
86 changes: 0 additions & 86 deletions docs/docs.md

This file was deleted.

50 changes: 50 additions & 0 deletions docs/sysadmin.md
@@ -0,0 +1,50 @@
# Sysadmin documentation
This document is for people who are trying to stand up a Hamlet deployment, or understand our existing AWS solution.

## Heroku

We tried to deploy on Heroku but the model file needs ~2GB of memory and that gets spendy. In theory the `hamlet.settings.heroku` file should be deployable with a large enough instance; the app has successfully deployed with small model files (which are too limited to support the app's features). You should be able to use `heroku local` with this file if that is a thing that makes you happy.

## AWS

### Deployment

Hamlet automatically builds to AWS (at https://mitlibraries-hamlet.mit.edu/) via Travis on updates to `master`.
* The build process: see `.ebextensions` files
* Config: `.elasticbeanstalk/config.yml` and `.travis.yml`

### Config

Environment variables defined in AWS for security reasons:
* All Database variables (these are standard and can put directly in your code)
* `SECRET_KEY` - will be created by TS3 or provided by developer securely

All other variables are defined with the config files of the `.ebextensions` folder and can be changed/modified and or added to for future use.

Because the neural net files are too large to live on GitHub, they need to be supplied separately:
* Go to aws.amazon.com
* Make sure you're on US East
* Search for S3
* Find the hamlet-models bucket
* Put model files there

You should be logged in via MIT Touchstone. If you're not in the relevant moira group, ask TS3. Re-uploading files will not trigger a server restart; you'll need to do that manually, or do something to update master and kick off a build.

### Build process
See the scripts in `.ebextensions` for details.

A few rough edges to know about:

Files in the hamlet-models bucket are synced to the build server by the `.ebextensions` scripts. (These are necessary for the app to run but can't be provided via GitHub.) `hamlet.settings.aws` is configured to look for `MODEL_FILE` in the directory created by the build scripts.

AWS doesn't speak Pipfile yet, so we generate `requirements.txt` as part of the deploy process.

### Architecture

The model files live in a bucket on S3. They are expected to change infrequently, so we haven't automated this process; talk to Andy if you need to push changes. The model files are *not* synced through github because they're too large. The s3 bucket is synced to a directory created on the the hamlet instance through a deploy script; `hamlet.settings.aws` creates this directory and tells `MODEL_FILE` to look in it.

Static is deployed using whitenoise within the hamlet instance. It's not big enough for us to have bothered with a real CDN.

Client connections run over https to the load balancer. Connections between the load balancer and the instance(s) are http, but on a private network only accessible by the load balancer and allowed instances. Config is in `.ebextensions/05_elb.config`.

Application logging doesn't actually work right now because the filesystem isn't persistent and we haven't thought through where AWS might want a logstream to go. #yolo

0 comments on commit 96de50b

Please sign in to comment.