Update 05/22/20: I wouldn't say I'm abandoning this project, but I am discontinuing work on it for the foreseeable future. I touch on the reasons for doing so - as well as what I learned from getting this far! - at a post here.
If you're looking for code to extract and apply to your own exploration of the WWC data, honestly most of the interesting stuff is in the db/*
dir: especially the three sets of wwc_*
ETL scripts and the (pretty gnarly, if I say so myself!) PostgreSQL of the 20200507003934_add_searchable_fields_to_studies.rb
migration.
Between the two of them, they should get you pretty close to having a normalized, SQL-friendly version of the WWC dataset. NB that there are several extant discrepancies in the original schema; I've yet to submit them to WWC for correction, but they can be located at notes_and_docs/initial_data_problems.md
if you'd like to do so!
I used my previous Rails toy app to get more familiar with foreman
, ActionMailer
, Rails 5.1+ system testing, webpack
, hand-rolled session-based auth(z/n), and basic full-text search in PostgreSQL.
I'm using this one to learn about JWT, and PostgreSQL's more in-depth full-text search options, before setting it up to feed JSON to (at least one) SPA... so I can learn exactly how much I dislike this decoupled approach to app development ^_^
(In addition, I like what WWC does quite a bit, but I find their their current browser UI opaque and unergonomic to navigate.)
-
DB Creation
- Option 01: run
rails db:reset studies=db/WWC-export-archive-2020-Apr-25-142355/Studies.csv findings=db/WWC-export-archive-2020-Apr-25-142355/Findings.csv reports=db/WWC-export-archive-2020-Apr-25-142355/InterventionReports.csv
- For newer data, simply substitute the CSV filepaths: modulo any newly-added corruptions to the data, the scrubbers/loaders should function identically.
- This option is sloooooooooow -- like, ~8-9 minutes slow. It's doing tons of table sequential-scans, and instancing tons of ActiveRecord objects (neither of which is necessary: but the removal of which is an optimization I haven't yet had time for.)
- Option 02: run
rails db:create && rails db:migrate && psql -d wwc_api_development -f ../2020_04_25_data.sql
- This requires you to download a public Gist containing the data.
- You're stuck with the data from April 25th, 2020 (unless you want to update and PR!) 😸
- On the other hand, this method takes under a second.
- Option 01: run
-
JWT Testing
- The only requirement to make use of the scripts in
notes_and_docs/wwc_api.postman_collection.json
is to first create an email:password record in theusers
table - The simplest way (i.e. no changes needed to those Postman queries) is to
rails c
in, thenUser.create!(email: 'foobar@example.com', password: 'password')
- The only requirement to make use of the scripts in
-
Querying
- There are currently two search endpoints:
StudySearchesController#autocomplete
, uh, serializes and returns your query params. (As close to a noop placeholder as I could get!)StudySearchesController#create
performs a full-text search against any of the three tokenizedauthor
,title
, andpublication
fields.
- The eventual (and, see above, currently indefinitely-paused) goal is to debounce-hit
autocomplete
to gather a list of viable query-terms as the user types their entry.
- There are currently two search endpoints:
-
Finish studies-search page
- Add logic for
prefilter
using sidebar/request.body
-params
- Add logic for
-
Add studies autocomplete
- add trigrams columns, per docs
- use the same regexp you did to extract
author_fts
,title_fts
, andpublication_fts
.) - add method on
Study
model (or elsewhere?) to select ten (20?) most-similar words from that column
-
Add interventions-search page
- Add scraper script for FTS
descriptions
field oninterventions
table- Use
Intervention_Page_URL
?
- Use
- Extract
outcome_domain
to separate Model (...eventually) - How does
products
relate tointerventions
in thereviews
table?
- Add scraper script for FTS
-
Add [
Review
,Finding
] search (byProtocol
/Protocol Version
...andStandards
?) -
Add Histogram chart (with selector for what to plot on x/y axes? Or static RQ's, like...)
- Which topics most commonly collocate with each other?
- Which topics most commonly collocate across years?
- Which fields return the most/strongest findings?
- Build
Controller
classes only as needed - No CSS framework: use FEM notes/O'Reilly books (can possibly reuse across apps)
- One API, two SPA's
- Vue app
- New framework
- Still have component classes/lifecycle events
- React app
- Familiar framework
- Only use Hooks and Context API's for state-management
- Vue app
- Consider building a third, HTML-first, version: perhaps using this fetch() demo for faster reloads