Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPL-579 Expose LH plate map metadata to PAM #118

Closed
2 tasks
rl15 opened this issue Jul 17, 2020 · 7 comments · Fixed by #121
Closed
2 tasks

GPL-579 Expose LH plate map metadata to PAM #118

rl15 opened this issue Jul 17, 2020 · 7 comments · Fixed by #121
Assignees

Comments

@rl15
Copy link

rl15 commented Jul 17, 2020

User story
GPL-579 | As PAM informatics leads (Rob K & Ian W) we would like to query the LH plate map data to support 'end to end' reporting

Who are the primary contacts for this story
Rob K (PAM)
Ian W (PAM)
Rich L
Neil S

Acceptance criteria
To be considered successful the solution must allow:

  • new table in MLWH lighthouse_samples
  • Draft design: Columns, lab_id, Root_sample_id, COGUK_id (can be null) result, date_tested, RNA_plate_barcode

NB This table will have rows discussion with PAM to be held if can expose only +ves in rows, however assume will send all LH samples (+ve, -ve and void results) to MLWH.

Additional context
Discussed in design meeting 17th July and document mail Rich L (Friday, 17 July 2020 at 13:59) Subject: Proposed changes to Lighthouse crawler + MLWH contents, mail to dev team

@rl15 rl15 added the Reporting label Jul 17, 2020
@andrewsparkes
Copy link
Member

This story may be to convert crawler to write to a MySQL MLWH database instead of Mongo. Rather than duplicate the information on both.
What were the advantages (if any) of using Mongo and are they still relevant given recent crawler changes to require specific columns and process by file?

@KatyTaylor
Copy link
Contributor

Further to Andrew's comment above -
We also want to check with PAM how familiar they are with the MongoDB and the Lighthouse API, and look into whether we can help them get the information they need from it without a move to SQL, or even just as a stop-gap in the meantime.

@emrojo
Copy link
Contributor

emrojo commented Aug 20, 2020

In theory Mongo was designed for non relational schemas were you want to keep a huge amount of tuples without decreasing performance on scalability. In practice, I suppose a well indexed and tuned SQL database could provide same level of performance, however I don't think this is something easy to achieve.

In our case, I think the main reason to use Mongo was that is the database for Eve, the web framework that we are using in Lighthouse. Although crawler now can be very easily changed to write to other database types, choosing another database will require substitute the Lighthouse project entirely, as it is developed in Eve.

For me, the main advantages for Mongo have been that we have increased the size of samples and we haven't decreased performance, which it doesn't happen so often. For instance, now we keep more samples inside Mongo database than in Sequencescape database.

And the main disadvantages for me:

  • There is not clear way of how to query data and join with other SQL databases.
  • Our users do have some experience with SQL, but not with Mongo.

@stevieing
Copy link
Contributor

Based on the comments it sounds like there need to be further discussions on the best approach for this.

@KatyTaylor
Copy link
Contributor

From the email "(Friday, 17 July 2020 at 13:59) Subject: Proposed changes to Lighthouse crawler + MLWH contents", the list of queries PAM wanted to be able to do was:

  1. a count of all positive plates from date to date for labs (plate contains one or more +ves)
  2. a count of all negative plates to date for list for labs (all samples -ve)
  3. a count of all positive samples to date for list for labs Milton Keynes, Alderney Park, Glasgow, Cambridge Astra Zenca
  4. A count of all positive samples on site to date. (needs join with labware_emporium )
  5. A list of the root sample ids & COG-UK ids (where assigned) of all positive samples on site to date

I've been looking at how much of this we can easily currently get from the mongodb. Results:

  1. a count of all positive plates from date to date for labs (plate contains one or more +ves)

db.samples.distinct( 'plate_barcode', { created_at: { $gte: ISODate('2020-07-01T00:00:00Z'), $lte: ISODate('2020-08-01T00:00:00Z') }, Result: 'Positive', source: 'Alderley' } ).length

  1. a count of all negative plates to date for list for labs (all samples -ve)

db.samples.distinct( 'plate_barcode', { source: 'Alderley' } ).length
minus
db.samples.distinct( 'plate_barcode', { Result: 'Positive', source: 'Alderley' } ).length

(Could probably put into one query, if you can get an aggregation to work)
(This ignores ‘void’ results - what do you want to do about them?)

  1. a count of all positive samples to date for list for labs Milton Keynes, Alderney Park, Glasgow, Cambridge Astra Zeneca

db.samples.count( { Result: 'Positive', source: { $in: ['Alderley', 'Cambridge-az', 'Queen Elizabeth University Hospital', 'UK Biocentre'] } } )

(Modify list to get different selection of centres)

  1. A count of all positive samples on site to date. (needs join with labware_emporium )

Use ‘Lighthouse positive samples report’ in Lighthouse UI

  1. A list of the root sample ids & COG-UK ids (where assigned) of all positive samples on site to date

Use ‘Lighthouse positive samples report’ in Lighthouse UI for Root Sample IDs
COG-UK ids would need some extra development

@KatyTaylor KatyTaylor self-assigned this Sep 3, 2020
@KatyTaylor
Copy link
Contributor

KatyTaylor commented Sep 3, 2020

Tasks:

Database schema:

Business as usual:

Historical data migrations:

Other:

Research tasks:

  • Think about how to log errors from MLWH insert - Lighthouse UI imports, Kibana, file of errored rows?
  • Research best SQL syntax to use: a) decide whether to 'upsert' or insert (ON DUPLICATE KEY UPDATE), b) bulk insert (using VALUES), c) need to separate into batches?

@KatyTaylor
Copy link
Contributor

KatyTaylor commented Sep 21, 2020

Testing notes for in UAT environment

Before deployment
Get some samples from MongoDB into Sequencescape and MLWH sample table, but not yet in lighthouse_sample table:

  • Find plates with positives that are in MongoDB
Plate Lighthouse Created +ves count
AP-rna-00125681 ALDP 6
AP-rna-00122445 ALDP 5
AP-rna-00122412 ALDP 7
AP-rna-00122372 ALDP 3
AP-rna-00122366 ALDP 3
  • Scan above into LabWhere
  • Import into SS using Lighthouse Sentinel - this will also get them into the MLWH

Deploy

  • checkout deployment project, with our changes and eta / theta changes
  • checkout new ssh keys for theta hosts
  • deploy unified warehouse
  • deploy crawler
  • deploy lighthouse

After deployment

  • Delete samples where file_name is 'AP_sanger_report_200921_0130.csv'
  • db.samples.count({ file_name: 'AP_sanger_report_200921_0130.csv' }) gives 28,538 samples in UAT)
  • created_at: 2020-09-21 11:36
  • to delete, run db.samples.remove({ file_name: 'AP_sanger_report_200921_0130.csv' })
  • Delete the file '200921_1136_AP_sanger_report_200921_0130.csv_cfaddc294ec04ef2f232132ef25497ca' from the ALDP/errors directory

Migrate historical data:

  • Run crawler migration, check all historical data in MongoDB gets inserted into MLWH lighthouse_sample
  • Ran first with subset - python run_migration.py update_mlwh_with_legacy_samples 200921_0000 200922_2359
    (178,527)
  • Then with all - python run_migration.py update_mlwh_with_legacy_samples 200201_0000 200923_2359
    (2,494,443 - probably reduced because not all in dev db have 'created_at' field) - took ~6 mins
  • Run unified warehouse migration, check all samples from data setup now have COG UK ids. (bundle exec rake lighthouse_sample_table:update_lighthouse_sample_cog_uk_ids)

Test 'business as usual':

  • Run crawler job, check data is imported into MLWH lighthouse_sample table as well as MongoDB, and is correct.
  • Find some plates in the above file that have positives, scan into LabWhere into a Location
  • Enter above location barcode into Lighthouse Sentinel Sample Creation - sends samples into SS
  • Check COG UK id appears in MLWH lighthouse_sample table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants