Feature/improve speed and limit memory (#11) #100

sambenfredj · 2023-04-12T15:37:44Z

Improve speed and limit memory consumption

stream input files for inference
add feature: skip deduplication
add feature: ensemble model
add feature: rescale input before inference with pre-trained models

Improve speed and limit memory consumption - stream input files for inference - add feature: skip deduplication - add feature: ensemble model - add feature: rescale input before inference with pre-trained models

💄 fix linting

- fix bug member variables not assigned when model is not trained - allow throw when input file is malformed: remove skip on bad lines from pandas read function

- Create new object of OnDiskPsmDataset to use for brew tests - Update brew function outputs and assert statements

- remove assign confidence tests because datasets don't have assign confidence methods anymore - add eval_fdr value to the _update_labels function

* Fix test confidence: - fix bugs for grouped confidence - fix test_one_group : create file using psm_df_1000 to create OnDiskPsmDataset. - remove test_pickle because confidence does not return dataframe results anymore. - add test_multi_groups to test that different group results are saved correctly. * fix bugs: - overwrite default fdr for update_labels function - return dataframe for psm_df_1000 to use with LinearPsmDataset

- Remove test_cli_pepxml because xml files don't work with streaming - Replace old output file names - Add random generator 'rng' variable to confidence since it is required for proteins - Remove subset_max_train from PluginModel - Fix bug: convert targets column after reading in chunks - Fix peptide column name for confidence - Fix test cli plugins : replace DecisionTreeClassifier with LinearSVC BECAUSE DecisionTreeClassifier return scores as 0 or 1

- Refactor test structure : Separate brew and confidence functions, read results from output files. - Fix bugs: fix output columns for proteins, sort proteins data by score

- Add label value to initial direction because it has to have a numerical number - Read pin does not return dataframe anymore - Compare output of read_pin function to example dataframe

- Add skip_deduplication flag test - Add ensemble flag test - Agg rescale flag test - Fix bug: remove target_column variable from read file for read_data_for_rescale

- Remove writer tests with confidence object becaause LinearPsmDataset does not have asign_confidence method anymore and results are streamed to output files while computing confidence

codecov · 2023-05-22T12:34:53Z

Codecov Report

Attention: Patch coverage is 89.75469% with 71 lines in your changes are missing coverage. Please review.

Project coverage is 83.21%. Comparing base (da2d545) to head (6726dea).
Report is 3 commits behind head on main.

Files	Patch %	Lines
mokapot/confidence.py	88.10%	22 Missing ⚠️
mokapot/dataset.py	87.15%	14 Missing ⚠️
mokapot/mokapot.py	71.87%	9 Missing ⚠️
mokapot/model.py	70.37%	8 Missing ⚠️
mokapot/parsers/pin.py	90.69%	8 Missing ⚠️
mokapot/aggregatePsmsToPeptides.py	89.28%	6 Missing ⚠️
mokapot/brew.py	97.32%	3 Missing ⚠️
mokapot/utils.py	98.63%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #100      +/-   ##
==========================================
- Coverage   85.36%   83.21%   -2.15%     
==========================================
  Files          19       21       +2     
  Lines        1640     2032     +392     
==========================================
+ Hits         1400     1691     +291     
- Misses        240      341     +101

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

gessulat · 2023-05-23T08:10:09Z

Thanks @sambenfredj for taking care of the tests!
Samia replaced some function working on data in-memory with streaming-based counterparts. Some of the "old" functions are still in the code base, but not used and the tests were removed, while the new functions should be covered. This still leads to the regressions reported by codecov.

We didn't want to remove the "old" functions just yet, before you have a look and are happy with the changes @wfondrie
😄

tests/system_tests/sample_plugin/mokapot_ctree/__init__.py

…alue then raise error that model performed worse (#33)

* Create new executable to aggregate psms to peptides. * Fix bugs: - fix error no psms found during training : if no psms passed the fdr value then raise error that model performed worse - raise error when pep values are all equal to 1 - prefixes paths to dest_dir to not pollute the workdir - catch error to prevent traces logged: Catch all errors to not break structured logging by error traces - fixes parallelism in parse_in_chunks to max_workers - fix indeterminism - fixed small column chunk bug - fix bug when using multiple input files * Fix and add tests: - remove writer tests with confidence object because LinearPsmDataset does not have asign_confidence method anymore and results are streamed to output files while computing confidence - add test for the new function "get_unique_peptides_from_psms" - add cli test for aggregatePsmsToPeptides

dev to main See merge request msaid/inferys/mokapot!36

- adds line break in dataset.py - updates call of ruff in CI - updates pyproject.toml according to new ruff api