-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/improve speed and limit memory (#11) #100
base: main
Are you sure you want to change the base?
Conversation
Improve speed and limit memory consumption - stream input files for inference - add feature: skip deduplication - add feature: ensemble model - add feature: rescale input before inference with pre-trained models
💄 fix linting
- fix bug member variables not assigned when model is not trained - allow throw when input file is malformed: remove skip on bad lines from pandas read function
- Create new object of OnDiskPsmDataset to use for brew tests - Update brew function outputs and assert statements
- remove assign confidence tests because datasets don't have assign confidence methods anymore - add eval_fdr value to the _update_labels function
* Fix test confidence: - fix bugs for grouped confidence - fix test_one_group : create file using psm_df_1000 to create OnDiskPsmDataset. - remove test_pickle because confidence does not return dataframe results anymore. - add test_multi_groups to test that different group results are saved correctly. * fix bugs: - overwrite default fdr for update_labels function - return dataframe for psm_df_1000 to use with LinearPsmDataset
- Remove test_cli_pepxml because xml files don't work with streaming - Replace old output file names - Add random generator 'rng' variable to confidence since it is required for proteins - Remove subset_max_train from PluginModel - Fix bug: convert targets column after reading in chunks - Fix peptide column name for confidence - Fix test cli plugins : replace DecisionTreeClassifier with LinearSVC BECAUSE DecisionTreeClassifier return scores as 0 or 1
- Refactor test structure : Separate brew and confidence functions, read results from output files. - Fix bugs: fix output columns for proteins, sort proteins data by score
- Add label value to initial direction because it has to have a numerical number - Read pin does not return dataframe anymore - Compare output of read_pin function to example dataframe
- Add skip_deduplication flag test - Add ensemble flag test - Agg rescale flag test - Fix bug: remove target_column variable from read file for read_data_for_rescale
- Remove writer tests with confidence object becaause LinearPsmDataset does not have asign_confidence method anymore and results are streamed to output files while computing confidence
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #100 +/- ##
==========================================
- Coverage 85.36% 83.21% -2.15%
==========================================
Files 19 21 +2
Lines 1640 2032 +392
==========================================
+ Hits 1400 1691 +291
- Misses 240 341 +101 ☔ View full report in Codecov by Sentry. |
Thanks @sambenfredj for taking care of the tests! We didn't want to remove the "old" functions just yet, before you have a look and are happy with the changes @wfondrie |
…alue then raise error that model performed worse (#33)
* Create new executable to aggregate psms to peptides. * Fix bugs: - fix error no psms found during training : if no psms passed the fdr value then raise error that model performed worse - raise error when pep values are all equal to 1 - prefixes paths to dest_dir to not pollute the workdir - catch error to prevent traces logged: Catch all errors to not break structured logging by error traces - fixes parallelism in parse_in_chunks to max_workers - fix indeterminism - fixed small column chunk bug - fix bug when using multiple input files * Fix and add tests: - remove writer tests with confidence object because LinearPsmDataset does not have asign_confidence method anymore and results are streamed to output files while computing confidence - add test for the new function "get_unique_peptides_from_psms" - add cli test for aggregatePsmsToPeptides
dev to main See merge request msaid/inferys/mokapot!36
- adds line break in dataset.py - updates call of ruff in CI - updates pyproject.toml according to new ruff api
- adds line break in dataset.py - updates call of ruff in CI - updates pyproject.toml according to new ruff api
Improve speed and limit memory consumption - stream input files for inference - add feature: skip deduplication - add feature: ensemble model - add feature: rescale input before inference with pre-trained models
💄 fix linting
- fix bug member variables not assigned when model is not trained - allow throw when input file is malformed: remove skip on bad lines from pandas read function
- Create new object of OnDiskPsmDataset to use for brew tests - Update brew function outputs and assert statements
- remove assign confidence tests because datasets don't have assign confidence methods anymore - add eval_fdr value to the _update_labels function
* Fix test confidence: - fix bugs for grouped confidence - fix test_one_group : create file using psm_df_1000 to create OnDiskPsmDataset. - remove test_pickle because confidence does not return dataframe results anymore. - add test_multi_groups to test that different group results are saved correctly. * fix bugs: - overwrite default fdr for update_labels function - return dataframe for psm_df_1000 to use with LinearPsmDataset
- Remove test_cli_pepxml because xml files don't work with streaming - Replace old output file names - Add random generator 'rng' variable to confidence since it is required for proteins - Remove subset_max_train from PluginModel - Fix bug: convert targets column after reading in chunks - Fix peptide column name for confidence - Fix test cli plugins : replace DecisionTreeClassifier with LinearSVC BECAUSE DecisionTreeClassifier return scores as 0 or 1
- Refactor test structure : Separate brew and confidence functions, read results from output files. - Fix bugs: fix output columns for proteins, sort proteins data by score
- Add label value to initial direction because it has to have a numerical number - Read pin does not return dataframe anymore - Compare output of read_pin function to example dataframe
- Add skip_deduplication flag test - Add ensemble flag test - Agg rescale flag test - Fix bug: remove target_column variable from read file for read_data_for_rescale
- Remove writer tests with confidence object becaause LinearPsmDataset does not have asign_confidence method anymore and results are streamed to output files while computing confidence
…alue then raise error that model performed worse (#33)
* Create new executable to aggregate psms to peptides. * Fix bugs: - fix error no psms found during training : if no psms passed the fdr value then raise error that model performed worse - raise error when pep values are all equal to 1 - prefixes paths to dest_dir to not pollute the workdir - catch error to prevent traces logged: Catch all errors to not break structured logging by error traces - fixes parallelism in parse_in_chunks to max_workers - fix indeterminism - fixed small column chunk bug - fix bug when using multiple input files * Fix and add tests: - remove writer tests with confidence object because LinearPsmDataset does not have asign_confidence method anymore and results are streamed to output files while computing confidence - add test for the new function "get_unique_peptides_from_psms" - add cli test for aggregatePsmsToPeptides
- adds line break in dataset.py - updates call of ruff in CI - updates pyproject.toml according to new ruff api
- adds line break in dataset.py - updates call of ruff in CI - updates pyproject.toml according to new ruff api
# Conflicts: # tests/conftest.py # tests/system_tests/test_system.py # tests/unit_tests/test_brew.py # tests/unit_tests/test_writer_flashlfq.py # tests/unit_tests/test_writer_txt.py
rebase main See merge request msaid/inferys/mokapot!37
Fix problem with type conversions in merge_sort Closes wfondrie#100 See merge request msaid/inferys/mokapot!59
Improve speed and limit memory consumption