Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/improve speed and limit memory (#11) #100

Open
wants to merge 45 commits into
base: main
Choose a base branch
from

Conversation

sambenfredj
Copy link
Contributor

Improve speed and limit memory consumption

  • stream input files for inference
  • add feature: skip deduplication
  • add feature: ensemble model
  • add feature: rescale input before inference with pre-trained models

sambenfredj and others added 2 commits April 12, 2023 17:31
Improve speed and limit memory consumption

- stream input files for inference
- add feature: skip deduplication
- add feature: ensemble model
- add feature: rescale input before inference with pre-trained models
💄 fix linting
gessulat and others added 10 commits April 20, 2023 16:29
- fix bug member variables not assigned when model is not trained
- allow throw when input file is malformed: remove skip on bad lines from pandas read function
- Create new object of OnDiskPsmDataset to use for brew tests
- Update brew function outputs and assert statements
- remove assign confidence tests because datasets don't have assign confidence methods anymore
- add eval_fdr value to the _update_labels function
* Fix test confidence:
- fix bugs for grouped confidence
- fix test_one_group : create file using psm_df_1000 to create OnDiskPsmDataset.
- remove test_pickle because confidence does not return dataframe results anymore.
- add test_multi_groups to test that different group results are saved correctly.

* fix bugs:
- overwrite default fdr for update_labels function
- return dataframe for psm_df_1000 to use with LinearPsmDataset
- Remove test_cli_pepxml because xml files don't work with streaming
- Replace old output file names
- Add random generator 'rng' variable to confidence since it is required for proteins
- Remove subset_max_train from PluginModel
- Fix bug: convert targets column after reading in chunks
- Fix peptide column name for confidence
- Fix test cli plugins : replace DecisionTreeClassifier with LinearSVC BECAUSE DecisionTreeClassifier return scores as 0 or 1
- Refactor test structure : Separate brew and confidence functions, read results from output files.
- Fix bugs: fix output columns for proteins, sort proteins data by score
- Add label value to initial direction because it has to have a numerical number
- Read pin does not return dataframe anymore
- Compare output of read_pin function to example dataframe
- Add skip_deduplication flag test
- Add ensemble flag test
- Agg rescale flag test
- Fix bug: remove target_column variable from read file for read_data_for_rescale
- Remove writer tests with confidence object becaause LinearPsmDataset does not have asign_confidence method anymore and results are streamed to output files while computing confidence
@codecov
Copy link

codecov bot commented May 22, 2023

Codecov Report

Attention: Patch coverage is 89.75469% with 71 lines in your changes are missing coverage. Please review.

Project coverage is 83.21%. Comparing base (da2d545) to head (6726dea).
Report is 3 commits behind head on main.

Files Patch % Lines
mokapot/confidence.py 88.10% 22 Missing ⚠️
mokapot/dataset.py 87.15% 14 Missing ⚠️
mokapot/mokapot.py 71.87% 9 Missing ⚠️
mokapot/model.py 70.37% 8 Missing ⚠️
mokapot/parsers/pin.py 90.69% 8 Missing ⚠️
mokapot/aggregatePsmsToPeptides.py 89.28% 6 Missing ⚠️
mokapot/brew.py 97.32% 3 Missing ⚠️
mokapot/utils.py 98.63% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #100      +/-   ##
==========================================
- Coverage   85.36%   83.21%   -2.15%     
==========================================
  Files          19       21       +2     
  Lines        1640     2032     +392     
==========================================
+ Hits         1400     1691     +291     
- Misses        240      341     +101     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gessulat
Copy link
Contributor

Thanks @sambenfredj for taking care of the tests!
Samia replaced some function working on data in-memory with streaming-based counterparts. Some of the "old" functions are still in the code base, but not used and the tests were removed, while the new functions should be covered. This still leads to the regressions reported by codecov.

We didn't want to remove the "old" functions just yet, before you have a look and are happy with the changes @wfondrie
😄

sambenfredj and others added 14 commits August 4, 2023 11:08
…alue then raise error that model performed worse (#33)
* Create new executable to aggregate psms to peptides.
* Fix bugs:
- fix error no psms found during training : if no psms passed the fdr value then raise error that model performed worse
- raise error when pep values are all equal to 1
- prefixes paths to dest_dir to not pollute the workdir
- catch error to prevent traces logged: Catch all errors to not break structured logging by error traces
- fixes parallelism in parse_in_chunks to max_workers
- fix indeterminism
- fixed small column chunk bug
- fix bug when using multiple input files
* Fix and add tests:
- remove writer tests with confidence object because LinearPsmDataset does not have asign_confidence method anymore and results are streamed to output files while computing confidence
- add test for the new function "get_unique_peptides_from_psms"
- add cli test for aggregatePsmsToPeptides
dev to main

See merge request msaid/inferys/mokapot!36
- adds line break in dataset.py
- updates call of ruff in CI
- updates pyproject.toml according to new ruff api
- adds line break in dataset.py
- updates call of ruff in CI
- updates pyproject.toml according to new ruff api
Improve speed and limit memory consumption

- stream input files for inference
- add feature: skip deduplication
- add feature: ensemble model
- add feature: rescale input before inference with pre-trained models
💄 fix linting
- fix bug member variables not assigned when model is not trained
- allow throw when input file is malformed: remove skip on bad lines from pandas read function
sambenfredj and others added 18 commits February 27, 2024 11:44
- Create new object of OnDiskPsmDataset to use for brew tests
- Update brew function outputs and assert statements
- remove assign confidence tests because datasets don't have assign confidence methods anymore
- add eval_fdr value to the _update_labels function
* Fix test confidence:
- fix bugs for grouped confidence
- fix test_one_group : create file using psm_df_1000 to create OnDiskPsmDataset.
- remove test_pickle because confidence does not return dataframe results anymore.
- add test_multi_groups to test that different group results are saved correctly.

* fix bugs:
- overwrite default fdr for update_labels function
- return dataframe for psm_df_1000 to use with LinearPsmDataset
- Remove test_cli_pepxml because xml files don't work with streaming
- Replace old output file names
- Add random generator 'rng' variable to confidence since it is required for proteins
- Remove subset_max_train from PluginModel
- Fix bug: convert targets column after reading in chunks
- Fix peptide column name for confidence
- Fix test cli plugins : replace DecisionTreeClassifier with LinearSVC BECAUSE DecisionTreeClassifier return scores as 0 or 1
- Refactor test structure : Separate brew and confidence functions, read results from output files.
- Fix bugs: fix output columns for proteins, sort proteins data by score
- Add label value to initial direction because it has to have a numerical number
- Read pin does not return dataframe anymore
- Compare output of read_pin function to example dataframe
- Add skip_deduplication flag test
- Add ensemble flag test
- Agg rescale flag test
- Fix bug: remove target_column variable from read file for read_data_for_rescale
- Remove writer tests with confidence object becaause LinearPsmDataset does not have asign_confidence method anymore and results are streamed to output files while computing confidence
…alue then raise error that model performed worse (#33)
* Create new executable to aggregate psms to peptides.
* Fix bugs:
- fix error no psms found during training : if no psms passed the fdr value then raise error that model performed worse
- raise error when pep values are all equal to 1
- prefixes paths to dest_dir to not pollute the workdir
- catch error to prevent traces logged: Catch all errors to not break structured logging by error traces
- fixes parallelism in parse_in_chunks to max_workers
- fix indeterminism
- fixed small column chunk bug
- fix bug when using multiple input files
* Fix and add tests:
- remove writer tests with confidence object because LinearPsmDataset does not have asign_confidence method anymore and results are streamed to output files while computing confidence
- add test for the new function "get_unique_peptides_from_psms"
- add cli test for aggregatePsmsToPeptides
- adds line break in dataset.py
- updates call of ruff in CI
- updates pyproject.toml according to new ruff api
- adds line break in dataset.py
- updates call of ruff in CI
- updates pyproject.toml according to new ruff api
# Conflicts:
#   tests/conftest.py
#   tests/system_tests/test_system.py
#   tests/unit_tests/test_brew.py
#   tests/unit_tests/test_writer_flashlfq.py
#   tests/unit_tests/test_writer_txt.py
rebase main

See merge request msaid/inferys/mokapot!37
gessulat pushed a commit to msaid-de/mokapot that referenced this pull request Jun 19, 2024
Fix problem with type conversions in merge_sort

Closes wfondrie#100

See merge request msaid/inferys/mokapot!59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants