Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: avoid duplicate mappings #131

Merged
merged 24 commits into from
Sep 14, 2023
Merged

refactor: avoid duplicate mappings #131

merged 24 commits into from
Sep 14, 2023

Conversation

BorisYourich
Copy link
Collaborator

Description

Part of the code from the get_read_orientation.py module was transferred into a new module mapping.py which takes care of the STAR mapping process. The instance of the Mapping class is created in the HtsInfer class and supplemented as an argument to the get_lib_type and get_orientation modules, so as to avoid a repeated mapping if the sequence IDs are not the same and the mates are actually not mates.

Fixes #130

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)

Checklist

Please carefully read these items and tick them off if the statements are true
or do not apply.

  • I have performed a self-review of my own code
  • My code follows the existing coding style, lints and generates no new
    warnings
  • I have added type annotations to all function/method signatures, and I
    have added type annotations for any local variables that are non-trivial,
    potentially ambiguous or might otherwise benefit from explicit typing.
  • I have commented my code in hard-to-understand areas
  • I have added ["Google-style docstrings"] to all new modules, classes,
    methods/functions or updated previously existing ones
  • I have added tests that prove my fix is effective or that my feature
    works
  • New and existing unit tests pass locally with my changes and I have not
    reduced the code coverage relative to the previous state
  • I have updated any sections of the app's documentation that are affected
    by the proposed changes

One Issue which I was not able to replicate is the unit test failures in the CI workflow, I will not be able to work on this currently, so I leave it as is for now.

Copy link
Member

@uniqueg uniqueg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff, thanks a lot. Please check the unit tests and resolve issues. Otherwise only minor comments from my side. Once the tests pass, can you please tag @balajtimate for an additional pair of eyes? 🙏

htsinfer/cli.py Outdated
"upper limit on the difference in the reference "
"sequence coordinates between the two mates to be "
"considered as coming from a single fragment "
"(Used only when sequence identifiers do not match) "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency I would use all lowercase "used". Also in the next option.

@@ -116,6 +124,111 @@ def _evaluate_mate_relationship(
self.results.relationship = (
StatesTypeRelationship.split_mates
)
self.mapping.library_type.relationship \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use parentheses for wrapping lines. See PEP 8 style guide.

@@ -416,6 +101,7 @@ def process_alignments(
if star_dirs[0].name == "file_1":
results.file_1 = self.process_single(sam=paths[0])
elif len(star_dirs) == 2:
print("Getting there")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a leftover from testing. If not, please change into a logging message

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opsie 😆

@uniqueg
Copy link
Member

uniqueg commented Aug 10, 2023

@BorisYourich: I've been away for a while. Should I look at this again or was there nothing added/changed since the last review?

@BorisYourich
Copy link
Collaborator Author

@BorisYourich: I've been away for a while. Should I look at this again or was there nothing added/changed since the last review?

Hey, sorry for such a late response @uniqueg I was busy with other stuff, no update on this yet.

@uniqueg
Copy link
Member

uniqueg commented Aug 22, 2023

No problem. Thanks for the update on no update :) I had just asked because I was away for quite a while myself and lost a bit track of progress on different ends

@BorisYourich
Copy link
Collaborator Author

I made the required changes plus something extra I found, but the unit tests still fail and I have no idea why, and it seems random, take a look at the replayed 3.7 unit test, many more tests failed compared to previous run. I cannot replicate in my local env, even if I downgrade python to 3.7, the tests still pass.

@uniqueg
Copy link
Member

uniqueg commented Sep 11, 2023

Hi @BorisYourich, I've fixed some of the errors by stabilizing the build in various ways (remote now installs the exact same package as local for me): https://github.com/zavolanlab/htsinfer/pull/131/files/0a232822922a006fd97ee9d6e94d44db733eca5d..b5bd469df2694f20ba46178425752b036cda2f94

Among other things, I have included Py3.10 in the supported versions, but dropped support for Py3.7 - support for which is phasing out anyway.

As you can see, there is still one issue remaining that affects two tests. These two tests also pass on my local machine (I suppose the same issue you faced), which -together with the error message- makes me think that it's a permission problem. Maybe a file is attempted to be written in a location that we do have write access on our local machine, but wouldn't have write access on the remote machine. Or a file is created in a temporary directory that is garbage collected on the remote between tests, but not locally.

From the error stack, I would have a look here first:

def subset_transcripts_by_organism(self) -> Path:
"""Filter FASTA file of transcripts by current sources.
The filtered file contains records from the indicated sources.
Typically, this is one source. However, for if two input files
were supplied that are originating from different sources (i.e.,
not from a valid paired-ended library), it may be from two
different sources. If no source is supplied (because it could
not be inferred), no filtering is done.
Returns:
Path to filtered FASTA file.
Raises:
FileProblem: Could not open input/output FASTA file for
reading/writing.
"""
LOGGER.debug(f"Subsetting transcripts for: {self.library_source}")
outfile = self.tmp_dir / "transcripts_subset.fasta"
def yield_filtered_seqs():
"""Generator yielding sequence records for specified sources.
Yields:
Next FASTA sequence record of the specified sources.
Raises: Could not process input FASTA file.
"""
sources = []
if self.library_source.file_1.short_name is not None:
sources.append(self.library_source.file_1.short_name)
if self.library_source.file_2.short_name is not None:
sources.append(self.library_source.file_2.short_name)
try:
for record in SeqIO.parse(
handle=self.transcripts_file,
format='fasta',
):
try:
org_name = record.description.split("|")[3]
except (ValueError, IndexError):
continue
if org_name in sources or len(sources) == 0:
yield record
except OSError as exc:
raise FileProblem(
f"Could not process file '{self.transcripts_file}'"
) from exc
try:
SeqIO.write(
sequences=yield_filtered_seqs(),
handle=outfile,
format='fasta',
)
except OSError as exc:
raise FileProblem(
f"Failed to write to FASTA file '{outfile}'"
) from exc
LOGGER.debug(f"Filtered transcripts file: {outfile}")
return outfile

This method also had a line that created a file name based off an object (which I first thought was the problem). I have now given the transcripts subset file a hardcoded name (not sure if that could cause problems in some case?). And still this stringified library source object instance is used in the log message, so this isn't handled well. Seems to me that this function needs guarding against edge cases, e.g., when there is no library source determined/available).

As I think you know that particular code better from your changes, I would like to pass it back to you if possible. With some debug messages and maybe additional try-except blocks around the code reading and writing that subset transcripts file, I'm sure you will find out what the problem is on the remote.

@codecov
Copy link

codecov bot commented Sep 13, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.19% 🎉

Comparison is base (d243b44) 99.80% compared to head (a5bee43) 100.00%.

Additional details and impacted files
@@             Coverage Diff             @@
##              dev      #131      +/-   ##
===========================================
+ Coverage   99.80%   100.00%   +0.19%     
===========================================
  Files          12        13       +1     
  Lines        1047      1078      +31     
===========================================
+ Hits         1045      1078      +33     
+ Misses          2         0       -2     
Files Changed Coverage Δ
htsinfer/cli.py 100.00% <100.00%> (ø)
htsinfer/get_library_type.py 100.00% <100.00%> (ø)
htsinfer/get_read_orientation.py 100.00% <100.00%> (+0.86%) ⬆️
htsinfer/htsinfer.py 100.00% <100.00%> (ø)
htsinfer/mapping.py 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@BorisYourich
Copy link
Collaborator Author

Playing around with the tests I realized where the problem is, the tests didn't have the tmpdir argument. Everything works now :)

@uniqueg
Copy link
Member

uniqueg commented Sep 13, 2023

Awesome, thanks a lot, @BorisYourich!

@balajtimate: Can you please test the feature branch to see if the proposed changes break anything on real world data? Maybe you can run it on a few samples and see if you get the same results as you did previously.

If all is good, I will merge.

@balajtimate
Copy link
Collaborator

Okay, so I tested on a subset of 20 samples from our sample data, random organisms, SE/PE and there was no error, code ran perfectly, and the results match the ones from the run with the dev branch. 😄

I think you can merge now, if the 3.7 integration and unit tests run.

@uniqueg
Copy link
Member

uniqueg commented Sep 14, 2023

Thanks! Merging this now, given that it doesn't break anything. We can still do in-depth testing later to see if it actually works as expected. Thanks a lot @BorisYourich

@uniqueg uniqueg changed the title Issue 130 refactor: avoid duplicate mappings Sep 14, 2023
@uniqueg uniqueg merged commit 67bb506 into dev Sep 14, 2023
18 checks passed
@uniqueg uniqueg deleted the issue_130 branch September 14, 2023 08:23
balajtimate pushed a commit that referenced this pull request Nov 9, 2023
Co-authored-by: Boris Jurič <499542@mail.muni.cz>
Co-authored-by: Alex Kanitz <alexander.kanitz@alumni.ethz.ch>
balajtimate added a commit that referenced this pull request Nov 15, 2023
* feat: add org param

* refactor: avoid duplicate mappings (#131)

Co-authored-by: Boris Jurič <499542@mail.muni.cz>
Co-authored-by: Alex Kanitz <alexander.kanitz@alumni.ethz.ch>

* fix typo, update pylint config

* feat: add org_id param #108

* refactor: get_library_source.py #108

* test: add org param tests #108

* fix: update Pydantic version (#146)

* fix pydantic issues

* fix: update pydantic version in envs

* fix: pin sphinx-rtd-theme into env

* fix: update readthedocs config

* update readme, gitignore

* feat: infer org source if id not in dict #108

* replace json with model_dump

* feat: add org_id param #108

* feat: add org_id param #108

* refactor: replace org with tax-id

* refactor get_library_source

* refactor get_library_source tests

* refactor: update models.py

* refactor: fix typos

---------

Co-authored-by: Boris Jurič <74237898+BorisYourich@users.noreply.github.com>
Co-authored-by: Boris Jurič <499542@mail.muni.cz>
Co-authored-by: Alex Kanitz <alexander.kanitz@alumni.ethz.ch>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create Mapping class for managing STAR cmd and outputs
3 participants