refactor: avoid duplicate mappings #131

BorisYourich · 2023-07-18T09:51:35Z

Description

Part of the code from the get_read_orientation.py module was transferred into a new module mapping.py which takes care of the STAR mapping process. The instance of the Mapping class is created in the HtsInfer class and supplemented as an argument to the get_lib_type and get_orientation modules, so as to avoid a repeated mapping if the sequence IDs are not the same and the mates are actually not mates.

Fixes #130

Type of change

Please delete options that are not relevant.

New feature (non-breaking change which adds functionality)

Checklist

Please carefully read these items and tick them off if the statements are true
or do not apply.

I have performed a self-review of my own code
My code follows the existing coding style, lints and generates no new
warnings
I have added type annotations to all function/method signatures, and I
have added type annotations for any local variables that are non-trivial,
potentially ambiguous or might otherwise benefit from explicit typing.
I have commented my code in hard-to-understand areas
I have added ["Google-style docstrings"] to all new modules, classes,
methods/functions or updated previously existing ones
I have added tests that prove my fix is effective or that my feature
works
New and existing unit tests pass locally with my changes and I have not
reduced the code coverage relative to the previous state
I have updated any sections of the app's documentation that are affected
by the proposed changes

One Issue which I was not able to replicate is the unit test failures in the CI workflow, I will not be able to work on this currently, so I leave it as is for now.

remote had merge changes from dev into issue_85

…M files

uniqueg

Good stuff, thanks a lot. Please check the unit tests and resolve issues. Otherwise only minor comments from my side. Once the tests pass, can you please tag @balajtimate for an additional pair of eyes? 🙏

uniqueg · 2023-07-18T22:39:36Z

htsinfer/cli.py

+            "upper limit on the difference in the reference "
+            "sequence coordinates between the two mates to be "
+            "considered as coming from a single fragment "
+            "(Used only when sequence identifiers do not match) "


For consistency I would use all lowercase "used". Also in the next option.

uniqueg · 2023-07-18T22:41:12Z

htsinfer/get_library_type.py

@@ -116,6 +124,111 @@ def _evaluate_mate_relationship(
                self.results.relationship = (
                    StatesTypeRelationship.split_mates
                )
+                self.mapping.library_type.relationship \


Better to use parentheses for wrapping lines. See PEP 8 style guide.

uniqueg · 2023-07-18T22:42:47Z

htsinfer/get_read_orientation.py

@@ -416,6 +101,7 @@ def process_alignments(
            if star_dirs[0].name == "file_1":
                results.file_1 = self.process_single(sam=paths[0])
        elif len(star_dirs) == 2:
+            print("Getting there")


I think this is a leftover from testing. If not, please change into a logging message

uniqueg · 2023-08-10T13:09:51Z

@BorisYourich: I've been away for a while. Should I look at this again or was there nothing added/changed since the last review?

BorisYourich · 2023-08-22T07:20:32Z

@BorisYourich: I've been away for a while. Should I look at this again or was there nothing added/changed since the last review?

Hey, sorry for such a late response @uniqueg I was busy with other stuff, no update on this yet.

uniqueg · 2023-08-22T09:37:42Z

No problem. Thanks for the update on no update :) I had just asked because I was away for quite a while myself and lost a bit track of progress on different ends

BorisYourich · 2023-09-08T11:33:29Z

I made the required changes plus something extra I found, but the unit tests still fail and I have no idea why, and it seems random, take a look at the replayed 3.7 unit test, many more tests failed compared to previous run. I cannot replicate in my local env, even if I downgrade python to 3.7, the tests still pass.

uniqueg · 2023-09-11T12:37:07Z

Hi @BorisYourich, I've fixed some of the errors by stabilizing the build in various ways (remote now installs the exact same package as local for me): https://github.com/zavolanlab/htsinfer/pull/131/files/0a232822922a006fd97ee9d6e94d44db733eca5d..b5bd469df2694f20ba46178425752b036cda2f94

Among other things, I have included Py3.10 in the supported versions, but dropped support for Py3.7 - support for which is phasing out anyway.

As you can see, there is still one issue remaining that affects two tests. These two tests also pass on my local machine (I suppose the same issue you faced), which -together with the error message- makes me think that it's a permission problem. Maybe a file is attempted to be written in a location that we do have write access on our local machine, but wouldn't have write access on the remote machine. Or a file is created in a temporary directory that is garbage collected on the remote between tests, but not locally.

From the error stack, I would have a look here first:

htsinfer/htsinfer/mapping.py

Lines 79 to 142 in b5bd469

    
               def subset_transcripts_by_organism(self) -> Path: 
        
                   """Filter FASTA file of transcripts by current sources. 
        
                   The filtered file contains records from the indicated sources. 
        
                       Typically, this is one source. However, for if two input files 
        
                       were supplied that are originating from different sources (i.e., 
        
                       not from a valid paired-ended library), it may be from two 
        
                       different sources. If no source is supplied (because it could 
        
                       not be inferred), no filtering is done. 
        
                   Returns: 
        
                       Path to filtered FASTA file. 
        
                   Raises: 
        
                       FileProblem: Could not open input/output FASTA file for 
        
                           reading/writing. 
        
                   """ 
        
                   LOGGER.debug(f"Subsetting transcripts for: {self.library_source}") 
        
                   outfile = self.tmp_dir / "transcripts_subset.fasta" 
        
                   def yield_filtered_seqs(): 
        
                       """Generator yielding sequence records for specified sources. 
        
                       Yields: 
        
                           Next FASTA sequence record of the specified sources. 
        
                       Raises: Could not process input FASTA file. 
        
                       """ 
        
                       sources = [] 
        
                       if self.library_source.file_1.short_name is not None: 
        
                           sources.append(self.library_source.file_1.short_name) 
        
                       if self.library_source.file_2.short_name is not None: 
        
                           sources.append(self.library_source.file_2.short_name) 
        
                       try: 
        
                           for record in SeqIO.parse( 
        
                                   handle=self.transcripts_file, 
        
                                   format='fasta', 
        
                           ): 
        
                               try: 
        
                                   org_name = record.description.split("|")[3] 
        
                               except (ValueError, IndexError): 
        
                                   continue 
        
                               if org_name in sources or len(sources) == 0: 
        
                                   yield record 
        
                       except OSError as exc: 
        
                           raise FileProblem( 
        
                               f"Could not process file '{self.transcripts_file}'" 
        
                           ) from exc 
        
                   try: 
        
                       SeqIO.write( 
        
                           sequences=yield_filtered_seqs(), 
        
                           handle=outfile, 
        
                           format='fasta', 
        
                       ) 
        
                   except OSError as exc: 
        
                       raise FileProblem( 
        
                           f"Failed to write to FASTA file '{outfile}'" 
        
                       ) from exc 
        
                   LOGGER.debug(f"Filtered transcripts file: {outfile}") 
        
                   return outfile

This method also had a line that created a file name based off an object (which I first thought was the problem). I have now given the transcripts subset file a hardcoded name (not sure if that could cause problems in some case?). And still this stringified library source object instance is used in the log message, so this isn't handled well. Seems to me that this function needs guarding against edge cases, e.g., when there is no library source determined/available).

As I think you know that particular code better from your changes, I would like to pass it back to you if possible. With some debug messages and maybe additional try-except blocks around the code reading and writing that subset transcripts file, I'm sure you will find out what the problem is on the remote.

codecov · 2023-09-13T11:30:30Z

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.19% 🎉

Comparison is base (d243b44) 99.80% compared to head (a5bee43) 100.00%.

Additional details and impacted files

@@             Coverage Diff             @@
##              dev      #131      +/-   ##
===========================================
+ Coverage   99.80%   100.00%   +0.19%     
===========================================
  Files          12        13       +1     
  Lines        1047      1078      +31     
===========================================
+ Hits         1045      1078      +33     
+ Misses          2         0       -2

Files Changed	Coverage Δ
htsinfer/cli.py	`100.00% <100.00%> (ø)`
htsinfer/get_library_type.py	`100.00% <100.00%> (ø)`
htsinfer/get_read_orientation.py	`100.00% <100.00%> (+0.86%)`	⬆️
htsinfer/htsinfer.py	`100.00% <100.00%> (ø)`
htsinfer/mapping.py	`100.00% <100.00%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

BorisYourich · 2023-09-13T11:58:58Z

Playing around with the tests I realized where the problem is, the tests didn't have the tmpdir argument. Everything works now :)

uniqueg · 2023-09-13T14:38:06Z

Awesome, thanks a lot, @BorisYourich!

@balajtimate: Can you please test the feature branch to see if the proposed changes break anything on real world data? Maybe you can run it on a few samples and see if you get the same results as you did previously.

If all is good, I will merge.

balajtimate · 2023-09-14T08:09:54Z

Okay, so I tested on a subset of 20 samples from our sample data, random organisms, SE/PE and there was no error, code ran perfectly, and the results match the ones from the run with the dev branch. 😄

I think you can merge now, if the 3.7 integration and unit tests run.

uniqueg · 2023-09-14T08:14:28Z

Thanks! Merging this now, given that it doesn't break anything. We can still do in-depth testing later to see if it actually works as expected. Thanks a lot @BorisYourich

Co-authored-by: Boris Jurič <499542@mail.muni.cz> Co-authored-by: Alex Kanitz <alexander.kanitz@alumni.ethz.ch>

* feat: add org param * refactor: avoid duplicate mappings (#131) Co-authored-by: Boris Jurič <499542@mail.muni.cz> Co-authored-by: Alex Kanitz <alexander.kanitz@alumni.ethz.ch> * fix typo, update pylint config * feat: add org_id param #108 * refactor: get_library_source.py #108 * test: add org param tests #108 * fix: update Pydantic version (#146) * fix pydantic issues * fix: update pydantic version in envs * fix: pin sphinx-rtd-theme into env * fix: update readthedocs config * update readme, gitignore * feat: infer org source if id not in dict #108 * replace json with model_dump * feat: add org_id param #108 * feat: add org_id param #108 * refactor: replace org with tax-id * refactor get_library_source * refactor get_library_source tests * refactor: update models.py * refactor: fix typos --------- Co-authored-by: Boris Jurič <74237898+BorisYourich@users.noreply.github.com> Co-authored-by: Boris Jurič <499542@mail.muni.cz> Co-authored-by: Alex Kanitz <alexander.kanitz@alumni.ethz.ch>

Boris Jurič and others added 10 commits May 24, 2023 12:12

feat: infer lib type if seq_ids don't match

2d04c2e

feat: infer lib type if seq_ids don't match

6326f97

Added descriptions, new parameters, updated README

f3e8c13

Merge branch 'dev' into issue_85

20ef424

Fixed TypeError when unmapped reads are present

1bdf91a

Merge branch 'issue_85' of github.com:zavolanlab/htsinfer into issue_85

bcccbfe

remote had merge changes from dev into issue_85

Removed Parameters from README.md

2a4b122

Added mapping class, get read orientation does not work for merged SA…

d975f8a

…M files

Working solution

00b0c2a

Removed unused imports

e691fa1

BorisYourich requested a review from uniqueg July 18, 2023 09:52

uniqueg requested changes Jul 18, 2023

View reviewed changes

Fixed minor issues and tests

0a23282

uniqueg added 8 commits September 11, 2023 12:34

fix merge conflicts

899640e

update deps

b1fe756

fix channel order

3c5cc6f

update CI workflow

cfa8f0a

drop support for python 3.7

58fafee

channel order

8ddb254

rename transcripts subset file

f0f8fad

remove file

b5bd469

Boris Jurič added 4 commits September 13, 2023 13:03

Test with source added

77acd75

Test with tmpdir

73ed136

Test with tmpdir

bd9d67e

Test with tmpdir in MAPPING

74c1387

Working solution with tests and fixed tmp problem.

a5bee43

BorisYourich requested a review from uniqueg September 13, 2023 11:59

uniqueg changed the title ~~Issue 130~~ refactor: avoid duplicate mappings Sep 14, 2023

uniqueg merged commit 67bb506 into dev Sep 14, 2023
18 checks passed

uniqueg deleted the issue_130 branch September 14, 2023 08:23

balajtimate pushed a commit that referenced this pull request Nov 9, 2023

refactor: avoid duplicate mappings (#131)

dc8af23

Co-authored-by: Boris Jurič <499542@mail.muni.cz> Co-authored-by: Alex Kanitz <alexander.kanitz@alumni.ethz.ch>

balajtimate mentioned this pull request Nov 15, 2023

bug: STAR using uncompressed sample files when mate relationship is not #149

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: avoid duplicate mappings #131

refactor: avoid duplicate mappings #131

BorisYourich commented Jul 18, 2023

uniqueg left a comment •

edited

Loading

uniqueg Jul 18, 2023

uniqueg Jul 18, 2023

uniqueg Jul 18, 2023

BorisYourich Jul 25, 2023

uniqueg commented Aug 10, 2023

BorisYourich commented Aug 22, 2023

uniqueg commented Aug 22, 2023 •

edited

Loading

BorisYourich commented Sep 8, 2023

uniqueg commented Sep 11, 2023

codecov bot commented Sep 13, 2023 •

edited

Loading

BorisYourich commented Sep 13, 2023

uniqueg commented Sep 13, 2023

balajtimate commented Sep 14, 2023

uniqueg commented Sep 14, 2023

refactor: avoid duplicate mappings #131

refactor: avoid duplicate mappings #131

Conversation

BorisYourich commented Jul 18, 2023

Description

Type of change

Checklist

uniqueg left a comment • edited Loading

Choose a reason for hiding this comment

uniqueg Jul 18, 2023

Choose a reason for hiding this comment

uniqueg Jul 18, 2023

Choose a reason for hiding this comment

uniqueg Jul 18, 2023

Choose a reason for hiding this comment

BorisYourich Jul 25, 2023

Choose a reason for hiding this comment

uniqueg commented Aug 10, 2023

BorisYourich commented Aug 22, 2023

uniqueg commented Aug 22, 2023 • edited Loading

BorisYourich commented Sep 8, 2023

uniqueg commented Sep 11, 2023

codecov bot commented Sep 13, 2023 • edited Loading

Codecov Report

BorisYourich commented Sep 13, 2023

uniqueg commented Sep 13, 2023

balajtimate commented Sep 14, 2023

uniqueg commented Sep 14, 2023

uniqueg left a comment •

edited

Loading

uniqueg commented Aug 22, 2023 •

edited

Loading

codecov bot commented Sep 13, 2023 •

edited

Loading