Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading BED files with less than 12 column. #144

Closed
ghuls opened this issue Jun 14, 2024 · 4 comments
Closed

Support reading BED files with less than 12 column. #144

ghuls opened this issue Jun 14, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@ghuls
Copy link
Contributor

ghuls commented Jun 14, 2024

It would be nice if BED files with less than 12 columns could be read.

For example if in BEDReadOptions, you can specify how many of the BED columns follow the spec.
Additional columns could be read as String columns.

Similarily to UCSC bigBed: BED3 or -type=bedN[+[P]], where N is an integer between 3 and 12 and the optional +[P] parameter specifies the number of extra fields, not required, but preferred
http://genome.ucsc.edu/goldenPath/help/bigBed.html

@tshauck
Copy link
Member

tshauck commented Jun 14, 2024

Thank you for raising this, let me have a look today and I'll follow up. At first blush it makes sense to me.

Also, FWIW, I was thinking about adding specific bigbed support along with indexed search (I see your link is bigbed related not just bed). Is that something you would/could use?

@tshauck tshauck added the enhancement New feature or request label Jun 14, 2024
@tshauck
Copy link
Member

tshauck commented Jun 15, 2024

@ghuls I think this should be doable now if you update biobear. BEDReadOptions now takes an n_fields param... e.g...

In [5]: session.read_bed_file('./test-three.bed', options=bb.BEDReadOptions(n_fields=3)).to_polars()
Out[5]: 
shape: (10, 3)
┌─────────────────────────┬───────┬───────┐
│ reference_sequence_namestartend   │
│ ---------   │
│ stri64i64   │
╞═════════════════════════╪═══════╪═══════╡
│ chr11187412227 │
│ chr11261312721 │
│ chr11322114409 │
│ chr11436214829 │
│ chr11497015038 │
│ chr11579615947 │
│ chr11660716765 │
│ chr11685817055 │
│ chr11723317368 │
│ chr11760617742 │
└─────────────────────────┴───────┴───────┘

Technically things shouldn't fail anymore if you don't specify the number of fields and the BED less than the full complement of fields, it just fills the additional cols with null.

In [7]: session.read_bed_file('./test-three.bed').to_polars()
Out[7]: 
shape: (10, 12)
┌─────────────────────────┬───────┬───────┬──────┬───┬───────┬─────────────┬─────────────┬──────────────┐
│ reference_sequence_namestartendname ┆ … ┆ colorblock_countblock_sizesblock_starts │
│ ------------  ┆   ┆ ------------          │
│ stri64i64str  ┆   ┆ stri64strstr          │
╞═════════════════════════╪═══════╪═══════╪══════╪═══╪═══════╪═════════════╪═════════════╪══════════════╡
│ chr11187412227null ┆ … ┆ nullnullnullnull         │
│ chr11261312721null ┆ … ┆ nullnullnullnull         │
│ chr11322114409null ┆ … ┆ nullnullnullnull         │
│ chr11436214829null ┆ … ┆ nullnullnullnull         │
│ chr11497015038null ┆ … ┆ nullnullnullnull         │
│ chr11579615947null ┆ … ┆ nullnullnullnull         │
│ chr11660716765null ┆ … ┆ nullnullnullnull         │
│ chr11685817055null ┆ … ┆ nullnullnullnull         │
│ chr11723317368null ┆ … ┆ nullnullnullnull         │
│ chr11760617742null ┆ … ┆ nullnullnullnull         │
└─────────────────────────┴───────┴───────┴──────┴───┴───────┴─────────────┴─────────────┴──────────────┘

I'm gonna close this task, but please reopen if it remains an issue. Thanks!

@ghuls
Copy link
Contributor Author

ghuls commented Jun 17, 2024

Thank you for raising this, let me have a look today and I'll follow up. At first blush it makes sense to me.

Also, FWIW, I was thinking about adding specific bigbed support along with indexed search (I see your link is bigbed related not just bed). Is that something you would/could use?

I didn't find the bedN[+[P]] documentation for just the BED format on UCSC. So far we only use bigBed when we create UCSC sessions. The main problem at the moment with bigBed is that you can only create/manipulate them with Kent tools and not many other tools support it (pyBigWig and pybigtools). Support for it in biobear could change this.

@tshauck
Copy link
Member

tshauck commented Jun 17, 2024

Cool, thanks for the context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants