Support reading BED files with less than 12 column. #144

ghuls · 2024-06-14T13:19:43Z

It would be nice if BED files with less than 12 columns could be read.

For example if in BEDReadOptions, you can specify how many of the BED columns follow the spec.
Additional columns could be read as String columns.

Similarily to UCSC bigBed: BED3 or -type=bedN[+[P]], where N is an integer between 3 and 12 and the optional +[P] parameter specifies the number of extra fields, not required, but preferred
http://genome.ucsc.edu/goldenPath/help/bigBed.html

The text was updated successfully, but these errors were encountered:

tshauck · 2024-06-14T16:40:52Z

Thank you for raising this, let me have a look today and I'll follow up. At first blush it makes sense to me.

Also, FWIW, I was thinking about adding specific bigbed support along with indexed search (I see your link is bigbed related not just bed). Is that something you would/could use?

tshauck · 2024-06-15T04:58:09Z

@ghuls I think this should be doable now if you update biobear. BEDReadOptions now takes an n_fields param... e.g...

In [5]: session.read_bed_file('./test-three.bed', options=bb.BEDReadOptions(n_fields=3)).to_polars()
Out[5]: 
shape: (10, 3)
┌─────────────────────────┬───────┬───────┐
│ reference_sequence_name ┆ start ┆ end   │
│ ---                     ┆ ---   ┆ ---   │
│ str                     ┆ i64   ┆ i64   │
╞═════════════════════════╪═══════╪═══════╡
│ chr1                    ┆ 11874 ┆ 12227 │
│ chr1                    ┆ 12613 ┆ 12721 │
│ chr1                    ┆ 13221 ┆ 14409 │
│ chr1                    ┆ 14362 ┆ 14829 │
│ chr1                    ┆ 14970 ┆ 15038 │
│ chr1                    ┆ 15796 ┆ 15947 │
│ chr1                    ┆ 16607 ┆ 16765 │
│ chr1                    ┆ 16858 ┆ 17055 │
│ chr1                    ┆ 17233 ┆ 17368 │
│ chr1                    ┆ 17606 ┆ 17742 │
└─────────────────────────┴───────┴───────┘

Technically things shouldn't fail anymore if you don't specify the number of fields and the BED less than the full complement of fields, it just fills the additional cols with null.

In [7]: session.read_bed_file('./test-three.bed').to_polars()
Out[7]: 
shape: (10, 12)
┌─────────────────────────┬───────┬───────┬──────┬───┬───────┬─────────────┬─────────────┬──────────────┐
│ reference_sequence_name ┆ start ┆ end   ┆ name ┆ … ┆ color ┆ block_count ┆ block_sizes ┆ block_starts │
│ ---                     ┆ ---   ┆ ---   ┆ ---  ┆   ┆ ---   ┆ ---         ┆ ---         ┆ ---          │
│ str                     ┆ i64   ┆ i64   ┆ str  ┆   ┆ str   ┆ i64         ┆ str         ┆ str          │
╞═════════════════════════╪═══════╪═══════╪══════╪═══╪═══════╪═════════════╪═════════════╪══════════════╡
│ chr1                    ┆ 11874 ┆ 12227 ┆ null ┆ … ┆ null  ┆ null        ┆ null        ┆ null         │
│ chr1                    ┆ 12613 ┆ 12721 ┆ null ┆ … ┆ null  ┆ null        ┆ null        ┆ null         │
│ chr1                    ┆ 13221 ┆ 14409 ┆ null ┆ … ┆ null  ┆ null        ┆ null        ┆ null         │
│ chr1                    ┆ 14362 ┆ 14829 ┆ null ┆ … ┆ null  ┆ null        ┆ null        ┆ null         │
│ chr1                    ┆ 14970 ┆ 15038 ┆ null ┆ … ┆ null  ┆ null        ┆ null        ┆ null         │
│ chr1                    ┆ 15796 ┆ 15947 ┆ null ┆ … ┆ null  ┆ null        ┆ null        ┆ null         │
│ chr1                    ┆ 16607 ┆ 16765 ┆ null ┆ … ┆ null  ┆ null        ┆ null        ┆ null         │
│ chr1                    ┆ 16858 ┆ 17055 ┆ null ┆ … ┆ null  ┆ null        ┆ null        ┆ null         │
│ chr1                    ┆ 17233 ┆ 17368 ┆ null ┆ … ┆ null  ┆ null        ┆ null        ┆ null         │
│ chr1                    ┆ 17606 ┆ 17742 ┆ null ┆ … ┆ null  ┆ null        ┆ null        ┆ null         │
└─────────────────────────┴───────┴───────┴──────┴───┴───────┴─────────────┴─────────────┴──────────────┘

I'm gonna close this task, but please reopen if it remains an issue. Thanks!

ghuls · 2024-06-17T07:44:46Z

Thank you for raising this, let me have a look today and I'll follow up. At first blush it makes sense to me.

Also, FWIW, I was thinking about adding specific bigbed support along with indexed search (I see your link is bigbed related not just bed). Is that something you would/could use?

I didn't find the bedN[+[P]] documentation for just the BED format on UCSC. So far we only use bigBed when we create UCSC sessions. The main problem at the moment with bigBed is that you can only create/manipulate them with Kent tools and not many other tools support it (pyBigWig and pybigtools). Support for it in biobear could change this.

tshauck · 2024-06-17T14:41:42Z

Cool, thanks for the context.

tshauck added the enhancement New feature or request label Jun 14, 2024

tshauck closed this as completed Jun 15, 2024

tshauck mentioned this issue Jun 15, 2024

Update user docs for new BED options #147

Closed

tshauck mentioned this issue Jun 24, 2024

Support BigBED file wheretrue/exon#549

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support reading BED files with less than 12 column. #144

Support reading BED files with less than 12 column. #144

ghuls commented Jun 14, 2024

tshauck commented Jun 14, 2024

tshauck commented Jun 15, 2024 •

edited

Loading

ghuls commented Jun 17, 2024

tshauck commented Jun 17, 2024

Support reading BED files with less than 12 column. #144

Support reading BED files with less than 12 column. #144

Comments

ghuls commented Jun 14, 2024

tshauck commented Jun 14, 2024

tshauck commented Jun 15, 2024 • edited Loading

ghuls commented Jun 17, 2024

tshauck commented Jun 17, 2024

tshauck commented Jun 15, 2024 •

edited

Loading