Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regions has no effect when streaming BCF #248

Open
cmdoret opened this issue May 29, 2024 · 2 comments
Open

regions has no effect when streaming BCF #248

cmdoret opened this issue May 29, 2024 · 2 comments

Comments

@cmdoret
Copy link

cmdoret commented May 29, 2024

Hello and thanks for providing this excellent implementation of htsget!
We are trying to use it together with a minio S3 storage and so far it worked well, however we noticed that when requesting a specific region, the /variants endpoint returned all records regardless. I believe this is a bug in htsget-rs based on the server logs, but I may also have mis-interpreted them or mis-used htsget-rs. Do you have any advice or suggestion on where the issue might be?

Environment:

Steps to reproduce:

  1. Download example file https://github.com/vcflib/vcflib/blob/master/samples/sample.vcf
  2. Process file into bcf:
bgzip sample.vcf
bcftools index sample.vcf.gz 
bcftools convert -Ob -o abc.bcf sample.vcf.gz
bcftools index abc.bcf
bcftools index -s abc.bcf.csi
19      .       2
20      .       6
X       .       1
  1. Upload bcf + csi index into s3 bucket
  2. Send a GET request to htsget for contig "19"
  3. Send a GET request to htsget for contig "20"

Observed behaviour: Both queries returned all variant records from all contigs.
Expected behaviour: Only variants of the requested chromosome are returned.

Observations:

Log from the contig 19 request shows that the query was parsed properly, and that segments (10,16) were requested.

2024-05-29T11:42:02.765895Z  INFO HTTP request{http.method=GET http.route=/variants/{id:.+} http.flavor=1.0 http.scheme=http http.host=htsget:8080 http.client_ip=172.23.0.5
http.user_agent=python-requests/2.31.0 http.target=/variants/ex/abc?referenceName=19&start=1000&end=5000&format=BCF otel.name=HTTP GET /variants/{id:.+} otel.kind="server" request_id=a9d7330
0-be43-4965-b972-ebfbef74041c}:variants{request=Query({"end": "5000", "referenceName": "19", "format": "BCF", "start": "1000"}) path=Path("ex/abc") http_request=
HttpRequest HTTP/1.0 GET:/variants/ex/abc
  query: ?"referenceName=19&start=1000&end=5000&format=BCF"
  params: Path { path: Url { uri: /variants/ex/abc?referenceName=19&start=1000&end=5000&format=BCF, path: None }, skip: 16, segments: [("id", Segment(10, 16))] }
  headers:
    "host": "htsget:8080"
    "connection": "close"
    "accept": "*/*"
    "accept-encoding": "gzip, deflate, br"
    "user-agent": "python-requests/2.31.0"
}: htsget_actix::handlers::get: variants endpoint GET request request=Request { path: "ex/abc", query: {"end": "5000", "referenceName": "19", "format": "BCF", "start": "1000
"}, headers: {"host": "htsget:8080", "connection": "close", "accept": "*/*", "accept-encoding": "gzip, deflate, br", "user-agent": "python-requests/2.31.0"} }

Logs from the contig 20 query show that the same segments (10,16) were requested, although the query is different. I am not sure whether I interpret this properly.

2024-05-29T11:56:32.734937Z  INFO HTTP request{http.method=GET http.route=/variants/{id:.+} http.flavor=1.0 http.scheme=http http.host=htsget:8080 http.client_ip=172.23.0.5
http.user_agent=python-requests/2.31.0 http.target=/variants/ex/abc?referenceName=20&start=1000&end=5000&format=BCF otel.name=HTTP GET /variants/{id:.+} otel.kind="server" request_id=baa2d58
8-3098-4ff5-acc7-decb9490403b}:variants{request=Query({"referenceName": "20", "start": "1000", "format": "BCF", "end": "5000"}) path=Path("ex/abc") http_request=
HttpRequest HTTP/1.0 GET:/variants/ex/abc
  query: ?"referenceName=20&start=1000&end=5000&format=BCF"
   params: Path { path: Url { uri: /variants/ex/abc?referenceName=20&start=1000&end=5000&format=BCF, path: None }, skip: 16, segments: [("id", Segment(10, 16))] }
  headers:
    "host": "htsget:8080"
    "user-agent": "python-requests/2.31.0"
    "accept": "*/*"
    "accept-encoding": "gzip, deflate, br"
    "connection": "close"
}: htsget_actix::handlers::get: variants endpoint GET request request=Request { path: "ex/abc", query: {"referenceName": "20", "start": "1000", "format": "BCF", "end": "5000
"}, headers: {"host": "htsget:8080", "user-agent": "python-requests/2.31.0", "accept": "*/*", "accept-encoding": "gzip, deflate, br", "connection": "close"} }
@mmalenic
Copy link
Member

mmalenic commented May 31, 2024

Hi @cmdoret, thanks for the detailed issue.

I had a look into this, and I believe that it is expected behaviour. This is because the sample file is small, and when compressing it using BGZF, only one block is produced. Since htsget must return a valid file (i.e. byte data cannot be cut across BGZF blocks), any request with that file will always result in the full file being returned, as there is only one BGZF block to return.

This can be confirmed by creating a GZI index for the file, and reading its contents:

bgzip -r abc.bcf
hexdump -C abc.bcf.gzi

Which shows 0 entries, indicating there is only one block:

00000000  00 00 00 00 00 00 00 00                           |........|
00000008

Unfortunately, I don't think there's a nicer way to read GZI files using command line tools, although you can use noodles:

fn main() {
  let index = noodles::bgzf::gzi::read("abc.bcf.gzi").unwrap();
  println!("{}", index.len());
}

Which outputs:

1

If you try a different file with more BGZF blocks (e.g. the htsnexus_test_NA12878.bam.gzi file), it should show more blocks:

fn main() {
  let index = noodles::bgzf::gzi::read("data/bam/htsnexus_test_NA12878.bam.gzi").unwrap();
  println!("{}", index.len());
}

Which outputs:

164

In effect, this number determines the maximum amount of "slices" that can be returned by htsget for a given file.

Note, that the segments in the logs refer to the HTTP request itself, i.e. the number of segments in the path, and don't mean anything in terms of the underlying BCF file. Apologies, the logs are currently a bit noisy, which I'm aiming to improve (see #250).

I think that this issue is related to #238. This feature is currently not supported, but there are plans to implement a mode that allows the htsget server to edit out data that was not requested.

Is this a feature that you would be interested in?

@cmdoret
Copy link
Author

cmdoret commented Jun 3, 2024

Thanks for your explanation, @mmalenic. Indeed, this does not seem to happen on larger files (feel free to close the issue).

I understand and that makes sense, in that case we will filter on client side. As we are not yet dealing with crypt4gh, this is not a priority for us right now. But we would definitely be interested in the future, and I think this would indeed make htsget-rs very interesting for health-related projects.

I'd love to contribute, but will need to step up my rust knowledge by then 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants