Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement and refactoring #6

Open
1 of 5 tasks
mboissel opened this issue Jul 21, 2022 · 3 comments · May be fixed by #8
Open
1 of 5 tasks

Enhancement and refactoring #6

mboissel opened this issue Jul 21, 2022 · 3 comments · May be fixed by #8
Assignees
Labels
enhancement New feature or request

Comments

@mboissel
Copy link
Member

mboissel commented Jul 21, 2022

After using the function R/estimate_ethnicity.R (on exom sequencing data)
We have discussed about several enhancement to developed :

  • Remove vcftools dependency(only keep bcftools) #9
    (only keep bcftools)
  • Add the option like "--missing-to-ref" for the step that merge the input VCFs. So this will make the possibility to assume genotypes at missing sites are 0/0. The merged vcf will hold much more variants filled as ref, instead of just variants that are all non ref in input population.
  • Check whether it will be interesting to fill all the WT variants met in the reference panel (let's say 1kg) by completing the input vcf with all common WT variants ever seen in our sequencing data (by reading all cover files or the capture may be).
  • Check LD clumping
  • Add option for point size in ethnicity graph #7 (check alternative methods to draw a lot of points with ggpointdensity or scattermore)
@mboissel mboissel added the enhancement New feature or request label Jul 21, 2022
@mboissel mboissel self-assigned this Jul 21, 2022
@mcanouil
Copy link
Contributor

Note: PLINK is used to convert from VCF to BED/BIM/FAM for flashpcaR to use directly.

@mcanouil
Copy link
Contributor

Notes:

  • VCFtools (and awk) to be replaced by BCFtools call:
    • rain/R/format_vcf.R

      Lines 41 to 48 in dbfa02c

      bin_path[["vcftools"]],
      "--gzvcf", ref1kg_vcfs,
      "--maf", ref1kg_maf,
      "--recode",
      "--stdout",
      "|", bin_path[["bgzip"]], "-c >", output_ref,
      "&&",
      bin_path[["tabix"]], "-p vcf", output_ref
    • rain/R/format_vcf.R

      Lines 72 to 79 in dbfa02c

      paste(bin_path[["vcftools"]],
      "--gzvcf", input_vcfs,
      "--get-INFO", quality_tag,
      "--out", gsub("filtered_", "excluded_", output_study_temp),
      "&&",
      'awk \'{if($5<', quality_threshold, ') print $1"\t"$2}\'',
      paste0(gsub("filtered_", "excluded_", output_study_temp), ".INFO"),
      ">", paste0(gsub("filtered_", "excluded_", output_study_temp), ".exclude"),
    • rain/R/format_vcf.R

      Lines 83 to 95 in dbfa02c

      bin_path[["vcftools"]],
      "--gzvcf", input_vcfs,
      if (!is.null(quality_tag)) {
      paste0(
      "--exclude-positions ", gsub("filtered_", "excluded_", output_study_temp), ".exclude"
      )
      },
      "--remove-indels",
      "--remove-filtered-all",
      "--max-missing-count 1",
      "--recode",
      "--stdout",
      "|", bin_path[["bgzip"]], "-c >", output_study_temp,
  • I think we should embedded PLINK within the package to avoid having to download a binary. This would make the dependency contained.

@mcanouil mcanouil changed the title feature: enhancement to dev in estimate_ethnicity function Enhancement and refactoring Jul 21, 2022
@mboissel mboissel linked a pull request Jul 22, 2022 that will close this issue
@mcanouil
Copy link
Contributor

mcanouil commented Jul 22, 2022

Note: PLINK is used to convert from VCF to BED/BIM/FAM for flashpcaR to use directly.

In fact, we might want and be able to remove PLINK completely with something like the following:

geno_mat <- unique(data.table::setnames(
  x = data.table::fread(
    cmd = paste(
      bcftools, "query",
      "--format", "'%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n'",
      "--print-header",
      file.path(output_directory, "all-samples-1kg.vcf.gz")
    ),
    sep = "\t"
  ),
  old = function(x) sub("^#", "", sub(":GT", "", sub(" *\\[[[:digit:]]+\\]", "", x)))
))
sample_columns <- setdiff(names(geno_mat), c("CHROM", "POS", "REF", "ALT"))
geno_mat <- geno_mat[
  j = (sample_columns) := lapply(
    X = .SD,
    FUN = function(x) c("00" = 0L, "10" = 1L, "01" = 1L, "11" = 2L)[sub("/|\\|", "", x)]
  ),
  .SDcols = sample_columns
][order(CHROM, POS)]

pca_res <- flashpcaR::flashpca(
  X = geno_mat[j = .SD, .SDcols = !c("CHROM", "POS", "REF", "ALT")],
  ndim = n_comp
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
2 participants