Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gene_Centric_Coding function fails due to character-formatted annotation data #36

Closed
kwdoyle opened this issue Oct 23, 2023 · 6 comments

Comments

@kwdoyle
Copy link
Collaborator

kwdoyle commented Oct 23, 2023

Hello,

I've been working my way through each step in the pipeline and have encountered an error when running STAARpipeline_Gene_Centric_Coding.r

The error occurs in the internal coding function within the Gene_Centric_Coding function. When the current annotation name is "aPC.LocalDiversity", the script attempts to perform a transformation of the data:

if (Annotation_name[k] == "aPC.LocalDiversity") {
                    Annotation.PHRED.2 <- -10 * log10(1 - 10^(-Annotation.PHRED/10))
                    Annotation.PHRED <- cbind(Annotation.PHRED,
                      Annotation.PHRED.2)
                    Anno.Int.PHRED.sub.name <- c(Anno.Int.PHRED.sub.name,
                      paste0(Annotation_name[k], "(-)"))
                  }

However, the data in Annotation.PHRED is of the character class, causing this error:

Error in -Annotation.PHRED : invalid argument to unary operator
Calls: Gene_Centric_Coding -> coding

aPC.LocalDiversity is not the only character annotation. Of those picked from the name catalog, the only other character values are 3 other annotation PCs:

/cadd_phred  # numeric
/linsight  # numeric
/fathmm_xf  # numeric
/apc_epigenetics_active  # character
/apc_epigenetics_repressed  # character
/apc_epigenetics_transcription  # numeric
/apc_conservation  # numeric
/apc_local_nucleotide_diversity  # character (causing this error)
/apc_mappability  # character
/apc_transcription_factor  # numeric
/apc_protein_function  # numeric

Would anyone know why these values would be written to (or read from) the GDS file as characters?

@xihaoli
Copy link
Owner

xihaoli commented Oct 23, 2023

Hi @kwdoyle,

Thank you for letting me know. Could I ask what FAVOR database files did you use to annotate your genotype data?

Best,
Xihao

@kwdoyle
Copy link
Collaborator Author

kwdoyle commented Oct 23, 2023

I'm using the full database of 160 annotations hosted on Harvard Dataverse. More specifically, just the database file for chromosome 5 for now.

Interestingly enough, the Annotate.R script seems to work correctly. If I read in the saved, merged annotation file "Anno_chr5_STAARpipeline.csv" with data.table::fread, all of the data indicated in my previous post as being character-formatted are read in as numeric.

So presumably the issue occurs somewhere in the gds2agds.R script?

It also might be of interest that I chose to annotate the gds using all 160 annotations (rather, I set my 'anno_colnum' variable as c(1, 8:190). I chose the same initial annotations as the default (1 and 8) and then selected all variables from 8 to the end. I'm not sure if this is relevant, however, as the file from Annotate.R seems fine

@kwdoyle
Copy link
Collaborator Author

kwdoyle commented Oct 23, 2023

Oh, while looking through the scripts, I think this part in gds2agds.R might be the problem:

FunctionalAnnotation <- read_csv(paste0(dir_anno,"chr",chr,"/",anno_file_name_1,chr,anno_file_name_2),
col_types=list(col_character(),col_double(),col_double(),col_double(),col_double(),
col_double(),col_double(),col_double(),col_double(),col_double(),
col_character(),col_character(),col_character(),col_double(),col_character(),
col_character(),col_character(),col_character(),col_character(),col_double(),
col_double(),col_character()))

The character values I'm seeing might be in the location of these hard-coded character columns..

@xihaoli
Copy link
Owner

xihaoli commented Oct 23, 2023

Hi @kwdoyle,

Thank you so much for taking a closer look at the issue. Yes indeed, the current FAVORannotator script in the STAARpipeline-Tutorial works well for the FAVOR Essential Database hosted on Harvard Dataverse. For the full database of 160 annotations, it is recommended to adapt the col_types in gds2agds.R to reflect the correct column type for each annotation.

Best,
Xihao

@kwdoyle
Copy link
Collaborator Author

kwdoyle commented Oct 24, 2023

Yes, this was indeed the issue. If I removed the col_types specification and let read_csv auto-assign the column classes, everything seems to work fine.

@xihaoli
Copy link
Owner

xihaoli commented Oct 24, 2023

Thank you @kwdoyle! Feel free to let me know if you have any other questions, or you may close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants