Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing phenotypes in icdinfo #4

Closed
yk-tanigawa opened this issue Feb 20, 2019 · 10 comments
Closed

missing phenotypes in icdinfo #4

yk-tanigawa opened this issue Feb 20, 2019 · 10 comments
Assignees

Comments

@yk-tanigawa
Copy link
Contributor

diff icdinfos

v1:

/oak/stanford/groups/mrivas/users/$USER/repos/rivas-lab/wiki/ukbb/icdinfo/icdinfo.txt

v2:

https://github.com/rivas-lab/ukbb-tools/blob/master/02_phenotyping/icdinfo.txt

@yk-tanigawa
Copy link
Contributor Author

  • duplicates removals
  • missing entries
    • phenotype (GBE IDs) without GBE name

@maguirre1
Copy link
Contributor

maguirre1 commented Feb 21, 2019

Here is a list of the sources of missing phenotypes:

  1. Million Veterans Program (lipids across populations, INI1004xxx/HC4xxx)
  2. Ashley Lab Collaboration (physical activity traits, INI1003xxx)
  3. Broad Collaboration (BROADBIN/BROADQT)
  4. Anna's project (shift work, HC3xxx)
  5. Adam's project (medications, MED100000xxx)

The above comprise the vast majority of phenotypes that are missing. There are a few (~80) traits that need to be updated in the pipeline. These are listed below in order of priority, but most of them are not yet able to be incorporated into the pipeline proper.

  1. Family History (FH) traits — we can only port over the old version as yet, see issue re-define 16698 phenotypes using 24983 data (the most recent one) #6.
  2. INI3063/3064 (FEV/FVC ratios) are derived ratio phenotypes.
  3. BIN1020483 (suicide attempt) is an expanded control set phenotype.
  4. INI1002395 (male pattern baldness) needs to be recoded.
  5. INI1001239 (smoking status) needs to be recoded.

There are a couple errors in the existing pipeline that also affect many more traits, but which have to be resolved with the current versioning, namely:

  1. BIN_FC/QT_FC prefixing. This doesn't work with the loader on GBE, so we've had to map the traits to the HC/INI namespace (numbers remain identical) for loading. This gives the appearance of missing traits if you compare the Sherlock version of icdinfo.txt with the version on production. We've decided to keep the versioning separate like this to avoid confusion when using reference data (like the phenotype info table) designed for the Sherlock filesystem.
  2. Missing MED phenotypes. There seem to be several hundred medication phenotypes that were missed in the most recent rerun. A new issue (Missing medication (MED) phenotypes #7) for this has been opened.

@maguirre1
Copy link
Contributor

Family History phenotypes have been added to the resource, and duplicate entries have been fixed in icdinfo.txt.

@maguirre1
Copy link
Contributor

maguirre1 commented Mar 5, 2019

The below cardiac phenotypes are still error-prone in icdinfo:

INI22330    0    PQ_interval    0    0    Y
INI22331    0    QT_interval    0    0    Y
INI22332    0    QTC_interval    0    0    Y
INI22333    0    RR_interval    0    0    Y
INI22334    0    PP_interval    0    0    Y
INI22335    0    P_axis    0    0    Y
INI22336    0    R_axis    0    0    Y
INI22337    0    T_axis    0    0    Y
INI22338    0    QRS_num    0    0    Y

It appears that this is due to an error in how the .info files were generated (unclear if this is a problem with the pipeline or how it was applied):

/oak/stanford/groups/mrivas/dev-ukbb-tools/phenotypes/2001702/logs/INI22330.log
/oak/stanford/groups/mrivas/dev-ukbb-tools/phenotypes/2001702/logs/INI22330.info
/oak/stanford/groups/mrivas/dev-ukbb-tools/phenotypes/2001702/INI22330.phe
/oak/stanford/groups/mrivas/dev-ukbb-tools/phenotypes/25279/logs/INI22330.log
/oak/stanford/groups/mrivas/dev-ukbb-tools/phenotypes/25279/INI22330.phe

It seems like the phenotype definition with the info file (above) has no nonmissing data, so the pipeline is working properly:

$ cut -f3 /oak/stanford/groups/mrivas/dev-ukbb-tools/phenotypes/2001702/INI22330.phe | grep -vc "\-9" 
0

But the one we actually want to have logged has not been annotated (see above):

$ cut -f3 /oak/stanford/groups/mrivas/dev-ukbb-tools/phenotypes/25279/INI22330.phe | grep -vc "\-9" 
23679

@maguirre1
Copy link
Contributor

https://github.com/rivas-lab/ukbb-tools/blob/master/02_phenotyping/tables/ukb_20181109.tsv

Phenotypes in this file are missing *.info files in their directory:
$OAK/dev-ukbb-tools/phenotypes/25729

@guhanrv
Copy link
Collaborator

guhanrv commented Mar 30, 2019

I found the issue re: the past two comments.

As Matt and I found yesterday, some spacing issues were found in https://github.com/rivas-lab/ukbb-tools/blob/master/02_phenotyping/tables/ukb_20181109.tsv. We have since re-saved a proper version in its place.

This is probably why *.info files were not found in the corresponding directory.

Also:

BDS-C02X3440JGH5:tables guhan$ grep "0\t0\tY" ../icdinfo.txt | cut -f1 | grep -v "RH" | grep -v "MED" | sort
BIN21064
BIN_FC10003082
BIN_FC10010844
BIN_FC20003082
HC65
INI22330
INI22331
INI22332
INI22333
INI22334
INI22335
INI22336
INI22337
INI22338
BDS-C02X3440JGH5:tables guhan$ cut -f4 ukb_20181109.tsv | sort | grep BIN21064
BIN21064
BDS-C02X3440JGH5:tables guhan$ cut -f4 ukb_20181109.tsv | sort | grep BIN_FC10003082
BIN_FC10003082
BDS-C02X3440JGH5:tables guhan$ cut -f4 ukb_20181109.tsv | sort | grep BIN_FC20003082
BIN_FC20003082
BDS-C02X3440JGH5:tables guhan$ cut -f4 ukb_20181109.tsv | sort | grep BIN_FC10010844
BDS-C02X3440JGH5:tables guhan$ cut -f4 ukb_20181109.tsv | sort | grep HC65
BDS-C02X3440JGH5:tables guhan$ cut -f4 ukb_20181109.tsv | sort | grep INI22330
INI22330
BDS-C02X3440JGH5:tables guhan$ cut -f4 ukb_20181109.tsv | sort | grep INI22331
INI22331
BDS-C02X3440JGH5:tables guhan$ cut -f4 ukb_20181109.tsv | sort | grep INI22332
INI22332
BDS-C02X3440JGH5:tables guhan$ cut -f4 ukb_20181109.tsv | sort | grep INI22333
INI22333
BDS-C02X3440JGH5:tables guhan$ cut -f4 ukb_20181109.tsv | sort | grep INI22334
INI22334
BDS-C02X3440JGH5:tables guhan$ cut -f4 ukb_20181109.tsv | sort | grep INI22335
INI22335
BDS-C02X3440JGH5:tables guhan$ cut -f4 ukb_20181109.tsv | sort | grep INI22336
INI22336
BDS-C02X3440JGH5:tables guhan$ cut -f4 ukb_20181109.tsv | sort | grep INI22337
INI22337
BDS-C02X3440JGH5:tables guhan$ cut -f4 ukb_20181109.tsv | sort | grep INI22338
INI22338

There are a large number of RH and MED phenotypes with 0 counts in the icdinfo file. Ignoring those, though, all phenotypes that have the weird bug of having 0s in our icdinfo.txt (except BIN_FC10010844 and HC65) are in ukb_20181109.tsv.

[guhan@sh-ln05 login /oak/stanford/groups/mrivas/dev-ukbb-tools/phenotypes/9797]$ cut -f3 BIN_FC10010844.phe | sort | uniq -c
     55 1
      8 2
 502559 -9

BIN_FC10010844's phe file isn't empty; there are people here and not in GBE. Perhaps these people don't make it to GBE because they are not white british unrelated (can someone check this?) Not sure where to find the HC phes to be honest.

Interestingly, none of these phes that have 0 in them are in master.phe or master.phe.info. I suppose that makes sense, but I think that the more complete master.phe is, the better off we are in the long run.

Long story short, a good number of problems will be resolved by rerunning gbe.sh on ukb_20181109.tsv.

I strongly suggest we do a find on all of the phenotypes that are in the entirety of our $OAK space, track them down, possibly rerun them through the pipeline, and put them in the master.phe/master.phe.info file. The more we don't utilize, the more we waste work we have done.

@guhanrv
Copy link
Collaborator

guhanrv commented Mar 30, 2019

all_phes.txt
This is all of the phenotypes on our OAK space; I don't know how many more exist.

I highly encourage: 1) removing duplicates, 2) moving all non-duplicates to ukbb24983/phenotypes or ukbb16698/phenotypes. We need to make our jobs easier on ourselves :(

@guhanrv
Copy link
Collaborator

guhanrv commented Mar 31, 2019

[guhan@sh-ln05 login /oak/stanford/groups/mrivas]$ awk -F ' ' '{print $3}' ./private_data/ukbb/16698/phe/highconfidence/HC65.phe | sort | uniq -c
 502610 1
     18 2

Looks like HC65.phe also has very low number of cases, that are maybe excluded from analyses when white british unrelated are analyzed. Also, not in master.phe, probably for this very reason. (@maguirre1?)

@yk-tanigawa
Copy link
Contributor Author

Also, biomarker phenotypes are not in icdinfo (just not yet?)

It might be nice to assign new GBE_ID for adjusted traits and include both unadjusted and adjusted in the master phe & icdinfo.

@guhanrv
Copy link
Collaborator

guhanrv commented Apr 8, 2020

Long overdue, but a new icdinfo file has been generated. This fixes the issues with the QRS traits.

INI12336	26310	Ventricular_rate	26310	26310	Y
INI12338	24866	P_duration	24866	24866	Y
INI12340	26306	QRS_duration	26306	26306	Y
INI22330	16695	PQ_interval	16695	16695	Y
INI22331	17575	QT_interval	17575	17575	Y
INI22332	17575	QTC_interval	17575	17575	Y
INI22333	17575	RR_interval	17575	17575	Y
INI22334	17575	PP_interval	17575	17575	Y
INI22335	16700	P_axis	16700	16700	Y
INI22336	17471	R_axis	17471	17471	Y
INI22337	17531	T_axis	17531	17531	Y
INI22338	17575	QRS_num	17575	17575	Y

@guhanrv guhanrv closed this as completed Apr 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants