## Filter hearing impairment phenotypes using Ran code

# Phenotype file creation for:
1. Hearing aids (f.3393)
2. Hearing difficulty/problems (f.2247)
3. Hearing difficulty/background noise (f.2257)
4. Combined phenotype (f2247 & f.2257)

## Aim

Create a dataset of filtered individuals using the inclusion and exclusion criteria for diverse hearing related phenotyes to perform association analyses using the LMM.ipynb. 

## Location of files

In the shared folder is the original UKBB data
```
/gpfs/gibbs/pi/dewan/data/UKBiobank/phenotype_files/pleiotropy_R01/ukb42495_updatedJune2020
```

In my personal folder the filtered dataset

```
/home/dc2325/project/HI_UKBB
```

In the phenotypes folder important phenotypic files 

```
/gpfs/gibbs/pi/dewan/data/UKBiobank/phenotype_files/hearing_impairment
```

## Important phenotypic files

1. `200804_UKBB_HI_genotypeqc.csv` File containing all individuals that passed QC and hearing impairment variables
2. `200804_UKBB_HI_genotypeqc_excr.csv` File with applied exclusion criteria as indicated [here](https://docs.google.com/document/d/1cpxTzElpsEkwmBDjnMBHg2wW7CL1AcG_b0_0wE_k5rQ/edit). **Note**: this file excludes individuals with otosclerosis, Meniere's and other diseases, if you need to filter those particular phenotypes use file 1 instead.
3. `200811_UKBB_Tinnitus_plan1_2_3_f4803` File with filtered phenotypes for tinnitus plan 1,2 and 3 and imputed noise variables
4. `200814_UKBB_HI_genotypeqc_excr_impvars` Database with qc'ed individuals, exclusion criteria, noise imputed vars and tinnitus phenotypes

## Analysis plan

The phenotypes to be analyzed are the following:

1. Hearing aid user (f.3393)
"Do you use a hearing aid most of the time?"

2. Hearing difficulty/problems (f.2247)
"Do you have any difficulty with your hearing?"

3. Hearing difficulty/background noise (f.2257)
"Do you find it difficult to follow a conversation if there is background noise (such as TV, radio, children playing)?"

**Sex corresponds to f.22001 (genetic sex):**

- Male = 0
- Female = 1

**Noisy workplace and loud music exposure frequency: same as for Tinnitus**
                
1. Remove inconsistent individuals 
    - said 1,2 or 3 and in following visits said 0
    - said a higher exposure (e.g 3) and then a lower one (e.g 1 or 2) in following visits
2. Retain consistent individuals and use highest reported exposure

**The SRT trait needs to be inverse normalized**

**Covariates to be included in the analysis include:**

1. Age at time of test (calculated from f.21003.0.0,f.21003.1.0,f.21003.2.0,f.21003.3.0)
2. Sex f.22001
3. Volume left ear f.4270 and right ear f.4277 (The volume set by the participant for the measurement which you are using in the analysis ir our case the last time they took the test). For the analysis we use the average of the right and left ear since there is overlap in the volume distribution
4. Noisy workplace f.4825
5. Loud music exposure f.4836

In [None]:
#Load libraries
library(plyr)
library(dplyr)
library(tidyverse)
library(pander)
library(ggpubr)
library(rapportools)
library(ggplot2)
#Get working directory
getwd()

## Set working directory

In [2]:
#Set working directory Yale
setwd('~/project/HI_UKBB/')
#Set workind directory Columbia
##setwd('/mnt/mfs/statgen/UKBiobank/data/phenotype_files/hearing_impairment')

## Load data

### Using all whites

In [3]:
df.final.imp = read.csv('010521_UKBB_HI_genotypeqc_expandedwhite_396974indiv_excr.csv')

In [4]:
head(df.final.imp)
dim(df.final.imp)

Unnamed: 0_level_0,IID,FID,ignore1,ignore2,ignore3,ignore4,f.31.0.0,f.34.0.0,f.53.0.0,f.53.1.0,⋯,f.131229.0.0,f.131230.0.0,f.131231.0.0,f.131232.0.0,f.131233.0.0,f.131250.0.0,f.131251.0.0,f.131252.0.0,f.131253.0.0,exclude
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<fct>,<int>,<fct>,<fct>,⋯,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<lgl>
1,1000019,1000019,0,0,2,-9,Female,1960,2008-01-24,,⋯,,,,,,,,,,False
2,1000022,1000022,0,0,1,-9,Male,1954,2008-01-22,,⋯,,,,,,,,,,False
3,1000035,1000035,0,0,1,-9,Male,1944,2007-11-08,,⋯,,,,,,,,,,False
4,1000046,1000046,0,0,2,-9,Female,1946,2008-12-01,,⋯,,,,,,,,,,False
5,1000054,1000054,0,0,2,-9,Female,1942,2007-11-23,,⋯,,,,,,,,,,False
6,1000063,1000063,0,0,1,-9,Male,1967,2010-06-26,,⋯,,,,,,,,,,False


# Hearing aids (f.3393)

## Step 1: classify cases and controls

In [5]:
hearing_all = df.final.imp %>% 
  select(IID,FID,f.31.0.0,f.34.0.0, f.21003.0.0,f.21003.1.0,f.21003.2.0,f.21003.3.0,f.3393.0.0,f.3393.1.0,f.3393.2.0,f.3393.3.0,f.2247.0.0,f.2247.1.0,f.2247.2.0,f.2247.3.0,f.2257.0.0, f.2257.1.0, f.2257.2.0, f.2257.3.0,starts_with("f.41270"),starts_with("f.41280")) 

### Classify cases and controls based on ICD10 code Z97.4

In [6]:
hearing_all = hearing_all %>% 
  mutate(cases_Z974 = apply(select(hearing_all,starts_with("f.41270")), 1, function(x) any(x %in% c("Z974")))) 

### Classify the cases and controls for hearing aid, based on 3393

In [7]:
options(warn=-1)

In [8]:
hearing_all = hearing_all %>% 
  mutate(cases_3393 = apply(select(hearing_all,starts_with("f.3393")), 1, function(x) length(which(x == "Yes")) > 0 && max(which(x == "No")) < min(which(x == "Yes"))))

In [9]:
hearing_all$control_3393 = with(hearing_all, ifelse(f.3393.0.0 %in% c("No",NA,"Prefer not to answer") & f.3393.1.0 %in% c("No", NA,"Prefer not to answer") & f.3393.2.0 %in% c("No",NA,"Prefer not to answer") & f.3393.3.0 %in% c("No",NA,"Prefer not to answer") 
                                                 & !(f.3393.0.0 %in% c(NA,"Prefer not to answer") & f.3393.1.0 %in% c(NA,"Prefer not to answer") & f.3393.2.0 %in% c(NA,"Prefer not to answer") & f.3393.3.0 %in% c(NA,"Prefer not to answer")),"FALSE", NA))

In [49]:
#148 individuals are true for Z974 but controls for 3393
hearing_aid_trial <- hearing_all %>% 
  filter(cases_Z974 == "TRUE" & control_3393 == "FALSE")
head(hearing_aid_trial)
nrow(hearing_aid_trial)

Unnamed: 0_level_0,IID,FID,f.31.0.0,f.34.0.0,f.21003.0.0,f.21003.1.0,f.21003.2.0,f.21003.3.0,f.3393.0.0,f.3393.1.0,⋯,f.41280.0.208,f.41280.0.209,f.41280.0.210,f.41280.0.211,f.41280.0.212,cases_Z974,cases_3393,control_3393,reclass_aid,hearing_aid_cat
Unnamed: 0_level_1,<int>,<int>,<fct>,<int>,<int>,<int>,<int>,<int>,<fct>,<fct>,⋯,<fct>,<fct>,<fct>,<fct>,<fct>,<lgl>,<lgl>,<chr>,<chr>,<chr>
1,1002806,1002806,Female,1944,63,,,,No,,⋯,,,,,,True,False,False,True,case
2,1011139,1011139,Female,1945,63,,,,No,,⋯,,,,,,True,False,False,True,case
3,1019013,1019013,Male,1938,69,,,,No,,⋯,,,,,,True,False,False,True,case
4,1069074,1069074,Male,1946,62,,,,No,,⋯,,,,,,True,False,False,True,case
5,1089526,1089526,Male,1949,59,,,,No,,⋯,,,,,,True,False,False,True,case
6,1148586,1148586,Male,1939,69,,,,No,,⋯,,,,,,True,False,False,True,case


In [84]:
hearing_all <- hearing_all %>% 
  mutate(reclass_aid= case_when(
    cases_Z974 == "TRUE" & control_3393 == "FALSE" ~ "TRUE",
    cases_Z974 == "TRUE" & control_3393 == NA  ~ "TRUE",
    cases_Z974 == "TRUE" & cases_3393 == "FALSE" ~ "TRUE",
    TRUE ~"FALSE") ) 

In [85]:
#13811 individuals are cases for either Z947 or 3393
#cases_3393 == TRUE N=13,633
#reclass_aid == TRUE N=148
#reclass_aid ==TRUE & cases_3393 TRUE N=0
#reclass_aid ==TRUE & cases_3393 FALSE N=30
hearing_aid_cases <- hearing_all %>% 
 filter(reclass_aid == "TRUE" | cases_3393 == "TRUE" )
nrow(hearing_aid_cases)

In [90]:
#245157 individuals are controls for Z947 and 3393
hearing_aid_control <- hearing_all %>% 
 filter(reclass_aid == "FALSE" & control_3393 == "FALSE")
nrow(hearing_aid_control)

In [91]:
#merge cases and controls
hearing_all <- hearing_all %>% 
  mutate(hearing_aid_cat= case_when(
    reclass_aid == "TRUE" | cases_3393 == "TRUE" ~ "case",
    reclass_aid == "FALSE" & control_3393 == "FALSE" ~ "control",
    TRUE ~"NA")
    ) 

In [92]:
#258965 are either cases or controls after reclassification
hear_aid <- hearing_all %>% 
  filter( hearing_aid_cat == "case" | hearing_aid_cat == "control")
nrow(hear_aid)

In [93]:
nrow(hear_aid %>% filter(hearing_aid_cat =="case"))

## step 2: get the ages for hearing aids (3393)

### Extract age for Control (3393)

In [94]:
aid_age_control <- hearing_all %>% 
  filter(hearing_aid_cat == "control") 

In [95]:
#find out the age at the last visit (control)

offset = which(colnames(aid_age_control) == 'f.21003.0.0') - which(colnames(aid_age_control) == 'f.3393.0.0')

aid_age_control$age_aid = apply(aid_age_control, 1, function(x) {
  hear_aid = which(x[grep("f.3393", names(x))] == "No")
  first_index_offset = grep("f.3393", names(x))[1] - 1
  unlist(x[hear_aid[length(hear_aid)] + first_index_offset + offset])
})

res<-head(aid_age_control)

In [96]:
head(aid_age_control)

Unnamed: 0_level_0,IID,FID,f.31.0.0,f.34.0.0,f.21003.0.0,f.21003.1.0,f.21003.2.0,f.21003.3.0,f.3393.0.0,f.3393.1.0,⋯,f.41280.0.209,f.41280.0.210,f.41280.0.211,f.41280.0.212,cases_Z974,cases_3393,control_3393,reclass_aid,hearing_aid_cat,age_aid
Unnamed: 0_level_1,<int>,<int>,<fct>,<int>,<int>,<int>,<int>,<int>,<fct>,<fct>,⋯,<fct>,<fct>,<fct>,<fct>,<lgl>,<lgl>,<chr>,<chr>,<chr>,<chr>
1,1000019,1000019,Female,1960,47,,,,No,,⋯,,,,,False,False,False,False,control,47
2,1000022,1000022,Male,1954,53,,,,No,,⋯,,,,,False,False,False,False,control,53
3,1000035,1000035,Male,1944,63,,,,No,,⋯,,,,,False,False,False,False,control,63
4,1000046,1000046,Female,1946,62,,73.0,,,,⋯,,,,,False,False,False,False,control,73
5,1000054,1000054,Female,1942,65,,,,No,,⋯,,,,,False,False,False,False,control,65
6,1000063,1000063,Male,1967,43,,,,No,,⋯,,,,,False,False,False,False,control,43


### Extract age for Cases (3393)

In [97]:
#first category of cases (3393 true, Z974 False)
aid_age_case1 <- hearing_all %>% 
  filter(hearing_aid_cat == "case" & reclass_aid == "FALSE") 

#find out the age at the first visit (case)

offset = which(colnames(aid_age_case1) == 'f.21003.0.0') - which(colnames(aid_age_case1) == 'f.3393.0.0')

aid_age_case1$age_aid = apply(aid_age_case1, 1, function(x) {
  hear_aid =  which(x[grep("f.3393", names(x))] == "Yes")
  first_index_offset = grep("f.3393", names(x))[1] - 1
  unlist(x[min(hear_aid) + first_index_offset + offset])
})

res<-head(aid_age_case1)

In [98]:
head(aid_age_case1)

Unnamed: 0_level_0,IID,FID,f.31.0.0,f.34.0.0,f.21003.0.0,f.21003.1.0,f.21003.2.0,f.21003.3.0,f.3393.0.0,f.3393.1.0,⋯,f.41280.0.209,f.41280.0.210,f.41280.0.211,f.41280.0.212,cases_Z974,cases_3393,control_3393,reclass_aid,hearing_aid_cat,age_aid
Unnamed: 0_level_1,<int>,<int>,<fct>,<int>,<int>,<int>,<int>,<int>,<fct>,<fct>,⋯,<fct>,<fct>,<fct>,<fct>,<lgl>,<lgl>,<chr>,<chr>,<chr>,<chr>
1,1000112,1000112,Male,1949,58,,68.0,,,,⋯,,,,,False,True,,False,case,68
2,1001067,1001067,Male,1959,50,,,,Yes,,⋯,,,,,False,True,,False,case,50
3,1001384,1001384,Female,1948,61,,,,Yes,,⋯,,,,,False,True,,False,case,61
4,1001459,1001459,Male,1944,64,,,,Yes,,⋯,,,,,True,True,,False,case,64
5,1002548,1002548,Male,1948,62,,,,Yes,,⋯,,,,,False,True,,False,case,62
6,1002888,1002888,Male,1940,68,,,,Yes,,⋯,,,,,False,True,,False,case,68


In [99]:
#second category of cases (3393 false, Z974 true)
aid_age_case2 <- hearing_all %>% 
  filter(hearing_aid_cat == "case" & reclass_aid == "TRUE") 

#age for hearing aid based on Z974
offset = which(colnames(aid_age_case2) == 'f.41280.0.0') - which(colnames(aid_age_case2) == 'f.41270.0.0')

aid_age_case2$age_aid = apply(aid_age_case2, 1, function(x) {
  hear_aid = which(x[grep("f.41270", names(x))] == "Z974")
  first_index_offset = grep("f.41270", names(x))[1] - 1
  unlist(x[hear_aid[length(hear_aid)] + first_index_offset + offset])
})

In [100]:
hearing_aid_new <- aid_age_case2%>% 
  separate(age_aid, into = c("year", "month", "day"), sep = "-") %>% 
  mutate(num_year=as.numeric(year))

In [101]:
hearing_aid_new2 <- hearing_aid_new %>%   
  mutate(age_aid= num_year - f.34.0.0) %>% 
  select(-year, -month, -day, -num_year)

In [102]:
#merge age for cases and controls
hearing_clean <- rbind(aid_age_case1,hearing_aid_new2, aid_age_control) 
dim(hearing_clean)
head(hearing_clean)

Unnamed: 0_level_0,IID,FID,f.31.0.0,f.34.0.0,f.21003.0.0,f.21003.1.0,f.21003.2.0,f.21003.3.0,f.3393.0.0,f.3393.1.0,⋯,f.41280.0.209,f.41280.0.210,f.41280.0.211,f.41280.0.212,cases_Z974,cases_3393,control_3393,reclass_aid,hearing_aid_cat,age_aid
Unnamed: 0_level_1,<int>,<int>,<fct>,<int>,<int>,<int>,<int>,<int>,<fct>,<fct>,⋯,<fct>,<fct>,<fct>,<fct>,<lgl>,<lgl>,<chr>,<chr>,<chr>,<chr>
1,1000112,1000112,Male,1949,58,,68.0,,,,⋯,,,,,False,True,,False,case,68
2,1001067,1001067,Male,1959,50,,,,Yes,,⋯,,,,,False,True,,False,case,50
3,1001384,1001384,Female,1948,61,,,,Yes,,⋯,,,,,False,True,,False,case,61
4,1001459,1001459,Male,1944,64,,,,Yes,,⋯,,,,,True,True,,False,case,64
5,1002548,1002548,Male,1948,62,,,,Yes,,⋯,,,,,False,True,,False,case,62
6,1002888,1002888,Male,1940,68,,,,Yes,,⋯,,,,,False,True,,False,case,68


In [78]:
#merge back to the origical data set
hearing_clean <- merge(x = hearing_all, y = hearing_clean, by = c("IID","FID") ,all.x = TRUE) %>% 
  select(-ends_with(".y")) %>% 
  dplyr::rename_all(
    ~stringr::str_replace_all(., ".x", "")
    )

In [65]:
haoyue_f3393 <- read.table('041421_UKBB_Hearing_aid_f3393_expandedwhite_z974included', header=TRUE)
dim(haoyue_f3393)

In [103]:
cases_not_in_ran <- haoyue_f3393 %>%
    filter(!(IID %in% hearing_clean$IID))
cases_not_in_ran

FID,IID,age_final_aid,sex,hearing_aid_cat_new
<int>,<int>,<int>,<int>,<int>


In [104]:
nrow(hearing_clean)

## First problem

In [144]:
hearing_all %>% 
    filter (IID %in% c(1010377, 1067717, 1117169, 1294639)) %>%
    select (IID, starts_with('f.3393'),cases_Z974, cases_3393, control_3393, reclass_aid, hearing_aid_cat)

IID,f.3393.0.0,f.3393.1.0,f.3393.2.0,f.3393.3.0,cases_Z974,cases_3393,control_3393,reclass_aid,hearing_aid_cat
<int>,<fct>,<fct>,<fct>,<fct>,<lgl>,<lgl>,<chr>,<chr>,<chr>
1010377,No,,Prefer not to answer,,False,False,,False,
1067717,No,,Prefer not to answer,,False,False,,False,
1117169,No,,Prefer not to answer,,False,False,,False,
1294639,No,No,Prefer not to answer,,False,False,,False,


## Second problem

In [105]:
hearing_all %>% 
    filter(IID %in% c(1421064, 1497578, 1637509))%>%
    select (IID, starts_with('f.3393'),cases_Z974, cases_3393, control_3393, reclass_aid, hearing_aid_cat)

IID,f.3393.0.0,f.3393.1.0,f.3393.2.0,f.3393.3.0,cases_Z974,cases_3393,control_3393,reclass_aid,hearing_aid_cat
<int>,<fct>,<fct>,<fct>,<fct>,<lgl>,<lgl>,<chr>,<chr>,<chr>
1421064,,,,,True,False,,True,case
1497578,,,,,True,False,,True,case
1637509,,,,,True,False,,True,case


# Hearing difficulty/problems (2247)

## Step 1: classify cases and controls

In [131]:
#classify cases of hearing difficulty/problems based on 2247
hearing_diff <- hearing_clean %>% 
  mutate(cases_2247 = apply(select(.,starts_with("f.2247")), 1, function(x) length(which(x == "Yes")) > 0 & max(which(x != "Yes")) < min(which(x == "Yes")))
  )

In [None]:
#classify controls of hearing difficulty/problems based on 2247
hearing_diff$control_2247 = with(hearing_diff, ifelse(f.2247.0.0 %in% c("No",NA) & f.2247.1.0 %in% c("No", NA) & f.2247.2.0 %in% c("No",NA) & f.2247.3.0 %in% c("No",NA) 
                                                 & !(f.2247.0.0 %in% c(NA) & f.2247.1.0 %in% c(NA) & f.2247.2.0 %in% c(NA) & f.2247.3.0 %in% c(NA)),"FALSE", "NA")) 