# Hearing related phenotypes

## Aim

Create a dataset of filtered individuals using the inclusion and exclusion criteria for diverse hearing related phenotyes to perform association analyses using the LMM.ipynb. 

## Location of files

In the shared folder is the original UKBB data
```
/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/pleiotropy_R01/ukb42495_updatedJune2020
```

In my personal folder the filtered dataset

```
/home/dc2325/project/tinnitus
```

## Subset the data using variables of interest

Using the ukbconvert software and a list of pre-specified variables

```
./ukbconv ukb42495.enc_ukb r -i/home/dc2325/project/tinnitus/selectvars_062520.txt -o/home/dc2325/project/tinnitus/ukb42495_subset062520
```

In [None]:
[global]
# The working dir
parameter:cwd = path
# The fam file
parameter: famfile = path

## Subsetting individuals with genotypic data

Load necessary libraries

In [2]:
getwd()

In [3]:
setwd('~/project/HI_UKBB')

In [4]:
# Clean workspace
rm(list=ls())

In [5]:
# Run script to import data to R
source("ukb42495_subset062520.r")
nrow(bd)

In [6]:
# List of individuals with qc'ed genotypic files
df.geno <- read.table("/SAY/dbgapstg/scratch/UKBiobank/genotype_files/pleiotropy_geneticfiles/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv.fam", header= FALSE, stringsAsFactors = FALSE)
names(df.geno) <-c("FID","IID","ignore1", "ignore2", "ignore3", "ignore4")
nrow(df.geno)

In [10]:
head(bd[,1, drop=FALSE])

Unnamed: 0_level_0,f.eid
Unnamed: 0_level_1,<int>
1,6025442
2,1000019
3,1000022
4,1000035
5,1000046
6,1000054


In [7]:
# Assign individual ID column to bd f.eid
names(bd)[1] <- "IID"
head(bd[,1, drop=FALSE])

Unnamed: 0_level_0,IID
Unnamed: 0_level_1,<int>
1,6025442
2,1000019
3,1000022
4,1000035
5,1000046
6,1000054


In [9]:
# Merge the two data frames
df.gen.phen <-merge(df.geno, bd, by="IID", all=FALSE)
nrow(df.gen.phen)

In [11]:
# Step 5 Save as csv file
write.csv(data.geno.pheno,'UKBB_071020_HI_genotypeqc.csv', row.names = FALSE)

ERROR: Error in merge(data.geno, bd, by = "IID", all = FALSE): object 'data.geno' not found


## 1. Tinnitus phenotype (binary)

### a. Exclusion criteria based on ICD10, ICD9 codes and self-report
Apply the exclusion criteria defined by the group to remove unwanted individuals. This takes into account ICD10 codes, ICD9 codes and f.20002 (self-report). Please find a list of removed codes [here](https://docs.google.com/spreadsheets/d/12L7Cx4Ov8FppGVmG0DxL9uG-lVRHM5QJSea0nORyirQ/edit#gid=0). A total 12397 individuals were excluded in this step.

In [None]:
# To get a list of removed individuals. Make sure the list with the strings each line has \bstring\b so it can be recognized by -w
grep -w -f 200713_ICDcodes_exclusion.txt UKBB_071020_HI_genotypeqc.csv > 200713_UKBB_excluded_individuals.csv
cat 200713_UKBB_excluded_individuals.csv | wc -l #12397 excluded
# To get the clean db with the included individuals
grep -wv -f 200713_ICDcodes_exclusion.txt UKBB_071020_HI_genotypeqc.csv > 200713_UKBB_genotypeqc_tinnitus_excr.csv
cat 200713_UKBB_genotypeqc_clean.csv | wc -l #354347 retained
# To obtain the duplicate lines (if they exist)
comm -12 <(sort 200713_UKBB_genotypeqc_tinnitus_excr.csv) <(sort 200713_UKBB_excluded_individuals.csv)

In [None]:
# Load libraries
library(plyr)
library(dplyr)

# Import clean data
df_clean = read.csv(file = "200713_UKBB_genotypeqc_tinnitus_excr.csv", header=TRUE)

dim(df_clean)

### b. Variable recoding

In [None]:
## Variable recoding
# Recode the f.4308 for every instance Yes contains all three categories and No is No, never. Do not know and Prefer not to answer are kept as they are

data_clean$f.4803.0.0_recode <- revalue(data_clean$f.4803.0.0, c("Yes, now most or all of the time"="Yes", "Yes, now a lot of the time"="Yes", "Yes, now some of the time"="Yes", "Yes, but not now, but have in the past"="Yes","No, never"="No","Prefer not to answer"=NA,"Do not know"="Do not know"))
summary(data_clean$f.4803.0.0_recode)
data_clean$f.4803.1.0_recode <- revalue(data_clean$f.4803.1.0, c("Yes, now most or all of the time"="Yes", "Yes, now a lot of the time"="Yes", "Yes, now some of the time"="Yes", "Yes, but not now, but have in the past"="Yes","No, never"="No","Prefer not to answer"=NA,"Do not know"="Do not know"))
summary(data_clean$f.4803.1.0_recode)
data_clean$f.4803.2.0_recode <- revalue(data_clean$f.4803.2.0, c("Yes, now most or all of the time"="Yes", "Yes, now a lot of the time"="Yes", "Yes, now some of the time"="Yes", "Yes, but not now, but have in the past"="Yes","No, never"="No","Prefer not to answer"=NA,"Do not know"="Do not know"))
summary(data_clean$f.4803.2.0_recode)
data_clean$f.4803.3.0_recode <- revalue(data_clean$f.4803.3.0, c("Yes, now most or all of the time"="Yes", "Yes, now a lot of the time"="Yes", "Yes, now some of the time"="Yes", "Yes, but not now, but have in the past"="Yes","No, never"="No","Prefer not to answer"=NA,"Do not know"="Do not know"))
summary(data_clean$f.4803.3.0_recode) 

# Recode variable noisy workplace f.4825

data_clean$f.4825.0.0_recode <- revalue(data_clean$f.4825.0.0, c("No"="0", "Yes, for less than a year"="1", "Yes, for around 1-5 years"="2",
                                                                         "Yes, for more than 5 years"="3","Prefer not to answer"=NA,"Do not know"=NA ))
data_clean$f.4825.0.0_recode <- ordered(data_clean$f.4825.0.0_recode, levels = c("0", "1", "2", "3"))
table(data_clean$f.4825.0.0_recode)

data_clean$f.4825.1.0_recode <- revalue(data_clean$f.4825.1.0, c("No"="0", "Yes, for less than a year"="1", "Yes, for around 1-5 years"="2",
                                                                         "Yes, for more than 5 years"="3","Prefer not to answer"=NA,"Do not know"=NA ))
data_clean$f.4825.1.0_recode <- ordered(data_clean$f.4825.1.0_recode, levels = c("0", "1", "2", "3"))
table(data_clean$f.4825.1.0_recode)

data_clean$f.4825.2.0_recode <- revalue(data_clean$f.4825.2.0, c("No"="0", "Yes, for less than a year"="1", "Yes, for around 1-5 years"="2",
                                                                         "Yes, for more than 5 years"="3","Prefer not to answer"=NA,"Do not know"=NA ))
data_clean$f.4825.2.0_recode <- ordered(data_clean$f.4825.2.0_recode, levels = c("0", "1", "2", "3"))
table(data_clean$f.4825.2.0_recode)

data_clean$f.4825.3.0_recode <- revalue(data_clean$f.4825.3.0, c("No"="0", "Yes, for less than a year"="1", "Yes, for around 1-5 years"="2",
                                                                         "Yes, for more than 5 years"="3","Prefer not to answer"=NA,"Do not know"=NA ))
data_clean$f.4825.3.0_recode <- ordered(data_clean$f.4825.3.0_recode, levels = c("0", "1", "2", "3"))
table(data_clean$f.4825.3.0_recode)

# Recode variable loud music exposure frequency f.4836

data_clean$f.4836.0.0_recode <- revalue(data_clean$f.4836.0.0, c("No"="0", "Yes, for less than a year"="1", "Yes, for around 1-5 years"="2",
                                                                 "Yes, for more than 5 years"="3","Prefer not to answer"=NA,"Do not know"=NA ))
data_clean$f.4836.0.0_recode <- ordered(data_clean$f.4836.0.0_recode, levels = c("0", "1", "2", "3"))
table(data_clean$f.4836.0.0_recode)

data_clean$f.4836.1.0_recode <- revalue(data_clean$f.4836.1.0, c("No"="0", "Yes, for less than a year"="1", "Yes, for around 1-5 years"="2",
                                                                 "Yes, for more than 5 years"="3","Prefer not to answer"=NA,"Do not know"=NA ))
data_clean$f.4836.1.0_recode <- ordered(data_clean$f.4836.1.0_recode, levels = c("0", "1", "2", "3"))
table(data_clean$f.4836.1.0_recode)

data_clean$f.4836.2.0_recode <- revalue(data_clean$f.4836.2.0, c("No"="0", "Yes, for less than a year"="1", "Yes, for around 1-5 years"="2",
                                                                 "Yes, for more than 5 years"="3","Prefer not to answer"=NA,"Do not know"=NA ))
data_clean$f.4836.2.0_recode <- ordered(data_clean$f.4836.2.0_recode, levels = c("0", "1", "2", "3"))
table(data_clean$f.4836.2.0_recode)

data_clean$f.4836.3.0_recode <- revalue(data_clean$f.4836.3.0, c("No"="0", "Yes, for less than a year"="1", "Yes, for around 1-5 years"="2",
                                                                 "Yes, for more than 5 years"="3","Prefer not to answer"=NA,"Do not know"=NA ))
data_clean$f.4836.3.0_recode <- ordered(data_clean$f.4836.3.0_recode, levels = c("0", "1", "2", "3"))
table(data_clean$f.4836.3.0_recode)

# Genetic Sex variable f.22001

table(data_clean$f.22001.0.0)
data_clean$sex <- revalue(data_clean$f.22001.0.0, c("Male" = '0', 'Female'='1' ))
table(data_clean$sex)

### c. Filtering of the tinnitus phenotype

In [None]:
data_clean$cases <- with(data_clean, ifelse(f.4803.0.0_recode == "No" & (f.4803.1.0_recode == "Yes" | f.4803.2.0_recode == "Yes" | f.4803.3.0_recode == "Yes")
                                                      & !(f.4803.0.0_recode == "No" & f.4803.1.0_recode == "Yes" & f.4803.2.0_recode  %in% c("No", "Do not know") & f.4803.3.0_recode %in% c("No", "Do not know",NA)) 
                                                      & !(f.4803.0.0_recode == "No" & f.4803.1.0_recode %in% c("No", "Do not know") & f.4803.2.0_recode == "Yes" & f.4803.3.0_recode %in% c("No", "Do not know"))
                                                      & !(f.4803.0.0_recode == "No" & f.4803.1.0_recode == "Yes" & f.4803.2.0_recode == "Yes" & f.4803.3.0_recode %in% c("No", "Do not know"))
                                                      | (f.4803.0.0_recode %in% c("Yes",NA) & (f.4803.1.0_recode %in% c("Yes",NA) | f.4803.2.0_recode %in% c("Yes",NA) | f.4803.3.0_recode %in% c("Yes",NA))
                                                         & !(f.4803.0.0_recode %in% c("Yes",NA) & (f.4803.1.0_recode %in% c("No", "Do not know") | f.4803.2.0_recode %in% c("No", "Do not know") | f.4803.3.0_recode %in% c("No", "Do not know")))
                                                         & !(f.4803.0.0_recode %in% c(NA) & f.4803.1.0_recode %in% c(NA) & f.4803.2.0_recode %in% c(NA) & f.4803.3.0_recode %in% c(NA))),
                                                      "Yes", NA))
# Number of cases with tinnitus
table(data_clean$cases)

data_clean$controls <- with(data_clean, ifelse(f.4803.0.0_recode %in% c("No",NA) & f.4803.1.0_recode %in% c("No", NA) & f.4803.2.0_recode %in% c("No",NA) & f.4803.3.0_recode %in% c("No",NA)
                                                         & !(f.4803.0.0_recode %in% c(NA) & f.4803.1.0_recode %in% c(NA) & f.4803.2.0_recode %in% c(NA) & f.4803.3.0_recode %in% c(NA)),"No", NA))

# Number of controls without tinnitus
table(data_clean$controls)

# Creates a column with the binary status for tinnitus of the individuals

data_clean$tinnitus <- coalesce(data_clean$cases, data_clean$controls)

table(data_clean$tinnitus)

# Get the number of NAs
length(which(is.na(data_clean$tinnitus)))

### d. Obtaining the age for tinnitus cases and controls

In [None]:
# Get the "age at onset" of tinnitus using f.21003 Age when attended assessment centre for each of the instances
# For cases first time they replied yes to f.4803
# Get the subset of data to extract age

age_all = data_clean %>% 
  filter(!is.na(tinnitus)) %>%
  select(IID,tinnitus, f.4803.0.0_recode, f.4803.1.0_recode, f.4803.2.0_recode, f.4803.3.0_recode, f.21003.0.0, f.21003.1.0, f.21003.2.0, f.21003.3.0)  # data_filed 210003: Age when attended assessment centre
head(age_all)

library(pander)
res<-head(age_all)
pandoc.table(res)

# Get the subset data of cases
age_cases = age_all %>% 
  filter(tinnitus=="Yes")  %>%
  select(IID,f.4803.0.0_recode,f.4803.1.0_recode,f.4803.2.0_recode,f.4803.3.0_recode,f.21003.0.0,f.21003.1.0,f.21003.2.0,f.21003.3.0)
res<-head(age_cases,12)
pandoc.table(res)

# Get the # of column where first replied Yes:
age_cases$visit_idx = apply(age_cases, 1, function(x) unlist(which(x == 'Yes')))

# Define offset:
# offset: refers to the # of columns between the first age column (i.e.f.21003.0.0) and the first recode column (i.e.f.4803.0.0_recode)
offset = which(colnames(age_cases) == 'f.21003.0.0') - which(colnames(age_cases) == 'f.4803.0.0_recode')

# Define the function to extract the first time they said yes for cases 
f=get_age_func <- function(x) {
  visit_index=x[which(colnames(age_cases)=="visit_idx")]
  index=min(unlist(visit_index))+offset
  age=x[index]
  final_age=unlist(age)
  if(is.null(final_age))
  {final_age<-NA}
  return(final_age)
}

# Get the final age for cases
age_cases$age_final = apply(age_cases, 1, f)

# Show first 6 rows
res<-head(age_cases)
pandoc.table(res)
summary(age_cases$age_final)

# Get the subset data of controls
age_control = age_all %>% 
  filter(tinnitus=="No")  %>%
  select(IID,f.4803.0.0_recode,f.4803.1.0_recode,f.4803.2.0_recode,f.4803.3.0_recode,f.21003.0.0,f.21003.1.0,f.21003.2.0,f.21003.3.0)
res<-head(age_control,12)
pandoc.table(res)

# Get the # of column where last replied No:
age_control$visit_idx = apply(age_control, 1, function(x) unlist(which(x == 'No')))

# Define offset:
# offset: refers to the # of columns between the first age column (i.e.f.21003.0.0) and the first recode column (i.e.f.4803.0.0_recode)
offset = which(colnames(age_control) == 'f.21003.0.0') - which(colnames(age_control) == 'f.4803.0.0_recode')

# Define the function to extract the last time they said no for control

f=get_age_func <- function(x) {
  visit_index=x[which(colnames(age_control)=="visit_idx")]
  index=max(unlist(visit_index))+offset
  age=x[index]
  age=unlist(age)
  return(age)
}

# Get the final age for controls
age_control$age_final = apply(age_control, 1, f)

# Show first 6 rows
res<-head(age_control)
pandoc.table(res)
summary(age_control$age_final)

# Merge age_cases and age_controls
age_tinnitus <- rbind(age_cases, age_control) 
dim(age_tinnitus)

#Merge with complete database keep the all the rows from original db
data_clean_age = merge(data_clean,age_tinnitus,by="IID", all.x=TRUE)
dim(data_clean_age)

### e. Checking consistency of the f.4825 noisy workplace and filtering

In [None]:
# Extract subset of data only with the recode columns of noisy workplace variable
data_noise <- data_clean_age %>%
  select(IID, f.4825.0.0_recode,f.4825.1.0_recode,f.4825.2.0_recode,f.4825.3.0_recode) 
dim(data_noise)

# Function to extract all the available answers for 4 visits
# and put them in one list
f<-function(x){
  visit<-c()
  for (i in 2:5){
    if (!is.na(x[i]))
    {visit<-c(visit,x[i])}
  }
  if(is.null(visit)){visit=NA}
  else{visit=as.numeric(visit)}
  return (visit)
}

# Apply the above function and remove NAs
data_noise$visit<-apply(data_noise, 1, f)
data_noise<-data_noise %>%
  filter(!is.na(visit)) 
head(data_noise)
dim(data_noise)
                              
# Function to get the final code for noise_wp
f<-function(x){
  l=length(x$visit)
  if (l==1){ # only one answer available
    result=x$visit
  }
  else{ # more then one answer available
    result=x$visit[1]
    for (i in 2:l){
      if (x$visit[i] >= x$visit[i-1]){result=x$visit[i]} # consistent ones
      else {result=NA; break} # inconsistent ones
    }
  }
  return(result)
}

# Apply the above function and remove NAs
data_noise$noise_wp<-apply(data_noise, 1, f)
data_noise<-data_noise %>%
  filter(!is.na(noise_wp)) 
head(data_noise, 12) # note: noise_wp code generated here is numeric, not factor

# Append the noise variable to the data
data_clean_noise = merge(data_clean_age,data_noise,by="IID", all.x=TRUE)
dim(data_clean_noise)

### f. Checking consistency of the f.4836 loud music exposure frequency and filtering

In [None]:
# Extract subset of data only with the recode columns of loud music exposure variable f.4836

data_music <-  data_clean_age %>%
  select(IID,f.4836.0.0_recode,f.4836.1.0_recode,f.4836.2.0_recode,f.4836.3.0_recode) 
head(data_music)
dim(data_music)

# Function to extract all the available answers for 4 visits
# and put them in one list

f<-function(x){
  visit<-c()
  for (i in 2:5){
    if (!is.na(x[i]))
    {visit<-c(visit,x[i])}
  }
  if(is.null(visit)){visit=NA}
  else{visit=as.numeric(visit)}
  return (visit)
}

# Apply the above function and remove NAs
                              
data_music$visit<-apply(data_music, 1, f)
data_music<-data_music %>%
  filter(!is.na(visit)) 
head(data_music)
dim(data_music)
                              
# Function to get the final code for "loud_music"
f<-function(x){
  l=length(x$visit)
  if (l==1){ # only one answer available
    result=x$visit
  }
  else{ # more then one answer available
    result=x$visit[1]
    for (i in 2:l){
      if (x$visit[i] >= x$visit[i-1]){result=x$visit[i]} # consistent ones
      else {result=NA; break} # inconsistent ones
    }
  }
  return(result)
}

# Apply the above function and remove NAs
data_music$loud_music<-apply(data_music, 1, f)
data_music<-data_music %>%
  filter(!is.na(loud_music)) 
head(data_music, 12) # note: loud_music code generated here is numeric, not factor
dim(data_music)

# Merge all of the variables in the final dataset
data_clean_final = merge(data_clean_noise,data_music,by="IID", all.x=TRUE)
dim(data_clean_final)                            

### g. Exporting the final phenotype file (Tinnitus) for association analyses

In [None]:
# Last renaming and recoding
data_clean_final$tinnitus <- revalue(data_clean_final$tinnitus, c("No" = '0', 'Yes'='1' ))
names(data_clean_final)[names(data_clean_final) == "age_final"] <- "age"

# Creating the file for subsequent association analyses

tinnitus_df <- data_clean_final %>%
  filter(!is.na(tinnitus)) %>%
  select(FID, IID, age, sex, tinnitus, noise_wp, loud_music)
head(tinnitus_df)
dim(tinnitus_df)

# Export to file in correct format

write.table(tinnitus_df, 'Tinnitus_UKBB_f4803_071620', quote = FALSE, row.names = FALSE)

## 2. SRT phenotype (quantitative)

The phenotypes to be used are as follow:
1. Left ear f.20019
2. Right ear f.20021
3. Best ear (create a new variable extracting the min SRT value among f.20019 and f.20021)
4. Worst ear (create a new varaible extracting the max SRT value among f.20019 and f.20021)

Age is calculated as follow:

- For people with repeated measures take age at last visit and measurement at last visit
- For people with only one measure take age at that visit

Sex corresponds to f.22001 (genetic sex):

- Male = 0
- Female = 1

Noise variable and loud music exposure frequency:

"No"="0", 
"Yes, for less than a year"="1", 
"Yes, for around 1-5 years"="2",
"Yes, for more than 5 years"="3",
"Prefer not to answer"=NA,
"Do not know"=NA
                
1. Remove inconsistent individuals 
    - said 1,2 or 3 and in following visits said 0
    - said a higher exposure (e.g 3) and then a lower one (e.g 1 or 2) in following visits
2. Retain consistent individuals and use highest reported exposure