# Selection of other white participants for association analysis

This notebook is intended to show the process of selection of new participans using the database from JUNE2020. 

First of all, the database needs to be filtered in order to facilitate working with it in jupyter notebooks

In [None]:
# I could not work with the oct2020 database since it does not have many of the variables I need
# Also the merged version (including all previous requests) has some problems and only 70K individuals
cd /gpfs/gibbs/pi/dewan/data/UKBiobank/phenotype_files/pleiotropy_R01/ukb43978_OCT2020
./ukbconv ukb43978.enc_ukb r -i/home/dc2325/project/HI_UKBB/selectvars_101120.txt -o/home/dc2325/project/HI_UKBB/ukb42495_subset101120

In [8]:
#Load libraries
library(plyr)
library(tidyverse)
library(pander)
library(ggpubr)
library(rapportools)
library(ggplot2)
#Get working directory
getwd()
#Set working directory
setwd('/home/dc2325/project/HI_UKBB/ukb42495_updatedJune2020')

In [14]:
# Clean workspace
rm(list=ls())

In [15]:
# Therefore the filtering is going to be made using the June2020 version
# Run script to import data to R
source("ukb42495_subset080420.r")
nrow(bd)

In [16]:
# List of individuals with qc'ed genotypic files
df.geno <- read.table("/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/pleiotropy_geneticfiles/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.fam", header= FALSE, stringsAsFactors = FALSE)
names(df.geno) <-c("FID","IID","ignore1", "ignore2", "ignore3", "ignore4")
nrow(df.geno)

In [17]:
head(bd[,1, drop=FALSE])

Unnamed: 0_level_0,f.eid
Unnamed: 0_level_1,<int>
1,6025442
2,1000019
3,1000022
4,1000035
5,1000046
6,1000054


In [18]:
# Assign individual ID column to bd f.eid
names(bd)[1] <- "IID"
head(bd[,1, drop=FALSE])

Unnamed: 0_level_0,IID
Unnamed: 0_level_1,<int>
1,6025442
2,1000019
3,1000022
4,1000035
5,1000046
6,1000054


In [19]:
# Merge the two data frames
df.gen.phen <-merge(df.geno, bd, by="IID", all=FALSE)
nrow(df.gen.phen)

In [20]:
# Step 5 Save as csv file
write.csv(df.gen.phen,'120120_UKBB_HI_expandedwhite_genotypeqc.csv', row.names = FALSE)

## Exclusion criteria based on ICD10, ICD9 codes and self-report
Apply the exclusion criteria defined by the group to remove unwanted individuals. This takes into account ICD10 codes, ICD9 codes (not present in the june2020 db) and f.20002 (self-report). Please find a list of removed codes [here](https://docs.google.com/spreadsheets/d/12L7Cx4Ov8FppGVmG0DxL9uG-lVRHM5QJSea0nORyirQ/edit#gid=0).

In [21]:
# To get a list of removed individuals. Make sure the list with the strings each line has \bstring\b so it can be recognized by -w
cd /home/dc2325/project/HI_UKBB/ukb42495_updatedJune2020




In [22]:
# To get the clean db with the included individuals
grep -wv -f 200713_ICDcodes_exclusion.txt 120120_UKBB_HI_expandedwhite_genotypeqc.csv > 120120_UKBB_HI_expandedwhite_genotypeqc_excr.csv
cat 120120_UKBB_HI_expandedwhite_genotypeqc_excr.csv | wc -l 

396977



## Using dask to filter huge UKBB database

In [46]:
import dask.dataframe as dd
df = dd.read_csv("/gpfs/gibbs/pi/dewan/data/UKBiobank/phenotype_files/pleiotropy_R01/UKB_combinedreqs_101920/combinedUKB_120320_43978.csv", sep=',', sample=1000000, dtype={'eid':int})

In [None]:
#df.iloc[:, [1, 0]]
df = df.set_index('eid')

In [None]:
import pandas as pd

# Create a list of individual IDs to be selected
colnames=['FID', 'IID'] 
ind_to_select = pd.read_csv("/home/dc2325/scratch60/pca/exome_IID_188448_outliers_removed.txt", sep=' ', names=colnames, header=None, dtype='object')
ind_to_select.head()
IID = ind_to_select['IID'].tolist()
IID[0]

In [25]:
# Create a list of variables to select
columns_to_select = pd.read_csv("/home/dc2325/scratch60/pca/selectvars_101120.txt", header=None)
columns_to_select 
columns_to_select = columns_to_select[0].tolist()
columns_to_select[0]

31

In [None]:
#Generate a new dask dataframe containing only the selected indices
df_id = df.loc[IID]

#df_id_vars = df_id[df_id.str.contains('|'.join(columns_to_select))]