### Purpose

For the analysis we need to construct some phenotypic and covariant data. This notebook loads

* PCS's from the MRC IEU UK Biobank file
* Removal ID's, so we exclude those whom have removed consent
* A linker between our application of 41382 to the IEU common
* Data extracted from application 41382 on BMI, gender and YoB

Then combines the PC's from the IEU to your application, and converts our IID's to the IID type of the genetic file

In [26]:
from miscSupports import terminal_time, flip_list
from csvObject import CsvObject, write_csv
import statsmodels.formula.api as smf
from pathlib import Path
import pandas as pd

common_path = r"Z:\UKB\GeographicID\Paper Data Extraction\Construction Files"
project_path = r"Z:\UKB\GeographicID\Paper Data Extraction\SB_Papers\SW_GWAS"

print(f"Set Environment {terminal_time()}")

Set Environment 10:38


In [2]:
# Load the principle components, set them as IID: [PC1, PC2, ... PC(N)]
pca_file = CsvObject(Path(common_path, "IEU_PCs_1_40.txt"), file_headers=False)
pc_dict = {row[0]: [r for r in row[2:] if r != ""] for row in pca_file.row_data}

# Load removal file for exclusion on IID
removal_ids = CsvObject(Path(common_path, "UKB_Withdrawal_IDs.csv"), set_columns=True)

# Set a linker between 41382: IEU Common
linker = {app: ieu for ieu, app in CsvObject(Path(common_path, "Linker.csv")).row_data}

# Load phenotype (BMI), gender, and year of birth from the data extraction from 41382
variables = CsvObject(Path(project_path, "Variables.csv"))

print(f"Loaded required files {terminal_time()}")

Loaded required files 9:59


In [32]:
analysis_rows = []
for iid, gender, yob, phenotype in variables.row_data:
    
    # If the iid is not set as withdrawn and is within the linker file between our applications
    if (iid not in removal_ids[0]) and (iid in linker.keys()) and (gender != "") and (yob != "") and (phenotype != ""):
        # Extract the PC via the linker and then append to the output container
        pcs = pc_dict[linker[iid]]
        analysis_rows.append([linker[iid], linker[iid], phenotype, gender, yob] + pcs)
        
print(f"Set IID file {terminal_time()}")

Set IID file 10:45


In [34]:
# Validate header length == row length
headers = ["FID", "IID", "BMI", "Gender", "YoB"] + [f"PC{i}" for i in range(1, 41)]
print(f"Header length is {len(headers)}: Row length is {len(analysis_rows[0])}")

Header length is 45: Row length is 45


In [35]:
# Write the file to disk
write_csv(project_path, "Analysis", headers, analysis_rows)
print(f"Output written at {terminal_time()}")

Output written at 10:55


### Create phenotypic residuals

For several of our analysis runs we will need the residualised phenotype, so we generate that here

In [36]:
# Load analysis sample
analysis = pd.read_csv(Path(project_path, "Analysis.csv"))
print(f"Set analysis data-frame {terminal_time()}")

Set analysis data-frame 10:58


In [37]:
# Set the formula as phenotype ~ all other explanatory variables then run the OLS
formula = "BMI~" + "+".join([h for h in analysis.columns[3:]])
result = smf.ols(formula=formula, data=analysis)
print(f"Set residualised phenotype {terminal_time()}")

Set residualised phenotype 11:6


In [40]:
fid = analysis["FID"].tolist()
iid = analysis["IID"].tolist()
res = result.fit().resid.tolist()

write_csv(project_path, "PhenoResiduals", ["FID", "IID", "RES"], flip_list([fid, iid, res]))
print(f"Written residualised phenotypes {terminal_time()}")

Written residualised phenotypes 11:39
