# PredLMM Manual

*Version 1.0*

PredLMM Team, August 8, 2020

### Introduction
PredLMM, which stands for Predictive Process approximated Linear Mixed Model, is a program for performing rapid SNP based heritability estimation with large number of genetically related individuals.

See PredLMM's README.md for installation instructions, documentation, code, and a bibliography.

### Contacts

Email one of the developers at slsouvik@gmail.com.
Open an issue on GitHub.


### Citing PredLMM

If you use PredLMM in any published work, please cite the main manuscript.

### Data Description

We will use the phenotype file: "example_pheno.csv", covariate file: "example_covar.csv". Both the files have 5000 many rows corresponding to 5000 many individuals and 3 columns of which first two are their family and individual IDs. Third column of the phenotype file contains a phenotype vector. The covariate file has only a single covariate vector (third column) of all 1's (intercept term). With the binary files, GRM files are computed using GCTA and are saved under name: "example_grm".

### Notebook preparation and general use

We start by loading PredLMM module.

In [1]:
#-----------------------Loading  the required Module------------------------------------------
from PredLMM.PredLMM_final import *

Next we load the GCTA-GRM files and construct the $N \times N$ Genetic Relationship Matrix under the name: GRM_array. 

In [2]:
#-----------------------Loading the GRM obtained using GCTA---------------------------------------------------
prefix = "Data/example_grm"
def sum_n_vec(n):
    out = [int(0)] * n
    for i in range(n):
        out[i] = int(((i + 1) * (i + 2) / 2) - 1)
    return(out)


def ReadGRMBin(prefix, AllN = False):
    BinFileName  = prefix + ".grm.bin"
    NFileName = prefix + ".grm.N.bin"
    IDFileName = prefix + ".grm.id"
    dt = np.dtype('f4') # Relatedness is stored as a float of size 4 in the binary file
    entry_format = 'f' # N is stored as a float in the binary file
    entry_size = calcsize(entry_format)
    ## Read IDs
    ids = pd.read_csv(IDFileName, sep = '\t', header = None)
    ids_vec = ids.iloc[:,1]
    n = len(ids.index)
    ids_diag = ['NA' for x in range(n)]
    n_off = int(n * (n - 1) / 2)
    ## Read relatedness values
    grm = np.fromfile(BinFileName, dtype = dt)
    i = sum_n_vec(n)
    out = {'diag': grm[i], 'off': np.delete(grm, i),'id': ids}
    return(out)


G = ReadGRMBin(prefix)
N = len(G['diag'])
GRM = csr_matrix((N, N));GRM_array = GRM.todense().A1.reshape(N, N)
idx = np.tril_indices(N,-1,N);idy = np.triu_indices(N,1,N);id_diag = np.diag_indices(N)
GRM_array[idx] = G['off'];GRM_array[id_diag] = G['diag'];GRM_array[idy] = GRM_array.T[idy]
GRM_array = np.float32(GRM_array)

Loading the GRM everytime for each trait in the above way is very time-consuming especially when $N$ is large (>$40,000)$. The following few lines of codes can be used to save the loaded GRM in efficient h5py format. 

Then for analyzing each trait, one would need to load this h5py data to construct the GRM (codes are provided).

In [3]:
#-----------------------convert the GRM to h5py format for faster loading------------------- 
#hf = h5py.File('Data/example_grm.h5', 'w')
#hf.create_dataset('dataset_1', data=GRM_array)
#hf.close()

#-----------------------loading GRM in h5py format------------------------------------------- 
#hf = h5py.File('Data/example_grm.h5', 'r')
#GRM_array= np.array(hf.get('GRM'),dtype="float32")

Next, load and create the phenotype (y) and covariate vectors (X).

In [4]:
#----------------------loading the phenotype and covariate data----------------------------
phenotypes = np.loadtxt("Data/example_pheno.csv",skiprows=1)
covariates = np.loadtxt("Data/example_covar.csv",delimiter=",",skiprows=1)
y = phenotypes[:,2]
X = np.delete(covariates,[0,1],axis=1)

We select a random subsample (to be used as set of knots) from the set of all individuals and select the correspondoing rows of y, X and GRM_array.


In [5]:
#----------------------Knot selection and selecting corresponding vectors----------------------------
subsample_size = 500;
sub_sample = sorted(np.random.choice(range(0,N),subsample_size,replace=False))
non_subsample = np.setdiff1d(range(0,N),sub_sample)
indices = np.hstack((sub_sample,non_subsample))
GRM_array = np.float32(GRM_array[np.ix_(indices,indices)].T)
y = y[indices]; X=X[indices]; X_T = X.T;

G_selected = GRM_array[range(0,subsample_size),:][:,range(0,subsample_size)]
y_sub = y[range(0,subsample_size)]; X_sub=X[range(0,subsample_size)]; X_subT=X_sub.T

Next, we fit a LMM with the selected subsample to estimate heritability ($h^2$) and variance ($\sigma^2$). The first two elements of "result_subsample" vector respectively store the subsample-based heritability and variance estimates. The thrid element is the time taken for convergence.

In [6]:
#------------------Fitting LMM using only the selected subsample (set of knots)-------------------------
A_selc = np.copy(G_selected)-Identity(subsample_size)
result_subsample = derivative_minim_sub(y_sub, X_sub, X_subT, G_selected, A_selc, subsample_size)

Finally, we fit the PredLMM likelihood to estimate heritability ($h^2$) and variance ($\sigma^2$). The first two elements of "result_full" vector respectively store PredLMM heritability and variance estimates. The thrid element is the time taken for convergence.

In [7]:
#------------------Running PredLMM----------------------------------------------------------------------
Ct =  np.copy(GRM_array[range(0,subsample_size),:],order='F')
C12 = Ct[:,range(subsample_size,N)]
id_diag = np.diag_indices(N)
diag_G_sub = GRM_array[id_diag]
G_inv = inv(G_selected).T
GRM_array[np.ix_(range(subsample_size,N),range(subsample_size,N))] = sgemm(alpha=1,a=C12.T,b=sgemm(alpha=1,a=G_inv,b=C12))
del G_inv, C12
add = copy(-GRM_array[id_diag] + diag_G_sub) ## diagonal adjustment
np.fill_diagonal(GRM_array, - 1 + diag_G_sub)

In [8]:
result_full = derivative_minim_full(y, X, X_T, Ct, id_diag, add, G_selected, GRM_array, N)

Finally we stack both the estimates as the final result.

In [9]:
final_result = np.hstack((result_subsample,result_full))
print(final_result)

[0.83936828 0.97270912 0.13463879 0.79496932 1.02671802 2.99898219]


In [None]:
True value