## Introduction

While [BioFindr][1] is developed primarily for causal inference from genomics and transcriptomics data, association analysis between genomics and transcriptomics data is also possible. In association analysis, genetic effects on the transcriptome are measured by testing if genes are differentially expressed in different groups of samples defined by the genotype of a genetic variant of interest. In [BioFindr][1], significance of association is computed using a [categorical model](https://tmichoel.github.io/BioFindr.jl/dev/realLLR/#prim_test_realLLR) and a variant-specific background distribution. Similar to what was done in the [coexpression analysis tutorial](coexpression.qmd), this is achieved by modelling the distribution of association values between a given variant $A$ and all genes $B$ as a [mixture distribution](https://en.wikipedia.org/wiki/Mixture_distribution) of real and null (random) associations. The relative weight of each component then reflects the prior probability of finding a non-null $B$ gene for a given variant $A$, and is fitted for every $A$ separately.

We will illustrate how to run association analysis with [BioFindr][1] using [preprocessed data][2] from the [GEUVADIS study][3]. See the [installation instructions](installation.qmd) for the steps you need to take to reproduce this tutorial.

## Set up the environment

We begin by setting up the environment and loading some necessary packages.


In [None]:
using DrWatson
quickactivate(@__DIR__)

using DataFrames
using Arrow

using BioFindr

## Load data

### Expression data

[BioFindr][1] expects that expression data are stored as [floating-point numbers](https://docs.julialang.org/en/v1/manual/integers-and-floating-point-numbers/) in a [DataFrame][4] where columns correspond to variables (genes) and rows to samples, see the [coexpression analysis tutorial](coexpression.qmd) for more details.

This tutorial uses two tables of expression data from the same set of samples, one for mRNA expression data called `dt`, and one for microRNA (miRNA) expression data called `dm`:


In [None]:
dt = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dt.arrow")));
dm = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dm.arrow")));

### Genotype data

[BioFindr][1] expects that genotype data are stored as [integer numbers](https://docs.julialang.org/en/v1/manual/integers-and-floating-point-numbers/) in a [DataFrame][4] where columns correspond to variables (genetic variants) and rows to samples. Since [BioFindr][1] uses a [categorical association model](https://tmichoel.github.io/BioFindr.jl/dev/realLLR/#prim_test_realLLR), it does not matter how different genotypes (e.g. heterozygous vs. homozygous) are encoded as integers. Future versions will support [scientific types](https://juliaai.github.io/ScientificTypes.jl/dev/) for representing genotype data.

This tutorial uses two tables of genotype data from the same set of samples as the expression data, one with genotypes for mRNA eQTLs called `dgt`, and one for microRNA (miRNA) eQTLs called `dgm`:


In [None]:
dgt = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dgt.arrow")));
dgm = DataFrame(Arrow.Table(datadir("exp_pro","findr-data-geuvadis", "dgm.arrow")));

## Run BioFindr

Assume we are interested in identifying mRNA genes whose expression levels are associated to microRNA eQTLs. We run:


In [None]:
dP = findr(dt, dgm, FDR=0.05)

BioFindr computes a [posterior probability](https://tmichoel.github.io/BioFindr.jl/dev/posteriorprobs/) of non-zero association for every **Source** variant (columns of `dgm`) and **Target** gene (columns of `dt`). By default the output is sorted by decreasing **Probability**. The optional parameter **FDR** can be used to limit the output to the set of pairs that has a [global false discovery rate (FDR)](https://en.wikipedia.org/wiki/False_discovery_rate#Storey-Tibshirani_procedure) less than a desired value (here set to 5%). The **qvalue** column in the output can be used for further filtering of the output, see the [coexpression analysis tutorial](coexpression.qmd) for further details.

Note the order of the arguments. The first argument `dt` is the **Target** DataFrame, and the second argument the **Source** DataFrame.

[1]: https://github.com/tmichoel/BioFindr.jl
[2]: https://github.com/lingfeiwang/findr-data-geuvadis
[3]: https://doi.org/10.1038/nature12531
[4]: https://dataframes.juliadata.org/stable/
[5]: https://doi.org/10.1371/journal.pcbi.1005703