# Fine-Mapping

## 1. Univariate Fine-Mapping and TWAS with SuSiE

### 1.1 Input
The Univariate SuSiE analysis requires the following inputs:
- Genotype matrix (X): Individual-level genotype data, with rows representing samples and columns representing variant loci
- Phenotype vector (Y): Continuous phenotype measurements
- Minor allele frequency (MAF): The MAF value for each variant
- Covariates: Optional confounding factors for adjustment
- LD reference panel: Used for variant filtering and quality control

In [None]:
cd /home/ubuntu/xqtl_protocol_exercise
head output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.region_list.txt

#chr	start	end	ID	path
chr1	89550	91104	ENSG00000239945	output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.chr1.bed.gz
chr1	944202	959308	ENSG00000188976	output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.chr1.bed.gz
chr1	1203507	1206591	ENSG00000186891	output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.chr1.bed.gz
chr1	2555638	2565381	ENSG00000157873	output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.chr1.bed.gz
chr1	7784319	7845176	ENSG00000049246	output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.chr1.bed.gz
chr1	7998186	7999933	ENSG00000284716	output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.chr1.bed.gz
chr1	9950571	9960786	ENSG00000283611	output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.chr1.bed.gz
chr1	10298965	10299071	ENSG00000199562	output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.chr1.bed.gz
chr1	10306464	10306756	ENSG00000264501	output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.chr1.bed.gz


### 1.2 Command

In [None]:
cd /home/ubuntu/xqtl_protocol_exercise
mkdir output/mnm_regression/susie_twas
mkdir output/mnm_regression/susie_twas/data
sos run pipeline/mnm_regression.ipynb susie_twas \
    --name test_susie_twas \
    --genoFile output/genotype_by_chrom/wgs.merged.plink_qc.1.bed \
    --phenoFile output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.region_list.txt \
    --covFile output/covariate/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.covariates.wgs.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \
    --customized-association-windows reference_data/TAD/TADB_enhanced_cis.bed \
    --phenotype-names test_pheno \
    --max-cv-variants 5000 --ld_reference_meta_file reference_data/ADSP_R4_EUR/ld_meta_file.tsv \
    --region-name ENSG00000049246 ENSG00000054116 ENSG00000116678 \
    --save-data \
    --cwd output/mnm_regression/susie_twas

/home/al4225/.pixi/envs/coreutils/bin/mkdir: cannot create directory ‘output/mnm_regression/susie_twas’: File exists


  import pkg_resources
INFO: Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
INFO: NumExpr defaulting to 16 threads.
INFO: Running [32mget_analysis_regions[0m: 
Loading customized association analysis window from reference_data/TAD/TADB_enhanced_cis.bed
INFO: [32mget_analysis_regions[0m is [32mcompleted[0m.
INFO: [32mget_analysis_regions[0m output:   [32mregional_data[0m
INFO: Running [32msusie_twas[0m: 
INFO: [32msusie_twas[0m (index=2) is [32mcompleted[0m.
INFO: [32msusie_twas[0m (index=0) is [32mcompleted[0m.
INFO: [32msusie_twas[0m (index=1) is [32mcompleted[0m.
INFO: [32msusie_twas[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/mnm_regression/susie_twas/fine_mapping/test_susie_twas.chr1_ENSG00000049246.univariate_bvsr.rds /mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/mnm_regression/susie_twas/twas_weights/test_susie_twas.chr1_ENSG00000049246.univariate_twas_weights.rds... (6 


### 1.3 Output File Structure

#### Overview

The `univariate_analysis_pipeline` produces output files of the form:

```
{name}.{region}.univariate_bvsr.rds
```

Each file contains the full fine-mapping result for a given gene and phenotype. The structure is hierarchical:

```
$GENE_ID
  $PHENOTYPE_NAME
```

---

#### Main Components

##### 1). Basic Metadata

- **`variant_names`**: Character vector of variant IDs (format: `chr:pos:ref:alt`)
- **`analysis_script`**: R script used for the analysis
- **`sample_names`**: List of sample identifiers
- **`other_quantities`**: Additional metadata (e.g. dropped samples)

##### 2). Summary Statistics (`sumstats`)

- **`betahat`**: Effect size estimates (one per variant)
- **`sebetahat`**: Standard errors
- **`z_scores`**: Z-statistics
- **`p_values`**: P-values
- **`q_values`**: FDR-adjusted q-values

##### 3). Candidate Variant Table (`top_loci`)

A data frame summarizing per-variant results:

| Column              | Description                                |
|---------------------|--------------------------------------------|
| `variant_id`        | Variant identifier                         |
| `betahat`           | Effect size estimate                       |
| `sebetahat`         | Standard error                             |
| `z`                 | Z-score                                    |
| `maf`               | Minor allele frequency                     |
| `pip`               | Posterior inclusion probability (0–1)      |
| `cs_coverage_0.95`  | Membership in 95% credible set             |
| `cs_coverage_0.7`   | Membership in 70% credible set             |
| `cs_coverage_0.5`   | Membership in 50% credible set             |

##### 4). SuSiE Core Output (`susie_result_trimmed`)

- **Fine-Mapping Output**:
  - `pip`: Vector of posterior inclusion probabilities
  - `sets`: List of credible sets (`cs`, `coverage`, `requested_coverage`)
  - `sets_secondary`: Credible sets at lower coverage thresholds (0.7, 0.5)

- **Model Parameters**:
  - `alpha`: Variational parameters matrix (L × V)
  - `lbf_variable`: Log Bayes factors
  - `mu`, `mu2`: First/second moments of effect sizes
  - `V`: Estimated residual variance
  - `X_column_scale_factors`: Scaling factors used for genotype matrix

- **Convergence Info**:
  - `niter`: Number of iterations
  - `max_L`: Maximum number of single-effect components

##### 5). Region Metadata (`region_info`)

- `region_coord`: Coordinates of the gene
- `grange`: Analysis window used for LD matrix
- `region_name`: Gene symbol or name

##### 6). Preset Variant Analysis (`preset_variants_result`)

Optional: Separate analysis of manually selected high-quality variants, same structure as main result.

##### 7). TWAS Weights Result (`twas_weights_result`, optional)

Included if `twas_weights = TRUE`. Contains:

- Trained TWAS weights
- Cross-validation statistics
- Prediction metrics

---

## Focus

The most important components for downstream analysis are:

- `top_loci`: summarized per-SNP statistics with PIPs and CS membership
- `susie_result_trimmed`: fine-mapping posterior output
- `preset_variants_result`: for filtered high-confidence variant sets



In [None]:
# change to R code
setwd('/home/ubuntu/xqtl_protocol_exercise')
uv_susie = readRDS('output/mnm_regression/susie_twas/fine_mapping/test_susie_twas.chr1_ENSG00000116678.univariate_bvsr.rds')
names(uv_susie)
str(uv_susie)

List of 1
 $ ENSG00000116678:List of 1
  ..$ test_pheno_ENSG00000116678:List of 10
  .. ..$ variant_names         : chr [1:22] "chr1:64580401:A:T" "chr1:64594253:TAA:TA" "chr1:64864010:G:A" "chr1:64928793:AAAAC:A" ...
  .. ..$ analysis_script       : chr "options(warn=1)\nlibrary(pecotmr)\nphenotype_files = c(\"/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/p"| __truncated__
  .. ..$ other_quantities      :List of 1
  .. .. ..$ dropped_samples:List of 3
  .. .. .. ..$ X    : NULL
  .. .. .. ..$ y    : NULL
  .. .. .. ..$ covar: NULL
  .. ..$ sumstats              :List of 5
  .. .. ..$ betahat  : Named num [1:22] 0.0023 -0.0471 0.3789 -0.2284 0.1857 ...
  .. .. .. ..- attr(*, "names")= chr [1:22] "chr1:64580401:A:T" "chr1:64594253:TAA:TA" "chr1:64864010:G:A" "chr1:64928793:AAAAC:A" ...
  .. .. ..$ sebetahat: Named num [1:22] 0.1067 0.0886 0.1768 0.2955 0.1282 ...
  .. .. .. ..- attr(*, "names")= chr [1:22] "chr1:64580401:A:T" "chr1:64594253:TAA:TA" "chr1:64864010:G:A" "chr1:64928

## 2. Regression with Summary Statistics (RSS) Fine-Mapping

### 2.1 Input
RSS fine-mapping analysis requires the following inputs:
- Summary statistics file (sumstat_path): A tab-delimited file containing the following columns:
    - chrom: Chromosome
    - pos: Genomic position
    - A1 / A2: Alleles
    - beta / se or z: Effect size and standard error, or Z-score
    - n_sample (optional): Sample size

- LD data (LD_data): The linkage disequilibrium matrix and variant information, typically loaded via the load_LD_matrix function.
- Column mapping file (column_file_path): A file that defines the mapping between expected column names and those present in the summary statistics file.

In [None]:
cd /home/ubuntu/xqtl_protocol_exercise
head data/mnm_regression/gwas_meta_data.txt

study_id	chrom	file_path	column_mapping_file	n_sample	n_case	n_control
AD_Bellenguez_2022	0	data/twas/AD_Bellenguez_2022_RSS_QC_RAISS_imputed.tsv.gz	data/twas/Bellenguez.yml	0	111326	677663


### 2.2 Command
#### Parameter Breakdown

- **`-ld-meta-data reference_data/ADSP_R4_EUR/ld_meta_file.tsv`**
This specifies the LD reference panel metadata file. The RSS analysis requires LD information to perform fine-mapping, as shown in where the **`rss_analysis_pipeline`** function takes **`LD_data`** as a required parameter.
- **`-gwas-meta-data data/mnm_regression/gwas_meta_data.txt`**
This points to the GWAS metadata file containing information about the summary statistics files to be analyzed. The pipeline uses this to load summary statistics via the **`load_rss_data`** function.
- **`-qc_method "rss_qc"`**
This sets the quality control method to "rss_qc". As seen in , the pipeline supports multiple QC methods including "rss_qc", "dentist", and "slalom". The QC is performed through the **`summary_stats_qc`** function.
- **`-impute`**
This enables statistical imputation using the RAISS method. When enabled, the pipeline calls to perform imputation for missing variants using the **`raiss`** function with default parameters.
- **`-finemapping_method "susie_rss"`**
This specifies using SuSiE RSS for fine-mapping. The method is implemented in through the **`susie_rss_pipeline`** function, which supports "susie_rss", "single_effect", and "bayesian_conditional_regression" methods.
- **`-skip_analysis_pip_cutoff 0`**
This sets the PIP (Posterior Inclusion Probability) threshold for skipping analysis to 0, meaning no regions will be skipped based on initial PIP screening. The logic is implemented in .
- **`-skip_regions 6:25000000-35000000`**
This excludes the specified genomic region (chromosome 6, positions 25-35 Mb) from analysis. This is processed through the **`rss_basic_qc`** function which handles region filtering.
- **`-region_name 22:49355984-50799822`**
This specifies the target region for analysis (chromosome 22, positions ~49.4-50.8 Mb). The region is used to subset both the summary statistics and LD data for focused analysis.

The RSS analysis pipeline will process the specified region using the ADSP_R4_EUR LD reference panel, apply RSS quality control, perform RAISS imputation, and conduct SuSiE RSS fine-mapping to identify potential causal variants in the target region.

In [None]:
sos run pipeline/rss_analysis.ipynb univariate_rss \
    --ld-meta-data reference_data/ADSP_R4_EUR/ld_meta_file.tsv \
    --gwas-meta-data data/mnm_regression/gwas_meta_data.txt \
    --qc_method "rss_qc" --impute \
    --finemapping_method "susie_rss" \
    --cwd output/rss_analysis \
    --skip_analysis_pip_cutoff 0 \
    --skip_regions 6:25000000-35000000 \
    --region_name 22:49355984-50799822

  import pkg_resources
INFO: Running [32mget_analysis_regions[0m: 
INFO: Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
INFO: NumExpr defaulting to 16 threads.
INFO: [32mget_analysis_regions[0m is [32mcompleted[0m.
INFO: [32mget_analysis_regions[0m output:   [32mregional_data[0m
INFO: Running [32munivariate_rss[0m: 
INFO: [32munivariate_rss[0m is [32mcompleted[0m.
INFO: [32munivariate_rss[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/rss_analysis/univariate_rss/RSS_QC_RAISS_imputed.chr22_49355984_50799822.univariate_susie_rss.rds[0m
INFO: Workflow univariate_rss (ID=w60aca78e2357ae7c) is executed successfully with 2 completed steps.


### 2.3 Output Structure

The RSS fine-mapping analysis produces results files with the naming pattern **`RSS_QC_RAISS_imputed.{region}.univariate_susie_rss.rds`**, where each file contains results for a specific genomic region and GWAS study.

#### **Main Result Components**

The top-level structure contains these key components:

**RSS_QC_RAISS_imputed** (the main analysis result):

- **variant_names**: Character vector of variant identifiers in "chrom:pos:ref:alt" format
- **analysis_script**: The R script used for the analysis
- **sumstats**: List containing the Z-scores used for SuSiE RSS analysis
- **susie_result_trimmed**: Core SuSiE fine-mapping results including PIPs, credible sets, and model parameters
- **outlier_number**: Integer count of outliers detected during quality control

#### **Detailed Data Frame Columns**

The **`rss_data_analyzed`** component contains a data frame with 26 variables representing the processed summary statistics:

**Basic Variant Information:**

- **chrom**: Chromosome number
- **pos**: Genomic position
- **variant_id**: Variant identifier in standardized format
- **A1**: Effect allele (reference allele)
- **A2**: Other allele (alternative allele)

**GWAS Summary Statistics:**

- **pvalue**: P-value from association test
- **effect_allele_frequency**: Frequency of the effect allele
- **odds_ratio**: Odds ratio for binary traits
- **ci_lower/ci_upper**: Confidence interval bounds
- **beta**: Effect size estimate (may be NA if using odds ratios)
- **se**: Standard error of effect estimate
- **z**: Z-score (beta/se or derived from p-value)

**Study Design Information:**

- **n_case**: Number of cases (for case-control studies)
- **n_control**: Number of controls
- **het_isq**: I² heterogeneity statistic for meta-analysis
- **het_pvalue**: P-value for heterogeneity test

**RAISS Imputation Results:**

- **var**: Variance estimate (-1 indicates original data, not imputed)
- **raiss_ld_score**: LD score from RAISS imputation
- **raiss_R2**: R² quality metric for imputed variants
- **variant_alternate_id**: Alternative variant identifier

In [None]:
# R code
rss = readRDS('/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/rss_analysis/univariate_rss/RSS_QC_RAISS_imputed.chr22_49355984_50799822.univariate_susie_rss.rds')
names(rss)
str(rss)

List of 1
 $ chr22_49355984_50799822:List of 1
  ..$ AD_Bellenguez_2022:List of 2
  .. ..$ susie_rss_RSS_QC_RAISS_imputed:List of 5
  .. .. ..$ variant_names       : chr [1:7610] "22:49356357:G:A" "22:49356408:G:A" "22:49356690:C:T" "22:49357190:G:A" ...
  .. .. ..$ analysis_script     : chr "library(pecotmr)\nlibrary(dplyr)\nlibrary(data.table)\nskip_region = c(\"6:25000000-35000000\")\nstudies = c(\""| __truncated__
  .. .. ..$ sumstats            :List of 1
  .. .. .. ..$ z: num [1:7610] 0.3034 1.2604 0.0707 1.2353 -1.2175 ...
  .. .. ..$ susie_result_trimmed:List of 9
  .. .. .. ..$ pip           : num [1:7610] 0.000471 0.000674 0.000461 0.000654 0.000659 ...
  .. .. .. ..$ sets          :List of 3
  .. .. .. .. ..$ cs                : NULL
  .. .. .. .. ..$ coverage          : NULL
  .. .. .. .. ..$ requested_coverage: num 0.95
  .. .. .. ..$ cs_corr       : logi NA
  .. .. .. ..$ sets_secondary:List of 2
  .. .. .. .. ..$ coverage_0.7:List of 2
  .. .. .. .. .. ..$ sets   :List o

## 3. Other finemapping methods (optional)
### Univariate Fine-Mapping of Functional (Epigenomic) Data with fSuSiE

In [None]:
cd /home/ubuntu/xqtl_protocol_exercise
sos run pipeline/mnm_regression.ipynb fsusie \
    --cwd output/fsusie/ \
    --name test_fsusie \
    --genoFile output/genotype_by_chrom/wgs.merged.plink_qc.genotype_by_chrom_files.txt \
    --phenoFile output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.region_list.txt \
    --covFile output/covariate/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.covariates.wgs.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \
    --numThreads 8 \
    --customized-association-windows reference_data/TAD/TADB_enhanced_cis.bed \
    --save-data \
    --region-name ENSG00000049246 ENSG00000054116 ENSG00000116678 

  import pkg_resources
INFO: Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
INFO: NumExpr defaulting to 16 threads.
INFO: Running [32mget_analysis_regions[0m: 
Loading customized association analysis window from reference_data/TAD/TADB_enhanced_cis.bed
INFO: [32mget_analysis_regions[0m is [32mcompleted[0m.
INFO: [32mget_analysis_regions[0m output:   [32mregional_data[0m
INFO: Running [32mfsusie[0m: 
INFO: [32mfsusie[0m (index=1) is [32mcompleted[0m.
INFO: [32mfsusie[0m (index=0) is [32mcompleted[0m.
INFO: [32mfsusie[0m (index=2) is [32mcompleted[0m.
INFO: [32mfsusie[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/fsusie/fsus/test_fsusie.chr1_7784319_7845176.fsusie_mixture_normal_TI__top_pc_weights.rds /mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/fsusie/fsus/test_fsusie.chr1_36136569_36156052.fsusie_mixture_normal_TI__top_pc_weights.rds... (3 items in 3 groups)[0m
INFO: Workflow 