# Introduction
These are the instructions for running freemuxlet on the ATAC data.

# Setup

In [3]:
import os
import gzip

In [4]:
mountpoint = '/data/clue_test/'
prefix = mountpoint + 'amo/atac/'

# Creating the VCF

1. Download the 1000 genomes VCF.

```
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20181203_biallelic_SNV/ALL.wgs.shapeit2_integrated_v1a.GRCh38.20181129.sites.vcf.gz
wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20181203_biallelic_SNV/ALL.wgs.shapeit2_integrated_v1a.GRCh38.20181129.sites.vcf.gz.tbi
```

2. Use a text editor to change the version from 4.3 to 4.2 in the header line so that `bcftools` can run on it. This should not affect functionality.

3. Run `bcftools` to filter to a minor allele frequency of 0.05.

4. Run `bedtools intersect` to filter the SNPs to only those found in the peak sets of every well.

5. Rename the contigs from 1 to chr1 and filter for only autosomes.

```
<contig renaming code here>
samtools view -b input.bam chr{1..22} > output.bam
```

This pipeline should result in a VCF with `184243` sites located at:

In [11]:
prefix + 'vcfs/filtered.2.with_chr.autosomes.vcf.gz'

'/data/clue/amo/atac/vcfs/filtered.2.with_chr.autosomes.vcf.gz'

# Running Freemuxlet


The freemuxlet pipeline is actually composed of 2 steps:

1. Running a pileup using `popscle dsc-pileup`.
2. Running the clustering and demultiplexing using `popscle freemuxlet`.

The `dsc-pileup` command takes in the aligned reads (BAM), the VCF created above, and the filtered droplet barcodes:

```
cd /data/clue/amo/atac/demux/plp/
popscle dsc-pileup --sam /data/clue/amo/atac/cr/well1/outs/possorted_genome_bam.bam --vcf /data/clue/amo/vcfs/filtered.2.with_chr.autosomes.vcf.gz --out plp1 --group-list /data/clue/amo/atac/bcs/well1_bcs.tsv --skip-umi
```

You repeat this command for each well of the 10x run. This creates several pileup files per well:

```
plp1.cel.gz
plp1.plp.gz
plp1.var.gz
```

You can then run `merge_pileups.py` to merge the pileup files across all wells. This is to leverage the data from all droplets across multiple wells for learning the genotypes. This automatically increments the cell barcode's GEM group so they remain between across wells.

```
python merge_pileups.py /data/clue/amo/atac/demux/plp/plp1 /data/clue/amo/atac/demux/plp/plp2 /data/clue/amo/atac/demux/plp/plp3 /data/clue/amo/atac/demux/plp/plp4 /data/clue/amo/atac/demux/plp/plp5
```

Finally, you can run `freemuxlet`:

```
cd /data/clue/amo/atac/freemux/
popscle freemuxlet --plp /data/clue/amo/atac/demux/plp/merged --out freemux --nsample 5
```

_Note_: To generate the genotype distance matrix, you must run `freemuxlet-old` and include the `--aux-files` flag to generate the `ldist.gz` file. 

At the end, you should have several freemuxlet outputs:
```
freemux.clust1.samples.gz
freemux.clust1.vcf.gz
freemux.lmix
```

In [5]:
freemux_path = prefix + 'demux/freemux/freemux.clust1.samples.gz'
freemux_path

'/data/clue/amo/atac/demux/freemux/freemux.clust1.samples.gz'

# Split by Well

Even though we merged for the demultiplexing, we'll load in the freemuxlet outputs separately by well, so we split them now.

In [6]:
freemux_path

'/data/clue/amo/atac/demux/freemux/freemux.clust1.samples.gz'

In [51]:
try:
    os.mkdir(prefix + 'demux/freemux/by_well/')
except FileExistsError:
    pass

In [58]:
freemux_file = gzip.open(freemux_path, 'rt')

In [59]:
header = freemux_file.readline()

In [60]:
wells = dict()
for well in range(1,6):
    wells[well] = dict()
    wells[well]['path'] = prefix + 'demux/freemux/by_well/freemux_well%d.clust1.samples' % well
    wells[well]['file'] = open(wells[well]['path'], 'w')
    wells[well]['file'].write(header)

In [61]:
for line in freemux_file.readlines():
    well = int(line.split('\t')[1].split('-')[-1])
    wells[well]['file'].write(line)

In [62]:
for well in range(1, 6):
    wells[well]['file'].close()

In [63]:
freemux_file.close()