# Example Pyllelic Use-Case Notebook

## Background

This notebook illustrates the import and use of `pyllelic` in a jupyter environment.

See https://github.com/Paradoxdruid/pyllelic for further details.

## Pre-setup

### Obtaining fastq data

We can download rrbs (reduced representation bisulfite sequencing) data from the Encode project:
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibMethylRrbs/

Those files are in unaligned fastq format.  We will need to align these to a reference human genome.

### Aligning reads (using process.py)

To align reads, we'll use bowtie2 and samtools (through its pysam wrapper).

First, we need to download a genomic index sequence: http://hgdownload.soe.ucsc.edu/goldenPath/hg19

In [None]:
# Processing imports
# from pathlib import Path

In [None]:
# Set up file paths
# index = Path(
#     "/home/andrew/allellic/hg19.p13.plusMT.no_alt_analysis_set//hg19.p13.plusMT.no_alt_analysis_set"
# )
# fastq = Path("/home/andrew/allellic/wgEncodeHaibMethylRrbsU87HaibRawDataRep1.fastq.gz")

**WARNING:** The next command is processor, RAM, and time intensive, and only needs to be run once!

In [None]:
# Convert fastq to bam
# pyllelic.process.bowtie2_fastq_to_bam(index={bowtie_index_filename_without_suffix},
#                                       fastq={fastq_file_name},
#                                       cores=6)

Notes:
* cores is number of processor cores, adjust for your system
* instead of `out.bam` use a filename that encodes cell-line and tissue.  Our convention is: `fh_CELLLINE_TISSUE.TERT.bam`

Next, we need to sort and index the bam file using samtools functions.

In [None]:
# Sort the bamfile
# bamfile = Path("/home/andrew/allellic/wgEncodeHaibMethylRrbsU87HaibRawDataRep1.bam")
# pyllelic.process_pysam_sort(bamfile)

In [None]:
# Create an index of the sorted bamfile
# sorted_bam = Path("")
# pyllelic.process_pysam_index(b)

Now, that sorted file (again, rename to capture cell-line and tissue info) is ready to be put in the `test` folder for analysis by pyllelic!

## Set-up

In [None]:
import pyllelic

In [None]:
# set up your disk location:
# base_path should be the directory we'll do our work in
# make a sub-directory under base_path with a folder named "test"
# and put the .bam and .bai files in "test"

# OSX setup
pyllelic.set_up_env_variables(
    base_path="/Users/abonham/documents/test_allelic/",
    prom_file="TERT-promoter-genomic-sequence.txt",
    prom_start="1293000",
    prom_end="1296000",
    chrom="5",
    offset=1298163,
)

# Windows set-up
# pyllelic.set_up_env_variables(
#     base_path="/home/andrew/allellic/",
#     prom_file="TERT-promoter-genomic-sequence.txt",
#     prom_start="1293000",
#     prom_end="1296000",
#     chrom="chr5",
#     offset=1298163,
# )

## Main Parsing Functions

In [None]:
files_set = pyllelic.make_list_of_bam_files()  # finds bam files

In [None]:
# Uncomment for debugging:
files_set

In [None]:
# index bam and creates bam_output folders/files
# set process False to skip writing output files if they already exist
positions = pyllelic.index_and_fetch(files_set, process=False)

In [None]:
# Turn off pretty printing to see position list better
%pprint

In [None]:
# Uncomment for debugging:
positions

In [None]:
# Turn back on pretty printing
%pprint

In [None]:
# Only needs to be run once, generates static files
# pyllelic.genome_parsing()

# Can also take sub-list of directories to process
# pyllelic.genome_parsing([pyllelic.config.bam_directory / "fh_BONHAM_TISSUE.TERT.bam"])

In [None]:
cell_types = pyllelic.extract_cell_types(files_set)

In [None]:
# Uncomment for debugging
cell_types

In [None]:
# Set filename to whatever you want
df_list = pyllelic.run_quma_and_compile_list_of_df(
    cell_types, "test1.xlsx",
    run_quma=True,
)  # to skip quma: , run_quma=False)

In [None]:
# Uncomment for debugging
df_list.keys()

In [None]:
df_list["NCIH196"]["1295937"]

In [None]:
means = pyllelic.process_means(df_list, positions, files_set)

In [None]:
# Uncomment for debugging
means

In [None]:
modes = pyllelic.process_modes(df_list, positions, files_set)

In [None]:
# Uncomment for debugging
modes

In [None]:
diff = pyllelic.find_diffs(means, modes)

In [None]:
# Uncomment for debugging
diff

## Write Output to excel files

In [None]:
# Set the filename to whatever you want
pyllelic.write_means_modes_diffs(means, modes, diff, "Test1")

## Visualizing Data

In [None]:
final_data = pyllelic.pd.read_excel(
    pyllelic.config.base_directory.joinpath("Test1_diff.xlsx"),
    dtype=str,
    index_col=0,
    engine="openpyxl",
)

In [None]:
final_data

In [None]:
individual_data = pyllelic.return_individual_data(df_list, positions, files_set)

In [None]:
# Uncomment for debugging
individual_data

In [None]:
individual_data.loc["NCIH196", "1295937"]

In [None]:
individual_data.loc["SORTED"]["1295680"]

In [None]:
individual_data.loc["CALU1"]

In [None]:
pyllelic.histogram(individual_data, "SORTED", "1295680")

In [None]:
pyllelic.histogram(individual_data, "SORTED", "1295903")

In [None]:
pyllelic.histogram(individual_data, "SW1710", "1295089")

In [None]:
pyllelic.histogram(individual_data, "CALU1", "1295937")

In [None]:
pyllelic.histogram(individual_data, "NCIH196", "1295937")

In [None]:
pyllelic.histogram(individual_data, "NCIH196", "1294945")

In [None]:
final_data.loc["SW1710"]

## Statistical Tests for Normality

In [None]:
pyllelic.summarize_allelic_data(individual_data, diff)