# Import fragments with ```scPrinter```

- Function to use: [scprinter.pp.import_fragments](https://ruochiz.com/scprinter_doc/reference/_autosummary/scprinter.pp.import_fragments.html#scprinter.pp.import_fragments)
- Tutorial to follow: [scPrinter PBMC scATAC-seq tutorial](https://ruochiz.com/scprinter_doc/tutorials/PBMC_scATAC_tutorial.html#Now-let's-use-scPrinter-for-some-basic-exploratory-analysis-to-get-a-better-idea-of-the-dataset)

## 0. Imports

In [1]:
%load_ext autoreload
%autoreload 2
import scprinter as scp
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import time
import pandas as pd
import numpy as np
import os
import pickle
import torch
import matplotlib as mpl
mpl.rcParams['pdf.fonttype'] = 42
from scanpy.plotting.palettes import zeileis_28
from tqdm.contrib.concurrent import *
from tqdm.auto import *
import anndata
import scanpy as sc
import statistics as stat
import json
import csv
import re
import copy
from sklearn.preprocessing import OneHotEncoder

In [2]:
import snapatac2 as snap

### 0.1 Setup

In [3]:
# Specify the reference genome. This must match that of your ATAC fragments file
genome = scp.genome.mm10

genome

<scprinter.genome.Genome at 0x7f6dffd34450>

## 1. Paths

### 1.1 Data directories

In [4]:
master_data_dir = '/bap/bap/collab_asthma_multiome/'

In [5]:
# Create small lambda function to get the path to the data, input variable being sample name
get_condition_fragments_path = lambda sample_name_bc, sample_name_frag: os.path.join(master_data_dir, 'ATAC', 'ATACFragmentFiles_Asthma', sample_name_bc, f'{sample_name_frag}_atac_fragments.tsv.gz')
get_condition_valid_barcodes_path = lambda sample_name: os.path.join(master_data_dir, 'outputs', 'ATAC', '2_Analysis_Outputs', '1a_ChromVAR_Inputs', f'{sample_name}_valid_barcodes.txt')

In [6]:
# outputs
printer_h5ad_output_dir = os.path.join(master_data_dir, 'ATAC', '2_Analysis_Outputs', '1b_ChromVAR_scPrinter_object')
printer_h5ad_output_path = os.path.join(printer_h5ad_output_dir, 'Asthma_Multiome_Collab_scPrinter.h5ad')

# if the output directory does not exist, create it
if not os.path.exists(printer_h5ad_output_dir):
    os.makedirs(printer_h5ad_output_dir)

### 1.2 Prep paths

In [7]:
# Sample names
sample_names_bc = ['NT',
                'OVA_C',
                'OVA',
                'PBS_C',
                'PBS'
                ]

# on-disk fragments files are named slightly differently
sample_names_load_fragments = ['NT',
                                'OVAC',
                                'OVA',
                                'PBSC',
                                'PBS'
                                ]

In [8]:
# to per-condition fragments

fragment_paths_l = []
valid_barcodes_l = []   # order-matched to fragment_paths_l
for sample_name_fragments_i, sample_name_bc_i in zip(sample_names_load_fragments, sample_names_bc):
    fragment_paths_l.append(get_condition_fragments_path(sample_name_bc_i, sample_name_fragments_i))
    valid_barcodes_l.append(get_condition_valid_barcodes_path(sample_name_bc_i))

In [9]:
fragment_paths_l

['/bap/bap/collab_asthma_multiome/ATAC/ATACFragmentFiles_Asthma/NT/NT_atac_fragments.tsv.gz',
 '/bap/bap/collab_asthma_multiome/ATAC/ATACFragmentFiles_Asthma/OVA_C/OVAC_atac_fragments.tsv.gz',
 '/bap/bap/collab_asthma_multiome/ATAC/ATACFragmentFiles_Asthma/OVA/OVA_atac_fragments.tsv.gz',
 '/bap/bap/collab_asthma_multiome/ATAC/ATACFragmentFiles_Asthma/PBS_C/PBSC_atac_fragments.tsv.gz',
 '/bap/bap/collab_asthma_multiome/ATAC/ATACFragmentFiles_Asthma/PBS/PBS_atac_fragments.tsv.gz']

In [10]:
valid_barcodes_l

['/bap/bap/collab_asthma_multiome/outputs/ATAC/2_Analysis_Outputs/1a_ChromVAR_Inputs/NT_valid_barcodes.txt',
 '/bap/bap/collab_asthma_multiome/outputs/ATAC/2_Analysis_Outputs/1a_ChromVAR_Inputs/OVA_C_valid_barcodes.txt',
 '/bap/bap/collab_asthma_multiome/outputs/ATAC/2_Analysis_Outputs/1a_ChromVAR_Inputs/OVA_valid_barcodes.txt',
 '/bap/bap/collab_asthma_multiome/outputs/ATAC/2_Analysis_Outputs/1a_ChromVAR_Inputs/PBS_C_valid_barcodes.txt',
 '/bap/bap/collab_asthma_multiome/outputs/ATAC/2_Analysis_Outputs/1a_ChromVAR_Inputs/PBS_valid_barcodes.txt']

In [11]:
# TODO: you'll likely need txt files of barcodes:subtype pairings per condition too,
# when you do the manual t-test later and need to group barcodes by subtype

## 2. ```scPrinter``` analysis

### 2.1 Initialize the scPrinter object

When you finish using the object, run ```printer.close()``` otherwise you won't be able to load it properly next time.

**Note Feb 24, 2025:** the QC filters of ```import_fragments()```

```min_num_fragments=1000, min_tsse=7```

may have lowered # pass QC cells from 7797 to 7747

From source code, ```min_tsse``` is no longer used

```# these are historical_kwargs that snapatac2 takes, but not anymore
    for historical_kwarg in ["min_tsse", "low_memory"]:
        if historical_kwarg in kwargs:
            del kwargs[historical_kwarg]
```

For QC consistency, we will not re-filter on # fragments because we have already QC'd in the R notebook. This should produce a ```printer``` object with the same # cells (7797) as the barcode preparation notebook.

In [12]:
import time
start = time.time()

# TODO: use lists of frag paths and lists of prepared pass-QC barcode txt files
printer = scp.pp.import_fragments(
                            path_to_frags=fragment_paths_l,
                            barcodes=valid_barcodes_l,
                            savename=printer_h5ad_output_path,
                            sample_names=sample_names_bc,
                            genome=genome,
                            min_num_fragments=0, min_tsse=7,
                            sorted_by_barcode=False,
                            low_memory=False,
                            )

end = time.time()

print(f"Time taken to import fragments: {end - start} seconds")

You are now using the beta auto_detect_shift function, this overwrites the plus_shift and minus_shift you provided
If you believe the auto_detect_shift is wrong, please set auto_detect_shift=False


Importing fragments:   0%|          | 0/5 [00:00<?, ?it/s]

Detecting the shift in the paired end fragments file
If you think the above message is wrong, please check the input file format
Minimum MSE is 0.00039826999809825463, shift detected
Minimum MSE is 0.0003485204024557461, shift detected
detected plus_shift and minus_shift are 4 -5 for /bap/bap/collab_asthma_multiome/ATAC/ATACFragmentFiles_Asthma/NT/NT_atac_fragments.tsv.gz
Detecting the shift in the paired end fragments file
If you think the above message is wrong, please check the input file format
Minimum MSE is 0.0004886510066651824, shift detected
Minimum MSE is 0.000523652283791898, shift detected
detected plus_shift and minus_shift are 4 -5 for /bap/bap/collab_asthma_multiome/ATAC/ATACFragmentFiles_Asthma/OVA_C/OVAC_atac_fragments.tsv.gz
Detecting the shift in the paired end fragments file
If you think the above message is wrong, please check the input file format
Minimum MSE is 0.0003478629785362107, shift detected
Minimum MSE is 0.00030655893975362373, shift detected
detected pl

  0%|          | 0/5 [00:00<?, ?it/s]

start transferring insertions
Time taken to import fragments: 792.0684185028076 seconds


In [13]:
printer

head project
AnnData object with n_obs x n_vars = 7418 x 0 backed at '/bap/bap/collab_asthma_multiome/ATAC/2_Analysis_Outputs/1b_ChromVAR_scPrinter_object/Asthma_Multiome_Collab_scPrinter.h5ad'
    obs: 'sample', 'n_fragment', 'frac_dup', 'frac_mito', 'frag_path', 'frag_sample_name', 'tsse'
    uns: 'unique_string', 'reference_sequences', 'bias_bw', 'footprints', 'genome', 'insertion', 'gff_db', 'bias_path', 'binding score'
    obsm: 'insertion_chr7', 'insertion_chr3', 'insertion_chr4', 'insertion_chr9', 'insertion_chr2', 'insertion_chr11', 'insertion_chr8', 'insertion_chr13', 'insertion_chrX', 'insertion_chr1', 'insertion_chr14', 'insertion_chr17', 'insertion_chrY', 'insertion_chr10', 'insertion_chr16', 'insertion_chr15', 'insertion_chr6', 'insertion_chr18', 'insertion_chr5', 'insertion_chr19', 'insertion_chr12'




**Always, always remember to close the object!**

In [14]:
printer.close()

In [15]:
printer_h5ad_output_path

'/bap/bap/collab_asthma_multiome/ATAC/2_Analysis_Outputs/1b_ChromVAR_scPrinter_object/Asthma_Multiome_Collab_scPrinter.h5ad'

# END