# Analyzing the cardiomyocyte data in asari pipeline

- Goal: data analysis to find sunitinib metabolites in Bowen et al 2023, published in the asari pipeline (pcpfm) paper
- Mitchell, J.M., Chi, Y., Thapa, M., Pang, Z., Xia, J. and Li, S., 2024. Common data models to streamline metabolomics processing and annotation, and implementation in a Python pipeline. PLOS Computational Biology, 20(6), p.e1011912. (https://doi.org/10.1371/journal.pcbi.1011912)
- Original repo: https://github.com/shuzhao-li-lab/PythonCentricPipelineForMetabolomics


In [None]:
# lets download the example dataset
# you can download the zip from here : 
# lets assume you downloaded it to ~/Downloads/ and unziped it

In [7]:
# now we need to make the sequence file as input to the pipeline

import pandas as pd
import os

bowen_sequence = []
downloads_path = os.path.abspath(os.path.expanduser("~/Downloads/Bowen_CellData/"))
for f in os.listdir(downloads_path):
    file_path = os.path.join(downloads_path, f)
    filename = os.path.basename(file_path).rstrip(".mzML")
    if "blank" in filename:
        sample_type = "Blank"
    else:
        sample_type = "Unknown"
    bowen_sequence.append({
        "Filepath": file_path,
        "File Name": filename,
        "Method": "Unknown",
        "Sample Type": sample_type
    })
pd.DataFrame(bowen_sequence).to_csv("bowen_cell_sequence.csv", index=False)


In [8]:
# the below block will run the pcpfm analysis. This relies upon python3 being on your $PATH.

# if not, you will need to open the terminal and cd to this directory:

os.getcwd()

# run the command to add it to $PATH and do it manually using the commands below:

# else you can use the block below to do it 'inline'

'/Users/mitchjo/pcpfm_tutorials/notebooks'

In [9]:
%%bash

pcpfm assemble -s bowen_cell_sequence.csv -j bowen_cell -o .
pcpfm asari -i bowen_cell





~~~~~~~ Hello from Asari (1.12.8) ~~~~~~~~~

Working on ~~ /Users/mitchjo/pcpfm_tutorials/notebooks/bowen_cell/converted_acquisitions/ ~~ 






Extracted AZ1.mzML to 1303 mass tracks.
Extracted AZ15.mzML to 1427 mass tracks.
Extracted AZ11.mzML to 1738 mass tracks.
Extracted AZ13.mzML to 1613 mass tracks.
Extracted AZ10.mzML to 1795 mass tracks.
Extracted AZ16.mzML to 1560 mass tracks.
Extracted AZ12.mzML to 1736 mass tracks.
Extracted AZ14.mzML to 1571 mass tracks.
Extracted AZ2.mzML to 1433 mass tracks.
Extracted AZ4.mzML to 1516 mass tracks.
Extracted AZ6.mzML to 1647 mass tracks.
Extracted AZ17.mzML to 3104 mass tracks.
Extracted AZ3.mzML to 1483 mass tracks.
Extracted AZ7.mzML to 1583 mass tracks.
Extracted AZ5.mzML to 2104 mass tracks.
Extracted AZ18.mzML to 1824 mass tracks.
Extracted AZ8.mzML to 1518 mass tracks.
Extracted blank_cell_end.mzML to 1021 mass tracks.
Extracted AZ9.mzML to 1466 mass tracks.
Extracted blank_cell_start.mzML to 1064 mass tracks.

    The reference sample is:
    ||* AZ17 *||

Max reference retention time is 840.44 at scan number 2245.

Constructing MassGrid, ...

Building composite mass tracks



Mass accuracy was estimated on 212 matched values as -0.3 ppm.


Multiple charges considered: [1, 2, 3]


Khipu search grid: 
               M+H+       Na/H        HCl        K/H        ACN
M0         1.007276  22.989276  36.983976  38.963158  42.033825
13C/12C    2.010631  23.992631  37.987331  39.966513  43.037180
13C/12C*2  3.013986  24.995986  38.990686  40.969868  44.040535
Constructed 837 khipus in this round.


Khipu search grid: 
                       M+H+, 2x charged  ...  ACN, 2x charged
M0                             0.503638  ...        21.520551
13C/12C, 2x charged            1.005316  ...        22.022228
13C/12C*2, 2x charged          1.506993  ...        22.523906

[3 rows x 5 columns]
Constructed 44 khipus in this round.


Khipu search grid: 
                       M+H+, 3x charged  ...  ACN, 3x charged
M0                             0.335759  ...        14.682793
13C/12C, 3x charged            0.670210  ...        15.017244
13C/12C*2, 3x charged          1.004662  ..

In [11]:
%%bash 

# now lets blank_mask the samples and drop outliers

pcpfm blank_masking --table_moniker preferred --new_moniker pref_blank_masked --blank_value Blank --sample_value Unknown --query_field "Sample Type" --blank_intensity_ratio 3 -i bowen_cell
pcpfm drop_samples --table_moniker pref_blank_masked --new_moniker masked_pref_unknowns --drop_value Unknown --drop_field "Sample Type" --drop_others true -i bowen_cell

pcpfm blank_masking --table_moniker full --new_moniker full_blank_masked --blank_value Blank --sample_value Unknown --query_field "Sample Type" --blank_intensity_ratio 3 -i bowen_cell
pcpfm drop_samples --table_moniker full_blank_masked --new_moniker masked_full_unknowns --drop_value Unknown --drop_field "Sample Type" --drop_others true -i bowen_cell




In [23]:
# lets look for the features reported in the paper
expected_features_cell = [
    (399.1823, 33.4,  'M0_1'),
    (399.2184, 224.8, 'M0_2'),
    (371.1874, 257.2, 'M1'),
    (415.2134, 264.8, 'M2_1'),
    (415.2128, 224.5, 'M2_2'),
    (343.1562, 249.4, 'M3'),
    (387.1823, 35.4,  'M4_1'),
    (387.1824, 242.5, 'M4_2'),
    (413.1978, 34.6,  'M12'),
    (385.1667, 38.9,  'M14_1'),
    (385.1666, 65.8,  'M14_2'),
    (159.1490, 273.2, 'M20')
]

import json

# load the preferred and full feature table
pft_cell = pd.read_csv("./bowen_cell/filtered_feature_tables/masked_pref_unknowns_Feature_table.tsv", sep="\t")
fft_cell = pd.read_csv("./bowen_cell/filtered_feature_tables/masked_full_unknowns_Feature_table.tsv", sep="\t")


In [24]:
# this builds an efficient structure to search the feature table by

from intervaltree import IntervalTree

ppm_mz_tol = 5
rt_tol = 10

pft_mz_tree = IntervalTree()
pft_rt_tree = IntervalTree()
for mz, rt, id in zip(pft_cell['mz'], pft_cell['rtime'], pft_cell['id_number']):
    mz_err = mz/1e6 * ppm_mz_tol
    pft_mz_tree.addi(mz-mz_err, mz+mz_err, id)
    pft_rt_tree.addi(rt-rt_tol, rt+rt_tol, id)

fft_mz_tree = IntervalTree()
fft_rt_tree = IntervalTree()
for mz, rt, id in zip(fft_cell['mz'], fft_cell['rtime'], fft_cell['id_number']):
    mz_err = mz/1e6 * ppm_mz_tol
    fft_mz_tree.addi(mz-mz_err, mz+mz_err, id)
    fft_rt_tree.addi(rt-rt_tol, rt+rt_tol, id)


# now look for the features
    
print("Cell - preferred")
for expected in expected_features_cell:
    exp_mz, exp_rt, id = expected
    matches_mz = set([x.data for x in pft_mz_tree.at(exp_mz)])
    matches_rt = set([x.data for x in pft_rt_tree.at(exp_rt)])
    true_matches = matches_mz.intersection(matches_rt)
    print(id, true_matches)
print()
print("Cell - full")
for expected in expected_features_cell:
    exp_mz, exp_rt, id = expected
    matches_mz = set([x.data for x in fft_mz_tree.at(exp_mz)])
    matches_rt = set([x.data for x in fft_rt_tree.at(exp_rt)])
    true_matches = matches_mz.intersection(matches_rt)
    print(id, true_matches)

Cell - preferred
M0_1 {'F3900', 'F3899'}
M0_2 set()
M1 set()
M2_1 set()
M2_2 set()
M3 {'F3729'}
M4_1 {'F3483'}
M4_2 {'F3488'}
M12 {'F4048'}
M14_1 set()
M14_2 set()
M20 {'F395'}

Cell - full
M0_1 {'F3900', 'F3899'}
M0_2 {'F3910'}
M1 {'F3741'}
M2_1 {'F4229'}
M2_2 set()
M3 {'F3729'}
M4_1 {'F3483', 'F3484'}
M4_2 {'F3488'}
M12 {'F4047', 'F4048', 'F4049'}
M14_1 {'F3364', 'F3366', 'F3365'}
M14_2 {'F3368'}
M20 {'F395'}


In [28]:
# As you can see, we have one missing feature but why is it missing?

# lets look using asari viz

experiment = json.load(open("./bowen_cell/experiment.json"))
asari_path = experiment["feature_tables"]['full'].split("export")[0]
os.system("asari viz --input " + asari_path)





~~~~~~~ Hello from Asari (1.12.8) ~~~~~~~~~

//*Asari dashboard*//   Press Control-C to exit.
Launching server at http://localhost:58687


In [None]:
# now find the mass track for m/z=415.2128 using asari viz