# Analyzing the HZV029 data in asari pipeline

- Goal: process and annotate the HZV029 dataset, published in the asari pipeline (pcpfm) paper
- Mitchell, J.M., Chi, Y., Thapa, M., Pang, Z., Xia, J. and Li, S., 2024. Common data models to streamline metabolomics processing and annotation, and implementation in a Python pipeline. PLOS Computational Biology, 20(6), p.e1011912. (https://doi.org/10.1371/journal.pcbi.1011912)
- Original repo: https://github.com/shuzhao-li-lab/PythonCentricPipelineForMetabolomics


In [3]:
# lets download the example dataset
# you can download the zip from here : https://drive.google.com/file/d/1PikUcw3fyF3AgMjCqp42hyVhEvl4Y5mw/view
# lets assume you downloaded it to ~/Downloads/ and unziped it

Retrieving folder list
Failed to retrieve folder contents:

 	Cannot retrieve the folder information from the link. You may need to
	change the permission to 'Anyone with the link'. 



In [35]:
# now we need to make the sequence file as input to the pipeline

import pandas as pd
import os

HZV029_subset_sequence = []
downloads_path = os.path.abspath(os.path.expanduser("~/Downloads/HZV029_subset/"))
for f in os.listdir(downloads_path):
    file_path = os.path.join(downloads_path, f)
    filename = os.path.basename(file_path).rstrip(".mzML")
    HZV029_subset_sequence.append({
        "Filepath": file_path,
        "File Name": filename,
        "Method": "Unknown",
        "Sample Type": "Unknown"
    })
pd.DataFrame(HZV029_subset_sequence).to_csv("HZV029_subset_sequence.csv", index=False)


In [42]:
# the below block will run the pcpfm analysis. This relies upon python3 being on your $PATH.

# if not, you will need to open the terminal and cd to this directory:

os.getcwd()

# run the command to add it to $PATH and do it manually using the commands below:

# else you can use the block below to do it 'inline'

'/Users/mitchjo/pcpfm_tutorials/notebooks'

In [40]:
%%bash

pcpfm assemble -s HZV029_subset_sequence.csv -j HZV029_subset -o .
pcpfm asari -i HZV029_subset
pcpfm build_empCpds -i HZV029_subset -tm full -em full
pcpfm l4_annotate -i HZV029_subset -em full -nm full_w_l4





~~~~~~~ Hello from Asari (1.12.8) ~~~~~~~~~

Working on ~~ /Users/mitchjo/pcpfm_tutorials/notebooks/HZV029_subset/converted_acquisitions/ ~~ 






Extracted batch13_MT_20210807_139.mzML to 3305 mass tracks.
Extracted batch10_MT_20210804_007.mzML to 3553 mass tracks.
Extracted batch10_MT_20210804_173.mzML to 3836 mass tracks.
Extracted batch11_MT_20210805_003.mzML to 3849 mass tracks.
Extracted batch17_MT_20210811_187.mzML to 3371 mass tracks.
Extracted batch5_MT_20210730_007.mzML to 3434 mass tracks.
Extracted batch7_MT_20210801_003C.mzML to 3910 mass tracks.
Extracted batch6_MT_20210731_003K.mzML to 3957 mass tracks.
Extracted batch8_MT_20210802_189.mzML to 3987 mass tracks.
Extracted batch9_MT_20210803_007.mzML to 3474 mass tracks.

    The reference sample is:
    ||* batch6_MT_20210731_003K *||

Max reference retention time is 299.75 at scan number 779.

Constructing MassGrid, ...
Adding sample to MassGrid, batch10_MT_20210804_007
    mapped pairs = 1067 / 3553 
Adding sample to MassGrid, batch10_MT_20210804_173
    mapped pairs = 3075 / 3836 
Adding sample to MassGrid, batch11_MT_20210805_003
    mapped pairs = 3569 / 3849 




Mass accuracy was estimated on 319 matched values as 0.4 ppm.


Multiple charges considered: [1, 2, 3]


Khipu search grid: 
               M+H+       Na/H        HCl        K/H        ACN
M0         1.007276  22.989276  36.983976  38.963158  42.033825
13C/12C    2.010631  23.992631  37.987331  39.966513  43.037180
13C/12C*2  3.013986  24.995986  38.990686  40.969868  44.040535
Downsized input network with 16 features, highest peak at F9151 
Downsized input network with 18 features, highest peak at F9694 
Downsized input network with 17 features, highest peak at F10266 
Downsized input network with 19 features, highest peak at F11978 
Downsized input network with 25 features, highest peak at F7169 
Constructed 2061 khipus in this round.


Khipu search grid: 
                       M+H+, 2x charged  ...  ACN, 2x charged
M0                             0.503638  ...        21.520551
13C/12C, 2x charged            1.005316  ...        22.022228
13C/12C*2, 2x charged          1.506993  ... 





Multiple charges considered: [1, 2, 3]


Khipu search grid: 
               M+H+       Na/H        HCl        K/H        ACN
M0         1.007276  22.989276  36.983976  38.963158  42.033825
13C/12C    2.010631  23.992631  37.987331  39.966513  43.037180
13C/12C*2  3.013986  24.995986  38.990686  40.969868  44.040535
13C/12C*3  4.017341  25.999341  39.994041  41.973223  45.043890
Downsized input network with 23 features, highest peak at F7737 
Downsized input network with 27 features, highest peak at F7759 
Constructed 2097 khipus in this round.


Khipu search grid: 
                       M+H+, 2x charged  ...  ACN, 2x charged
M0                             0.503638  ...        21.520551
13C/12C, 2x charged            1.005316  ...        22.022228
13C/12C*2, 2x charged          1.506993  ...        22.523906
13C/12C*3, 2x charged          2.008671  ...        23.025583

[4 rows x 5 columns]
Constructed 117 khipus in this round.


Khipu search grid: 
                       M+H+, 3x ch



In [34]:
%%bash

pcpfm report -i HZV029_subset




[]
Failure Executing Method: pca
Found array with 0 sample(s) (shape=(0, 8463)) while a minimum of 1 is required by StandardScaler.
Unable to processes section: 
 {'section': 'figure', 'table': 'preferred', 'name': 'pca'}
Failure Executing Method: pearson
negative dimensions are not allowed
Unable to processes section: 
 {'section': 'figure', 'table': 'preferred', 'name': 'pearson_correlation'}
[]
Failure Executing Method: pca
Found array with 0 sample(s) (shape=(0, 12716)) while a minimum of 1 is required by StandardScaler.
Unable to processes section: 
 {'section': 'figure', 'table': 'full', 'name': 'pca'}
Failure Executing Method: pearson
negative dimensions are not allowed
Unable to processes section: 
 {'section': 'figure', 'table': 'full', 'name': 'pearson_correlation'}
No such table:  for_analysis
Unable to processes section: 
 {'section': 'figure', 'table': 'for_analysis', 'name': 'pca'}
No such table:  for_analysis
Unable to processes section: 
 {'section': 'figure', 'table': 

In [43]:
# now you can open the report at:

print(os.path.abspath("./HZV029_subset/output/report.pdf"))

/Users/mitchjo/pcpfm_tutorials/notebooks/HZV029_subset/output/report.pdf


In [52]:
# now lets summarize some features:

import json
experiment = json.load(open("./HZV029_subset/experiment.json"))
preferred_ft = pd.read_csv(experiment["feature_tables"]["preferred"], sep="\t")
print("Total Features: ", preferred_ft.shape[0])
print("Total Samples: ", preferred_ft.shape[1])



Total Features:  8463
Total Samples:  21


In [63]:
# now lets move on to empCpds and annotations:

empcpd = json.load(open(experiment["empCpds"]["full_w_l4"]))
print("Num empcpds:", len(empcpd))
l4_annotated = 0
for x in empcpd.values():
    if "Level_4" in x and x["Level_4"]:
        l4_annotated += 1
print("Num empcpds w/ l4 annots: ", l4_annotated)



Num empcpds: 2254
Num empcpds w/ l4 annots:  1092
