# Basic Usage of the PCPFM pipeline

A typical pipeline performs data processing, QC, annotation and data standardization. 

This notebook illustrates basic use of such a pipeline, PCPFM (Python-Centric Pipeline for Metabolomics, Mitchell et al., 2024. Common data models to streamline metabolomics processing and annotation, and implementation in a Python pipeline. PLOS Computational Biology, 20(6), p.e1011912. (https://doi.org/10.1371/journal.pcbi.1011912).

<a href="https://colab.research.google.com/github/shuzhao-li-lab/MANA2024/blob/main/Module%201%20-%20Processing%20and%20Visualizing%20MS%20Data/1.3.Basic%20PCPFM%20Usage.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# lets start by installing the pipeline

!pip install pcpfm isocor

import requests, zipfile, io, os

os.makedirs("./Datasets", exist_ok=True)

datasets = [
    "https://github.com/shuzhao-li-lab/data/raw/main/data/MT02Dataset.zip",
]

for dataset in datasets:
    r = requests.get(dataset)
    if dataset.endswith(".zip"):
        z = zipfile.ZipFile(io.BytesIO(r.content))
        z.extractall("./Datasets/")
    else:
        with open("./Datasets/" + os.path.basename(dataset), 'bw+') as out_fh:
            out_fh.write(r.content)

## Metadata CSV Generation

The current pipeline implementation works best with a metadata CSV that details the file name, its location, and the "type" of sample for every sample you wish to analyze. Lets create it.

In [None]:
# now we need to generate a very basic CSV file to tell the pipeline where the mzML files are located.
# future versions of the pipeline will not require this step.

import pandas as pd
import os

metadata_dicts = []
for x in os.listdir("./Datasets/MT02Dataset/"):
  if x.endswith(".mzML"):
    metadata_dicts.append({
        "File Name": x.rstrip(".mzML"), # this is to tell the pcpfm what to call the sample
        "Sample Type": "Unknown", # this field specifies what sort of acquisition this is, needed for advanced usage.
        "Filepath": os.path.join(os.path.abspath("./Datasets/MT02Dataset/"), x) # this is to tell pcpfm where the mzML is located
    })
    print(x)
metadata_df = pd.DataFrame(metadata_dicts)
metadata_df.to_csv("metadata.csv")

## Actual PCPFM procesisng

PCPFM is a python package but it can be ran like another program from the commmand line.

In [None]:
# for example:

!pcpfm

# you should see an error message and some instructions.

In [None]:
# Now lets start working with the pipeline. The assemble command creates the experiment object
# that stores all intermediates.

!pcpfm assemble -o . -j pcpfm_tutorial_basic -s ./metadata.csv


In [None]:
# here is a list of the output directories in the pipeline directory
os.listdir("./pcpfm_tutorial_basic")

In [None]:
# now we can run asari within the pipeline, this will automatically infer the ionization type of the samples.
# all acquisitions must be collected in the same ionization mode!

!pcpfm asari -i pcpfm_tutorial_basic

In [None]:
# now lets examine the feature table as we did previously
# here we can load the JSON file within the experiment directory to get the feature table path
import json

exp = json.load(open("./pcpfm_tutorial_basic/experiment.json"))
exp["feature_tables"]

# see we have two feature tables: 'preferred' and 'full'

In [None]:
# we should have the same feature table as in the asari standalone example

ft = pd.read_csv(exp["feature_tables"]["preferred"], sep="\t")
print("Num Samples = ", ft.shape[1]-11)
print("Num Features = ", ft.shape[0])

In [None]:
ft.head()

## Notebook Summary

Now you can process data using the pipeline. While you can run Asari in either the pipeline or as a standalone tool, running it in the pipeline makes downstream processing more convenient.