# Data processing With Asari

Multiple tools exist for processing mass spectra into feature tables. 
Asari is a recent open-source tool, described in:

Li, et al, 2023. Trackable and scalable LC-MS metabolomics data processing using asari. Nature Communications, 14(1), p.4113. (https://www.nature.com/articles/s41467-023-39889-1)

In this notebook we will use Asari from the CLI (via the notebook) to analyze a representative LC-MS dataset and visualize some of the data. 

The notebook (and Colab) is used for teaching. In production environment, one can run Asari as commandline and in a pipeline, scripts etc.

<a href="https://colab.research.google.com/drive/1fjoqvFizQL4orI_Jlb9lTfpCIv_Hd8DT?authuser=1" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# lets verify all needed packages are installed

!pip3 install asari-metabolomics

import requests, zipfile, io, os

os.makedirs("./Datasets", exist_ok=True)

datasets = [
    "https://github.com/shuzhao-li-lab/data/raw/main/data/MT02Dataset.zip",
]

for dataset in datasets:
    r = requests.get(dataset)
    if dataset.endswith(".zip"):
        z = zipfile.ZipFile(io.BytesIO(r.content))
        z.extractall("./Datasets/")
    else:
        with open("./Datasets/" + os.path.basename(dataset), 'bw+') as out_fh:
            out_fh.write(r.content)

In [None]:
# now lets run asari on our test dataset. 
# the arguments to asari do the following:
#   -i specifies the input directory
#   -o specifies where to save the output
#   -m indicates the ionization mode, either pos or neg
#   -j is the name of the results directory so we can find it later
#   --pickle true and --database_mode ondisk are optional parameters that will keep intermediate data structures for visualization alter.

!asari process -i ./Datasets/MT02Dataset/ -m pos -j MT02_results --pickle true --database_mode ondisk

# After you exeucte this block, you should see the output from asari.

# Exploring Asari Outputs

After running the previous block, you have successfully pre-processed the LC-MS data into a feature table and other outputs. Lets look at these results more closely.

In [None]:
# find the outputs from the asari run

import os 
asari_subdir = None
for x in os.listdir("./"):
    if x.startswith("Results_MT02_results"): # here we are using the name specified earlier
        asari_subdir = os.path.join(os.path.abspath("./"), x)
preferred_table = os.path.join(asari_subdir, "preferred_Feature_table.tsv") # this is the high-quality feature table
annotation_table = os.path.join(asari_subdir, "Feature_annotation.tsv") # this is annotations generated by asari based on the serum subset of the HMDBv4   


# Examining the Feature Table

A feature is an aligned peak across all the samples in the experiment. How the feature intensity is calculated is shown in the next notebook.

More features is not always better as many features are low quality (poor quantification, poor peak shape etc.). The recommended feature table to use in Asari is therefore the preferred table, which is a summary of high quality features only.

In [None]:
# Lets count how many samples and how many features were returned.

import pandas as pd

# use pandas to read the feature table, note that it is .tsv, so we need to specify the delimiter character.
ft = pd.read_csv(preferred_table, sep="\t")

# the first 11 columns are properties of the features, the remaining columns are per-sample intensities
print("Num_Samples = ", ft.shape[1]-11)

# the first row is the header, but the rest represent a feature.
print("Num_Features = ", ft.shape[0]-1)

In [None]:
# this shows the top rows of the feature table
ft.head()

In [None]:
# the preferred feature table contains only high-quality features. we can see this by looking at the distributions of cSelectivity, goodness of fit, and SNR.

import matplotlib.pyplot as plt
import numpy as np
fig, (ax1, ax2, ax3) = plt.subplots(1, 3)
ax1.hist(ft['cSelectivity'])
ax1.set_title("cSelectivity")
ax2.hist(ft['goodness_fitting'])
ax2.set_title("goodness_fitting")
ax3.hist(np.log2(ft['snr']))
ax3.set_title("log2 snr")
fig.tight_layout()
plt.show()

# we will examine more plotting in Module 3.

In [None]:
# lets look at a feature from the table

plt.bar(ft.columns[11:], ft[ft['id_number'] == "F1"][ft.columns[11:]])

In [None]:
# lets look at the default annotation returned by Asari

annots = pd.read_csv(annotation_table, sep="\t")
annots.head()

## Notebook Summary

Now you can process data using Asari and explore its outputs. With some modification, you can reuse this notebook on your own data.

Next we will run Asari as part of a pipeline rather than by itself. 