# Get some example data

To provide some example of execution of this software, we provide a small example dataset, namely the [PlaSMA](http://plasma.riken.jp/) dataset that can be freely download from the network.

The dataset is provided in NIST MS format (`.msp`), so we first need to write a function to convert such format to the *tab-separated-values* (`.tsv`) used to compute the molecular descriptors. During the conversion, only entries with an InChiKey, the SMILES and with positive ionization mode are kept.

We can preserve how many information we want from the original file, provided that:

* the file is in *tab-separated-values*,
* the first field is the *retention time*,
* the last field is the *SMILES*.

Since the computation of the molecular descriptors is performed in parallel, the output file is not guaranteed to have the same order as the input file, for this reason it can be a good idea to preserve at least the InChIKey to be able to match the original data with the one with computed molecular descriptors.

In [1]:
# A function converting MSP file to TSV file (while filtering out negative ion mode records and records with no SMILES)

import csv

def msp2tsv(src, dst, extra_keys = ('INCHIKEY', 'NAME')):
  keys = ('RETENTIONTIME', ) + extra_keys + ('SMILES', )
  keys_set = frozenset(keys)
  csv_writer = csv.writer(dst, delimiter = '\t', quotechar = '"', quoting = csv.QUOTE_MINIMAL)
  record = dict()
  for line in src:
    line = line.strip()
    if not line:
      if keys_set <= set(record.keys()) and 'IONMODE' in record and record['IONMODE'] == 'Positive':
        csv_writer.writerow(record[k] for k in keys)
      record = dict()
    else:
      key, *value = line.split(': ', 1)
      if value: record[key] = value[0]

We now can use such function to save a TSV file while downloading the MSP original dataset directly from the PlaSMA website (without saving it on the local disk).

In [2]:
from urllib.request import urlopen

# The URL of the dataset, can be found on http://plasma.riken.jp/menta.cgi/plasma/plant_chemical_diversity_download

PLASMA__DATASET_URL = 'http://plasma.riken.jp/menta.cgi/plasma/get_msp_all'

with urlopen(PLASMA__DATASET_URL) as msp_src, open('plasma.tsv', mode = 'w') as tsv_dst:
  src = msp_src.read().decode('utf-8').splitlines()
  msp2tsv(src, tsv_dst)

Let's count the number of lines to check that it worked.

In [3]:
with open('plasma.tsv', 'r') as inf: print(len(inf.readlines()))

799


# Compute the molecular descriptors

In [4]:
from jp2rt import add_descriptors_via_tsv

add_descriptors_via_tsv('plasma.tsv', 'plasma+descriptors.tsv')

Computing   5% [33m│█▉                               │[0m  46/799 (0:00:05 / 0:01:21) 

107409 not found


Computing   8% [33m│██▊                              │[0m  67/799 (0:00:06 / 0:01:05) 

7205 not found


Computing  45% [33m│██████████████▉                  │[0m 363/799 (0:00:15 / 0:00:18) 

208203 not found


Computing 100% [33m│█████████████████████████████████│[0m 799/799 (0:00:51 / 0:00:00) 
