# General Workflow

Using the package is a three step process.

1. preprocessing the data to match the format expected by library. 
2. split replicates into groups and compute a summary statistic for each group.
3. calculate similarity of the summary statistics.

One thing that makes this description slightly more complicated is the software package uses two novel algorithms for measuring similarity.
They are used to compute two different types of summary statistic.
The two types are 1D summary statistics (or bin summary statistics), and 2D summary statistics (or peak statistics).
If the context is clear, we will refer to both types simply as summaries.
A brief description of each algorithm/summary type will follow in a subsection.

(1) The data format expected by the C library is not very demanding.
Examples are provided in the next block.
It expects distinct replicate measurements to be in separate files.
It also expects each mass/charge ratio and intensity pair to be on its own line, and separated by some combination of spaces, tabs, and commas. 
They can be in normal floating point format or scientific notation.
The format permits an optional header with leading `#` on each line (like a comment). 
There are examples in the [examples/](https://github.com/jasoneveleth/hdcms-python/tree/main/examples) directory of the python repository.

(2) The package provides many data loading routines, which will gather replicate measurements and compute a summary statistic (in either mode).
Each of them takes in a collection of filenames (as a list, a regular expression, comma separated string, etc.) and returns a summary.
We also provide a visualization routine `visualize` for summaries.

(3) The package also provides a similarity function `compare`.
The function will automatically detect whether the summary statistics are 1D or 2D, and compute the correct comparison.
It returns a float if 2 arguments were supplied, or a (symmetric) matrix if more than 2 were supplied.
The similarities are reported from 0 (not similar at all, the gaussians in the summaries do not share any area) to 1 (every gaussian in summary is identical)


## Summary of helper functions


`regex2stats1d`, `regex2stats2d` - takes regex and converts to summary statistic

`array2stats1d`, `array2stats2d` - takes a varargs list of numpy arrays and converts them into 1d summary statistic

`file2stats1d`, `file2stats2d` - takes filename and converts it to a summary stat. it is expected that the file contents are a list of filenames on separate lines

`filenames2stats1d`, `filenames2stats2d` - takes list of filenames and converts it to a summary stat

`compare` - compares two summary statistics

`write_image` - visualizes a summary statistic (returns an image)

`generate_examples` - generate synthetic data as an example

`is_valid_ms_data_format` - checks if a file is valid data format, raises an exception if not

# Data format examples

In [17]:
data = """#"+EI Scan (rt: 2.610-2.680 min, 13 scans) CM1-1-1.D  Subtract "
#Point,X(Thompsons),Y(Counts)
#jkdsljfa
#aljfdkljfads
#jkladsjfklads
40    353.939331054688
41    11444.1796875
42    30021.458984375
43    353.939331054688
44    11444.1796875
45    30021.458984375
46    353.939331054688
47    11444.1796875
48    30021.458984375
49    353.939331054688
50    11444.1796875
51    30021.458984375
"""

data2 = """#"+EI Scan (rt: 2.610-2.680 min, 13 scans) CM1-1-1.D  Subtract "
#Point,X(Thompsons),Y(Counts)
40.1,353.939331054688
41.1,11444.1796875
42.1,30021.458984375
43.1,353.939331054688
44.1,11444.1796875
45.1,30021.458984375
46.1,353.939331054688
47.1,11444.1796875
48.1,30021.458984375
49.1,353.939331054688
50.1,11444.1796875
51.1,30021.458984375
"""

data3 = """40.1,    353.939331054688
41.1,    11444.1796875
42.1,    30021.458984375
43.1,    353.939331054688
44.1,    11444.1796875
45.1,    30021.458984375
46.1,    353.939331054688
47.1,    11444.1796875
48.1,    30021.458984375
49.1,    353.939331054688
50.1,    11444.1796875
51.1,    30021.458984375
"""

invalid = """1, 40.1, 353.939331054688
2, 41.1, 11444.1796875
3, 42.1, 30021.458984375
4, 43.1, 353.939331054688
"""

In [18]:
with open("/tmp/data.txt", "w") as fo:
    fo.write(data)
    
with open("/tmp/data2.txt", "w") as fo:
    fo.write(data2)

with open("/tmp/data3.txt", "w") as fo:
    fo.write(data3)

with open("/tmp/invalid.txt", "w") as fo:
    fo.write(invalid)

In [20]:
# from hdcms import ms_valid_data_format as is_valid_ms_data_format
is_valid_ms_data_format("/tmp/data.txt")
is_valid_ms_data_format("/tmp/data2.txt")
is_valid_ms_data_format("/tmp/data3.txt")
is_valid_ms_data_format("/tmp/invalid.txt")

RuntimeError: got too many values per line