A scientific paper may try to define a computation in words: "We downloaded all abstracts from Wikipedia, and counted the vowels, giving a result of YYY".

With Seamless, you can define a computation precisely: "all abstracts from Wikipedia = c953f648215413c5c7a3ae179a57d74e5ca495290a8e5a06a474baa158178d15, counted the vowels =  computation XXXX". The computation can then be executed, shared, and re-run.

## 1. Definition

To define a computation, first download its input. 

In a terminal, run:
```bash
$ wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-abstract.xml.gz 
$ gunzip enwiki-latest-abstract.xml.gz
$ conda activate seamless
$ seamless-checksum enwiki-latest-abstract.xml

c953f648215413c5c7a3ae179a57d74e5ca495290a8e5a06a474baa158178d15
```

*The checksum above is from September 2024. To reproduce this in the future, download `https://dumps.wikimedia.org/enwiki/20240901/enwiki-20240901-abstract.xml.gz` instead*

We can then upload it into a Seamless buffer directory:

```bash
$ buffer_dir=./buffers
$ seamless-upload --dest $buffer_dir enwiki-latest-abstract.xml
```
After that, there will be a file `./buffers/c953f648215413c5c7a3ae179a57d74e5ca495290a8e5a06a474baa158178d15`

The rest of the computation is defined inside the Notebook:

In [2]:
import seamless
seamless.delegate(level=2)
seamless.config.add_buffer_folder("buffers")

In [3]:
%%time
from seamless import Checksum, transformer

abstracts = Checksum.load("enwiki-latest-abstract.xml")
#abstracts = Checksum("c953f648215413c5c7a3ae179a57d74e5ca495290a8e5a06a474baa158178d15")
#abstracts = Checksum("da0774f46efed72c7c20ba0133716bc0d7f7e3ae7c7531f0da7fc60deefbb07a")

@transformer(return_transformation=True)
def count_vowels(abstracts):    
    import re
    import xml.etree.ElementTree as ET
    from io import BytesIO
    
    abstracts_reader = BytesIO(abstracts)
    
    vowels = re.compile("[aeiou]")
    count = 0
    for event, elem in ET.iterparse(abstracts_reader, events=["end"]):
        try:
            if elem.tag != "abstract":
                continue
            text = elem.text
            if text is None:
                continue
            if text.startswith("|"):
                continue
            count += len(re.findall(vowels, text))
        finally:
            elem.clear()
    
    return count
   
count_vowels.celltypes.abstracts = "bytes"

transformation = count_vowels(abstracts)
print(transformation.as_checksum())
transformation.compute()
print(transformation.exception)
print(transformation.logs)
print(transformation.value)


OSError: could not get source code

In [3]:

transformation.compute()

<seamless.direct.Transformation.Transformation at 0x76fb50115270>

In [4]:
#print(transformation.exception)
print(transformation.logs)
print(transformation.value)


*************************************************
* Result
*************************************************
<checksum 5cc3eb990f522e2a161da11590dc44168a0a3104d73c7746ce2fe32c7c3b8f2c>
*************************************************
Execution time: 126.5 seconds
*************************************************
149134431


This computation is called a "transformation" in Seamless. Its checksum is...

(run-transformation, hand off to an engineer / compute cluster , doesn't have to be Python, can be C/C++, ...)

In [26]:
transformation.as_checksum().resolve()

b'{\n  "__language__": "python",\n  "__output__": [\n    "result",\n    "mixed",\n    null\n  ],\n  "abstracts": [\n    "bytes",\n    null,\n    "c953f648215413c5c7a3ae179a57d74e5ca495290a8e5a06a474baa158178d15"\n  ],\n  "code": [\n    "python",\n    "transformer",\n    "f1bc5c1c26b2b077b905c5edcf71f5b1335ec058a43577f63fe613605e5bd72c"\n  ]\n}\n'

In bash:
```bash
$ conda activate seamless
$ export SEAMLESS_HASHSERVER_DIRECTORY=buffers
$ seamless-delegate none
```

In [27]:
import seamless
seamless.delegate(level=2)
seamless.config.add_buffer_folder("buffers")

In [28]:
transformation = count_vowels(abstracts)
transformation.as_checksum()

'004c832d9a6c7e65302eb87b0d9b4f73fc0cb75fd22795fcb3ed43a7bf0c54b2'

In [29]:
transformation.compute()

<seamless.direct.Transformation.Transformation at 0x7abd60c37520>

In [30]:
print(transformation.exception)

seamless.workflow.core.transformation.SeamlessTransformationError: Traceback (most recent call last):
  File "transformer", line 19, in <module>
    result = count_vowels(abstracts=abstracts)
  File "transformer", line 6, in count_vowels
    for event, elem in xml.etree.ElementTree.iterparse(abstracts):
NameError: name 'xml' is not defined
*************************************************
Execution time: 0.0 seconds



In [None]:
buf = abstracts.resolve()

In [None]:
print(buf[:1000])

In [None]:
transformation = count_vowels(abstracts)

In [None]:
from seamless import Checksum, Buffer
cs_wikipedia_abstract_part11 = Checksum("664e3ed93d65bc048f0aaef954a1d5145c67faa763a271aca37258fc144f9f20")

In [None]:
from seamless.util.fair import add_direct_urls
add_direct_urls({
    cs_wikipedia_abstract_part11: [
        {
            "url": "https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-abstract11.xml.gz",
            "compression": "gz"
        },
        {
            "url": "https://dumps.wikimedia.org/enwiki/20240901/enwiki-20240901-abstract11.xml.gz",
            "compression": "gz"
        },
    ]                              
})

In [None]:
%time buf = cs_wikipedia_abstract11.resolve()

In [None]:
cs_wikipedia_abstract = Checksum("c953f648215413c5c7a3ae179a57d74e5ca495290a8e5a06a474baa158178d15")

In [None]:
add_direct_urls({
    cs_wikipedia_abstract: [
        {
            "url": "https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-abstract.xml.gz",
            "compression": "gz"
        },
        {
            "url": "https://dumps.wikimedia.org/enwiki/20240901/enwiki-20240901-abstract.xml.gz",
            "compression": "gz"
        },
    ]                              
})

In [None]:
%time buf = cs_wikipedia_abstract.resolve()

In [None]:
xml = buf.decode()

In [None]:
import xml.etree.ElementTree as ET

In [None]:
root = ET.fromstring(xml)

In [None]:
cs_wikipedia_abstract

In [None]:
abstracts = [tag.text for tag in root.findall(".//doc/abstract") if tag.text is not None]

In [None]:
abstracts = [abstract for abstract in abstracts if not abstract.startswith("|")]

In [None]:
len(abstracts)

In [None]:
abstracts[:100]

In [None]:
import re
v=re.compile('[aeiou]')
print(len(re.findall(v, "bltkli")))