# ONEcode Python Interface Tutorial

This notebook demonstrates the various features of the ONEcode Python bindings.

## Installation and Setup

First, make sure the ONEcode Python bindings are compiled:

```bash
make
make python
```

## Import the ONEcode Module

In [1]:
import ONEcode
print("ONEcode module imported successfully!")

ONEcode module imported successfully!


## 1. Defining Schemas

In [4]:
# Define a schema for sequence files
schema_text = (
    "P 3 seq                 SEQUENCE\n"
    "S 6 segseq              segment sequences - objects are 1:1 with those in seg file\n"
    "S 7 readseq             read sequences\n"
    "O S 1 3 DNA             sequence: the DNA string\n"
    "D I 1 6 STRING          id - sequence identifier; unnecessary for segments\n"
)

schema = ONEcode.ONEschema(schema_text)
print("Schema created successfully!")
print("\nSchema definition:")
print(schema_text)

Schema created successfully!

Schema definition:
P 3 seq                 SEQUENCE
S 6 segseq              segment sequences - objects are 1:1 with those in seg file
S 7 readseq             read sequences
O S 1 3 DNA             sequence: the DNA string
D I 1 6 STRING          id - sequence identifier; unnecessary for segments



## 2. Opening Files for Reading

There are several ways to open a ONEcode file:

In [17]:
# Simple read (1 thread)
onefile = ONEcode.ONEfile("./TEST/small.seq")
print(f"File opened: {onefile.fileName()}")
print(f"Number of sequences: {onefile.givenCount('S')}")

File opened: ./TEST/small.seq
Number of sequences: 10


In [6]:
# Read with schema validation
onefile = ONEcode.ONEfile("./TEST/small.seq", "r", schema, "", 1)
print(f"File opened with schema validation")
print(f"Total sequences in file: {onefile.givenCount('S')}")

File opened with schema validation
Total sequences in file: 10


In [8]:
# Read with multiple threads (for large binary files)
onefile = ONEcode.ONEfile("./TEST/small.seq", 4)  # 4 threads
print(f"File opened with 4 threads for parallel decompression")

File opened with 4 threads for parallel decompression


## 3. Reading Data Line by Line

The main pattern for reading ONEcode files:

In [9]:
# Open the file
onefile = ONEcode.ONEfile("./TEST/small.seq", "r", schema, "", 1)

print(f"Reading sequences from {onefile.fileName()}")
print(f"Expected {onefile.givenCount('S')} sequences\n")

seq_count = 0

# Read through the file line by line
while onefile.readLine():
    line_type = onefile.lineType()
    line_num = onefile.lineNumber()
    
    if line_type == 'S':
        seq_count += 1
        sequence = onefile.getString()
        length = onefile.length()
        print(f"Sequence {seq_count}: length={length}, DNA={sequence[:50]}..." if length > 50 else f"Sequence {seq_count}: length={length}, DNA={sequence}")
        
    elif line_type == 'I':
        seq_id = onefile.getString()
        print(f"  ID: {seq_id}")

print(f"\nTotal sequences read: {seq_count}")

Reading sequences from ./TEST/small.seq
Expected 10 sequences

Sequence 1: length=51, DNA=cttagtagcgatattagttaataaaggtaaattcaaatgcgagtggtaga...
  ID: seq1
Sequence 2: length=72, DNA=ctttaccctccgaggctcttatccaccagaaacttccgccggggtccagg...
  ID: seq2
Sequence 3: length=58, DNA=catattctgtcgtaaatgtagaagaaagtagtagacaactcagaacgatc...
  ID: seq3
Sequence 4: length=42, DNA=ttttgagcgagagagaatgataagacctcgagggagcttgaa
  ID: seq4
Sequence 5: length=55, DNA=tttaaatcaaaggccgaagtttttttaagcgacaaagcactttaatatca...
  ID: seq5
Sequence 6: length=66, DNA=agagtgaatatcattaaactagacattcacgatagaaaattagttaatta...
  ID: seq6
Sequence 7: length=47, DNA=gctctgtataatgtttctgttttactgtgtttgggattatgctaagc
  ID: seq7
Sequence 8: length=62, DNA=ccgagatctataacagtatcaaaaataaaaaacttttaataaaatattaa...
  ID: seq8
Sequence 9: length=53, DNA=tagaagttgtttaataagttttattcacaatcgtttaatatttacacata...
  ID: seq9
Sequence 10: length=71, DNA=acatttacatattgatgtaacactcctatagcctttgatgaccgaaaact...
  ID: seq10

Total sequences read: 10


## 4. Different Ways to Access DNA Data

ONEcode provides multiple ways to access DNA sequences:

In [10]:
onefile = ONEcode.ONEfile("./TEST/small.seq", "r", schema, "", 1)

# Read first sequence
while onefile.readLine():
    if onefile.lineType() == 'S':
        print("Different ways to access DNA data:\n")
        
        # 1. As a string
        dna_string = onefile.getString()
        print(f"1. getString(): {dna_string[:50]}...")
        
        # 2. As character array
        dna_chars = onefile.getDNAchar()
        print(f"2. getDNAchar(): {dna_chars[:50]}...")
        
        # 3. As 2-bit compressed format (4 bases per byte)
        dna_2bit = onefile.getDNA2bit()
        print(f"3. getDNA2bit(): {len(dna_2bit)} bytes for {onefile.length()} bases")
        print(f"   Compression ratio: {onefile.length() / len(dna_2bit):.2f}x")
        
        break  # Just show first sequence

Different ways to access DNA data:

1. getString(): cttagtagcgatattagttaataaaggtaaattcaaatgcgagtggtaga...
2. getDNAchar(): cttagtagcgatattagttaataaaggtaaattcaaatgcgagtggtaga...
3. getDNA2bit(): 13 bytes for 51 bases
   Compression ratio: 3.92x


## 5. File Metadata and Statistics

Get information about the file contents:

In [11]:
onefile = ONEcode.ONEfile("./TEST/small.seq", "r", schema, "", 1)

print("File Statistics:")
print("=" * 50)
print(f"File name: {onefile.fileName()}")
print(f"\nSequence statistics:")
print(f"  Count (S lines): {onefile.givenCount('S')}")
print(f"  Max length: {onefile.givenMax('S')}")
print(f"  Total bases: {onefile.givenTotal('S')}")

if onefile.givenCount('S') > 0:
    avg_length = onefile.givenTotal('S') / onefile.givenCount('S')
    print(f"  Average length: {avg_length:.1f}")

File Statistics:
File name: ./TEST/small.seq

Sequence statistics:
  Count (S lines): 10
  Max length: 72
  Total bases: 577
  Average length: 57.7


## 6. Writing ONEcode Files

Create new ONEcode files in ASCII or binary format:

In [12]:
# Create a new ASCII file
outfile_ascii = ONEcode.ONEfile("./TEST/example_output.seq", "w", schema, "seq", 1)

# Add provenance information
outfile_ascii.addProvenance("python_tutorial", "1.0.0", "Creating example sequences from Jupyter notebook")

# Write some example sequences
sequences = [
    ("ACGTACGTACGTACGT", "seq1"),
    ("GGGGCCCCAAAAATTTT", "seq2"),
    ("ATCGATCGATCGATCG", "seq3"),
]

for seq, seq_id in sequences:
    # Write the sequence (S line)
    outfile_ascii.writeLine('S', seq)
    # Write the ID (I line)
    outfile_ascii.writeLine('I', seq_id)

# Close the file (important!)
del outfile_ascii

print("ASCII file created: ./TEST/example_output.seq")
print(f"Wrote {len(sequences)} sequences")

ASCII file created: ./TEST/example_output.seq
Wrote 3 sequences


In [13]:
# Create a binary file
outfile_binary = ONEcode.ONEfile("./TEST/example_output.1seq", "wb", schema, "seq", 1)

outfile_binary.addProvenance("python_tutorial", "1.0.0", "Creating binary example")

for seq, seq_id in sequences:
    outfile_binary.writeLine('S', seq)
    outfile_binary.writeLine('I', seq_id)

del outfile_binary

print("Binary file created: ./TEST/example_output.1seq")
print("Binary files are compressed and faster to read!")

Binary file created: ./TEST/example_output.1seq
Binary files are compressed and faster to read!


## 7. Verify the Written Files

Read back the files we just created:

In [14]:
# Read the ASCII file
print("Reading ASCII file:")
infile = ONEcode.ONEfile("./TEST/example_output.seq", "r", schema, "", 1)
while infile.readLine():
    if infile.lineType() == 'S':
        print(f"  Sequence: {infile.getString()}")
    elif infile.lineType() == 'I':
        print(f"    ID: {infile.getString()}")

print("\nReading binary file:")
infile = ONEcode.ONEfile("./TEST/example_output.1seq", "r", schema, "", 1)
while infile.readLine():
    if infile.lineType() == 'S':
        print(f"  Sequence: {infile.getString()}")
    elif infile.lineType() == 'I':
        print(f"    ID: {infile.getString()}")

Reading ASCII file:
  Sequence: ACGTACGTACGTACGT
    ID: seq1
  Sequence: GGGGCCCCAAAAATTTT
    ID: seq2
  Sequence: ATCGATCGATCGATCG
    ID: seq3

Reading binary file:
  Sequence: acgtacgtacgtacgt
    ID: seq1
  Sequence: ggggccccaaaaatttt
    ID: seq2
  Sequence: atcgatcgatcgatcgt
    ID: seq3


## 8. Working with Integer and Real Lists

ONEcode supports lists of integers and real numbers:

In [15]:
# Define a schema with integer and real lists
list_schema_text = (
    "P 4 demo                 DEMO FILE\n"
    "O D 0                    data object\n"
    "D Q 1 8 INT_LIST         quality scores\n"
    "D F 1 9 REAL_LIST        feature values\n"
    "D T 1 11 STRING_LIST     tags\n"
)

list_schema = ONEcode.ONEschema(list_schema_text)

# Write a file with lists
outfile = ONEcode.ONEfile("./TEST/list_demo.demo", "w", list_schema, "demo", 1)
outfile.addProvenance("python_tutorial", "1.0.0", "Demo of list types")

# Write object marker
outfile.writeLine('D')

# Write integer list (quality scores)
quality_scores = [10, 20, 30, 40, 35, 30, 25, 20]
outfile.writeLineIntList('Q', quality_scores)

# Write real list (feature values)
features = [1.5, 2.7, 3.1, 4.9, 5.2]
outfile.writeLineRealList('F', features)

# Write string list (tags)
tags = ["tag1", "tag2", "tag3"]
outfile.writeLine('T', tags)

del outfile

print("List demo file created!")

# Read it back
print("\nReading back:")
infile = ONEcode.ONEfile("./TEST/list_demo.demo", "r", list_schema, "", 1)
while infile.readLine():
    lt = infile.lineType()
    if lt == 'D':
        print("Data object")
    elif lt == 'Q':
        quals = infile.getIntList()
        print(f"  Quality scores: {quals}")
    elif lt == 'F':
        feats = infile.getRealList()
        print(f"  Features: {feats}")
    elif lt == 'T':
        tag_list = infile.getStringList()
        print(f"  Tags: {tag_list}")

List demo file created!

Reading back:
Data object
  Quality scores: 10
  Features: 1.5
  Tags: ['tag1', 'tag2', 'tag3']


## 9. Random Access with gotoObject

For binary files, you can jump directly to specific objects:

In [None]:
# This only works with binary files that have an index
onefile = ONEcode.ONEfile("./TEST/example_output.1seq", "r", schema, "", 1)

print(f"Total sequences: {onefile.givenCount('S')}\n")

# Jump to the 2nd sequence
if onefile.gotoObject('S', 2):
    print("Jumped to sequence #2:")
    if onefile.readLine() and onefile.lineType() == 'S':
        print(f"  Sequence: {onefile.getString()}")
    if onefile.readLine() and onefile.lineType() == 'I':
        print(f"  ID: {onefile.getString()}")
else:
    print("Could not jump to object (may need a larger file or different format)")

## 10. Schema Validation

Check if a file matches an expected schema:

In [None]:
onefile = ONEcode.ONEfile("./TEST/small.seq", "r", schema, "", 1)

# Check against schema text
schema_check = (
    "P 3 seq\n"
    "O S 1 3 DNA\n"
    "D I 1 6 STRING\n"
)

if onefile.checkSchemaText(schema_check):
    print("✓ File matches the expected schema!")
else:
    print("✗ Schema mismatch")

# Try with a wrong schema
wrong_schema = (
    "P 3 seq\n"
    "O S 1 3 INT\n"  # Wrong: S should be DNA, not INT
)

if onefile.checkSchemaText(wrong_schema):
    print("✓ File matches wrong schema (unexpected!)")
else:
    print("✗ File correctly rejected wrong schema")

## 11. Practical Example: Reverse Complement

Read sequences and write their reverse complements:

In [16]:
def reverse_complement(seq):
    """Simple reverse complement for DNA sequences"""
    complement = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G',
                  'a': 't', 't': 'a', 'g': 'c', 'c': 'g'}
    return ''.join(complement.get(base, base) for base in reversed(seq))

# Open input file
infile = ONEcode.ONEfile("./TEST/small.seq", "r", schema, "", 1)

# Create output file for reverse complements
outfile = ONEcode.ONEfile("./TEST/small_rc.1seq", "wb", schema, "seq", 1)
outfile.addProvenance("python_tutorial", "1.0.0", "Reverse complement from Python")

seq_count = 0

# Process sequences
print("Processing sequences...\n")
while infile.readLine():
    lt = infile.lineType()
    
    if lt == 'S':
        seq = infile.getString()
        rc_seq = reverse_complement(seq)
        outfile.writeLine('S', rc_seq)
        seq_count += 1
        
        # Show first few
        if seq_count <= 3:
            print(f"Sequence {seq_count}:")
            print(f"  Original: {seq[:40]}..." if len(seq) > 40 else f"  Original: {seq}")
            print(f"  Rev-comp: {rc_seq[:40]}..." if len(rc_seq) > 40 else f"  Rev-comp: {rc_seq}")
    
    elif lt == 'I':
        seq_id = infile.getString()
        outfile.writeLine('I', seq_id + "_RC")

del outfile

print(f"\n✓ Processed {seq_count} sequences")
print(f"✓ Output written to: ./TEST/small_rc.1seq")

Processing sequences...

Sequence 1:
  Original: cttagtagcgatattagttaataaaggtaaattcaaatgc...
  Rev-comp: atctaccactcgcatttgaatttacctttattaactaata...
Sequence 2:
  Original: ctttaccctccgaggctcttatccaccagaaacttccgcc...
  Rev-comp: attacagaagaacgttaagagtcctggaccccggcggaag...
Sequence 3:
  Original: catattctgtcgtaaatgtagaagaaagtagtagacaact...
  Rev-comp: accgttctgatcgttctgagttgtctactactttcttcta...

✓ Processed 10 sequences
✓ Output written to: ./TEST/small_rc.1seq
