Skip to content

yusuke1997/esaxx-py

Repository files navigation

esaxx-py

License: MIT Python versions Python package

Python wrapper of the Enhanced Suffix Array (ESA).

This is a version adapted for Python, originating from the C++ implementation found here.

Installation

pip install esaxx-py

Usage

from esa import esaxx

def print_snippet(T, beg, length):
    for i in range(length):
        c = T[beg + i]
        print("_" if c.isspace() else c, end="")

T = 'abracadabra'
SA = [0]*len(T)
L = [0]*len(T)
R = [0]*len(T)
D = [0]*len(T)
    
k = 0x100
node_num = 0
node_num = esaxx(T, SA, L, R, D, len(T), k, node_num)
if node_num == -1:
    exit()

print(f"node:{node_num}")

for i in range(node_num):
    print(f"{i}\t{R[i] - L[i]}\t{D[i]}\t", end="")
    print_snippet(T, SA[L[i]], D[i])
    print()
node:5
0	2	4	abra
1	5	1	a
2	2	3	bra
3	2	2	ra
4	11	0

Alternatively, you can use enumSubstring.py:

ehco abracadabra | python enumSubstring.py

For the original implementation:

g++ enumSubstring.cpp
echo abracadabra | ./a.out

Note

In the original implementation, the return value of esaxx was an error code, not node_num. However, due to the constraints of Python and the difficulty in passing by reference, I've chosen to return node_num.

Maximal Substrings

To obtain Maximal Substrings:

from esa import esaxx

def print_snippet(T, beg, length):
    for i in range(length):
        c = T[beg + i]
        print("_" if c.isspace() else c, end="")

T = 'abracadabra'
SA = [0]*len(T)
L = [0]*len(T)
R = [0]*len(T)
D = [0]*len(T)
    
k = 0x100
node_num = 0
node_num = esaxx(T, SA, L, R, D, len(T), k, node_num)
if node_num == -1:
    exit()

size = len(T)

# Record changes in BWT
rank = [0] * size
r = 0
for i in range(size):
    if i == 0 or T[(SA[i] + size - 1) % size] != T[(SA[i - 1] + size - 1) % size]:
        r += 1
    rank[i] = r
    
print("count\tlength\tstring")
# Enumerate maximal partial strings
for i in range(node_num):
    if D[i] == 0 or (rank[R[i] - 1] - rank[L[i]] == 0):
        continue
    print(f"{R[i] - L[i]}\t{D[i]}\t", end="")
    print_snippet(T, SA[L[i]], D[i])
    print()

The first column represents the frequency of occurrence, and the second column represents the length of the string.

count	length	string
2	4	abra
5	1	a

Here, even strings that appear more than once are listed, even if they are just one character. If you want to skip those, you can use if len < 2: continue.

enumMaxSubstring.py:

ehco abracadabra | python enumMaxSubstring.py

C++ (enumMaxSubstring.cpp):

g++ enumMaxSubstring.cpp
echo abracadabra | ./a.out

UPDATE in 0.2.0

Introduce a new function: get_maximal_substrings(str).

This function allows for easier extraction of maximal substrings from a given string.

Usage Example:

from esa import get_maximal_substrings

T = 'abracadabra'
substrings = get_maximal_substrings(T)

print("count\tlength\tstring")
for substring in substrings:
    print(f'{substring.count}\t{substring.length}\t{substring.string})
count   length  string
2       4       abra
5       1       a

UPDATE in 2.7.0

Available unicode character

Usage Example:

from esa import get_maximal_substrings_unicode

T = '松島やああ松島や松島や'
substrings = get_maximal_substrings_unicode(T)
print(T)
print("count\tlength\tstring")
for substring in substrings:
    print(f'{substring.count}\t{substring.length}\t{substring.string})
松島やああ松島や松島や
count	length	string
3	3	松島や
3	2	島や
3	1	や
2	1	あ

Additional Information

C++ Implementation:

Rust Version:

Software using esaxx:

List of papers using esaxx:

Articles about esaxx: