GST is a module that calculates Greedy String Tiling algorithm as described in "String Similarity via Greedy String Tiling and Running Karp−Rabin Matching" (Wise, 1993) - https://www.researchgate.net/publication/262763983_String_Similarity_via_Greedy_String_Tiling_and_Running_Karp-Rabin_Matching
pip install gst-calculation
pip install git+https://github.com/tomytw/gst-calculation.git@0.1.2
>>> from gst_calculation import gst
You can calculate a gst of a collection of numbers or strings (or any collection of object that can be compared using equal function, must have __eq__ method inside it's class)
The result will be array that contains two elements (two index):
- Tile information (position and length of matched tiles)
- Total score
>>> tokens_sequence_1 = [1,2,3,4,5]
>>> tokens_sequence_2 = [3,4,5,6,7]
>>> gst.calculate(tokens_sequence_1, tokens_sequence_2, minimal_match=3)
[[{'token_1_position': 2, 'token_2_position': 0, 'length': 3, 'score': 3}], 3]
>>> tokens_sequence_1 = ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
>>> tokens_sequence_2 = ['the', 'lazy', 'dog', 'jumps', 'over', 'the', 'quick', 'brown', 'fox']
>>> gst.calculate(tokens_sequence_1, tokens_sequence_2, minimal_match=3)
[[{'token_1_position': 0, 'token_2_position': 5, 'length': 4, 'score': 4},
{'token_1_position': 6, 'token_2_position': 0, 'length': 3, 'score': 3}],
7]