This package contains methods to calculate the Normalized Compression Distance (NCD) - a metric for measuring how similar two strings are using a real life compression algorithm such as bzip2.
InformationDistances.jl is registered in the general registry and can therefore be simply installed from the REPL with
] add InformationDistances
julia> using InformationDistances
# Create three strings that we want to compare - we expect s1 and s2 to be more similar than any of them to s3
julia> s1 = repeat("ab", 100)
"abababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababab"
julia> s2 = repeat("ba", 100)
"babababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababa"
julia> s3 = String(rand(('a', 'b'), 200))
"aabaaabaaababaabababbaaaaabaaaaaabbabbaaabbbabbbbaaaaababaabbbbaababbbbaaaaaaaaabababaaabbbbbbbabbbaabbabababbaababbbbabbbababaaaababaaababbababaaaaababbabbbbaabbaabbbaabaababbbaaaaaababbbabbbabbabbaa"
# Create a normalized compression distance with the default parameters
julia> d = NormalizedCompressionDistance();
julia> d(s1, s2)
0.125
julia> d(s1, s3)
0.4482758620689655
julia> d(s2, s3)
0.4482758620689655
# Create annother distance that uses Bzip2 for compression
julia> using CodecBzip2: Bzip2Compressor
julia> d_bzip2 = NormalizedCompressionDistance(CodecCompressor{Bzip2Compressor}(workfactor=250));
julia> d_bzip2(s1, s2)
0.1
julia> d_bzip2(s1, s3)
0.5903614457831325
julia> d_bzip2(s2, s3)
0.5783132530120482
The examples folder contains an interactive notebook that can be run with Pluto.jl. To quickly view the notebook online there is also a static non-interactive version where it is currently not possible to choose different options.