Skip to content
Allow differences in the comparison of elements
Perl
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
lib/LCS
t
.gitignore
.travis.yml
Build.PL
Changes
LICENSE
META.json
README.md
cpanfile
dist.ini

README.md

NAME

LCS::Similar - allow differences in the compared elements of Longest Common Subsequence (LCS) Algorithm

LCS-Similar Coverage Status Kwalitee Score CPAN version

SYNOPSIS

use LCS::Similar;

$alg = LCS::Similar->new;
@lcs = $alg->LCS(\@a,\@b);

ABSTRACT

LCS::Similar allows differences in the compared elements.

DESCRIPTION

Pure LCS algorithms can be used for global alignment of two sequencies.

Usually they provide one result, but there can be more than one solution fulfilling the criterion of being longest.

For example this two sequencies have two possible LCS:

# solution1
@sequence1 = qw/a b   d   /;
@sequence2 = qw/  b a d c /;
@lcs1      = qw/  b   d   /;

# solution2
@sequence1 = qw/  a b d   /;
@sequence2 = qw/b a   d c /;
@lcs2      = qw/  a   d   /;

Or an example of mistyping:

eo___nnnnnicaio
 |   |    ||| |
commun____icato

This is an actual result of a pure LCS algorithm. Here the letter n of the second line could align in 5 possible ways with one of the letters n of the first line, and the LCS will always have a length of 6.

But this solution better maps the mismatches:

eonnnnnicaio
~|~~~ ||||~|
commu_nicato

Here the tilde ~ represents similarity of the aligned characters.

With a function returning e.g. a similarity of 0.7 for a comparison of similar characters this module takes similarity into account for alignment.

See examples.

CONSTRUCTOR

  • new()

    Creates a new object which maintains internal storage areas for the LCS computation. Use one of these per concurrent LCS() call.

METHODS

  • LCS(\@a,\@b,\&similarity,$threshold)

    Finds a Longest Common Subsequence, taking two arrayrefs as method arguments. It returns an array reference of corresponding indices, which are represented by 2-element array refs.

    The third argument is the reference of a subroutine comparing two elements and returning a number between 0 and 1. Where 0 means unequal and 1 means equal.

    Without a subroutine the module falls back to string comparison.

    The fourth argument is a threshold passed to the subroutine.

  • max($number1, $number2)

    Returns the maximum of two numbers.

  • max3($number1, $number2, $number3)

    Returns the maximum of three numbers.

EXPORT

None by design.

EXAMPLES

Aline two textfiles

use LCS::Similar;
use LCS;

binmode(STDOUT,":encoding(UTF-8)");

open(my $in1,"<:encoding(UTF-8)",'file1')
  or die "cannot open file1: $!";
open(my $in2,"<:encoding(UTF-8)",'file2')
  or die "cannot open file2: $!";

my $lines1 = [<$in1>];
my $lines2 = [<$in2>];

sub similarity {
  my ($a, $b, $threshold) = @_;

  $a //= '';
  $b //= '';
  $threshold //= 0.7;

  return 1 if ($a eq $b);
  return 1 if (!$a eq !$b); # avoid division by zero

  # length of LCS
  my $llcs = LCS->LLCS(
    [split(//,$a)],
    [split(//,$b)],
  );

  # the standard formula
  my $similarity = (2 * $llcs) / (length($a) + length($b));
  return $similarity if ($similarity >= $threshold);
  return 0;
}

# aligned indices of elements similar more than $threshold 0.5
my $lcs = LCS::Similar->LCS( $lines1, $lines2, \&similarity, 0.5 );

# map indices and not so similar elements into AoA of lines
my $aligned = LCS->lcs2align( $lines1, $lines2, $lcs );

# print them
for my $chunk (@$aligned) {
  print 'a: ',$chunk->[0];
  print 'b: ',$chunk->[1];
  print "\n";
}

Aline two words

use LCS::Similar;
use LCS;

my $word1 = [ split(//, 'eonnnnnicaio') ];
my $word1 = [ split(//, 'communicato' ) ];

sub confusable {
  my ($a, $b, $threshold) = @_;

  $a //= '';
  $b //= '';
  $threshold //= 0.7;

  return 1 if ($a eq $b);
  return 1 if (!$a && !$b);

  my $map = {
    'e' => 'c',
    'c' => 'e',
    'm' => 'n',
    'n' => 'm',
    'i' => 't',
    't' => 'i',
  };

  return $threshold if (exists $map->{$a} && $map->{$a} eq $b);
  return 0;
}

my $aligned = [
  LCS->align2strings(
    LCS->lcs2align(
      $word1,
      $word2,
      LCS::Similar->LCS( $word1, $word2, \&confusable, 0.7 )
    )
  )
];
print 'a: ',$aligned->[0],"\n"; # eonnnnnicaio
print 'b: ',$aligned->[1],"\n"; # commu_nicato
print "\n";

SEE ALSO

Algorithm::Diff

AUTHOR

Helmut Wollmersdorfer helmut.wollmersdorfer@gmail.com

Kwalitee Score

COPYRIGHT AND LICENSE

Copyright 2015 by Helmut Wollmersdorfer

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

You can’t perform that action at this time.