## Task 3 - Deduplication
(10 points)

In this task, the implementation of the `Deduplicater` class should be finished. Note that you may want to copy your solution from the previous tasks where appropriate. The class comprises the following methods:
* The constructor takes all parameters necessary for the deduplication process as it has been described in the lecture slides. (See the javadoc comment for a description of the parameters)
* The `accept` method will be called for each document exactly once. It should
  * preprocess the document
    * lower case characters
    * remove all characters except lowercased characters, digits and whitespaces (i.e. `' '`)
  * assign an id to the document
    * the first document should get the id `0`, the next should get `1`, etc.
  * generate and store the documents set of shingles (not the document itself)
* The `determineDuplicates` method should return the set of (near) duplicates.
  * A duplicate should have a Jaccard similarity higher or equal to the given threshold $\theta=0.9$. (The Jaccard similarity of two documents is based on the set of shingles of the document)
  * The Min Hashing and Local Sensitive Hashing algorithms should be used to generate a set of pair candidates. For these pairs, the similarity calculation should be carried out (using `jaccard.jaccardSim`) to make sure that the similarity is larger or equal to $\theta$.
  * The result should have the type `Set<Duplicate>` where `Duplicate` is a given, simple class used to store a document pair.
  * Since there is a small chance that not all (near) duplicates are found, it is sufficient to find 99%.
  * The result is not allowed to contain document pairs with a lower similarity than the given threshold.
* The class should make use of the `jaccard` attribute. This is an instance of the `JaccardSimilarity` class that
  * should implement the Jaccard similarity as in task 1.
  * should be used in the `determineDuplicates` method
  * count the number of times it has been called to show that this whole deduplication approach needs much less comparisons than $n*(n-1) / 2$.

#### Notes

- Do not add additional external libraries.
- Interface
  - You can use _[TAB]_ for autocompletion and _[SHIFT]_+_[TAB]_ for code inspection.
  - Use _Menu_ -> _View_ -> _Toggle Line Numbers_ for debugging.
  - Check _Menu_ -> _Help_ -> _Keyboard Shortcuts_.
- Finish
  - Save your solution by clicking on the _disk icon_.
  - Finally, choose _Menu_ -> _File_ -> _Close and Halt_.
  - Do not forget to _Submit_ your solution in the _Assignments_ view.

In [1]:
import java.util.concurrent.atomic.AtomicInteger;
import java.util.function.Consumer;

/**
 * A simple class representing a pair of documents
 */
public class Duplicate {
    public int id1, id2;
    public Duplicate(int id1, int id2) {
        if (id1 < id2) {
            this.id1 = id1;
            this.id2 = id2;
        } else {
            this.id1 = id2;
            this.id2 = id1;
        }
    }
    public void setId1(int id1) { this.id1 = id1; }
    public void setId2(int id2) { this.id2 = id2; }
    @Override
    public int hashCode() { return 31 * (31 + id1) + id2; }
    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if (obj == null)
            return false;
        if (!(obj instanceof Duplicate))
            return false;
        Duplicate other = (Duplicate) obj;
        if (id1 != other.id1)
            return false;
        if (id2 != other.id2)
            return false;
        return true;
    }
    @Override
    public String toString() {
        return (new StringBuilder()).append('(').append(id1).append(',').append(id2).append(')').toString();
    }
}
/**
 * A simple class implementing the Jaccard similarity and counting the number of times it is called.
 */
public class JaccardSimilarity {
    private AtomicInteger calls = new AtomicInteger(0);
    public int getCalls() { return calls.get(); }

    public double jaccardSim(Set<Integer> set1, Set<Integer> set2) {
        calls.incrementAndGet();
        double similarity = 0;
        // You may want to copy your solution from task 1
        // YOUR CODE HERE
        Set<Integer> union = new HashSet<Integer>(set1);
        union.addAll(set2);
        
        if(set1.size()>set2.size()){
            
            Set<Integer> intersection = new HashSet<Integer>(set1);
            intersection.retainAll(set2);
           // System.out.println("Size of intersection: "+intersection.size());
           // System.out.println("Size of union: "+union.size());
            similarity = (double)intersection.size()/union.size();
            
        }
        else{
            
            Set<Integer> intersection = new HashSet<Integer>(set2);
            intersection.retainAll(set1);
           // System.out.println("Size of intersection: "+intersection.size());
           // System.out.println("Size of union: "+union.size());
            similarity = (double)intersection.size()/union.size();
            
        }
        
        return similarity;
    }
}

// YOUR CODE HERE

/**
 * Class for finding duplicates in a given corpus
 */
public class Deduplicater implements Consumer<String> {
    
    public final JaccardSimilarity jaccard = new JaccardSimilarity();
    // YOUR CODE HERE
    
    int k, count=0, totalDoc=0;
    
    Map<String, Integer> wordShingles = new HashMap<String, Integer>();
    Map<Integer, Set<Integer>> docMap = new HashMap<Integer, Set<Integer>>();
    
    int shingleLength, numberOfHashes, b, r;
    double threshold;
    long seed;
    
    int[][] permutations = null;
    
    /**
     * Constructor.
     * 
     * @param threshold
     *            the similarity threshold theta.
     * @param shingleLength
     *            the length of the shingles
     * @param numberOfHashes
     *            the number of hash functions (i.e., permutations) that should be
     *            used
     * @param seed
     *            the seed that can be used for pseudo random processes
     * @param b
     *            the number of bands for the LSH
     * @param r
     *            the number of rows of a single band for LSH
     */
    public Deduplicater(double threshold, int shingleLength, int numberOfHashes, long seed, int b, int r) {
        // YOUR CODE HERE
        
        this.threshold = threshold;
        this.shingleLength = shingleLength;
        this.numberOfHashes = numberOfHashes;
        this.seed = seed;
        this.b = b;
        this.r = r;
    }

    /**
     * This method is called with a single document that should be added to the internal, 
     * shingled representation of documents.
     *
     * @param line
     *            a single document that should be processed by the Deduplicator.
     */
    public void accept(String line) {
        // YOUR CODE HERE
        
         int docId = totalDoc;
        
        line = line.toLowerCase().replaceAll("[^a-z0-9\\s]","");
        
        Set<Integer> shingles = null;        
        shingles = new HashSet<Integer>();    
        
        int k = shingleLength;
        
        for(int i=0; i<=line.length()-k; i++)
        {
            String key = line.substring(i, i + k);
            
                        
            if(wordShingles.containsKey(key)){
                
                shingles.add(wordShingles.get(key));
            
            }
            else{
                wordShingles.put(key, count);
                shingles.add(count);
                count++;
            }
            
        }
        docMap.put(docId, shingles);
        totalDoc++;
        
    }

    public Set<Duplicate> determineDuplicates() {
        Set<Duplicate> duplicates = new HashSet<>();
        // YOUR CODE HERE
        
        int row = docMap.size();
        int[][] SigMat = new int[row][numberOfHashes];
        
        PermutationProcessor(numberOfHashes, wordShingles.size());
        
        for(int i =0;i <row; i++)
        {
            SigMat[i] = minHash(docMap.get(i));
        }
        
        
        for(int doc = 0; doc<totalDoc-1 ; doc++)
        {
            for(int nextDoc = doc +1 ; nextDoc<totalDoc; nextDoc++)
            {
                for(int band = 0; band<b; band++)
                {
                    boolean check = true; 
                    
                    for(int br = band*r; br<r*(band+1); br++)
                    {
                        if(SigMat[doc][br] == SigMat[nextDoc][br])
                        {
                            continue;
                            
                        }
                        else
                        {
                            check = false;
                            break;
                        }

                    }
                    
                    if(check)
                    {
                        //toDo for similarity check
                            if(jaccard.jaccardSim(docMap.get(doc), docMap.get(nextDoc))>threshold)
                            {
                                duplicates.add(new Duplicate(doc,nextDoc));
                                break;
                                
                            }
                        
                    }
                }
                
            }
        }
        
    
        
        return duplicates;
    }
    
    
    //computes set of permutations
     public void PermutationProcessor(int numberOfHashes, int numberOfShingles) {
        // YOUR CODE HERE
       
        permutations = new int[numberOfHashes][numberOfShingles];
        int[] arr= new int[numberOfShingles];
        for (int j = 0; j < numberOfShingles; j++) {
            arr[j]=j;
        }
      
        int n = numberOfShingles; 
        
       for(int k=0; k<numberOfHashes; k++){
           int first = arr[0];
          for (int i = 0; i <n; i++) { 
              
              if(i == n-1)
              {
                 // int temp = arr[0];
                  arr[n-1] = first;
                  
              }
              else
              {
                  //int temp = arr[i+1];
                  arr[i] = arr[i+1];
              }
            
        } 
           
           permutations[k] = arr;
           
       }
    
        
    }
    
    //create signature 
    public int[] minHash(Set<Integer> s) {
        int hash[] = new int[permutations.length];
        // YOUR CODE HERE
        
        for(int k=0; k<permutations.length; k++){
           int[] p = permutations[k];
            
            for (int i=0; i<p.length; i++){
                
                if(s.contains(p[i])){
//                     System.out.println(p[i]);
//                     System.out.println(i);
                    hash[k] = i;
                    break;
                }
            }
        }
        return hash;
    }

}
// This line should make sure that compile errors are directly identified when executing this cell
// (the line itself does not produce any meaningful result)
new Deduplicater(0.9, 10, 50, 123, 10, 5);
System.out.println("compiled");

compiled


# Evaluation

- Run the following cells to test your implementation.
- This time, you have two different test cells. The first test uses a smaller file while the other test uses a longer file. We separated the two cells since the second one may take much more time.
- The files used for testing are:
  - [example-corpus.txt](https://hobbitdata.informatik.uni-leipzig.de/teaching/SNLP/deduplication/example-corpus.txt)
  - [example-corpus-expected.txt](https://hobbitdata.informatik.uni-leipzig.de/teaching/SNLP/deduplication/example-corpus-expected.txt)
  - [shakespeare-dedup-small.txt](https://hobbitdata.informatik.uni-leipzig.de/teaching/SNLP/deduplication/shakespeare-dedup-small.txt)
  - [shakespeare-dedup-small-expected.txt](https://hobbitdata.informatik.uni-leipzig.de/teaching/SNLP/deduplication/shakespeare-dedup-small-expected.txt)
  - The files with the expected results contain similarities which do not have to be part of the output of your solution. They are just provied as additional information.
- You can ignore the cells afterwards.

In [2]:
%maven org.junit.jupiter:junit-jupiter-api:5.3.1
%maven commons-io:commons-io:2.6
import org.junit.jupiter.api.Assertions;
import org.opentest4j.AssertionFailedError;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.LineIterator;
import java.util.stream.Collectors;
import java.io.File;

public static final double MIN_AMOUNT_OF_TP = 0.99;

/**
 * Simple method for reading the expected document pairs from a file.
 */
public Set<Duplicate> readDuplicatesFromFile(String filename) throws IOException {
    return FileUtils.readLines(new File(filename), "utf-8").parallelStream().map(s -> s.split(","))
            .filter(s -> s.length > 1).map(s -> new Duplicate(Integer.parseInt(s[0]), Integer.parseInt(s[1])))
            .collect(Collectors.toSet());
}
/**
 * Simple method that prints a given set of duplicates (max. the first 10).
 */
public void printDuplicates(Iterator<Duplicate> iter, StringBuilder builder) {
    int count = 0;
    while (iter.hasNext() && (count < 10)) {
        builder.append(count == 0 ? "[" : ", ");
        builder.append(iter.next().toString());
        ++count;
    }
    if (iter.hasNext()) {
        builder.append(", ...");
    }
    builder.append("]\n");
}

public void checkDuplicator(Iterator<String> lineIterator, Set<Duplicate> expectedDuplicates,
        double threshold, int shingleLength, int numberOfHashes, long seed, int b, int r, long maxRuntime) {
    try {
        // Determine the expected number of TPs to pass the test
        int minTP = (int) Math.floor(expectedDuplicates.size() * MIN_AMOUNT_OF_TP);

        // Process the given documents
        long time1 = System.currentTimeMillis();
        Deduplicater deduplicater = new Deduplicater(threshold, shingleLength, numberOfHashes, seed, b, r);
        lineIterator.forEachRemaining(deduplicater);
        time1 = System.currentTimeMillis() - time1;
        System.out.println("processing all documents took: " + time1 + "ms");

        // Search for duplicates
        long time2 = System.currentTimeMillis();
        Set<Duplicate> duplicates = deduplicater.determineDuplicates();
        time2 = System.currentTimeMillis() - time2;
        System.out.println("determineDuplicates took: " + time2 + "ms");

        // Check runtime
        if ((maxRuntime > 0) && ((time1 + time2) > maxRuntime)) {
            System.out.println("Warning! Your solution may take too much time (" + (time1 + time2)
                    + "ms while not more than " + maxRuntime + " is suggested)");
        }
            
        // Print Jaccard similarities
        System.out.println(deduplicater.jaccard.getCalls() + " Jaccard similarities were calculated.");
        if(deduplicater.jaccard.getCalls() == 0) {
            System.out.println("It looks like you are not using the jaccard attribute of the class. Please fix that.");
        }

        // Check duplicates
        // Get overlap (= true positives)
        Set<Duplicate> s1, s2, overlap;
        overlap = new HashSet<>();
        if (expectedDuplicates.size() > duplicates.size()) {
            s1 = duplicates;
            s2 = expectedDuplicates;
        } else {
            s1 = expectedDuplicates;
            s2 = duplicates;
        }
        overlap = s1.parallelStream().filter(d -> s2.contains(d)).collect(Collectors.toSet());
        // Get false positives
        expectedDuplicates.removeAll(overlap);
        // Get false positives
        duplicates.removeAll(overlap);
        System.out.print("TP=");
        System.out.print(overlap.size());
        System.out.print("\tFP=");
        System.out.print(duplicates.size());
        System.out.print("\tFN=");
        System.out.println(expectedDuplicates.size());

        // make sure that enough TPs have been found
        if (overlap.size() < minTP) {
            // The students solution has an issue... generate a detailed message
            StringBuilder builder = new StringBuilder();
            builder.append("Your solution found only ");
            builder.append(overlap.size());
            builder.append(" duplicates while at least ");
            builder.append(minTP);
            builder.append(" duplicates should have been found.\nYour solution missed:");
            printDuplicates(expectedDuplicates.iterator(), builder);
            Assertions.fail(builder.toString());
        }
        // make sure that no FPs are found
        if (duplicates.size() > 0) {
            // The students solution has an issue... generate a detailed message
            StringBuilder builder = new StringBuilder();
            builder.append("Your solution generated ");
            builder.append(duplicates.size());
            builder.append(
                    " FPs while none where expected.\nYour solution returned the following, wrong duplicates:");
            printDuplicates(duplicates.iterator(), builder);
            Assertions.fail(builder.toString());
        }
        System.out.println("Test successfully completed.");
    } catch (AssertionFailedError e) {
        throw e;
    } catch (Throwable e) {
        System.err.println("Your solution caused an unexpected error:");
        throw e;
    }
}

System.out.println("--- example corpus ---");
LineIterator iterator = FileUtils.lineIterator(new File("/srv/distribution/example-corpus.txt"), "UTF-8");
checkDuplicator(iterator, 
                readDuplicatesFromFile("/srv/distribution/example-corpus-expected.txt"), 
                0.9, 10, 50, 123, 10, 5, 0L);
iterator.close();

--- example corpus ---
processing all documents took: 636ms
determineDuplicates took: 28314ms
85960 Jaccard similarities were calculated.
TP=10	FP=0	FN=0
Test successfully completed.


In [3]:
/*
 * This test uses a longer file as input. Note that it may take up to 2 minutes.
 */
System.out.println("--- Shakespeare small ---");
iterator = FileUtils.lineIterator(new File("/srv/distribution/shakespeare-dedup-small.txt"), "UTF-8");
checkDuplicator(iterator, 
                readDuplicatesFromFile("/srv/distribution/shakespeare-dedup-small-expected.txt"), 
                0.9, 5, 50, -99, 5, 10, 120000L);
iterator.close();

--- Shakespeare small ---
processing all documents took: 409ms
determineDuplicates took: 495303ms
45113638 Jaccard similarities were calculated.
TP=308	FP=0	FN=0
Test successfully completed.


In [None]:
// Ignore this cell