# Task 1 - Shingling
(5 points)

Finalize the implementation of the `ShinglingProcessor` class. 
* Its `apply` method implements the shingling from the lecture slides based on set semantics. It returns the ids of the shingles that have been found inside of the document.
* Its constructor takes the length of ths shingles.
* The `jaccardSim` method should return the jaccard similarity of the two given shingle sets.

#### Example

The document
```
google is good
```
has the following shingles with length 3
```
"goo", "oog", "ogl", "gle", "le ", "e i", " is", "is ", "s g", " go", "ood"
```
Since set semantics is used, the second occurence of `"goo"` is not added a second time to the list of shingles. If the shingles are simply assigned ids in the order in which they have been seen, the document would be represented by the following shingle ids (starting with 0):
```
    0,     1,     2,     3,     4,     5,     6,     7,     8,     9,   10
```
A second document
```
gooses google
```
would lead to the shingles
```
"goo", "oos", "ose", "ses", "es ", "s g", " go", "oog", "ogl", "gle"
```
and the ids
```
    0,    11,    12,    13,    14,     8,     9,     1,     2,     3
```

Their intersections is $\{0,1,2,3,7,8\}$ while their union is 
$\{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14\}$. Therefore, their Jaccard similarity
is $6/15 = 0.4$

#### Hints

- As it can be seen in the example, the first three letter of a document form the first shingle of the document and the last three letters form the last shingle.
- For the tests in **this notebook**, you can assume, that the input documents have already been preprocessed and contain only the following three character classes:
  - lowercased alphabetic characters
  - digits
  - whitespaces

#### Notes

- Do not add additional external libraries.
- Interface
  - You can use _[TAB]_ for autocompletion and _[SHIFT]_+_[TAB]_ for code inspection.
  - Use _Menu_ -> _View_ -> _Toggle Line Numbers_ for debugging.
  - Check _Menu_ -> _Help_ -> _Keyboard Shortcuts_.
- Finish
  - Save your solution by clicking on the _disk icon_.
  - Finally, choose _Menu_ -> _File_ -> _Close and Halt_.
  - Do not forget to _Submit_ your solution in the _Assignments_ view.

In [1]:

/**
 * Class implementing the shingling of documents
 */
public class ShinglingProcessor {
    // YOUR CODE HERE
    int k, count=0;
    Map<String, Integer> wordShingles = new HashMap<String, Integer>();

    public ShinglingProcessor(int shingleLength) {
        // YOUR CODE HERE
         if (shingleLength > 0) {
            this.k = shingleLength;
        }
    }

    public Set<Integer> applyShingling(String text) {
        Set<Integer> shingles = null;
        // YOUR CODE HERE
        
        shingles = new HashSet<Integer>();
        //System.out.println("k is: "+k);
        //String key = text.substring(i, i + k);
        
        
        for(int i=0; i<=text.length()-k; i++)
        {
            String key = text.substring(i, i + k);
            
                        
            if(wordShingles.containsKey(key)){
                
                shingles.add(wordShingles.get(key));
            
            }
            else{
                wordShingles.put(key, count);
                shingles.add(count);
                count++;
            }
            
        }
        
        return shingles;
    }
    
    public static double jaccardSim(Set<Integer> set1, Set<Integer> set2) {
        double similarity = 0;
        // YOUR CODE HERE
        
        Set<Integer> union = new HashSet<Integer>(set1);
        union.addAll(set2);
        
        if(set1.size()>set2.size()){
            
            Set<Integer> intersection = new HashSet<Integer>(set1);
            intersection.retainAll(set2);
           // System.out.println("Size of intersection: "+intersection.size());
           // System.out.println("Size of union: "+union.size());
            similarity = (double)intersection.size()/union.size();
            
        }
        else{
            
            Set<Integer> intersection = new HashSet<Integer>(set2);
            intersection.retainAll(set1);
           // System.out.println("Size of intersection: "+intersection.size());
           // System.out.println("Size of union: "+union.size());
            similarity = (double)intersection.size()/union.size();
            
        }
        
        return similarity;
    }
}

// This line should make sure that compile errors are directly identified when executing this cell
// (the line itself does not produce any meaningful result)
new ShinglingProcessor(0);
System.out.println("compiled");

compiled


# Evaluation

- Run the following cell to test your implementation.
- You can ignore the cells afterwards.

In [2]:
%maven org.junit.jupiter:junit-jupiter-api:5.3.1
import org.junit.jupiter.api.Assertions;
import org.opentest4j.AssertionFailedError;

public static final double DELTA = 0.000001;

public static void checkShingleSimilarity(ShinglingProcessor shingling, String text1, String text2,
        double expectedSim) throws Exception {
    try {
        double similarity = ShinglingProcessor.jaccardSim(shingling.applyShingling(text1),
                shingling.applyShingling(text2));
        double diff = Math.abs(similarity - expectedSim);
        Assertions.assertTrue(diff < DELTA, "Your Jaccard similarity of the shingles of \"" + text1 + "\" and \""
                + text2 + "\" is " + similarity + " but the expected similarity was " + expectedSim);
        System.out.println("Test successfully completed.");
    } catch (AssertionFailedError e) {
        throw e;
    } catch (Throwable e) {
        System.err.println("Your solution caused an unexpected error:");
        throw e;
    }
}

// example text
ShinglingProcessor processor = new ShinglingProcessor(3);
checkShingleSimilarity(processor, "google is good", "gooses google", 0.4);
checkShingleSimilarity(processor, "abc", "cba", 0.0);

processor = new ShinglingProcessor(1);
checkShingleSimilarity(processor, "example", "elpmaxe", 1.0);
processor = new ShinglingProcessor(2);
checkShingleSimilarity(processor, "example", "elpmaxe", 0.0);


Test successfully completed.
Test successfully completed.
Test successfully completed.
Test successfully completed.


In [None]:
// Ignore this cell

In [None]:
// Ignore this cell