# MMD 2024, Problem Sheet 6

Group: Daniela Fichiu, Aaron Maekel, Manuel Senger

# Exercise 1

**Task:**

The Jaccard similarity can be applied to sets of elements. Sometimes, documents (or
other objects) may be represented as multi-sets/bags rather than sets. In a multi-set,
an element can be a member more than once, whereas a set can only hold each element
at most once. Try to define a similarity metric for multi-sets. This metric should take
exactly the same values as Jaccard similarity in the special case where both multi-sets
are in fact sets.

**Solution:**

Remember Jaccard similiarity:
I12 := Interception of C1 and C2
U12:= Union of C1 and C2
sim(C1,C2) = #(I12)/#(U12)


Assume we now have multisets with duplicate elements for example MC1= [a,a,b,b,c] and MC2 = [a,b,b,b,c,c,d]

Then we can transform each multiset by differentiating each tuple of elements into unique elements by indexing:

MC1 = [a1,a2,b1,b2,c1]

MC2 = [a1,b1,b2,b3,c1,c2,d1]

we can now use Interception and unions as before:

#I12 = 4

#U12 = 8

sim(MC1,MC2) = 0.5

This metric is equivalent to the jaccard similiarity if both inputs are normal sets, because the transformation will not change anything, except for renaming the set elements. 



# Exercise 2
 

In [2]:
# Step 1: Set up the Spark session
import findspark
findspark.init()

from pyspark.sql import SparkSession
import os
from pyspark.sql.functions import split, explode
from pyspark.sql.types import StructType, StructField, StringType, IntegerType



In [22]:


spark = SparkSession.builder \
    .appName("TExercise2") \
    .getOrCreate()

schema = StructType([
    StructField("shingle", StringType(), True)
])

# Create an empty DataFrame with the specified schema


def get_shingles(file,k=9):
    with open(os.path.join("Task6Ex2_documents",  file), 'r') as file:
        txt = file.read().replace("-\n","").replace("\n","").upper()
     
    
    df = spark.createDataFrame([], schema)
    shingles ={} 
    for i in range(len(txt)-k):
        subtext = txt[i:i+k]

        if subtext not in shingles.keys():
            shingles[subtext] = True
    

       
    df = spark.createDataFrame( data= shingles, schema=StringType())
    
    return df, df.count()


 
# Create a DataFrame from the list of tuples
#df = spark.read.option("delimiter", "").text(os.path.join("Task6Ex2_documents",  "example_txt.txt")) 
#df = df.withColumn("value",explode(split('value','')))
#df.show()


 
df , n = get_shingles("example_txt.txt")
df.show()
print(n)

+---------+
|    value|
+---------+
|WHEN A DI|
|HEN A DIS|
|EN A DIST|
|N A DISTI|
| A DISTIN|
|A DISTING|
| DISTINGU|
|DISTINGUI|
|ISTINGUIS|
|STINGUISH|
|TINGUISHE|
|INGUISHED|
|NGUISHED |
|GUISHED B|
|UISHED BU|
|ISHED BUT|
|SHED BUT |
|HED BUT E|
|ED BUT EL|
|D BUT ELD|
+---------+
only showing top 20 rows

151


In [23]:
df,n5 = get_shingles("grundgesetz.txt",5) 
df,n9 = get_shingles("grundgesetz.txt",9) 
#print(df.rdd.takeSample(False, 30))
print("Amount of 5-shingles",n5)
print("Amount of 9-shingles",n9)

Amount of 5-shingles 23830
Amount of 9-shingles 81060


for checking 10 different documents, we just split the grundgesetz into 10 different parts

In [26]:

for i in range(10):
    for k in [5,9]:
        _,n = get_shingles("grundgesetz_"+str(i+1)+".txt",k) 
        print("Grundgesetz Part", i+1 , " Amount of ",k,"-shingles:",n)

Grundgesetz Part 1  Amount of  5 -shingles: 6827
Grundgesetz Part 1  Amount of  9 -shingles: 12873
Grundgesetz Part 2  Amount of  5 -shingles: 6337
Grundgesetz Part 2  Amount of  9 -shingles: 12203
Grundgesetz Part 3  Amount of  5 -shingles: 6309
Grundgesetz Part 3  Amount of  9 -shingles: 12140
Grundgesetz Part 4  Amount of  5 -shingles: 6650
Grundgesetz Part 4  Amount of  9 -shingles: 12297
Grundgesetz Part 5  Amount of  5 -shingles: 5632
Grundgesetz Part 5  Amount of  9 -shingles: 11023
Grundgesetz Part 6  Amount of  5 -shingles: 5975
Grundgesetz Part 6  Amount of  9 -shingles: 11708
Grundgesetz Part 7  Amount of  5 -shingles: 5628
Grundgesetz Part 7  Amount of  9 -shingles: 10812
Grundgesetz Part 8  Amount of  5 -shingles: 6080
Grundgesetz Part 8  Amount of  9 -shingles: 11539
Grundgesetz Part 9  Amount of  5 -shingles: 6020
Grundgesetz Part 9  Amount of  9 -shingles: 11443
Grundgesetz Part 10  Amount of  5 -shingles: 6223
Grundgesetz Part 10  Amount of  9 -shingles: 11503


# Exercise 3

**Task:** 

Figure 1 shows a table (or matrix) representing four sets S1, S2, S3 and S4 (subsets of
{0, 1, 2, 3, 4, 5}).

a) **Task:**

Compute the MinHash signature for each set using the following three hash functions:

h1(x) = 2x + 1 mod 6

h2(x) = 3x + 2 mod 6

h3(x) = 5x + 2 mod 6

In [3]:
import numpy as np

matrix = np.array([[0,0,1,0,0,1],[1,1,0,0,0,0],[0,0,0,1,1,0],[1,0,1,0,1,0]])

def h1(x):
    return (2*x + 1 )% 6
def h2(x):
    return (3*x + 2 )% 6
def h3(x):
    return (5*x + 2 )% 6

def give_set(input):
    return  np.nonzero(input)[0]

def minhash(matrix):
    for i in range(4):
        set = give_set(matrix[i])
        
        
        print("minhash( Set",i,"):",min(min(h1(set)),min(h2(set)),min(h3(set))))
        

minhash(matrix)
 

minhash( Set 0 ): 0
minhash( Set 1 ): 1
minhash( Set 2 ): 1
minhash( Set 3 ): 0


**Task:** 

Which of these hash functions are true permutations? What collisions do occur in
the other hash functions? Name the corresponding inputs and outputs

In [45]:
test =np.array([0,1,2,3,4,5])
print("input:",test)
print("output of hash function 1",h1(test))
print("output of hash function 2",h2(test))
print("output of hash function 3",h3(test))

input: [0 1 2 3 4 5]
output of hash function 1 [1 3 5 1 3 5]
output of hash function 2 [2 5 2 5 2 5]
output of hash function 3 [2 1 0 5 4 3]


as we can see, only the third hash function is a true permutation, as the other two are not bijective

c) **Task:** 

Compare the similarity of the MinHash signatures against the corresponding Jaccard
similarities, for each of the   6 pairs of columns.



In [86]:
from itertools import permutations
PERMS = set(permutations( [0,1,2,3,4,5]))


def minhash2(vec,perm):
     
    return min(np.array(perm)[give_set(vec)])
    
def jaccard_sim(c1,c2):
    
    return np.sum(np.logical_and(c1,c2))/np.sum(np.logical_or(c1,c2))

def minhash_sim(c1,c2):
    prob = 0
    for i in PERMS:
        prob += minhash2(c1,i)==minhash2(c2,i)
    prob /= len(PERMS)
    return prob



for i in range(3):
    for j in range(1,4-i):
        
        print("column",i," and column",i+j," have minhash sim:", minhash_sim(matrix[i],matrix[i+j]))
        print("column",i," and column",i+j," have jacard sim:", jaccard_sim(matrix[i],matrix[i+j]))

column 0  and column 1  have minhash sim: 0.0
column 0  and column 1  have jacard sim: 0.0
column 0  and column 2  have minhash sim: 0.0
column 0  and column 2  have jacard sim: 0.0
column 0  and column 3  have minhash sim: 0.25
column 0  and column 3  have jacard sim: 0.25
column 1  and column 2  have minhash sim: 0.0
column 1  and column 2  have jacard sim: 0.0
column 1  and column 3  have minhash sim: 0.25
column 1  and column 3  have jacard sim: 0.25
column 2  and column 3  have minhash sim: 0.25
column 2  and column 3  have jacard sim: 0.25


we can see that the both similiarity functions are the same,as it should be

# Exercise 4

**Task:** Recall the concepts of Shingling and MinHash signatures to perform the following tasks.
Submit as your solution your source code, results, and logs of runs. You do not need to
take care of the scalability of your code, e.g., assume that all input/output data fit into
RAM.

a) **Task:**

Implement a routine in Python to compute a representation of a string of decimal
digits (0...9) as a set of k-shingles. The input of your routine is a string of digits
and k. The output is an ordered list of positions of 1’s in a (virtual) Boolean
representation of a set of k-shingles as outlined in Lecture 7 (see slide “From Sets
to Boolean Matrices”). The position of a k-shingle x (of digits) in the Boolean
vector is x interpreted as an integer. For example, shingles “0...00” and “0...2024”
would map to (decimal) positions 0 and 2024, respectively. Moreover, for a string
“1234567” and k = 4 your routine should output the list [1234, 2345, 3456, 4567].
Hint: You can use Python’s data structure set() (or as alternative dict()) to
need just one pass through the input string plus outputting the positions in an
ordered fashion.

b) **Task:** 

Run your implementation from a) on the first 10000 digits of π after comma using
k = 12. Save the output list as a text file with one position (list element) per line,
and submit it as a part of your solution.

In [8]:
def get_shingles_dec(txt,k):
    
    shingles ={} 
    for i in range(len(txt)-k):
        #remove trailing zeros by converting it to int.
        subtext = str(int(txt[i:i+k]))

        if subtext not in shingles.keys():
            shingles[subtext] = True
    unsorted_pos = [int(x) for x in shingles.keys()]

    return np.sort(unsorted_pos)

with open( "pi.txt" , 'r') as file:
    txt = file.read().replace(" ","").replace(".","")
    
shingles = get_shingles_dec(txt,12)

print(shingles)

with open(r'pi_shingles.txt', 'w') as fp:
    for item in shingles:
        # write each item on a new line
        fp.write("%s\n" % item)
    print('Done')


[   313783875    407854733    422966171 ... 999983729780 999998372978
 999999837297]
Done


we only have 9999 digits of pi, I am sorry


c) **Task:** 

Implement (in Python) the algorithm for MinHash signatures as described in the
slides “Implementation /*” of Lecture 7. We simplify here and assume only one
column C representing one document/string. Thus, your algorithm shall use as
input a single list of positions of 1s in a (virtual) Boolean vector described in a).
Run your implementation on the list of positions obtained in b) using 5 hash
functions, specified as follows:

In [105]:
import numpy as np
import random
N = 10**12  
#print("N:",N)

A = np.array([37, random.getrandbits(40), random.getrandbits(40), random.getrandbits(40), random.getrandbits(40)], dtype=np.longdouble)
B = np.array([126, random.getrandbits(40),random.getrandbits(40), random.getrandbits(40), random.getrandbits(40)], dtype=np.longdouble)
P = (10*np.ones(5))**15 + np.array([223,37, 91, 159, 187])

def h(x,a=37,b=126,p=(10**15 + 223)):
    #print("a:",a)
    #print("b:",b)
    #print("X:",x)
    #print("ax+b:" ,(a*x)+b)
    #print("mod p",np.mod((a*x+b) , p))
    #print("mod N",np.mod(np.mod(a*x+b , p) , N))
    return np.mod(np.mod(a*x+b,p), N) + 1

arr = np.ones(5)*float("inf")
for i in   range(len(shingles)):
    for j in range(5):
        res= h(shingles[i], A[j],B[j],P[j])
        if res < arr[j]:
            arr[j] = res 
arr.astype(np.int64)
       
print("Signature:" ,arr)    
 


Signature: [2.49166640e+07 1.04714586e+08 1.81477250e+07 3.18072120e+07
 2.58758930e+07]


to avoid integer overflows we used np.longdouble