In [6]:
import numpy as np

In [24]:
def isqrt(n):
    x = n
    y = (x + 1) // 2
    while y < x:
        x = y
        y = (x + n // x) // 2
    return x

# Prerequisites

- Probability theory (for the main idea)
- Hashes (an application)

# Theory

## Birthday paradox

- https://www.youtube.com/watch?v=ofTb57aZHZs

### Birthday version

**Question 1**
- What is the probability that 1 person has the same birthday as you?

*Solution*
- Let $A$ be the event that someone has the same birthday as you and $\bar{A}$ be the complementary event 
    - Mutually exclusive => $Pr(A) = 1 - Pr(\bar{A})$ 
    
- Let $E_i$ be the events that person $i$ does not have your birthday

Then 
- $Pr(A) = 1 - Pr(\bar{A}) = 1 - \prod_{i=1}^n Pr(E_i) = 1 - \left( \dfrac {364} {365}\right)^n$

**Question 2**
- What is the probability that 2 out of $n$ people in a room share the same birthday?
- **Suppose the birthdays are distributed independently and uniformly**

*Solution*
- Let $A$ be the event that 2 people have the same birthday, let $\bar{A}$ be the complementary event (no 2 people have the same birthday)
    
    
- Event 1 = Person 1 is born => $Pr(E_1) = \dfrac {365} {365}$
- Event 2 = Person 2 is born on a different day than Person 1 => $Pr(E_2) = \dfrac {364} {365}$  
$\vdots$
- Event n = Person n is born on a different day than Person $1,...,n-1$ => $Pr(E_n) = \dfrac {365-(n-1)} {365}$  

$Pr(\bar{A}) = Pr(E_1) \cdot Pr(E_2) \cdot \dots \cdot Pr(E_n) = \dfrac {365} {365} \cdot \dfrac {364} {365} \cdot \dots \cdot \dfrac {365-(n-1)} {365} = \left( \dfrac {1} {365} \right) ^{n} \cdot \dfrac {365!} {(365-n)!} = \prod_{i=1}^{n-1} \left(1 - \dfrac i {365}\right)$ 


    

### General case

**Question 1**
- Instead of $365$ days we have $d$ => $\boxed{1 - \left( \dfrac {d-1} {d}\right)^n}$

**Question 2**
- Instead of $365$ days we have $d$ => $\boxed{1 - \prod_{i=1}^{n-1} \left(1 - \dfrac i {d}\right)}$

#### Code

In [1]:
def my_birthday(n, d):
    return 1 - pow((d-1)/d , n)

def same_birthday(n, d):
    p = 1
    for i in range(1, n): #1 -> n-1
        p*=(1-i/d)
    return 1 - p

In [2]:
same_birthday(23, 365), same_birthday(32, 365), same_birthday(100, 365)

(0.5072972343239854, 0.7533475278503207, 0.9999996927510721)

In [3]:
my_birthday(23, 365), my_birthday(32, 365), my_birthday(100, 365)

(0.06115058190745448, 0.08404821326682732, 0.23993292618409912)

### Approximations

From the taylor approximation we know $e^x = 1 + x + \dfrac {x^2} {2!} + \dots => e_x\approx 1 + x$ for $x << 1$

Apply for each event => $x = -a/d => e^{\frac {-a} d} \approx 1- \dfrac a d => Pr(A) = 1 - \prod_{i=1}^{n-1}e^{-i/d} = 1-e^{-\frac {n(n-1)} {2d}} \approx 1-\boxed{e^{-\frac {n^2} {2d}}}$

If we want to solve for $n$ knowing $Pr(A)$ we take the log => $\boxed{n \approx \sqrt{2d \ln \left(\dfrac 1 {1-Pr(A)}\right)}}$


#### Code

In [7]:
def approx_same_birthday(n, d):
    return 1 - pow(np.e, -pow(n, 2) / (2*d))

def n_given_prob(p, d):
    return np.sqrt(2 * d * np.log(1 / (1-p)))

In [8]:
print(approx_same_birthday(23, 365))
prinapprox_same_birthday(32, 365), approx_same_birthday(100, 365)

(0.5155095380615168, 0.7540777195328239, 0.9999988760149834)

In [9]:
n_given_prob(.5, 365), n_given_prob(.75, 365), n_given_prob(.999999, 365)

(22.49438689559598, 31.811867025019456, 100.42570740250191)

### Balls version

**Collision Theorem**:
An urn contains $N$ balls, of which $n$ are red and $N-n$ are blue.  
Bob samples with replacement until he has $m$ balls

* The probability that Bob selects at least one red ball:  $Pr(\text{at least one red})=1−\left(1−\cfrac n N\right)^m$

* A lower bound for the probability $Pr(\text{at least one red})≥1−e^{−mn/N}$

If $N$ is large and if $m$ and $n$ are not too much larger than $\sqrt N$ (Ex: $m, n <10\sqrt N$), the lower bound is almost an equality

**Proof**

For the first point
- Let $A$ be the event that Bob selects 1 red ball in $m$ attempts and $\bar{A}$ the complementary event - all $m$ choices are blue
- Let $E_i$ be the event that the i'th ball is blue
- $Pr(A) = 1-Pr(\bar{A}) = 1-\prod_{i=1}^m Pr(E_i) = 1-\prod_{i=1}^m \left(\dfrac {N-n} N\right) = 1-\left(\dfrac {N-n} N\right)^m$

For the second point
- $e^{-x} \geq 1-x \ \ \forall x \in \mathbb{R}$
- Set $x = n/N => 1-\left(\dfrac {N-n} N\right)^m \geq 1 - (e^{-n/N})^m = 1 - e^{-mn/N}$

#### Code

In [10]:
def pr_at_least_one_red(n, m, N):
    return 1 - pow((1 - n/N), m)


def approx_pr_at_least_one_red(n, m, N):
    return 1 - pow(np.e, (-m*n/N))

A deck of cards is shuffled and eight cards are dealt face up.Bob then takes a second deck of cards and chooses eight cards at random,replacing each chosen card before making the next choice. What is Bob’s probability of matching one of the cards from the first deck?

In [11]:
N = 52
n = 8
m = 8
print(pr_at_least_one_red(n,m,N))
print(approx_pr_at_least_one_red(n, m, N))

0.7372185753440565
0.7079321763085858


In [12]:
n = 10; m = 5
print(pr_at_least_one_red(n,m,N))
print(approx_pr_at_least_one_red(n, m, N))

0.6562602681709593
0.6176957271079193


In [13]:
N = 100000
n = 1000
m = 1000
print(pr_at_least_one_red(n,m,N))
print(approx_pr_at_least_one_red(n, m, N))

0.9999568287525893
0.9999546000702375


## Application to hashes

### Collision 

- Let $H:\mathcal{M} \longrightarrow \mathcal{T}$ be a hash function with $|\mathcal{M}| >> |T|$
- Let's denote $N = |\mathcal{T}|$

**Algorithm**
1. Choose $s \approx \sqrt{N}$ random distinct messages in $\mathcal{M}$
2. Compute $t_i = H(m_i)$  for $1\leq i \leq \sqrt{N}$
3. Look for $(t_i = t_j)$ -> If not found go to step 1



**How well would this work**

We chose $\sqrt N$ => The probability of finding a collision is $1/2$ => We would need to iterate this algorithm twice

Running time
- $\mathcal{O}(\sqrt N)$

Space
- $\mathcal{O}(\sqrt N)$


#### Code
- We code a hash collision for a $10b$ hash

In [16]:
import hashlib
from Crypto.Random.random import getrandbits
import random
from Crypto.Util.number import long_to_bytes, bytes_to_long

In [17]:
getrandbits(11)

1127

In [18]:
m = getrandbits(11)
t = hashlib.sha256(long_to_bytes(m))

In [19]:
t.hexdigest()

'fd15a0791cc2203277a715e087fdca5190f5f13795444166990c5685011fc59c'

In [20]:
bin(bytes_to_long(t.digest()))[2:2 + 11]

'11111101000'

In [21]:
int(bin(int(t.hexdigest(), 16))[2:2+11], 2)

2024

In [26]:
def small_hash(m, hash_bits):
    t = hashlib.sha256(long_to_bytes(m)).hexdigest() #the hash in bytes
    t = bin(int(t,16))[2:2+hash_bits]
    t = int(t, 2)
    return t

def small_hash_colision(M_dim, hash_bits):
    
    N = 1<<hash_bits
    print('Hash size: ', N)
    num_samples = 1 * isqrt(N)
    num_samples += num_samples//5 + 1 #num_samples = 1.2 * isqrt(N) + 1
    print(f'Making a list of {num_samples} hashes')
    print(f'Probability of finding a collision is {same_birthday(num_samples, N)}')
    m_list = []
    t_list = []
    for i in range(num_samples):
        m = random.randint(0, M_dim-1)
        t = small_hash(m, hash_bits)
        if m not in m_list:
            t_list.append(t)
            m_list.append(m)
    
    for i in range(len(t_list)):
        for j in range(i+1, len(t_list)):
            if t_list[i] == t_list[j]:
                print('Collision found!')
                return m_list[i], m_list[j], t_list[i]
    else:
        print('Collision not found :(')
        return -1, -1, -1

In [34]:
bit_range = 20
M_dim = 10000 * pow(2, bit_range)
m1, m2, t = small_hash_colision(M_dim, bit_range)
print(m1, m2, t)
print(small_hash(m1, bit_range) == small_hash(m2, bit_range))

Hash size:  1048576
Making a list of 1229 hashes
Probability of finding a collision is 0.5132134608547976
Collision found!
2751479053 6226435068 850509
True


# Resources

- https://en.wikipedia.org/wiki/Birthday_problem
- https://en.wikipedia.org/wiki/Birthday_attack