# Reservoir Sampling

#### Umang Bhatt ECE '19

## Introduction
This tutorial will give you a taste of a randomized algorithm family called reservoir sampling. This is an algorithm that is commonly used to pick a set number of items from a dataset of unknown size. Before we jump into the algorithm itself, let's look at a sample simple scenario.

### Simple Scenario
Imagine you have a pool of 100 people (N = 100). You want to randomly select 20 of them to come with you on a weekend trip to Frank Llyod Wright's Falling Water (k = 20). The simple and intuitive way to do this to generate k random numbers between 0 and N-1, get the name at every corresponding index, and then invite those people and enjoy your trip. Note: each index must be unique - you can't clone a person and take them on the trip. Let's quickly take a look at how this would work, assuming N is small (10000 will do for this quick example).

In [91]:
import string
import pandas as pd
import time
import random
import csv
import math

In [92]:
#Let's generate N random, faux names.
def simpleNameGenerator(N):
    result = []
    letters = string.ascii_uppercase
    length = len(letters)
    for i in range(N):
        index = i % length
        multiplier  = i // length + 1
        toAdd = str(letters[index])*multiplier
        result += [toAdd]
    assert (len(result) == N)
    return result

ourNames = simpleNameGenerator(10000)
# print(ourNames)

In [93]:
#From my N friends, pick k of them
def pickMyFriends(myList, N, k):
    result = []
    seen = set([])
    count = 0
    while(count < k):
        index = random.randint(0, N-1)
        if index not in seen:
            seen.add(index)
            result += [myList[index]]
            count += 1
    return result

start_time = time.time()
ourFriends = pickMyFriends(ourNames, 10000, 20)
print("--- %s seconds ---" % (time.time() - start_time))
print(ourFriends)

--- 0.000159978866577 seconds ---
['XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX', 'DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD', 'LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL', 'FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF', 'UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU

## Tweaked Scenario
Again, imagine you have a pool of people. You want to randomly select k of them to come with you on a weekend trip to Frank Llyod Wright's Falling Water (k = 20). However, now you have absolutely no idea how large your pool of friends is (N). N could equal 4 or it could equal 999,000,000,000. The naive soultion would be to iterate through the given list (of the pool of people) to calculate N and then do as we did above (i.e. generate k random numbers between 0 and N-1, get the name at every corresponding, unique index, and so on. Let's see how this would pan out on the list we used above. Let's assume for the sake of this problem that python's len() function cannot be employed. 

In [94]:
#Find N and then pick k from N
def pickMyFriendsNoN(myList, k):
    result = []
    seen = set([])
    count = 0
    array = myList
    N = 0
    while(array != []):
        N += 1
        array = array[1:]
    while(count < k):
        index = random.randint(0, N-1)
        if index not in seen:
            seen.add(index)
            result += [myList[index]]
            count += 1
    return result

start_time = time.time()
ourFriends = pickMyFriendsNoN(ourNames, 20)
print("--- %s seconds ---" % (time.time() - start_time))
print(ourFriends)

--- 0.362530946732 seconds ---
['VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV', 'UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU', 'EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE', 'VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV', 'SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS', 'QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ', 'XXXXXXXXXXXXXXXXXXXXXXXXXXXX', 'UUUUUUUUUUUUUUU', 'RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR', 'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB', 'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

## The Algorithm
Observe that the latter code we ran where we calculated the length (N) of our list is orders of magnitude slower than our former code where N was provided to us. We need not run through the array twice in the tweaked problem. There has to be a faster way of generating k random samples from an unknown sized pool or reservoir. Cue the algorithm!

### The Concept

Reservoir sampling allows us to solve this tweaked problem in a quick and efficient manner, O(n) time as opposed to O(n^2) from the double pass seen above. It seems intuitive to want to iterate through the pool (also known as stream) to create our reservoir of size k. More often than not, this alogrithm is used when the stream we are sampling from is too large to store in memory and it would be grossly inefficient to iterate through the unknown sized list that could have millions and millions of data points - we will discuss practical uses in a bit. But first, how do we implement this algorithm in O(n) time.

### The Specifications

1. To start, we will fill a reservoir of the first k samples from our pool (also known as stream).
2. Next, we want to iterate through the list once, from k+1 to the len(N).
3. During this iteration, we will generate a random integer - if that integer is less than k, then we will replace the element at the integer index found with the current element of our iteration. 


In [95]:
#Fill the reservoir with k elements and then randomly pick elements to replace
def reservoirSampling(myList, k):
    reservoir = []
    for index in range(k):
        reservoir += myList[index]
    for j in range(k+1, len(myList)):
        i = random.randint(0, j)
        if i < k:
            reservoir[i] = myList[j]
    return reservoir
        
start_time = time.time()
ourFriends = reservoirSampling(ourNames, 20)
print("--- %s seconds ---" % (time.time() - start_time))
print(ourFriends)    

--- 0.0368258953094 seconds ---
['DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD', 'ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ', 'RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR', 'LLLLLLL', 'JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ

### The Why
When we originally pick our k elements from the stream to put into the reservoir, each element is picked with a probability of k/n. We pick our n+1 element with a probability of k/(n+1) and the elements currently in our list have a 1/k probability of eviction. The probability we replace an element is (k/(n+1))*(1/k) = 1/(n+1), which means that the probabilty we don't replace an element is n/(n+1). In order for an element to be in the n+1 set, it has to be chosen and evict an existing element (probability of k/(n+1)) or it has to have been in the original n set and not chosen to be replace ((k/n)*(n/(n+1)) = k/(n+1))!



## Weighted Reservoir Sampling
Let's imagine a similar scenario except we have a corresponding list of weights associated with each item in the list. We prefer certain items to be in our reservoir over others. We adjust the algorithm to be as follows. 

### The Specifications

1. To start, we will fill a reservoir of the first k samples from our pool (also known as stream), again. This time though. We will sum their weights as we go. When we add a new weight in, we divide by k to ensure proportionality.
2. Next, we want to iterate through the list once, from k+1 to the len(N).
3. We will add the weight of the current element to the sum. The probability of replacement will be this current item's weight over the sum.
3. Then we will replace the element at a randomly picked index between 0 and k index with the current element of our iteration if and only if a randomly generated number between 0 and 1 is less than or equal to the probability we calculated earlier!

In [96]:
#Let's generate N random weights
def weightGenerator(N):
    return [random.random() for i in range(N)]

weightedNames = zip(ourNames,weightGenerator(10000))
# print(weightedNames)

In [97]:
#As you fill the reservoir to start, add the weights proportionally. Then, based on the current weight, add the current
#element to the reservoir
def reservoirSampling(myList, k):
    reservoir = []
    totalWeight = 0
    for index in range(k):
        reservoir += myList[index][0]
        totalWeight += (myList[index][1]/k)
    for j in range(k+1, len(myList)):
        totalWeight += (myList[j][1]/k)
        prob = myList[j][1] / totalWeight
        check = random.random()
        if check < prob:
            index = random.randint(0,k-1)
            reservoir[index] = myList[j][0]
    return reservoir
        
start_time = time.time()
ourFriends = reservoirSampling(weightedNames, 20)
print("--- %s seconds ---" % (time.time() - start_time))
print(ourFriends)    

--- 0.0135481357574 seconds ---
['AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA', 'RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR', 'SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS', 'KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK', 'YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY

### An Aside
<b>Distributed Random Sampling</b> is a popular technique used to perform reservoir sampling across several machines. Sometimes our stream of data is too large to be stored in memory and would take too long to process on one machine. At that point, we can decide to distribute the data amongst m machines, each of which does its own weighted sampling of k elements as described above. We amalgamate the k samples from each, and then regenerate keys on one machine to pick the final k elements.


## Practical Uses

As stated earlier, reservoir sampling comes most in handy when an entire data stream does not fit in memory. Imagine if our stream was every Google search performed in the month of June every year since Google's inception or every transaction made on Amazon in the past three holiday seasons - these streams are extremely large. In order to randomly select a k number of them we would need to probably employ distributed random sampling, as these streams would a) be too large to efficiently iterate over to get its size and b) may not fit into memory - or would significantly slow down our machine if we tried.


# Twitter Example
In our past homeworks, we have manipulated Twitter data. Let's do a real life example where we load and select k tweets from an unknown stream of tweets to presumably do data analysis on. 

In [98]:
# From HW2, load our tweets.csv
def load_twitter_data_pandas(tweets_filepath):
    tweets_df = (pd.read_csv(tweets_filepath)).fillna("")
    return tweets_df

In [99]:
(tweets_df) = load_twitter_data_pandas('tweets.csv')
# make sure to change the path to csv files appropriately
print (tweets_df.head())

       screen_name                      created_at  retweet_count  \
0  realDonaldTrump  Fri Sep 09 02:00:32 +0000 2016           2859   
1  realDonaldTrump  Fri Sep 09 00:39:36 +0000 2016           6463   
2  realDonaldTrump  Thu Sep 08 23:56:22 +0000 2016           5405   
3  realDonaldTrump  Thu Sep 08 19:52:32 +0000 2016          11633   
4  realDonaldTrump  Thu Sep 08 18:17:01 +0000 2016           3824   

   favorite_count                                               text  
0            7030  Final poll results from NBC on last nights Com...  
1           17951  It wasn't Matt Lauer that hurt Hillary last ni...  
2           13223  More poll results from last nights Commander-i...  
3           27028  Last nights results - in poll taken by NBC. #A...  
4           12567  With Luis, Mexico and the United States would ...  


In [100]:
#Fill the reservoir with k elements and then randomly pick elements to replace
def tweetReservoirSampling(df, k):
    reservoir = []
    listDF = df.values.tolist()
    for index in range(k):
        reservoir += listDF[index]
    for j in range(k+1, len(listDF)):
        i = random.randint(0, j)
        if i < k:
            reservoir[i] = listDF[j]
    return reservoir
        
start_time = time.time()
ourTweets = tweetReservoirSampling(tweets_df, 20)
print("--- %s seconds ---" % (time.time() - start_time))
print(ourTweets) 

--- 0.0279831886292 seconds ---
[['RapidCityPD', 'Sat Sep 17 02:05:47 +0000 2016', 0, 1, '@tweeks210 Pretty clam and quiet for the most part, but we see our fair share of action.'], ['AnnCoulter', 'Sat Sep 17 07:40:37 +0000 2016', 954, 0, 'RT @I_AmAmerica: If Hillary Clinton have the chance to appoint 2 to 4 Supreme Court Justices, it could change our Republic forever.Please C\xe2\x80\xa6'], ['LGlick1', 'Wed Sep 14 00:03:06 +0000 2016', 4, 9, "Let's let the people decide! @realDonaldTrump #VoteTrump #MAGA  https://t.co/ePfDb2Jd7Q"], ['andersoncooper', 'Fri Jul 29 04:02:43 +0000 2016', 0, 34, '@megfarrisWWL cant wait for wrinkle free friday!'], ['evangelistmatt', 'Sat Sep 17 02:04:34 +0000 2016', 37, 0, 'RT @EvanHeadrick: #HillsongMovie was unbelievable. Incredible film. I have such a deeper respect and love for who they are, and their music.'], ['joshspragins', 'Sat Aug 20 16:42:25 +0000 2016', 326, 0, 'RT @CoachKWisdom: All dreams come true if we have the courage to pursue them.'], ['

## Tweaked Scenario
Let's now imagine that we wanted to associate a specific weight with a given scenario. Frist let;s randomly generate weights for the given screen names. After that, let's randomly pick 20 tweets to analyze.

In [101]:
#First lets isolate the screen names and make a dictionary where the key value pairs are the unique screen name and a
#random weight
def makeSetOfScreenNames(df):
    return set(df["screen_name"].tolist())

myScreenNames = makeSetOfScreenNames(tweets_df)
#print(makeSetOfScreenNames(tweets_df))

def screenNameWeights(screenNames):
    screenNameWeights = {}
    for name in screenNames:
        screenNameWeights[name] = random.random()
    return screenNameWeights
 
myWeights = screenNameWeights(myScreenNames)
print(screenNameWeights(myScreenNames))

{'LaraLeaTrump': 0.6248763216837869, 'parscale': 0.062435954772907754, 'AnnCLauer': 0.84804203423153, 'BucksFargo': 0.17829362339700194, 'DonaldJTrumpJr': 0.04944992132156867, 'TrumpGolfHV': 0.49261696325332294, 'TrumpCharlotte': 0.7350803190490024, 'ketosisquickly': 0.9783133134684304, 'NBCNews': 0.27834049591074217, 'MaxAdlerGD': 0.40148367926662265, 'Aly_Raisman': 0.7733722111307808, 'TrumpHotels': 0.84735480130601, 'blueandcream': 0.9862129520156625, 'TedDiBiase': 0.2787697242029894, 'mjfoxy12': 0.1552932513384898, 'bhvacations': 0.4106551206045287, 'MinnesotaDFL': 0.6135559210540275, 'DANGEROUSDG1': 0.1913625598122457, 'TrumpLasVegas': 0.4629583035680024, 'CusterChronicle': 0.5747203160756378, 'iherb__promo': 0.376433410740763, 'karchkiraly': 0.9402854488021014, 'MyTownBlackHill': 0.9299476245994821, 'NCAA': 0.2057443000667939, 'Goldust': 0.8568065339166543, 'CureCancerNow': 0.6044865900992201, 'WojVerticalNBA': 0.6889654234208427, 'Ginamzz': 0.7880569466365592, 'iHerb': 0.0348199

In [102]:
#Fill the reservoir with k elements and then randomly pick elements to replace
def tweetWeightedReservoirSampling(df, weight, k):
    reservoir = []
    listDF = df.values.tolist()
    totalWeight = 0
    for index in range(k):
        reservoir += listDF[index]
        totalWeight += (weight[listDF[index][0]]/k)
    for j in range(k+1, len(listDF)):
        totalWeight += (weight[listDF[index][0]]/k)
        prob = weight[listDF[index][0]]/ totalWeight
        check = random.random()
        if check < prob:
            index = random.randint(0,k-1)
            reservoir[index] = listDF[j]
    return reservoir
        
start_time = time.time()
ourWTweets = tweetWeightedReservoirSampling(tweets_df, myWeights,20)
print("--- %s seconds ---" % (time.time() - start_time))
print(ourWTweets) 

--- 0.0220670700073 seconds ---
[['SIGolfPlus', 'Wed Nov 05 21:33:28 +0000 2014', 6, 11, 'The Bump &amp; Run, a shot you need in your bag: http://t.co/GEV2HN2ypl'], ['EricTrumpFdn', 'Fri Jun 24 21:31:06 +0000 2016', 28, 67, "Here's a sneak peek at what to expect at our 10th Annual #ETF Golf Invitational &amp; Auction Dinner!  Details soon! \xe2\x9b\xb3\xef\xb8\x8f https://t.co/vMJQb2bw6T"], ['TrumpChicago', 'Tue Sep 13 15:51:55 +0000 2016', 0, 2, 'Meet the Sixteen Team:\nExecutive Pastry Chef Evan Sheridan \n*From: Iowa\n*Favorite day off food: Vietnamese... https://t.co/KfGjosIpi7'], ['CusterChronicle', 'Thu Sep 15 19:30:39 +0000 2016', 0, 0, 'Volleyball takes three wins in busy week https://t.co/D9dxq3JtQq'], ['Kampkoa', 'Fri May 20 17:45:15 +0000 2016', 0, 0, 'Think people - Do not approach...  they are wild animals...\nhttps://t.co/58cAyepKsB'], ['ErinAndrews', 'Fri Sep 16 00:45:20 +0000 2016', 0, 0, '@Reveretoo ugh what?'], ['BlackHillsWine', 'Wed Sep 07 21:16:17 +0000 2016', 1, 0

## Further Resources
This tutorial highlighted a tad of the potential of reservoir sampling. It started with the naive approach to a simple problem. Then, it introduced the algorithm, whcih was finally applied to a real life example of randomly selectingtweets to survey.

For increased complexity and further information, visit one of the following:
1. https://gregable.com/2007/10/reservoir-sampling.html
2. https://www.cs.umd.edu/~samir/498/vitter.pdf
3. http://dimacs.rutgers.edu/~graham/pubs/papers/fwddecay.pdf
4. http://erikerlandson.github.io/blog/2015/11/20/very-fast-reservoir-sampling/

Reading about the Fisher-Yates shuffle will be beneficial in understanding a variation of reservoir sampling in a practical example of shuffling and picking cards from a deck of cards. See here: https://bost.ocks.org/mike/shuffle/