Experimental part

1) Step 1: Simulate Rankings of Relevance for E and P (5 points)

In the next section, we first create a list of all combinations of relevances. We use itertools.product which gives all possible combinations of a list in any order. Then we use permutation which gives all the combinations of experiment and production relevances. We use itertools.permutations to do this.

In [26]:
import itertools
values = ['N','R','HR'] #possible values of a prediction

relevances = [] #relevances contains all combinations of N/R/HR with length 5
for r in itertools.product(values, repeat=5):
    relevances.append(list(r))

In [27]:
combinations = [] #combinations contains all pairs of relevances
for p in itertools.permutations(relevances, 2):
    combinations.append(list(p)) #we use this to get rid of the permutations object

In [25]:
print(combinations[:10]) #show the first 10 combinations

[[['N', 'N', 'N', 'N', 'N'], ['N', 'N', 'N', 'N', 'R']], [['N', 'N', 'N', 'N', 'N'], ['N', 'N', 'N', 'N', 'HR']], [['N', 'N', 'N', 'N', 'N'], ['N', 'N', 'N', 'R', 'N']], [['N', 'N', 'N', 'N', 'N'], ['N', 'N', 'N', 'R', 'R']], [['N', 'N', 'N', 'N', 'N'], ['N', 'N', 'N', 'R', 'HR']], [['N', 'N', 'N', 'N', 'N'], ['N', 'N', 'N', 'HR', 'N']], [['N', 'N', 'N', 'N', 'N'], ['N', 'N', 'N', 'HR', 'R']], [['N', 'N', 'N', 'N', 'N'], ['N', 'N', 'N', 'HR', 'HR']], [['N', 'N', 'N', 'N', 'N'], ['N', 'N', 'R', 'N', 'N']], [['N', 'N', 'N', 'N', 'N'], ['N', 'N', 'R', 'N', 'R']]]


2) Step 2: Implement Evaluation Measures (10 points)

In the next section we take two assumptions:

1) Values for the prediction relevances are N=0, R=1, HR=2

2) The amount of relevant predictions (overall) is assumed to be the total amount of relevant (R or HR) docs in the prediction set. So we assume there is no overlap between articles in the prediction.

In [101]:
#the first binary evaluation methods: average precision
numeric_map = {'N':0, 'R':1, 'HR':2} #we use this numeric map to map N/R/HR to a numeric value.
prediction = ['R','HR','N','R','N'] #this is a sample prediction to test functions

def count_rel(prediction1,prediction2):
    return sum(1 for i in prediction1 if i != 'N') + sum(1 for i in prediction2 if i != 'N')

def average_precision(prediction, r):
    ap = 0
    relevant_preds = 0
    for i in range(0,len(prediction)):
        if prediction[i] != 'N':
            relevant_preds += 1
            ap += relevant_preds/(i+1)
    return ap/r

ap = average_precision(prediction, count_rel(prediction, prediction))
print(ap)

0.4583333333333333


Now we will implement both multi-graded evaluation methods. 

The first is nDCG@k which requires a optimal prediction to normalize predictions. Here we will use the total amount of HR/R files to create an optimum prediction. Again we assume there is no overlap in predictions.

The second one is ERR, this model does not need any assumptions.

In [105]:
#nDCG@K
import numpy as np #Numpy is amazing right?

def generate_opt(prediction1, prediction2): #generate optimal sequences from two predictions
    opt_pred = []
    num_hr = sum(1 for i in prediction1 if i == 'HR') + sum(1 for i in prediction2 if i == 'HR')
    num_r  = sum(1 for i in prediction1 if i == 'R') + sum(1 for i in prediction2 if i == 'R')
    for i in range(min(num_hr,5)): #check if num_hr exceeds 5, fill with HR's
        opt_pred.append('HR')
    for i in range(min(5-num_hr,num_r)): #check if num_r exceeds the space left, will with R's
        opt_pred.append('R')
    for i in range(5-len(opt_pred)): #fill the rest with N
        opt_pred.append('N')
    return opt_pred

def dcg_k(numeric_map, prediction, opt_pred, k):
    dcg_opt = 0
    dcg = 0
    for i in range(0,k): #for the range until K, we sum both the optimum and prediction dcg
        dcg_opt += (2**numeric_map[opt_pred[i]]-1)/np.log2(1+i+1)
        dcg +=(2**numeric_map[prediction[i]]-1)/np.log2(1+i+1)
    return dcg/dcg_opt #dcg is normalized compared to the optimum
ndcg = dcg_k(numeric_map, prediction, generate_opt(prediction,prediction), 3)
print(ndcg)

0.53641800576


In [106]:
#ERR
def ERR(numeric_map, prediction):
    err = 0
    max_val = 2**max(list(numeric_map.values()))
    thetas = [(2**numeric_map[p]-1)/max_val for p in prediction]
    for i in range(0,len(prediction)):
        prod_val = 1
        for j in range(0,i):
            prod_val *= (1-thetas[j])*thetas[i]
        prod_val *= 1/(i+1)
        err += prod_val
    return err
err = ERR(numeric_map, prediction)
print(err)

1.281982421875


3) Step 3: Calculate the 𝛥measure (0 points)

In [107]:
k = 5

def check_performance(s):
    prediction_e = s[0]
    prediction_p = s[1]
    print(prediction_e, prediction_p)
    r = count_rel(prediction_e, prediction_p)
    ap_e, ap_p = average_precision(prediction_e, r), average_precision(prediction_p, r)
    print('The average prec. scores are ',ap_e,ap_p,' for experiment and production respectively!')
    ERR_e, ERR_p = ERR(numeric_map, prediction_e), ERR(numeric_map, prediction_p)
    print('The ERR scores are ',ERR_e,ERR_p,' for experiment and production respectively!')
    opt_prediction = generate_opt(prediction_e,prediction_p)
    ndcg_e = dcg_k(numeric_map, prediction_e, opt_prediction, k)
    ndcg_p = dcg_k(numeric_map, prediction_p, opt_prediction, k)
    print('The NDCG scores @ k=',k,' are: ',ndcg_e, ndcg_p,' for experiment and production respectively!!!')

check_performance(combinations[20005])

['R', 'N', 'N', 'N', 'R'] ['HR', 'N', 'N', 'N', 'N']
The average prec. scores are  0.4666666666666666 0.3333333333333333  for experiment and production respectively!
The ERR scores are  1.0005859375 1.0  for experiment and production respectively!
The NDCG scores @ k= 5  are:  0.33572413233 0.726228761795  for experiment and production respectively!!!


Step 4: Implement Interleaving (15 points)

In [None]:
def get_A_first():
    A = np.random.uniform()
    
    if A > 0.5:
        return True
    else: 
        return False
    
def balanced_interleaving(s):
    
    ranking_A = s[0]
    ranking_B = s[1]
    
    print ("ranking A is",ranking_A)
    print ("ranking B is",ranking_B)
    
    I = []
    k_a, k_b = 0,0
        
    A_first = get_A_first()
    
    print ("A_first is", A_first)
    
    while k_a+1 <= len(ranking_A) and k_b+1 <= len(ranking_B):
        if (k_a < k_b) or ((k_a == k_b) and A_first):
            if ranking_A[k_a] not in I:
                I.append(ranking_A[k_a])
            k_a += 1
            
        else:
            if ranking_B[k_b] not in I:
                I.append(ranking_B[k_b])
            k_b += 1
            
    return I

test_set = [[1,2,3,4,5],[6,7,8,9,10]]

balanced_interleaving(test_set)