## Kullback-Leibler Divergence of Empirical and Theoretical Probabilities of Rankings 

The Kullback-Leibler(KL) divergence of two probability distributions is a measure of difference between the two probability distributions. For probability distributions E and T, the KL divergence is 

$$ D_{KL}(P, Q) = \sum_{i}Q(i)\log\frac{Q(i)}{P(i)}    $$

where i is the ith term that the probability distribution is defined over. To find the KL divergence between the empirical and theoretical probability distributions of the Ireland 2002 data, we first load in the data as well as the parameters we found for the Mallows and Plackett-Luce models that best fit the data:

In [1]:
import readPreflib
import numpy as np

_, lengths, votes = readPreflib.soiInputwithWeights('data_input/ED-debian-2002.soi')
num_votes = 1.0 * sum(lengths.values())

import pickle

mallows_params  = pickle.load( open('pickle/mallows2002_1mil_2.p','rb') )
sigma, phi = mallows_params
plackett_params = pickle.load( open('pickle/plackett2002_3mil_2.p','rb')) 
pl_weights = plackett_params

mallows_params, plackett_params, num_votes

([array([3, 1, 5, 4, 2]), 0.006131716274196619],
 array([0.31115367, 0.24348515, 0.33875853, 0.10660266]),
 475.0)

We also need to gather the probability functions for the Mallows and Plackett-Luce models

In [2]:
import import_ipynb
from Mallows_Notebook import *
from PL_Notebook import *
import metropolis
import math
from tqdm import tqdm_notebook

importing Jupyter notebook from Mallows_Notebook.ipynb
importing Jupyter notebook from PL_Notebook.ipynb
0.125


Now we can follow the equation for KL divergence to find it.

In [3]:
divergence_mallows = 0
divergence_plackett = 0

def insideSum(Qi,Pi):
    return Qi * math.log(Qi/Pi)

for entry in votes:
    num_occurances, vote = entry
    empirical = num_occurances / num_votes
    mallows = mallowsProb(vote, sigma, phi)
    plackett = probPlackett(vote, pl_weights)
    divergence_mallows += insideSum(mallows, empirical)
    divergence_plackett += insideSum(plackett, empirical)

results = [divergence_mallows, divergence_plackett]
results

[372.9318955776771, 46.746431392831106]

Save results

In [4]:
import pickle

pickle.dump(results, open('pickle/divergence_.p','wb'))

This can be done for all the files in a folder as well

In [5]:
files = ['ED-00002-00000001.soi',\
         'ED-00002-00000002.soi',\
         'ED-00002-00000003.soi',\
         'ED-00002-00000004.soi',\
         'ED-00002-00000005.soi',\
         'ED-00002-00000006.soi',\
         'ED-00002-00000007.soi']

list_of_votes = []
mallows_params = []
pl_params = []

nruns = 100

for file in tqdm_notebook(files,desc = 'All Files'):
    _, lengths, votes = readPreflib.soiInputwithWeights('preflib_soi/'+file)
    num_votes = 1.0 * sum(lengths.values())
    list_of_votes.append((num_votes,votes))
    p_mal = runMallows(votes, nruns, lengths)
    mallows_params.append(p_mal)
    p_pl = runPL(votes, nruns, lengths)
    pl_params.append(p_pl)

HBox(children=(IntProgress(value=0, description='All Files', max=7), HTML(value='')))

HBox(children=(IntProgress(value=0, description='mallowsCost'), HTML(value='')))

HBox(children=(IntProgress(value=0, description='plackettCost'), HTML(value='')))

HBox(children=(IntProgress(value=0, description='mallowsCost'), HTML(value='')))

HBox(children=(IntProgress(value=0, description='plackettCost'), HTML(value='')))

HBox(children=(IntProgress(value=0, description='mallowsCost'), HTML(value='')))

HBox(children=(IntProgress(value=0, description='plackettCost'), HTML(value='')))

HBox(children=(IntProgress(value=0, description='mallowsCost'), HTML(value='')))

HBox(children=(IntProgress(value=0, description='plackettCost'), HTML(value='')))

HBox(children=(IntProgress(value=0, description='mallowsCost'), HTML(value='')))

HBox(children=(IntProgress(value=0, description='plackettCost'), HTML(value='')))

HBox(children=(IntProgress(value=0, description='mallowsCost'), HTML(value='')))

HBox(children=(IntProgress(value=0, description='plackettCost'), HTML(value='')))

HBox(children=(IntProgress(value=0, description='mallowsCost'), HTML(value='')))

HBox(children=(IntProgress(value=0, description='plackettCost'), HTML(value='')))




In [9]:
pl_params

[array([0.11405058, 0.11543402, 0.33712478, 0.43339062]),
 array([0.07744458, 0.13448718, 0.15879523, 0.16511992, 0.46415309]),
 array([0.04860412, 0.0938401 , 0.10664794, 0.15898738, 0.08018003,
        0.10099901, 0.41074142]),
 array([0.03062975, 0.04536536, 0.06975143, 0.04565361, 0.02429043,
        0.02372079, 0.06410887, 0.69647976]),
 array([0.04486576, 0.1058907 , 0.08613949, 0.09569888, 0.0795164 ,
        0.0342645 , 0.024514  , 0.0565099 , 0.47260036]),
 array([0.13428294, 0.12178278, 0.13597867, 0.13974907, 0.46820654]),
 array([0.12055543, 0.08788404, 0.17890132, 0.61265921])]

In [11]:
table = [[]'File#','N_Votes','Mallow\'s Divergence','Plackett-Luce Divergence']]

for i in range(len(list_of_votes)):
    num_votes, votes = list_of_votes[i]
    sigma, phi = mallows_params[i]
    pl_weights = pl_params[i]
    
    divergence_mallows = 0
    divergence_plackett = 0
    
    for entry in votes:
        num_occurances, vote = entry
        empirical = num_occurances / num_votes
        mallows = mallowsProb(vote, sigma, phi)
        plackett = probPlackett(vote, pl_weights)
        divergence_mallows += insideSum(mallows, empirical)
        divergence_plackett += insideSum(plackett, empirical)
    
    table.append((num_votes, divergence_mallows, divergence_plackett))

print(table)

[('File#', 'N_Votes', "Mallow's Divergence", 'Plackett-Luce Divergence'), (0, 475.0, 16.05505744793081, 38.64222198435496), (1, 488.0, 379.3396771774338, 54.790067328853034), (2, 504.0, -0.14123800021863592, 87.62519816120155), (3, 421.0, 0.0950929431104374, 61.464238267862), (4, 482.0, 74.65255232995827, 65.88763850291036), (5, 436.0, 306.38254671304367, 50.068986284565725), (6, 403.0, 308.48662360400925, 35.72022832992879)]


In [12]:
import pandas as pd

for tup in table:
    