This notebook analyses how non-random agents would perform under the ELO leaderboard scoring.

In the last section it shows how scores would evolve. Non-random agents do not seem to consistently come out top. Main reason are possibly:

* in this simulation random agents are more numerous (10:1) to simulate a mass submission of plain random bots; the number of submissions is not equal between players depending on their behavior
* weak non-random agents are pushed down in score and hence strong non-random agents do not play against them anymore when being paired with similar score agents; they start facing more random agents
* interestingly it may be the case that the latter point is every stronger, with a larger difference in performance among agents; hence the results becomes more random if the difference in agent performance is larger

Just swamping the leaderboard with any non-dumb agents or just random agents with 5 submissions a day seems to be good strategy?

In [None]:
import numpy as np
from operator import itemgetter
import random
import matplotlib.pyplot as plt
from collections import Counter, defaultdict
import itertools

# Agents

The Gauss agent has a stochastic performance with a normal distribution. One Gauss agent wins vs another Gauss agent if the current performance drawn from a normal distribution with a defined mean happens to be larger then the opponent's current performance. This matches very well the assumptions of the common ELO system.

In [None]:
class Gauss:
    def __init__(self, loc, scale=1):
        self.loc = loc
        self.scale = scale
    
    def round_performance(self):
        return np.random.normal(loc=self.loc, scale=self.scale)
    
    def __repr__(self):
        return f"Gauss({self.loc}, {self.scale})"

# Setup

In [None]:
num_random=1000           # number of random agents
num_loc_non_random=100    # number of non-random agents per strength level
locs = [0,1,2]            # strengths of non-random agents (means for Gauss distribution)

agents = []

for loc in locs:
    for _ in range(num_loc_non_random):
        agents.append(Gauss(loc))
        
for _ in range(num_random):
    agents.append(None)

# Rounds

In [None]:
def new_score(score1, score2, winner, k=30):
    q1 = 10 ** (score1 / 400)
    q2 = 10 ** (score2 / 400)
    
    e1 = q1 / (q1+q2)
    e2 = q2 / (q1+q2)
    
    new_score1 = score1 + k * ((1-winner) - e1)
    new_score2 = score2 + k * (winner - e2)
    
    return new_score1, new_score2
    

def make_round_scores(agents, num_rounds_per_agent = 100, diff_rank = 10, init_score=600):
    # this is a bit slow at the moment; feel free to optimize and let us know!
    round_scores = []
    scores = [init_score] * len(agents)
    
    for _ in range(num_rounds_per_agent):
        for i in range(len(agents)):
            argsort = np.argsort(scores)
            ranks = np.empty_like(argsort)
            ranks[argsort] = np.arange(len(ranks))

            my_rank = ranks[i]
            other_rank = random.choice(list(set(range(max(my_rank - diff_rank, 0), min(my_rank + diff_rank, len(scores)))) - {my_rank}))
            j=argsort[other_rank]

            if agents[i] is None or agents[j] is None:
                winner = random.choice([0, 1])
            else:
                perf_i = agents[i].round_performance()
                perf_j = agents[j].round_performance()
                winner = int(perf_j > perf_i)

            scores[i], scores[j] = new_score(scores[i], scores[j], winner)

        round_scores.append(scores.copy())
        
    return round_scores

# Plot

Under ideal conditions the top scoring Gauss agent should win every time. In fact we'd want almost all top color agents to be ahead of the gray agents.

In the plots the colors correspond to Gauss agents of a particular performance level. Gray lines are random bots.

In [None]:
def plot_scores(round_scores):
    agent_scores = list(zip(*round_scores))
    for agent, cur_agent_scores in random.sample(list(zip(agents, agent_scores)), k=len(agents)):
        plt.plot(cur_agent_scores, c={loc: col for loc, col in zip(locs, "rgb")}[agent.loc] if agent is not None else "gray", zorder=1 if agent is not None else 0)
    plt.show()
    
print("Number of agents")
for loc, cnt in Counter(agent.loc for agent in agents if agent is not None).items():
    print(f"{cnt:5} Strength {loc}")
print(f"{sum(agent is None for agent in agents):5} Random")

winner_count = Counter()
        
for _ in range(10):
    round_scores = make_round_scores(agents)
    plot_scores(round_scores)
    winner_idx = np.argmax(round_scores[-1])
    winner_agent = agents[winner_idx]
    print(f"Winner: {winner_agent if winner_agent is not None else 'Random'}")
    winner_count[str(winner_agent) if winner_agent is not None else 'Random']+=1
    print(f"Current winner count: {', '.join(f'{agent}:{cnt}' for agent, cnt in winner_count.most_common())}")

What we seem to get with the ELO scoring is a nice split of the non-random agents (apart from some outliers), but a background of random bots which spreads over all the score range.


# Alternative scoring system with all-vs-all win rate (no ELO)

Let's try to see what the effect of a classical all-vs-all scoring based on win-rates would be.[](http://)

In [None]:
num_rounds_per_agent = 100  # same number of games for a fair comparison

####
agent_scores=[0] * len(agents)

for i in range(len(agents)):
    for j in random.sample(set(range(len(agents)))-{i}, num_rounds_per_agent):
        agent1 = agents[i]
        agent2 = agents[j]

        if agent1 is None or agent2 is None:
            if random.random() < 0.5:
                agent_scores[i] += 1
            else:
                agent_scores[j] += 1
        else:
            perf1 = agent1.round_performance()
            perf2 = agent2.round_performance()
            if perf1 > perf2:
                agent_scores[i] += 1
            else:
                agent_scores[j] += 1

In [None]:
agent_idxs, sorted_agent_scores = zip(*sorted(random.sample(list(enumerate(agent_scores)), len(agent_scores)), key=itemgetter(1), reverse=True))

loc_color = {loc: color for loc, color in zip(locs, "rgb")}

plt.figure(figsize=(30, 5))
plt.bar(range(len(agent_idxs)), sorted_agent_scores, width=1, color=[loc_color.get(agents[idx].loc) if agents[idx] is not None else "lightgray"  for idx in agent_idxs]);

The leaderboard seems to dominated by the best scoring agent much more clearly than for ELO.

And this is even though we did not even include non-transitivity issues yet.

This demonstrates that random bots are no issue with the right scoring system. You can compare it to the previous ELO simulation, where a background of random bots overwhelmed the others.

Some reasons for the inefficiency of ELO to capture the interesting performance differences are outlined in [Elo, Glicko etc rating system not suitable for RPS](https://www.kaggle.com/c/rock-paper-scissors/discussion/196174).

Reasons are:

* RPS bots are not transitive. There are some strong bots, where A beats B, B beats C, C beats A convincingly (one of my buggy bots wins almost every time vs a the past winner Greenberg). This breaks the assumption of ELO
* This means individual comparisons (as done with ELO based on similar scores) are not meaningful. It is much more interesting to see who of A, B, C can beat *most* of D, E, F.
* The ranking starts to stabilize when you count how *many* bots one can beat, not if you can beat high scoring bots specifically.

Caveats:

Here we play all-vs-all which is a lot of matches. If due to limited matches you only play a random (distinct!) subset of all bots, some bots may be more lucky getting weak bots assigned. This can be investigated with the above simulation.

If you believe the bot environment changes over time and earlier bots had a harder/easier time, you could introduce an exponentially scaled win rate counter `new_score = score * decay + I(win=1) * (1-decay)`. This would converge to the recent average win rate.

Technically, even using the given past matches, the whole LB could be rescored with this approach. I believe past RPS competitions used similar scoring with success.

