# Simulations Episode Scraper Match Downloader

This notebook downloads episodes using Kaggle's GetEpisodeReplay API and the [Meta Kaggle](https://www.kaggle.com/kaggle/meta-kaggle) dataset.

**To run this notebook you WILL need to re-add the Meta Kaggle dataset. After opening your copy of the notebook, click "+ Add data" top right in the notebook editor.
**

Meta Kaggle is refreshed daily, but sometimes misses daily refreshes for a few days.

Why download replays?
- Train your ML/RL model
- Inspect the performance of yours and others agents
- To add to your ever growing json collection 

Only one scraping strategy is implemented: For each top scoring submission, download all missing matches, move on to next submission.

Other scraping strategies can be implemented, but not here. Like download max X matches per submission or per team per day, or ignore certain teams or ignore where some scores < X, or only download some teams.

Todo:
- Add teamid's once meta kaggle add them. Edit: it's been a long time, it doesn't look like Kaggle is adding this.

## Imports

In [None]:
import pandas as pd
import numpy as np
import os
import requests
import json
import datetime
import time
import glob
import collections
import matplotlib.pyplot as plt

## Load data

In [None]:
## You should configure these to your needs. Choose one of ...
# 'hungry-geese', 'rock-paper-scissors', santa-2020', 'halite', 'google-football'
COMP = 'lux-ai-2021'

In [None]:
ROOT ="../working/"
META = "../input/meta-kaggle/"
MATCH_DIR = '../working/'
base_url = "https://www.kaggle.com/requests/EpisodeService/"
get_url = base_url + "GetEpisodeReplay"
BUFFER = 1
COMPETITIONS = {
    'lux-ai-2021': 30067,
    'hungry-geese': 25401,
    'rock-paper-scissors': 22838,
    'santa-2020': 24539,
    'halite': 18011,
    'google-football': 21723
}

In [None]:
# Load Episodes
episodes_df = pd.read_csv(META + "Episodes.csv")

# Load EpisodeAgents
epagents_df = pd.read_csv(META + "EpisodeAgents.csv")

print(f'Episodes.csv: {len(episodes_df)} rows before filtering.')
print(f'EpisodeAgents.csv: {len(epagents_df)} rows before filtering.')

episodes_df = episodes_df[episodes_df.CompetitionId == COMPETITIONS[COMP]] 
epagents_df = epagents_df[epagents_df.EpisodeId.isin(episodes_df.Id)]

print(f'Episodes.csv: {len(episodes_df)} rows after filtering for {COMP}.')
print(f'EpisodeAgents.csv: {len(epagents_df)} rows after filtering for {COMP}.')

Let's add creation date to the agents.

In [None]:
episodes_df['CreateTime'] = pd.to_datetime(episodes_df['CreateTime'])
episode_id_to_create_time = episodes_df.set_index('Id')[['CreateTime']]
episode_id_to_create_time = {key: value for key, value in zip(episode_id_to_create_time.index, episode_id_to_create_time.CreateTime)}

In [None]:
submission_id_to_episode_id = epagents_df.groupby('SubmissionId').head()[['SubmissionId', 'EpisodeId']]
submission_id_to_create_time = {}
for submission_id, episode_id in zip(submission_id_to_episode_id.SubmissionId, submission_id_to_episode_id.EpisodeId):
    submission_id_to_create_time[submission_id] = episode_id_to_create_time[episode_id]

In [None]:
epagents_df['CreateTime'] = epagents_df.SubmissionId.apply(lambda x: submission_id_to_create_time[x])

I have tried getting the name of the teams but luxai competition is not there yet, maybe they update that only when the challenge has ended. Below there are some tries but none of them worked, I had to scrap kaggle website.

In [None]:
# teams = pd.read_csv(META + "Teams.csv")
#teams = teams[teams.CompetitionId == COMPETITIONS[COMP]]

I have also tried getting it from the submissions dataframe but it does not involve episodes.

In [None]:
#submissions = pd.read_csv(META + "Submissions.csv")
#submissions.head()

So it seems I cannot get that information from the meta dataset. Maybe I can use the kaggle api instead, or scrap the web.

## Data inspection

This shows that `episodes_df` has information about the competition, whereas `epagents_df` does not. That explains why we needed to use both to be able to filter by the competition of interest.

After that filtering we don't probably need `episodes_df` anymore.

In [None]:
episodes_df.head()

In [None]:
epagents_df.head()

## Leaderboard replication

Let's see how many unique agents are there, and try to create the last version of the leaderboard.

In [None]:
len(epagents_df.SubmissionId.unique())

This is very interesting, we only have 10k unique agents, while the number of matches is 1.4M. So that means that each agent plays around 100 matches. Let's verify that.

In [None]:
plt.figure(figsize=(20, 5))
plt.hist(epagents_df.SubmissionId.value_counts(), bins=50);

There are a lot of agents with 0 or 1 matches, so if we exclude those we can see a uniform distribution, probably related to the date of agent submission.

In [None]:
leaderboard = epagents_df.groupby('SubmissionId').tail(1)
leaderboard = leaderboard[~leaderboard.UpdatedScore.isna()]
leaderboard = leaderboard.sort_values('UpdatedScore', ascending=False)
print(len(leaderboard))
leaderboard

So we have 8k agents with a numeric score. Let's see the distribution of scores.

In [None]:
plt.figure(figsize=(20, 5))
plt.hist(leaderboard.UpdatedScore, bins=1000, log=True, cumulative=-1, histtype='stepfilled');
plt.grid()

We can see that there are around 250 agents above 1500 score.

## Getting the name of the team that made the submission

The only way I have found to do that is to use the notebook that scraps the kaggle website.

https://www.kaggle.com/yalikesifulei/bot-statistics-with-selenium-beautiful-soup

In [None]:
%%capture
!pip install selenium
!apt-get update 
!apt install chromium-chromedriver -y
!pip install BeautifulSoup4

In [None]:
from functools import lru_cache
from selenium import webdriver
from bs4 import BeautifulSoup
import time
from tqdm.notebook import tqdm
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')

In [None]:
def getSoup(sub_id, verbose=False):
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    browser = webdriver.Chrome(options=options)

    URL = 'https://www.kaggle.com/c/lux-ai-2021/submissions?dialog=episodes-submission-'

    if verbose: print('Loading submission page...')
    browser.get(URL + str(sub_id))
    time.sleep(2)

    if verbose: print('Scrolling results...')
    scrolling_element = browser.find_element(
        webdriver.common.by.By.XPATH,
        "//div[@class='mdc-dialog__surface']")
    if verbose:
        generator = tqdm(range(100))
    else:
        generator = range(100)
    for k in generator:
        browser.execute_script('arguments[0].scrollTop = arguments[0].scrollHeight', scrolling_element)
    time.sleep(1)

    if verbose: print('Parsing page...')
    html_source = browser.page_source
    soup = BeautifulSoup(html_source, 'html.parser')
    if verbose: print('Done!')
    
    return soup

def get_team_name_from_soup(soup):
    team_names = []
    for span in soup.select('span[class*="sc-"]'):
        text = span.get_text()
        if 'vs' in text and '[' in text and 'ago' not in text:
            for part in text.split(' vs '):
                part_split = part.split(' ')
                team_name = ' '.join(part_split[1:-2])
                team_names.append(team_name)
                
    team_name = max(set(team_names), key = team_names.count)
    return team_name

@lru_cache(maxsize=1000)
def get_team_name(submission_id):
    soup = getSoup(submission_id)
    return get_team_name_from_soup(soup)

## Creating a csv for later downloading the files

To download the files I need the episode id. I only want to download matches from the agents with the highest ranking.

Thus I'm going to update `epagents_df` to include the final ranking and later remove agents with low score.

In [None]:
submission_id_to_final_scores = {key: value for key, value in zip(leaderboard.SubmissionId, leaderboard.UpdatedScore)}

epagents_df['FinalScore'] = epagents_df['SubmissionId'].apply(lambda x: submission_id_to_final_scores.get(x, -100))

In [None]:
SCORE_THRESHOLD = 1750
selection = epagents_df[epagents_df['FinalScore'] > SCORE_THRESHOLD]
selection.sort_values('FinalScore', ascending=False, inplace=True)
len(selection),  len(selection.SubmissionId.unique())

This is far from elegant but it works. It seems that kaggle does not like receiving too many requests.

In [None]:
while 1:
    try:
        for submission_id in tqdm(selection['SubmissionId'].unique()):
            get_team_name(submission_id)
        break
    except ValueError:
        time.sleep(300)

In [None]:
submission_id_to_team = {submission_id: get_team_name(submission_id) for submission_id in tqdm(selection['SubmissionId'].unique())}

In [None]:
tqdm.pandas()
selection['Team'] = selection['SubmissionId'].progress_apply(lambda x: submission_id_to_team[x])

In [None]:
selection.to_csv('agent_selection_%s.csv' % time.strftime("%Y%m%d"), index=False)

In [None]:
selection.head()

In [None]:
selection.tail()

In [None]:
selection.groupby('SubmissionId').tail(1)[['FinalScore', 'UpdatedConfidence', 'Team', 'CreateTime', 'SubmissionId']].head(50)

In [None]:
selection.groupby('SubmissionId').tail(1)[['FinalScore', 'UpdatedConfidence', 'Team', 'CreateTime', 'SubmissionId']].to_csv(
    'leaderboard_%s.csv' % time.strftime("%Y%m%d"), index=False)

In [None]:
with open('submission_id_to_team.json', 'w') as f:
    json.dump({int(key): value for key, value in submission_id_to_team.items()}, f)