## Part I: Collecting the data 
---
#### Data:
History of teams for each player within the 2018 Dota 2 pro circuit
#### Source: 
[I.](#Request-DPC-rankings-and-setup-players) Liquipedia [Dota Pro Circuit Player Rankings](https://liquipedia.net/dota2/Dota_Pro_Circuit/Rankings/Players) <br>
[II.](#Request-individual-player-infoboxes) Respective Liquipedia [player infoboxes](https://liquipedia.net/dota2/Template:Infobox_player)
#### Method:
[MediaWiki API](https://liquipedia.net/commons/Liquipedia:API_Usage_Guidelines) for Liquidpedia ( [usage compliance](#Liquipedia-API-usage-compliance) is *strongly* recommended 👍)

---

In [1]:
import re
import requests
import requests_cache
from urllib.parse import quote

import time
from datetime import datetime, timedelta

from pprint import pprint

### Liquipedia API usage compliance 

<details><summary>To avoid being IP blocked then shamefully pleading to be unblocked on <a href="https://discord.gg/x8kRmqu">discord</a> - <i>sadly from experience</i></summary><img src="img/pepehands.png" alt="pepehands.png"></details>


Complete guidelines can be found [here](https://liquipedia.net/commons/Liquipedia:API_Usage_Guidelines).

#### 1. Cache and re-use data

Changes to DPC points and team shuffles are sufficiently infrequent for 24-hour caches.

In [2]:
requests_cache.install_cache(
    cache_name='dpc_cache',
    backend='sqlite',
    expire_after=60*60*24
)

#### 2. Limit requests to every 2 seconds

Decorator controls frequency by tracking the last request and applying appropriate delays.

In [3]:
def throttle(f):
    def wrap(*args, **kwds):
        now = datetime.now()
        if (now - wrap.last).seconds < 2: 
            time.sleep(2)
        wrap.last = now
        return f(*args, **kwds)
    wrap.last = datetime.now()
    return wrap

#### 3. Include user-agent header with usage info and accept gzip encoding

Don't forget to credit Liquipedia as a data source in the project as well.

In [4]:
headers = {
    'User-Agent': 'DPC Connectivity 1.0/youmikoh@github',
    'Accept-Encoding': 'gzip'
}

#### 4. Stage requests and extract data 

Note that automated requests of generated HTML pages is prohibited; request only proper API endpoints as per MediaWiki. Further information on request schema can be found [here](https://www.mediawiki.org/wiki/API:Main_page#A_simple_example).

In [5]:
@throttle
def liquidpedia_content(key, qualifier):
    api = f'https://liquipedia.net/dota2/api.php?{key}={quote(qualifier)}&action=query&format=json&prop=revisions&rvprop=content&rvsection=0'
    response = requests.get(api, headers=headers)
    data = response.json()
    pages = list(data.get('query').get('pages').values())
    content = lambda page: page.get('revisions').pop().get('*')
    return [content(p) for p in pages]

### Request DPC rankings and setup players

To collect data from a specific page, query using the page id found under '[Tools](https://liquipedia.net/dota2/Dota_Pro_Circuit/Rankings/Players)' > '[Page information](https://liquipedia.net/dota2/index.php?title=Dota_Pro_Circuit/Rankings/Players&action=info)' on the upper-right corner. Far left of 'Tools' on the same menubar, the '[Edit](https://liquipedia.net/dota2/index.php?title=Dota_Pro_Circuit/Rankings/Players&action=edit)' feature contains the source of the page. 🕵️Examining this will be very helpful in parsing and extracting the appropriate data.

Only players with DPC points are considered.

<details><summary>Keep in mind that the data may require <strike>some</strike> scrubbing. Also beware of finicky white-spaces, case sensitivity and <i>oddballs</i> when working with regular expressions.</summary><img src="img/puppey.gif" alt="puppey.gif"></details>

In [6]:
dpc_content = liquidpedia_content('pageids', '70879').pop() 

dpc_content = re.sub(r'\s+\|', '|', dpc_content)
dpc_content = re.sub(r'transfer.+?\|', '', dpc_content, flags=re.IGNORECASE)
dpc_content = re.findall( #oddballs: "Noone|No[o]ne", "Maybe|Somnus丶M"
    r'TableRow/DPC1718Players\|.*?\|(.*?)\|\[+(.*?)\]+\|.*?\|(.*?)\|', 
    dpc_content,
    flags=re.IGNORECASE
)

dpc = {}

for p in dpc_content:
    flag, name, points = p
    points = int(points)
    if points > 0:
        if '|' in name: key, alias = name.split('|')
        else: key, alias = name, name
        dpc[key] = {'Alias':alias, 'Points':points, 'Flag':flag}

In [7]:
print(f'\nAfter collecting dpc content, {len(dpc)} players of the form:\n')
pprint(dpc['fy'])


After collecting dpc content, 126 players of the form:

{'Alias': 'fy', 'Flag': 'cn', 'Points': 2444}


### Request individual player infoboxes

Given a compilation of all DPC players, instead of using page ids to request player data, query using player names as the title of pages. To minimize the number of requests, multiple titles are queried with each request. Titles are sliced into chunks of 50, the [maximum number of values](https://www.mediawiki.org/wiki/API:Query) a single request can handle.

In [8]:
def collect_history_chunks():
    players = list(dpc.keys())

    while players: 
        last = min(len(players), 50)
        chunk = '|'.join(p for p in players[:last])
        content = liquidpedia_content('titles', chunk)

        key = lambda c: re.findall(r'\|id=(.*?)\s', c, flags=re.IGNORECASE)
        key_content = [(key(c).pop(), c) for c in content if key(c)]

        for key, content in key_content:
            if key in dpc: compile_history(key, content)

        players = players[last:]

#### Titles versus aliases

Above, it was assumed that player names would accurately reflect the page titles. Unfortunately, in cases where players have changed names on a whim, the title no longer match their new alias and a redirect is invoked. Since majority of player histories have been captured in the chunks above, pages of the finicky remaining players are requested individually.

In [9]:
def collect_history_individually():
    players_without_history = [p for p in dpc.keys() if not dpc[p].get('History')]
    
    for player in players_without_history:
        content = liquidpedia_content('titles', player).pop()
        
        if '#REDIRECT' in content:
            redirect = re.findall('\[+(.*?)\]+', content, flags=re.IGNORECASE).pop()
            dpc[redirect] = dpc[player]
            dpc.pop(player)
            player = redirect
            content = liquidpedia_content('titles', player).pop()
            
        compile_history(player, content)

#### Templates for parsing

In addition to viewing the page source, there are associated [templates](https://liquipedia.net/dota2/Template:Infobox_player) to facilitate the parsing process. Nonetheless, parsing dates suck.

<details><summary><a href="https://liquipedia.net/dota2/Artstyle">Artstyle</a> couldn't have joined DTS Gaming on <i>June 31, 2010</i></summary><img src="img/artstyle.png" alt="artstyle.png"></details>

<details><summary><a href="https://liquipedia.net/dota2/CemaTheSlayeR">CemaTheSlayer</a> & <a href="https://liquipedia.net/dota2/Iceberg">Iceberg</a>: men of <i>mystery</i>❓</summary><img src="img/cema.png" alt="cema.png"><img src="img/iceberg.png" alt="iceberg.png"></details>

In [10]:
today = datetime.strftime(datetime.today(), '%Y-%m-%d')

def compile_history(key, content):
    content = re.sub(r'"?\'?', '', content, flags=re.IGNORECASE)
    content = re.sub('present', today, content, flags=re.IGNORECASE)

    history = re.findall(r'{{TH\|(.*?)}}', content, flags=re.IGNORECASE)
    history = [h.split('|') for h in history]
    dpc[key]['History'] = timeline(history)   

def timeline(content):
    timeframes = {}
    content = [c for c in content if len(c[0]) > 22]
    for c in content:
        dates, team = c[0].strip(), c[1]
        y0, m0, d0 = dates[:10].split('-')
        y1, m1, d1 = dates[-10:].split('-')

        known = lambda k: '?' not in k
        if known(y0 + m0 + y1 + m1): #unknown dates: Iceberg "201?-??-??"
            if not known(d0): d0 = '01'
            if not known(d1): d1 = d0

            timeframe = {'start': date(y0, m0, d0), 'end': date(y1, m1, d1)}
            if team in timeframes: timeframes[team].append(timeframe)
            else: timeframes[team] = [timeframe]
            
    return timeframes

def date(y, m, d):
    y, m, d = int(y), int(m), int(d)
    try: return datetime(y, m, d)
    except ValueError: return datetime(y, m, d-1) #invalid dates: Artstyle "2010-06-31"

In [11]:
collect_history_chunks()
collect_history_individually()

In [12]:
print(f'\nAfter collecting team histories, {len(dpc)} players of the form:\n')
pprint(dpc['fy'])


After collecting team histories, 126 players of the form:

{'Alias': 'fy',
 'Flag': 'cn',
 'History': {'LGD Gaming': [{'end': datetime.datetime(2018, 6, 10, 0, 0),
                             'start': datetime.datetime(2017, 9, 4, 0, 0)}],
             'Team VGJ': [{'end': datetime.datetime(2017, 9, 4, 0, 0),
                           'start': datetime.datetime(2016, 12, 26, 0, 0)}],
             'Vici Gaming': [{'end': datetime.datetime(2016, 3, 19, 0, 0),
                              'start': datetime.datetime(2012, 10, 21, 0, 0)},
                             {'end': datetime.datetime(2016, 12, 26, 0, 0),
                              'start': datetime.datetime(2016, 9, 16, 0, 0)}],
             'Vici Gaming Reborn': [{'end': datetime.datetime(2016, 8, 30, 0, 0),
                                     'start': datetime.datetime(2016, 3, 19, 0, 0)}]},
 'Points': 2444}


---
## Continue to Part II: [Stratify the data](part2_stratify.ipynb)
---
<br><br>

In [1]:
from IPython.core.display import HTML
HTML(open("css/dpc_ipynb.css", "r").read()) #IPYNB STYLING