Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split data #1

Open
jabadia opened this issue Sep 5, 2018 · 2 comments
Open

split data #1

jabadia opened this issue Sep 5, 2018 · 2 comments

Comments

@jabadia
Copy link

jabadia commented Sep 5, 2018

hi,

great job here!
have you tried getting the 5km split times? any luck with that?

thanks!

@stappit
Copy link
Owner

stappit commented Sep 5, 2018 via email

@jabadia
Copy link
Author

jabadia commented Sep 6, 2018

so cool!

Well I poked around the web page a bit and discovered that they show the split times when you click on any runner. So there is an API to get the splits, but unfortunately It will give you split times for one runner at a time.

I wrote this function to scrape them. However, the problem is that there are tens of thousands of runners. I restricted my scraping to the subset of runners I'm interested in (age_class=40) for only one year (I modified your scrape.py to get only 2017 results).

Also, I used grequests that is a wrapper around requests that allows sending many requests in parallel. I tried different concurrency values and found that you can go up to around 100 requests, and that speeds the process a lot. In my development environment it took 12 min approx to scrape 6.5K runners.

import time

import grequests
import pandas as pd
from lxml import etree


def download_split_times(dirty_filename, splits_file):
    df = pd.read_csv(dirty_filename).sort_values(['year', 'id']).reset_index(drop=True)

    df['net_minutes'] = pd.to_timedelta(df['net_time']) / pd.Timedelta(minutes=1)  # convert time in HH:MM:SS to a float representing minutes
    df['clock_minutes'] = pd.to_timedelta(df['clock_time']) / pd.Timedelta(minutes=1)

    # print(df.head())

    participants_by_age_class = df.age_class.value_counts().sort_index()
    print(participants_by_age_class)

    # today, I'm interested only in participants in my same age class
    same_age_participants = df[df['age_class'] == '40'].reset_index()
    print(same_age_participants.shape)

    # we need to send one request per participant :-(
    # like this one https://www.bmw-berlin-marathon.com/files/addons/scc_events_data/ajax.results.php?t=BM_2017&m=d&pid=11995
    url = "https://www.bmw-berlin-marathon.com/files/addons/scc_events_data/ajax.results.php"
    params = {
        't': 'BM_2017',     # common params
        'm': 'd'
    }
    rs = [
        grequests.get(url, params=dict(params, pid=participant_id))  # combine with participant id
        for participant_id in same_age_participants.id
    ]

    t0 = time.time()
    results = grequests.imap(rs, size=100)  # size = max parallel requests, don't put too many
    count = 0
    for result in results:
        doc = etree.HTML(result.text)  # unfortunately the response is an html fragment we need to parse
        split_time_headers = doc.xpath("//div[@class='gridResultsDetailHead']")
        split_headers = [element.text for element in split_time_headers]  # ['5 km', '10 km', '15 km', '20 km', '21,1 km' ... '40 km']
        split_time_divs = doc.xpath("//div[@class='gridResultsDetailBody']")
        split_times = [element.text for element in split_time_divs]  # ['00:14:29', '00:29:04', ...]

        participant_id = int(result.request.url.split('&pid=')[-1])
        count += 1
        print(count, participant_id, ' '.join(split_times))
        for distance, split_time in zip(split_headers, split_times):
            same_age_participants.loc[same_age_participants.id == participant_id, distance] = \
                pd.Timedelta(split_time) / pd.Timedelta(minutes=1)  # store a float with minutes

    t1 = time.time()
    print("time taken: %.1f sec" % (t1-t0,))
    same_age_participants.to_csv(splits_file, index=False)


if __name__ == '__main__':
    download_split_times('data/berlin_marathon_times_dirty.csv', 'data/with_split.csv')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants