# Introduction

Our goal is to find short "undervalued" trails in Belgium. We have defined the latter as:
- races between 10 and 20km (max 25km)
- races with less runners than the average
- races where good performers don't tend to participate

To achieve that goal, we are going to web scrape data from [Betrai](https://www.betrail.run/home). Betrail initially identified each trail in Belgium and has now expanded in Luxembourg, France and the Netherlands.

Betrail holds the following information:
- ranking of all runners based on their performance
- listing of all trails 

By combining the above information, we must be able to spot "undervalued" trails.

# Extract list of all runners

As a 1st step, we are going to extract the list of runners.

As the website dynamically loads the content (25 runners at a time), we will have to repeat our get request several time. Below is the example on how to get the first 25 runners of the global ranking (Belgium, France, Netherlands and Luxembourg).

In [1]:
import requests
import json

response_1_to_25 = requests.get("https://www.betrail.run/api/score/type/bt_score/ALL/-1/0")
status_code_1_to_25 = response_1_to_25.status_code
content_1_to_25 = response_1_to_25.content
json_content_1_to_25 = json.loads(content_1_to_25)
json_content_1_to_25["body"][0]

{'id': '1268136ALLCURRENT',
 'title': None,
 'bt_score': 9540,
 'bt_min_perf': 8966,
 'btu_score': 9841,
 'bts_score': 8499,
 'btu_min_perf': 9834,
 'bts_min_perf': 8379,
 'tc_points': 3228,
 'tc_pts': 2000,
 'uc_points': 2484,
 'uc_pts': 1410,
 'oldest_result_date': 1571263200,
 'nb_races': 12,
 'nb_races_ultra': 5,
 'gender': 0,
 'year': 'CURRENT',
 'bt_score_result_reid_1': 3615145,
 'bt_score_result_reid_2': 3383215,
 'bt_score_result_reid_3': 3199588,
 'btu_score_result_reid_1': 3615145,
 'btu_score_result_reid_2': 3383215,
 'bts_score_result_reid_1': 2843249,
 'bts_score_result_reid_2': 2568626,
 'bts_score_result_reid_3': 2785811,
 'country': 'ALL',
 'ruid': 1268136,
 'runner': {'lastname': 'POMMERET',
  'firstname': 'LUDOVIC',
  'nickname': None,
  'title': 'POMMERET LUDOVIC',
  'display_title': 'POMMERET LUDOVIC',
  'uid': 1,
  'has_account': None,
  'account_created': None,
  'gender': 0,
  'birthdate': 157762800,
  'postal_code': None,
  'place': None,
  'country': 'FR',
  '

Based on the above result, we are going to create an empty DataFrame that will allow us to store all information.
As we don't know yet what precise information we'll need to use in the future, we're going to store almost all data.

In [2]:
import pandas as pd

runner_df = pd.DataFrame(columns = ['id', 'bt_score', 'bt_min_perf', 'btu_score', 'bts_score', 'btu_min_perf', 'bts_min_perf', 
                                    'tc_points', 'tc_pts', 'uc_points', 'uc_pts', 'oldest_result_date', 'nb_races', 
                                    'nb_races_ultra', 'year', 'bt_score_result_reid_1', 'bt_score_result_reid_2', 
                                    'bt_score_result_reid_3', 'btu_score_result_reid_1', 'btu_score_result_reid_2', 
                                    'bts_score_result_reid_1', 'bts_score_result_reid_2', 'bts_score_result_reid_3', 
                                    'country', 'ruid', 'lastname', 'firstname', 'nickname', 'title', 'display_title', 'uid', 
                                    'has_account', 'account_created', 'gender', 'birthdate', 'postal_code', 'place',
                                    'country', 'nationality', 'team', 'geo_lat', 'geo_lon', 'display_options', 'alias', 
                                    'avatar', 'cover'])

We are extracting the content of each json object to fill our DataFrame.

In [3]:
import numpy as np

body_columns = ['id', 'bt_score', 'bt_min_perf', 'btu_score', 'bts_score', 'btu_min_perf', 'bts_min_perf', 'tc_points', 
                'tc_pts', 'uc_points', 'uc_pts', 'oldest_result_date', 'nb_races', 'nb_races_ultra', 'year', 
                'bt_score_result_reid_1', 'bt_score_result_reid_2', 'bt_score_result_reid_3', 'btu_score_result_reid_1',
                'btu_score_result_reid_2', 'bts_score_result_reid_1', 'bts_score_result_reid_2', 
                'bts_score_result_reid_3', 'country', 'ruid']
    
body_runner_columns = ['lastname', 'firstname', 'nickname', 'title', 'display_title', 'uid', 'has_account', 'account_created', 
                       'gender', 'birthdate', 'postal_code', 'place', 'country', 'nationality', 'team', 'geo_lat', 'geo_lon', 
                       'display_options', 'alias', 'avatar', 'cover']

index = 0

for i in range(0, 51000, 25): # There seems to be a bit less than 51000 runners ranked on betrail
    response = requests.get("https://www.betrail.run/api/score/type/bt_score/ALL/-1/{}".format(i))
    content = response.content
    json_content = json.loads(content)
    
    for item in json_content["body"]:
        runner_info_var_list = []
        
        for body_col in body_columns:
            try:
                info_var = item[body_col]
            except TypeError:
                info_var = np.nan
            runner_info_var_list.append(info_var)
        
        for body_runner_col in body_runner_columns:
            try:
                info_var = item["runner"][body_runner_col]
            except TypeError:
                info_var = np.nan
            runner_info_var_list.append(info_var)

        runner_df.loc[index] = runner_info_var_list
        
        index += 1

In [11]:
runner_df.shape

(51025, 46)

As the number of runners seems to have increased, we are going to append the latest rows to our DataFrame.

In [12]:
index = 51025

for i in range(51025, 51125, 25): # There seems to be a bit less than 51000 runners ranked on betrail
    response = requests.get("https://www.betrail.run/api/score/type/bt_score/ALL/-1/{}".format(i))
    content = response.content
    json_content = json.loads(content)
    
    for item in json_content["body"]:
        runner_info_var_list = []
        
        for body_col in body_columns:
            try:
                info_var = item[body_col]
            except TypeError:
                info_var = np.nan
            runner_info_var_list.append(info_var)
        
        for body_runner_col in body_runner_columns:
            try:
                info_var = item["runner"][body_runner_col]
            except TypeError:
                info_var = np.nan
            runner_info_var_list.append(info_var)

        runner_df.loc[index] = runner_info_var_list
        
        index += 1

In [14]:
runner_df.tail()

Unnamed: 0,id,bt_score,bt_min_perf,btu_score,bts_score,btu_min_perf,bts_min_perf,tc_points,tc_pts,uc_points,...,place,country,nationality,team,geo_lat,geo_lon,display_options,alias,avatar,cover
51105,135781ALLCURRENT,2636,2408,0,2636,0,2408,4,30,0,...,ANS,BE,BE,,43.0138,-88.0472,,de.beer.michele,14659.0,
51106,1046278ALLCURRENT,2601,2438,0,2601,0,2438,5,34,0,...,,,,,,,,jean.solange,,
51107,1126661ALLCURRENT,2522,2505,0,2522,0,2505,6,47,0,...,,,,SPIRIDON FTP,,,,fiore.irene,,
51108,870203ALLCURRENT,2498,1702,0,2498,0,1702,3,22,0,...,,,,,,,,de.lena.philippe,,
51109,729758ALLCURRENT,2379,2272,0,2379,0,2272,3,24,0,...,,,FR,RED STAR CLUB CHAMPIGN,,,,de.hannuna.marjorie,,


In [15]:
runner_df.to_csv("betrail_runner.csv", sep=";", index=False)

# Next steps

- Extract all trails
- Extract runners' performances for each trail