# data collection
This notebook is responsible for collecting CrossFit 2018 Open Leaderboard data and athelete profile data *as represented at the time of data collection*.

## imports
The below is just a set of import statements required to run the code in this notebook. A description for the purpose of each import statement should be commented above it.

In [1]:
#sql connector
import pymysql as pms
#time recording and sleeping
from time import time, sleep
#browser automation
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

## database credentials
In order to connect to a local MySQL database, the block below runs to read the username, password, database name, and host required to establish a connection.

In [2]:
db_user = ""
db_pass = ""
db_name = ""
db_host = ""
with open("database_credentials.txt") as f:
    db_user = f.readline().strip()
    db_pass = f.readline().strip()
    db_name = f.readline().strip()
    db_host = f.readline().strip()

## test database connection
This short snippet is going to attempt to connect to the database and drop out without doing anything. This is just to make sure the credentials and PyMySQL are working properly.

In [3]:
def get_connect():
    """
    Returns a database connection object using the default params
    specified in the database_credentials file.
    """
    return pms.connect(host=db_host, user=db_user, passwd=db_pass, db=db_name, charset="utf8")

In [4]:
try:
    con = get_connect()
    print("Connected.")
    with con.cursor() as cur:
        print("Got database cursor. Can make queries within here.")
finally:
    if con:
        con.close()
        print("Connection closed.")

Connected.
Got database cursor. Can make queries within here.
Connection closed.


## urls
These urls/variables can be used to jump to pages where leaderboard data is available.

In [5]:
default_url = "https://games.crossfit.com/leaderboard/open/2018?division=1&region=0&scaled=0&sort=0&occupation=0&page=1"
custom_url = "https://games.crossfit.com/leaderboard/open/2018?division={}&region={}&scaled={}&sort={}&occupation={}&page={}"
athlete_url = "https://games.crossfit.com/athlete/{}"

#this map would be used to substitute values into the
#custom url string at position "division={}" in place
#of the "{}"
map_division = {
    "men": 1,
    "women": 2,
    "team": 11,
    #men aged (35-39) inclusive
    "m35-39": 18,
    #women aged (35-39) inclusive
    "w35-39": 19,
    "m40-44": 12,
    "w40-44": 13,
    "m45-49": 3,
    "w45-49": 4,
    "m50-54": 5,
    "w50-54": 6,
    "m55-59": 7,
    "w55-59": 8,
    "m60+": 9,
    "w60+": 10,
    #boys aged (16-17) inclusive
    "b16-17": 16,
    #girls aged (16-17) inclusive
    "g16-17": 17,
    "b14-15": 14,
    "g14-15": 15
}

## getting custom url values
Although the URL custom attributes can be harded, the more robust solution is to write a browser automation step, prior to the main data collection, that acquires the custom url attributes corresponding to each filter. This is done below.

The goal is to obtain all of the maps like the one above in a more robust manner.

Also, in the below step, it's important to note that the year and competition can also be used to filter results. However, at the current time, regional data is not available for 2018 (the open just finished). Furthermore, previous open leaderboard have different HTML structure (would require additional scraping code), and I really only care about 2018. Additionally, Rx'd/scaled, per-workout, occupation, and region are also available filtering criteria.

**Rx'd/scaled and occupation**
I don't care about these for the time being. This repo will only consider non-specific occupation and Rx'd athletes.

**per-workout and region**
I'll be able to do this filtering on my own (hypothetically). In order to do so, the region and per-workout scores will be scraped from the leaderboard and filtered on later in other notebooks.

In [6]:
filters = ["division", "region"]
print("HTML filters which will be used for filtering:\n{}".format(filters))

HTML filters which will be used for filtering:
['division', 'region']


Below we're going to go to the leaderboard (autonomously) and scrape the IDs [CrossFit](https://games.crossfit.com/leaderboard/open/2018?division=1&region=0&scaled=0&sort=0&occupation=0&page=1) uses for divisions and regions. Although I don't need to use the same IDs, it can only help to use the same mappings. When we're talking about IDs here, I mean the numeric values that would be used to plug into the `"{}"` occurences in the `custom_url` string above.

Here's the documentation I use for [Selenium](http://selenium-python.readthedocs.io/locating-elements.html).

In [7]:
#attempt database connect
try:
    con = get_connect()
    with con.cursor() as cur:
        #create division and region tables if they don't exist
        sql = """
        CREATE TABLE IF NOT EXISTS region (
            id INT PRIMARY KEY,
            region VARCHAR(24) NOT NULL
        );
        CREATE TABLE IF NOT EXISTS division (
            id INT PRIMARY KEY,
            division VARCHAR(16) NOT NULL
        );
        """
        cur.execute(sql)
        
        #attempt to get regions and divisions
        result_counts = [-1, -1]
        sql = """
        SELECT * FROM {};
        """
        for i in range(len(filters)):
            cur.execute(sql.format(filters[i]))
            result = cur.fetchall()
            #store number of results
            result_counts[i] = len(result)
            #output results
            #print("==== {} results({}) ====".format(filters[i], result_counts[i]))
            #print(", ".join([col[0] for col in cur.description]))
            for j in range(result_counts[i]):
                break
                #print("{}: {}".format(j, result[j]))
        
        #if results for both are not empty, the values have
        #already been scraped, so skip this
        if result_counts[0] < 1 or result_counts[1] < 1:
            #Store id, region/div pairs as tuples (id_0, region_0/div_0)
            #the below entries will have 2 lists of such tuples, 1 for
            #each filter_id
            entries = []
            #spin up browser
            driver = webdriver.Chrome()
            driver.get(default_url)
            
            #iterate over useful filters
            for i in range(len(filters)):
                #ids are formatted with control- as a prefix
                dropdown = driver.find_element_by_id("control-" + filters[i])
                options = dropdown.find_elements_by_tag_name("option")
                #aggregate entries
                #also, this is making 2 calls to o.get_attribute("value"), and this
                #could be done "more efficiently" without the list comprehension
                entries.append([(int(o.get_attribute("value")), o.get_attribute("innerText"))
                            for o in options if o.get_attribute("value") != ""])
                #print(entries[i])
            
            #close driver
            driver.close()
        
            #write entries to file
            sql = """
            INSERT INTO {}(id, {}) VALUES
                {}
            """
            for i in range(len(filters)):
                cur.execute(sql.format(filters[i], filters[i],
                                       ",\n".join(["({}, '{}')".format(e[0], e[1]) for e in entries[i]])))
            #commit inserts
            con.commit()
finally:
    if con:
        con.close()

  return self._nextset(False)


## checkout filter values
Now that we've ensured the filters are in the database, let's grab them and create mappings. These will be necessary for the athlete table.

In [8]:
#this map will contain maps for each filter
filter_maps = {}
try:
    con = get_connect()
    with con.cursor() as cur:
        sql = """
        SELECT * FROM {};
        """
        for f in filters:
            #add filter map
            filter_maps[f] = {}
            #get results
            cur.execute(sql.format(f))
            result = cur.fetchall()
            #store results
            for r in result:
                #create mapping from div/region -> id or vice-versa (depending on scraping needs)
                if f == "division":
                    filter_maps[f][r[0]] = r[1]
                elif f == "region":
                    filter_maps[f][r[1]] = r[0]
finally:
    if con:
        con.close()

In [9]:
#filter_maps

## hard-coded values
At the time of data collection, the 2018 CrossFit Open has ended. With this, we're making the assumption the total number of pages for each leaderboard will not change (this is not necessarily 100%, but I'm assuming it's very close to 100%). The leaderboard is structured in a way such that if a page is requested beyond the total number of leaderboard pages available for a specific filter, it redirects back to the first leaderboard page.

Although there are different ways to handle this, in order to know when to stop scraping for a specific filter, we'll just scrape the index of the last available page, which is available at the bottom of each leaderboard. Additionally, the current index for each filter will also scraped, defaulting to the first leaderboard page: 1.

In [10]:
try:
    con = get_connect()
    with con.cursor() as cur:
        #create database to store these worldwide-division indices
        #(could easily be stored in a flat file instead)
        sql = """
        CREATE TABLE IF NOT EXISTS worldwide_division_pages (
            division_id INT PRIMARY KEY,
            FOREIGN KEY (division_id)
                REFERENCES division(id)
                ON DELETE CASCADE,
            curr_page INT NOT NULL DEFAULT 1,
            last_page INT NOT NULL DEFAULT -1
        )
        """
        cur.execute(sql)
        
        #check if table has been populated already in a previous run
        sql = """
        SELECT * FROM worldwide_division_pages;
        """
        cur.execute(sql)
        result = cur.fetchall()
        if len(result) == 0:
            #cross-populate division id's from division table
            sql = """
            INSERT INTO worldwide_division_pages (division_id)
                SELECT id FROM division;
            """
            cur.execute(sql)
            con.commit()
            
            #grab option IDs from database
            #this could be done by just collecting them from the browser, the values
            #are supposed to be identical
            sql = """
            SELECT division_id FROM worldwide_division_pages;
            """
            cur.execute(sql)
            result = cur.fetchall()
            ids = [r[0] for r in result]
            print(ids)
                     
            #scrape last_page values for each division
            driver = webdriver.Chrome()
            driver.get(default_url)
            
            #grab division selectable and store in browser
            inject_store_select = """
            window.division_select = document.getElementById("control-division");
            """
            driver.execute_script(inject_store_select)
            
            last_pages = []
            #iterate over ids
            for i in ids:
                #force select element to change to new dropdown option
                inject_change_select = """
                window.division_select.value = {};
                window.division_select.dispatchEvent(new Event("change"));
                """.format(i)
                driver.execute_script(inject_change_select)
                #wait for page to update
                sleep(2)
                #grab last page text from the bottom
                last_pages.append(int(
                    driver.find_element_by_class_name("nums")
                        .find_elements_by_tag_name("a")[-1]
                        .get_attribute("innerText")
                ))
            
            #write updates to file in bulk
            sql = """
            INSERT INTO worldwide_division_pages(division_id, last_page) VALUES
                {}
                ON DUPLICATE KEY UPDATE last_page = VALUES(last_page);
            """
            cur.execute(
                sql.format(
                    ",\n".join(["({}, '{}')".format(ids[i], last_pages[i]) for i in range(len(ids))])
                )
            )
            con.commit()
            
            #close driver
            driver.close()
finally:
    if con:
        con.close()

  result = self._query(query)


### additional fixed filters
Below are the hard-coded values for scaled, occupation, sort, and workout type.

In [11]:
region = 0
scaled = 0
sort = 0
occupation = 0
custom_url = (
    "https://games.crossfit.com/leaderboard/open/2018?division={}&region={}&scaled={}&sort={}&occupation={}&page={}"
        .format("{}", region, scaled, sort, occupation, "{}")
)
print("custom url:\n{}".format(custom_url))

custom url:
https://games.crossfit.com/leaderboard/open/2018?division={}&region=0&scaled=0&sort=0&occupation=0&page={}


## leaderboard/athlete scraping overview
Scraping the Open leaderboard data is only half of the data required per athlete. The remaining data will be collected from their athlete profile, containing statistics for their Back Squat, Fran, and other CrossFit staples. This will be done **after** the Open data for a specific division is completely finished. Therefore, the scraping process pipeline from this point forward is as follows:
* for each division:
    * for each leaderboard page:
        * scrape all athlete leaderboard information
        * for each athlete:
            * scrape profile data
        * write all athlete data to file
        * update current page for this division

### reading in the pages to be scraped from the Open leaderboard
Below the pages left to be scraped will be collected from the database and sorted in order from least athletes to most athletes **per division**. Divisions which have complete Open leaderboard data already scraped will have a current page value exceeding the last page value by 1 (`curr_page = 3`, `last_page = 2`).

In [12]:
try:
    con = get_connect()
    with con.cursor() as cur:
        sql = """
        SELECT * FROM worldwide_division_pages;
        """
        cur.execute(sql)
        #sort results based on last_page value
        division_pages = sorted(
            list(
                map(
                    lambda tup: list(tup),
                    cur.fetchall()
                )
            ), 
            key=lambda tup: tup[2],
            reverse=True
        )
finally:
    if con:
        con.close()

### filtering division pages
Divisions that have already been completely scraped (`curr_page == last_page + 1`) should be excluded from the scraping process.

In [13]:
#output
print("Original division pages:\n{}".format("\n".join(["\t{}".format(d) for d in division_pages])))
#filter
filtered_division_pages = [d for d in division_pages if d[1] <= d[2]]
print("Filtered division pages:\n{}".format("\n".join(["\t{}".format(d) for d in filtered_division_pages])))

Original division pages:
	[1, 687, 4552]
	[2, 44, 3441]
	[18, 47, 894]
	[19, 59, 619]
	[12, 65, 568]
	[13, 47, 398]
	[3, 56, 337]
	[4, 45, 239]
	[5, 41, 167]
	[6, 54, 124]
	[7, 48, 91]
	[8, 61, 76]
	[16, 55, 63]
	[9, 51, 55]
	[17, 48, 47]
	[10, 40, 45]
	[14, 41, 40]
	[15, 33, 32]
Filtered division pages:
	[1, 687, 4552]
	[2, 44, 3441]
	[18, 47, 894]
	[19, 59, 619]
	[12, 65, 568]
	[13, 47, 398]
	[3, 56, 337]
	[4, 45, 239]
	[5, 41, 167]
	[6, 54, 124]
	[7, 48, 91]
	[8, 61, 76]
	[16, 55, 63]
	[9, 51, 55]
	[10, 40, 45]


## setting up the athlete tables
The athlete table will contain data for athletes from 2 sources: leaderboards and profiles. The data from each in it's native form is shown below.

### leaderboards
<img src="images/leaderboard.png" />

### profiles
<img src="images/basic_stats.png" />
<img src="images/benchmark_stats.png" />

Although this data could be separated into 2 or 3 different tables, for the scope of the planned analysis, it can all go into the same table. The collected data per athlete will contain the following:
* leaderboard
    * name
    * 18.1, 18.2, 18.2a, 18.3, 18.4, 18.5 scores (not rank, a.k.a. time or reps or weight)
    * height, weight, age
    * region, division
* profile
    * affiliate
    * back squat, clean and jerk, snatch, deadlift
    * fight gone bad, fran, grace, helen, filthy 50
    * max pull-ups
    * sprint 400m, run 5k
    
The table containing all this data is created below. Time data will be recorded in seconds, weight in pounds, height in inches, and age in years.

In [14]:
try:
    con = get_connect()
    with con.cursor() as cur:
        sql = """
        CREATE TABLE IF NOT EXISTS athlete (
            id INT PRIMARY KEY,
            name VARCHAR(128) NOT NULL,
            leaderboard_18_1_reps INT,
            leaderboard_18_2_time_secs INT,
            leaderboard_18_2a_weight_lbs INT,
            leaderboard_18_3_reps INT,
            leaderboard_18_4_time_secs INT,
            leaderboard_18_5_reps INT,
            
            height_in INT,
            weight_lbs INT,
            age_years INT,
            
            region_id INT NOT NULL,
            FOREIGN KEY (region_id)
                REFERENCES region(id)
                ON DELETE CASCADE,
            division_id INT NOT NULL,
            FOREIGN KEY (division_id)
                REFERENCES division(id)
                ON DELETE CASCADE,
            
            affiliate_id INT,
            
            back_squat_lbs INT,
            clean_and_jerk_lbs INT,
            snatch_lbs INT,
            deadlift_lbs INT,
            
            fight_gone_bad_time_secs INT,
            fran_time_secs INT,
            grace_time_secs INT,
            helen_time_secs INT,
            filthy_50_time_secs INT,
            
            max_pull_ups INT,
            
            sprint_400_m_time_secs INT,
            run_5_km_time_secs INT
        );
        """
        cur.execute(sql)
finally:
    if con:
        con.close()

  result = self._query(query)


## scrape athlete data
Here we go. glhf

### Constants used during scraping
The below constants are used during the scraping process. The uses are commented, but they range from CSS selectors, output formatting, SQL column names, and time multipliers (h:m:s).

In [15]:
#used for output styling
center_space = 60
center_sep = " "

#for leaderboard workouts
workout_keys = [
    "18_1_reps", "18_2_time_secs", "18_2a_weight_lbs", "18_3_reps", "18_4_time_secs", "18_5_reps"]
workout_keys = list(map(lambda k: "leaderboard_" + k, workout_keys))

#for stats metrics on the athlete profile page
stats_keys = [
    "back_squat_lbs",
    "clean_and_jerk_lbs",
    "snatch_lbs",
    
    "deadlift_lbs",
    "fight_gone_bad_time_secs",
    "max_pull_ups",
    
    "fran_time_secs",
    "grace_time_secs",
    "helen_time_secs",
    
    "filthy_50_time_secs",
    "sprint_400_m_time_secs",
    "run_5_km_time_secs"
]

#for other values (height, weight, etc.)
height_key = "height_in"
weight_key = "weight_lbs"
age_key = "age_years"
id_key = "id"
region_key = "region_id"
division_key = "division_id"
name_key = "name"
affiliate_key = "affiliate_id"

#all keys (used for insertion)
all_keys = sorted(workout_keys + stats_keys + [
    height_key, weight_key, age_key, id_key, region_key, division_key, name_key, affiliate_key
])
all_keys_str = ", ".join(all_keys)

#time map values for converting to seconds
time_mults = [3600, 60, 1]

#used for getting statistics on athlete profile page
custom_stat_selects = [
    "li:nth-child({})",
    "> .stats-section:nth-child({}) ",
    "> table > tbody > tr:nth-child({}) > td"
]

### Reformatting data
The below funciton is used to reformat a lb/kg, reps, or hours:minutes:seconds value into it's database-equivalent format.

In [16]:
def handle_crossfit_score(html):
    """
    Handles the parsing of several different CrossFit scores such as
    weight (lb), time (x:y:z), reps (reps)/(), or none (--). Returns the
    vale parsed to it's database required equivalent.
    """
    #reps
    if html[-4:] == "reps":
        return int(html[:-5])
    #time
    elif ":" in html:
        hrs_mins_secs = list(map(lambda x: int(x), html.split(":")))
        len_diff = len(time_mults) - len(hrs_mins_secs)
        #collect all seconds
        return sum(
            [hrs_mins_secs[i] * time_mults[i + len_diff]
                 for i in range(len(hrs_mins_secs))]
        )
    #no score
    elif html == "--":
        return -1
    #weight
    elif html[-2:] == "lb" or html[-2:] == "kg":
        weight = int(html[:-3])
        return weight if html[-2:] == "lb" else int(round(weight * 2.2))
    #reps with no reps on the end (used for max pull-ups in athlete profile)
    else:
        #handle special values with -2 entry (200 reps for pullups is not happening in 2018)
        parsed_val = int(html)
        return parsed_val if parsed_val < 150 else -2

### Random
The below code has been refactored several times. However, the current implementation forces the divisions to be scraped at random. Once a division's leaderboard has been completely collected, the division is removed as a *required* scraping target (deleted from the local copy of division pages). This means that on iteration 17, the next 50 athletes from the Men's division could be scraped, and on iteration 18 Girl's (14-15) could be the target.

In [None]:
from random import randint

The second codeblock here is the legacy version of the first codeblock. The legacy version completely scrapes a single division until it's completely collected, and then moves onto the next. The newer version randomly samples from the remaining incompletely scraped division on each iteration.

In [None]:
#spinup driver
driver = webdriver.Chrome()

#for each division
#for div in division_pages:
while 0 < len(filtered_division_pages):
    #select division randomly
    div = filtered_division_pages[randint(0, len(filtered_division_pages) - 1)]
    
    #output division
    print("Division {} {}/{}".format(filter_maps["division"][div[0]], div[1], div[2])
             .rjust(center_space, center_sep))
    print("Page {}".format(div[1]))

    #go to page
    driver.get(custom_url.format(div[0], div[1]))

    #get leaderboard (might have to wait)
    lb = WebDriverWait(driver, 5).until(
        EC.presence_of_element_located(
            (
                By.CSS_SELECTOR,
                "body > #containerOverlay > #leaderboard > .lb-main > .inner > table > tbody"
            )
        )
    )
    #get rows
    rows = lb.find_elements_by_xpath("*")
    #collect athlete data
    athletes = []
    for r in rows:
        #get columns and create empty dictionary
        cols = r.find_elements_by_xpath("*")#only td elements
        dic = {}

        #column 1
        #name
        names = cols[1].find_elements_by_css_selector("div > div > div:nth-child(2) > div")
        dic[name_key] = (
            "'" + (
                names[0].get_attribute("innerText") + " " + names[1].get_attribute("innerText")
            ).replace("'", "\\'") + "'"
        )
        #info = cols[1].find_element_by_class_name("info")
        info = cols[1].find_element_by_css_selector("div > .bottom > ul")
        info_lis = info.find_elements_by_css_selector("li")
        #id (profile identifier for url)
        dic[id_key] = int(info.find_element_by_tag_name("a").get_attribute("href").split("/")[-1])
        #region/division ids
        dic[region_key] = filter_maps["region"][info_lis[0].get_attribute("innerText")]
        dic[division_key] = div[0]
        #age
        dic[age_key] = int(info_lis[1].get_attribute("innerText").split(" ")[1])
        #height/weight (not mandatory fields)
        if 2 < len(info_lis):
            height_weight = info_lis[2].get_attribute("innerText").split(" ")
            #height
            if "in" in height_weight:
                dic[height_key] = float(height_weight[height_weight.index("in") - 1])
            elif "cm" in height_weight:
                dic[height_key] = float(height_weight[height_weight.index("cm") - 1]) / 2.54
            else:
                dic[height_key] = -1
            dic[height_key] = int(round(dic[height_key]))
            #weight
            if "lb" in height_weight:
                dic[weight_key] = float(height_weight[height_weight.index("lb") - 1])
            elif "kg" in height_weight:
                dic[weight_key] = float(height_weight[height_weight.index("kg") - 1]) * 2.20462
            else:
                dic[weight_key] = -1
            dic[weight_key] = int(round(dic[weight_key]))

        #columns 3-8 (inclusive, 0-indexed)
        #leaderboard workout reps, time, or weight
        scaled = False
        first_col = 3
        for i in range(first_col, 9):
            #get inner value minus the surrounding parentheses
            html_raw = cols[i].find_element_by_css_selector(
                "div > div > span > span:nth-child(2)"
            ).get_attribute("innerText")
            html = html_raw[html_raw.index("(") + 1:-1]
            #print(html_raw, html)
            #if athlete scaled a workout, don't any data
            if html.endswith("- s"):
                scaled = True
                break
            #convert values
            key = workout_keys[i - first_col]
            #handle html
            dic[key] = handle_crossfit_score(html)

        #append athlete if not scaled
        if not scaled:
            athletes.append(dic)
    #iterate over athletes
    for a in athletes:
        driver.get(athlete_url.format(a["id"]))
        #wait for bottom stats to laod
        #get leaderboard (might have to wait)
        try:
            #get stats
            stats = WebDriverWait(driver, 1).until(
                EC.presence_of_element_located(
                    (
                        By.CSS_SELECTOR,
                        "body #athleteProfile > div:nth-last-child(3) > .container > .stats-container"
                    )
                )
            )
            #iterate and collect numbers
            for i in range(1, 3):
                sel0 = custom_stat_selects[0].format(i)
                for j in range(1, 3):
                    sel1 = custom_stat_selects[1].format(j)
                    for k in range(1, 4):
                        sel2 = custom_stat_selects[2].format(k)
                        #select
                        html = stats.find_element_by_css_selector(sel0 + sel1 + sel2).get_attribute("innerText")
                        a[stats_keys[6 * (i - 1) + 3 * (j - 1) + (k - 1)]] = handle_crossfit_score(html)
        except:
            #pass
            for k in stats_keys:
                a[k] = -1

        #get affiliate id
        try:
            a[affiliate_key] = int(
                (
                    driver.find_element_by_css_selector(
                        "body #athleteProfile > .page-cover .bg-games-black-overlay .infobar > li:nth-child(6) a"
                    ).get_attribute("href")
                ).split("/")[-1]
            )
        except:
            a[affiliate_key] = -1

    #"stats-container" class may not be present on profile pages (bs, dl, cj, ...)
    #for a in athletes:
    #    print("\n".join(["{}: {}".format(k, a[k]) for k in sorted(list(a.keys()))]) + "\n")

    #update database
    try:
        con = get_connect()
        with con.cursor() as cur:
            #insert athletes (if any from this page did not scale)
            if len(athletes) != 0:
                sql = """
                INSERT INTO athlete ({}) VALUES
                    {}
                    ON DUPLICATE KEY UPDATE id={};
                """.format(
                    all_keys_str,
                    ",\n".join(["(" + ",".join([str(a[k]) for k in all_keys]) + ")" for a in athletes]),
                    id_key
                )
                cur.execute(sql)
                con.commit()

            #increase value of current page
            sql = """
            UPDATE worldwide_division_pages SET curr_page = {} WHERE division_id = {};
            """.format(
                div[1] + 1,
                div[0]
            )
            cur.execute(sql)
            con.commit()
    finally:
        if con:
            con.close()

    #increase local value of current page
    div[1] += 1
    #remove division if current page exceeds last page (after increment)
    if div[1] == div[2] + 1:
        filtered_division_pages.remove(div)

#close browser
driver.close()

                                      Division Women 44/3441
Page 44
                               Division Women (40-44) 47/398
Page 47
                                 Division Men (40-44) 65/568
Page 65
                               Division Women (40-44) 48/398
Page 48
                                  Division Women (60+) 40/45
Page 40
                               Division Women (35-39) 59/619
Page 59
                                 Division Men (45-49) 56/337
Page 56
                               Division Women (40-44) 49/398
Page 49
                                      Division Women 45/3441
Page 45
                                  Division Men (55-59) 48/91
Page 48
                                 Division Men (40-44) 66/568
Page 66


#spinup driver
driver = webdriver.Chrome()

#for each division
for div in division_pages:
    #output division
    print("Division {}".format(filter_maps["division"][div[0]])
             .rjust(center_space, center_sep))
    #continue until current page exceeds last page
    while div[1] <= div[2]:
        print("Page {}".format(div[1]))
        
        #go to page
        driver.get(custom_url.format(div[0], div[1]))
        
        #get leaderboard (might have to wait)
        lb = WebDriverWait(driver, 5).until(
            EC.presence_of_element_located(
                (
                    By.CSS_SELECTOR,
                    "body > #containerOverlay > #leaderboard > .lb-main > .inner > table > tbody"
                )
            )
        )
        #get rows
        rows = lb.find_elements_by_xpath("*")
        #collect athlete data
        athletes = []
        for r in rows:
            #get columns and create empty dictionary
            cols = r.find_elements_by_xpath("*")#only td elements
            dic = {}
            
            #column 1
            #name
            names = cols[1].find_elements_by_css_selector("div > div > div:nth-child(2) > div")
            dic[name_key] = (
                "'" + (
                    names[0].get_attribute("innerText") + " " + names[1].get_attribute("innerText")
                ).replace("'", "\\'") + "'"
            )
            #info = cols[1].find_element_by_class_name("info")
            info = cols[1].find_element_by_css_selector("div > .bottom > ul")
            info_lis = info.find_elements_by_css_selector("li")
            #id (profile identifier for url)
            dic[id_key] = int(info.find_element_by_tag_name("a").get_attribute("href").split("/")[-1])
            #region/division ids
            dic[region_key] = filter_maps["region"][info_lis[0].get_attribute("innerText")]
            dic[division_key] = div[0]
            #age
            dic[age_key] = int(info_lis[1].get_attribute("innerText").split(" ")[1])
            #height/weight (not mandatory fields)
            if 2 < len(info_lis):
                height_weight = info_lis[2].get_attribute("innerText").split(" ")
                #height
                if "in" in height_weight:
                    dic[height_key] = float(height_weight[height_weight.index("in") - 1])
                elif "cm" in height_weight:
                    dic[height_key] = float(height_weight[height_weight.index("cm") - 1]) / 2.54
                else:
                    dic[height_key] = -1
                dic[height_key] = int(round(dic[height_key]))
                #weight
                if "lb" in height_weight:
                    dic[weight_key] = float(height_weight[height_weight.index("lb") - 1])
                elif "kg" in height_weight:
                    dic[weight_key] = float(height_weight[height_weight.index("kg") - 1]) * 2.20462
                else:
                    dic[weight_key] = -1
                dic[weight_key] = int(round(dic[weight_key]))
            
            #columns 3-8 (inclusive, 0-indexed)
            #leaderboard workout reps, time, or weight
            scaled = False
            first_col = 3
            for i in range(first_col, 9):
                #get inner value minus the surrounding parentheses
                html_raw = cols[i].find_element_by_css_selector(
                    "div > div > span > span:nth-child(2)"
                ).get_attribute("innerText")
                html = html_raw[html_raw.index("(") + 1:-1]
                #print(html_raw, html)
                #if athlete scaled a workout, don't any data
                if html.endswith("- s"):
                    scaled = True
                    break
                #convert values
                key = workout_keys[i - first_col]
                #handle html
                dic[key] = handle_crossfit_score(html)

            #append athlete if not scaled
            if not scaled:
                athletes.append(dic)
        #iterate over athletes
        for a in athletes:
            driver.get(athlete_url.format(a["id"]))
            #wait for bottom stats to laod
            #get leaderboard (might have to wait)
            try:
                #get stats
                stats = WebDriverWait(driver, 1).until(
                    EC.presence_of_element_located(
                        (
                            By.CSS_SELECTOR,
                            "body #athleteProfile > div:nth-last-child(3) > .container > .stats-container"
                        )
                    )
                )
                #iterate and collect numbers
                for i in range(1, 3):
                    sel0 = custom_stat_selects[0].format(i)
                    for j in range(1, 3):
                        sel1 = custom_stat_selects[1].format(j)
                        for k in range(1, 4):
                            sel2 = custom_stat_selects[2].format(k)
                            #select
                            html = stats.find_element_by_css_selector(sel0 + sel1 + sel2).get_attribute("innerText")
                            a[stats_keys[6 * (i - 1) + 3 * (j - 1) + (k - 1)]] = handle_crossfit_score(html)
            except:
                #pass
                for k in stats_keys:
                    a[k] = -1
            
            #get affiliate id
            try:
                a[affiliate_key] = int(
                    (
                        driver.find_element_by_css_selector(
                            "body #athleteProfile > .page-cover .bg-games-black-overlay .infobar > li:nth-child(6) a"
                        ).get_attribute("href")
                    ).split("/")[-1]
                )
            except:
                a[affiliate_key] = -1
            
        #"stats-container" class may not be present on profile pages (bs, dl, cj, ...)
        #for a in athletes:
        #    print("\n".join(["{}: {}".format(k, a[k]) for k in sorted(list(a.keys()))]) + "\n")
        
        #update database
        try:
            con = get_connect()
            with con.cursor() as cur:
                #insert athletes (if any from this page did not scale)
                if len(athletes) != 0:
                    sql = """
                    INSERT INTO athlete ({}) VALUES {};
                    """.format(
                        all_keys_str,
                        ",\n".join(["(" + ",".join([str(a[k]) for k in all_keys]) + ")" for a in athletes])
                    )
                    cur.execute(sql)
                    con.commit()
                    
                #increase value of current page
                sql = """
                UPDATE worldwide_division_pages SET curr_page = {} WHERE division_id = {};
                """.format(
                    div[1] + 1,
                    div[0]
                )
                cur.execute(sql)
                con.commit()
        finally:
            if con:
                con.close()
        
        #increase local value of current page
        div[1] += 1

#close browser
driver.close()