# data collection
This notebook is responsible for collecting CrossFit 2018 Open Leaderboard data and athelete profile data *as represented at the time of data collection*.

## imports
The below is just a set of import statements required to run the code in this notebook. A description for the purpose of each import statement should be commented above it.

In [3]:
#sql connector
import pymysql as pms
#time recording and sleeping
from time import time, sleep
#browser automation
from selenium import webdriver
#from selenium.webdriver.common.by import By
#from selenium.webdriver.support.ui import WebDriverWait
#from selenium.webdriver.support import expected_conditions as EC

## database credentials
In order to connect to a local MySQL database, the block below runs to read the username, password, database name, and host required to establish a connection.

In [4]:
db_user = ""
db_pass = ""
db_name = ""
db_host = ""
with open("database_credentials.txt") as f:
    db_user = f.readline().strip()
    db_pass = f.readline().strip()
    db_name = f.readline().strip()
    db_host = f.readline().strip()

## test database connection
This short snippet is going to attempt to connect to the database and drop out without doing anything. This is just to make sure the credentials and PyMySQL are working properly.

In [6]:
def get_connect():
    """
    Returns a database connection object using the default params
    specified in the database_credentials file.
    """
    return pms.connect(host=db_host, user=db_user, passwd=db_pass, db=db_name)

In [7]:
try:
    con = get_connect()
    print("Connected.")
    with con.cursor() as cur:
        print("Got database cursor. Can make queries within here.")
finally:
    if con:
        con.close()
        print("Connection closed.")

Connected.
Got database cursor. Can make queries within here.
Connection closed.


## urls
These urls/variables can be used to jump to pages where leaderboard data is available.

In [11]:
default_url = "https://games.crossfit.com/leaderboard/open/2018?division=1&region=0&scaled=0&sort=0&occupation=0&page=1"
custom_url = "https://games.crossfit.com/leaderboard/open/2018?division={}&region={}&scaled={}&sort={}&occupation={}&page={}"

#this map would be used to substitute values into the
#custom url string at position "division={}" in place
#of the "{}"
map_division = {
    "men": 1,
    "women": 2,
    "team": 11,
    #men aged (35-39) inclusive
    "m35-39": 18,
    #women aged (35-39) inclusive
    "w35-39": 19,
    "m40-44": 12,
    "w40-44": 13,
    "m45-49": 3,
    "w45-49": 4,
    "m50-54": 5,
    "w50-54": 6,
    "m55-59": 7,
    "w55-59": 8,
    "m60+": 9,
    "w60+": 10,
    #boys aged (16-17) inclusive
    "b16-17": 16,
    #girls aged (16-17) inclusive
    "g16-17": 17,
    "b14-15": 14,
    "g14-15": 15
}

## getting custom url values
Although the URL custom attributes can be harded, the more robust solution is to write a browser automation step, prior to the main data collection, that acquires the custom url attributes corresponding to each filter. This is done below.

The goal is to obtain all of the maps like the one above in a more robust manner.

Also, in the below step, it's important to note that the year and competition can also be used to filter results. However, at the current time, regional data is not available for 2018 (the open just finished). Furthermore, previous open leaderboard have different HTML structure (would require additional scraping code), and I really only care about 2018.

In [17]:
filter_ids = ["division", "region", "scaled", "sort", "occupation"]
for i in range(len(filter_ids)):
    filter_ids[i] = "control-" + filter_ids[i]
print("HTML ids which will be used for filtering:\n{}".format(filter_ids))

HTML ids which will be used for filtering:
['control-division', 'control-region', 'control-scaled', 'control-sort', 'control-occupation']
