# data collection
This notebook is responsible for collecting CrossFit 2018 Open Leaderboard data and athelete profile data *as represented at the time of data collection*.

## imports
The below is just a set of import statements required to run the code in this notebook. A description for the purpose of each import statement should be commented above it.

In [3]:
#sql connector
import pymysql as pms
#time recording and sleeping
from time import time, sleep
#browser automation
from selenium import webdriver
#from selenium.webdriver.common.by import By
#from selenium.webdriver.support.ui import WebDriverWait
#from selenium.webdriver.support import expected_conditions as EC

## database credentials
In order to connect to a local MySQL database, the block below runs to read the username, password, database name, and host required to establish a connection.

In [4]:
db_user = ""
db_pass = ""
db_name = ""
db_host = ""
with open("database_credentials.txt") as f:
    db_user = f.readline().strip()
    db_pass = f.readline().strip()
    db_name = f.readline().strip()
    db_host = f.readline().strip()

## test database connection
This short snippet is going to attempt to connect to the database and drop out without doing anything. This is just to make sure the credentials and PyMySQL are working properly.

In [6]:
def get_connect():
    """
    Returns a database connection object using the default params
    specified in the database_credentials file.
    """
    return pms.connect(host=db_host, user=db_user, passwd=db_pass, db=db_name)

In [7]:
try:
    con = get_connect()
    print("Connected.")
    with con.cursor() as cur:
        print("Got database cursor. Can make queries within here.")
finally:
    if con:
        con.close()
        print("Connection closed.")

Connected.
Got database cursor. Can make queries within here.
Connection closed.


## urls
These urls/variables can be used to jump to pages where leaderboard data is available.

In [11]:
default_url = "https://games.crossfit.com/leaderboard/open/2018?division=1&region=0&scaled=0&sort=0&occupation=0&page=1"
custom_url = "https://games.crossfit.com/leaderboard/open/2018?division={}&region={}&scaled={}&sort={}&occupation={}&page={}"

#this map would be used to substitute values into the
#custom url string at position "division={}" in place
#of the "{}"
map_division = {
    "men": 1,
    "women": 2,
    "team": 11,
    #men aged (35-39) inclusive
    "m35-39": 18,
    #women aged (35-39) inclusive
    "w35-39": 19,
    "m40-44": 12,
    "w40-44": 13,
    "m45-49": 3,
    "w45-49": 4,
    "m50-54": 5,
    "w50-54": 6,
    "m55-59": 7,
    "w55-59": 8,
    "m60+": 9,
    "w60+": 10,
    #boys aged (16-17) inclusive
    "b16-17": 16,
    #girls aged (16-17) inclusive
    "g16-17": 17,
    "b14-15": 14,
    "g14-15": 15
}

## getting custom url values
Although the URL custom attributes can be harded, the more robust solution is to write a browser automation step, prior to the main data collection, that acquires the custom url attributes corresponding to each filter. This is done below.

The goal is to obtain all of the maps like the one above in a more robust manner.

Also, in the below step, it's important to note that the year and competition can also be used to filter results. However, at the current time, regional data is not available for 2018 (the open just finished). Furthermore, previous open leaderboard have different HTML structure (would require additional scraping code), and I really only care about 2018. Additionally, Rx'd/scaled, per-workout, occupation, and region are also available filtering criteria.

**Rx'd/scaled and occupation**
I don't care about these for the time being. This repo will only consider non-specific occupation and Rx'd athletes.

**per-workout and region**
I'll be able to do this filtering on my own (hypothetically). In order to do so, the region and per-workout scores will be scraped from the leaderboard and filtered on later in other notebooks.

In [18]:
filter_ids = ["division", "region"]
for i in range(len(filter_ids)):
    filter_ids[i] = "control-" + filter_ids[i]
print("HTML ids which will be used for filtering:\n{}".format(filter_ids))

HTML ids which will be used for filtering:
['control-division', 'control-region']


Below we're going to go to the leaderboard (autonomously) and scrape the IDs [CrossFit](https://games.crossfit.com/leaderboard/open/2018?division=1&region=0&scaled=0&sort=0&occupation=0&page=1) uses for divisions and regions. Although I don't need to use the same IDs, it can only help to use the same mappings. When we're talking about IDs here, I mean the numeric values that would be used to plug into the `"{}"` occurences in the `custom_url` string above.

In [23]:
#attempt database connect
try:
    con = get_connect()
    with con.cursor() as cur:
        #create division and region tables if they don't exist
        sql = """
        CREATE TABLE IF NOT EXISTS region (
            id INT PRIMARY KEY,
            region VARCHAR(24) NOT NULL
        );
        CREATE TABLE IF NOT EXISTS division (
            id INT PRIMARY KEY,
            division VARCHAR(16) NOT NULL
        );
        """
        cur.execute(sql)
        
        #attempt to get regions and divisions
        result_counts = [-1, -1]
        tables = ["region", "division"]
        sql = """
        SELECT * FROM {};
        """
        for i in range(len(tables)):
            cur.execute(sql.format(tables[i]))
            result = cur.fetchall()
            #store number of results
            result_counts[i] = len(result)
            #output results
            print("==== {} results({}) ====".format(tables[i], result_counts[i]))
            print("\t" + ", ".join([col[0] for col in cur.description]))
            for j in range(result_counts[i]):
                print("\t{}: {}").format(j, result[j])
        
        #if results for both are not empty, the values have
        #already been scraped, so skip this
        if result_counts[0] == 0 or result_counts[1] == 0:
            
        
        #if either result is empty, rescrape both
finally:
    if con:
        con.close()

#spin up browser
#driver = webdriver.Chrome()
#driver.get(default_url)

==== region results(0) ====
	id, region
==== division results(0) ====
	id, division
scrape


  return self._nextset(False)
