<h1>Does Height Matter for Distance Running?</h1>
<h3> By Samuel Kellum </h3>

___

<h4> Table of contents: </h4>
<ol>
<li> Introduction </li>
<li> Extraction, Transform and Load </li>
<li> Exploratory Data Analysis and Data Visualization </li>
<li> Hypothesis Testing </li>
<li> Conclusion and Further Study </li>
</ol>

---

<h2 align='center'> 1. Introduction </h2>
<p>Height is very important in basketball and most positions in football, where athletes are rarely shorter than 6 feet tall at the professional and NCAA Division I level. On the other hand, shorter athletes have an advantage in sports like gymnastics or equestrian.</p>
<p>Distance running has always been viewed as a sport where height does not matter. The heights of different world class runnners in the same event varies a lot. For example, Kenenisa Bekele, former world record holder in the 10,000m, stands at 5'3. On the other hand, Chris Solinsky, fromer American record holder in the 10,000m, is 6'1. These two athletes are among some of the best of all-time, competing the same event, but their height differs by almost an entire foot!</p>
<p>As a Division I runner, I became interested in exploring heights of athletes on Division I cross-country teams after a couple of my teammates noticed that the people on other teams seemed to be significantly taller than us. Many of these teams were also better than my team. This sequence of observations gave me two questions:</p><br>
    <li>Is there an association between the average height of a team's runners and team success?</li>
    <li>How does my team (Tulane) compare to other D1 cross-country in terms of average height?</li>
    <p>In this analysis I will attempt to answer the above questions.</p>
    
___


<h2 align="center"> 2. Data Extraction, Transform and Load </h2>

For this analysis, I will will be using Python 3, Pandas, MatPlotLib, and more. The first code cell will be importing the necessary libraries.

In [1]:
## Importing and loading everythng we will need to use.
# Load requests
import requests
# Load BeautiulSoup
from bs4 import BeautifulSoup
# Load Regular Expression Library
import re
# Load Headers
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36"}
# Load Google Search
from googlesearch import search

# Load MatPlotLib
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use("fivethirtyeight")

# Load Pandas
import pandas as pd
# Load JSON
import json

The first thing I needed was a list and rank of every Division 1 cross country team. I decided to use data from <a href='https://www.lacctic.com/leagues/4'>LACCTiC.com</a>, a website that ranks every Division 1 cross-country team that has at least five runners (the minimum required to score as a team). The data in this website is extracted from <a href='https://www.tfrrs.org/'>TFRRS</a>, which is a website that compiles all NCAA cross county meet results.



In [2]:
#Collect Team Rankings
teams1to100 = requests.get("https://c03mmwsf5i.execute-api.us-east-2.amazonaws.com/production/api_ranking/teams/?leagueids=4&nonull=true&page=1", headers=headers).json()
teams101to200 = requests.get("https://c03mmwsf5i.execute-api.us-east-2.amazonaws.com/production/api_ranking/teams/?leagueids=4&nonull=true&page=2", headers=headers).json()
teams201to300 = requests.get("https://c03mmwsf5i.execute-api.us-east-2.amazonaws.com/production/api_ranking/teams/?leagueids=4&nonull=true&page=3", headers=headers).json()
teams301to319 = requests.get("https://c03mmwsf5i.execute-api.us-east-2.amazonaws.com/production/api_ranking/teams/?leagueids=4&nonull=true&page=4", headers=headers).json()
data = []
for i in range(len(teams1to100["results"])):
    data.append({"name" : teams1to100["results"][i]["name"], "rank": i + 1, "top_5_ability_average": teams1to100["results"][i]["top_5_ability_average"]})
    
for i in range(len(teams101to200["results"])):
    data.append({"name" : teams101to200["results"][i]["name"], "rank": i + 101, "top_5_ability_average": teams101to200["results"][i]["top_5_ability_average"]})

for i in range(len(teams201to300["results"])):
    data.append({"name" : teams201to300["results"][i]["name"], "rank": i + 201, "top_5_ability_average": teams201to300["results"][i]["top_5_ability_average"]})

for i in range(len(teams301to319["results"])):
    data.append({"name" : teams301to319["results"][i]["name"], "rank": i + 301, "top_5_ability_average": teams301to319["results"][i]["top_5_ability_average"]})

In [3]:
#This takes like 20 minutes to run
for i in range(len(data)):
    query = data[i]["name"] + " men's cross country roster 2021-22"
    for url in search(query, tld="co.in", num=1, stop=1, pause=2):
        data[i]["team_url"] = url

'#This takes like 20 minutes to run\nfor i in range(len(data)):\n    pass\n    #Google stuff'

In [None]:
#Fix: Columbia, Liberty

In [4]:
def get_website_type(url):
    r = requests.get(url, headers=headers)
    if(r.status_code == 200):
        if re.search("sidearm", r.text):
            return "sidearm"
        elif re.search("wmt", r.text):
            return "wmt"
        else:
            return "other"
    else:
        return str(r.status_code)

In [5]:
#Inputs a raw height (str) and converts into height in inches (int)
#Ex: "5'9" -> 69
def height_to_inches(heights):
    new_heights = []
    for height in heights:
        inches = 0
        #Remove any non-digits
        num = re.sub("\D", "", height)
        #First character of remaining string is height in feet, rest of characters are height in inches
        inches = int(num[0]) * 12 + int(num[1:])
        new_heights.append(inches)
    return new_heights

In [6]:
def sidearm_get_heights_separated_by_gender(url):
    men_heights = []
    women_heights = []
    r = requests.get(url, headers=headers)
    if(r.status_code == 200):
        soup = BeautifulSoup(r.text, "html.parser")
        men = soup.find("ul", {"id": "sidearm-m-roster"})
        women = soup.find("ul", {"id": "sidearm-f-roster"})
        if men:
            male_heights = men.find_all(
                "span", {"class": "sidearm-roster-player-height"})
            for height in male_heights:
                men_heights.append(height.text)

        if women:
            female_heights = women.find_all(
                "span", {"class": "sidearm-roster-player-height"})
            for height in female_heights:
                women_heights.append(height.text)
                
        return {"men": height_to_inches(men_heights), "women": height_to_inches(women_heights)}   
    
    else:
        return {"men": [], "women": []}

In [7]:
def wmt_get_heights_separated_by_gender(url):
    r = requests.get(url, headers=headers)
    men_heights = []
    women_heights = []
    if(r.status_code == 200):
        soup = BeautifulSoup(r.text, "html.parser")
        if soup.find("table"):
            dfs = pd.read_html(r.text)
            men = dfs[0]
            if "Ht" in men.columns:
                men_heights = height_to_inches(list(men["Ht"].dropna()))
                if len(dfs) > 1 and "Ht" in dfs[1].columns:
                    women = dfs[1]
                    women_heights = height_to_inches(list(women["Ht"].dropna())) 
                                        
            elif "HT." in men.columns:
                men_heights = height_to_inches(list(men["HT."].dropna()))
                if len(dfs) > 1 and "HT." in dfs[1].columns:
                    women = dfs[1]
                    women_heights = height_to_inches(list(women["HT."].dropna()))    
            
            elif "Height" in men.columns:
                men_heights = height_to_inches(list(men["Height"].dropna()))
                if len(dfs) > 1 and "Height" in dfs[1].columns:
                    women = dfs[1]
                    women_heights = height_to_inches(list(women["Height"].dropna()))

    return {'men': men_heights, 'women': women_heights}

In [8]:
def other_get_heights_separated_by_gender(url):
    r = requests.get(url, headers=headers)
    men_heights = []
    if(r.status_code == 200):
        soup = BeautifulSoup(r.text, "html.parser")
        if soup.find("table"):
            dfs = pd.read_html(r.text)
            men = dfs[0]
            if "Ht." in men.columns:
                men_heights = height_to_inches(list(men["Ht."].dropna()))
            elif "Height" in men.columns:
                men_heights = height_to_inches(list(men["Height"].dropna()))
                
    return {'men': men_heights, 'women': []}

In [9]:
for i in range(len(data)):
    data[i]["website_type"] = get_website_type(data[i]["team_url"])
    
    website_type = data[i]["website_type"]
    if website_type == "sidearm":
        heights = sidearm_get_heights_separated_by_gender(url)
        data[i]["heights"] = heights
    elif website_type == "wmt":
        heights = wmt_get_heights_separated_by_gender(url)
        data[i]["heights"] = heights
    elif website_type == "other":
        heights = other_get_heights_separated_by_gender(url)
        data[i]["heights"] = heights
    else:
        data[i]["heights"] = {"men": [], "women": []}

In [10]:
#Fixing errors that my code makes
#Slicing out non-distance runners

#Furman
data[24]["heights"] = {"men": [75, 67, 70, 74, 68, 69, 67, 70, 72, 71, 73, 71, 70, 72, 72, 70, 76], "women": [66, 64, 65, 64, 70, 63, 68, 61, 65]}

#Georgetown
data[27]["heights"] = {"men": [68, 69, 69], "women": []}

#Iona
data[33]["heights"] = {"men": [74, 69, 68, 72, 70, 72, 68, 70, 71, 70, 72, 71, 70, 74, 72, 72, 71, 69, 72, 70, 71, 67, 73, 73, 70, 70, 69], "women": []} 

#Alabama
data[34]["heights"] = {"men": [], "women": [63]}

#UMass Lowell
data[60]["heights"] = {"men": [70, 70, 69, 73, 69, 70, 71, 70, 75, 70, 69, 71, 70, 69, 68, 70, 70, 70, 67, 69, 73, 65, 73, 66, 69, 66, 70, 67, 68, 70, 70, 70, 72, 70, 74, 69, 70, 69, 68, 69, 71, 70], 'women': []}

#Fix Columbia
#Fix Liberty

#Idaho
data[109]["heights"] = {"men": [], "women": []}

#Coastal Carolina
data[273]["heights"] = {"men": [73, 65], "women": [60, 67, 63, 64]}

In [11]:
data

[{'name': 'Northern Arizona',
  'rank': 1,
  'top_5_ability_average': 812,
  'team_url': 'https://nauathletics.com/sports/cross-country/roster',
  'website_type': 'sidearm',
  'heights': {'men': [], 'women': []}},
 {'name': 'BYU',
  'rank': 2,
  'top_5_ability_average': 814,
  'team_url': 'https://byucougars.com/roster/m-cross-country/2020-2021',
  'website_type': 'other',
  'heights': {'men': [71,
    70,
    73,
    74,
    72,
    71,
    70,
    66,
    74,
    69,
    77,
    71,
    72,
    68,
    70,
    72,
    70,
    69,
    71,
    70,
    69],
   'women': []}},
 {'name': 'Oklahoma State',
  'rank': 3,
  'top_5_ability_average': 815,
  'team_url': 'https://okstate.com/sports/mxct/roster',
  'website_type': 'sidearm',
  'heights': {'men': [], 'women': []}},
 {'name': 'Notre Dame',
  'rank': 4,
  'top_5_ability_average': 817,
  'team_url': 'https://und.com/sports/cross/roster/',
  'website_type': 'wmt',
  'heights': {'men': [], 'women': []}},
 {'name': 'Iowa State',
  'rank':

In [12]:
df = pd.DataFrame({'heights': heights_column, 'teams': team_association, 'team_rank': team_rank})

NameError: name 'heights_column' is not defined

In [None]:
atleast5 = df[df.groupby(['teams','team_rank'])['inches'].transform('size') >= 5]

In [None]:
average_df = pd.DataFrame(atleast5.groupby(['teams','team_rank'])['inches'].mean().sort_values())

In [None]:
atleast5

In [None]:
pd.set_option('display.max_rows', 100)
average_df = average_df.reset_index()
average_df

In [None]:
average_df['inches'].describe()

In [None]:
any_size = df.groupby(['teams','team_rank'])['inches'].mean().sort_values()
any_size

In [None]:
average_df.corr()

In [None]:
average_df.plot.scatter('inches','team_rank')