# Machine Learning on PGA Tour - Data Collection

In this notebook we will collect the data, handle the potential missing values, and label it. The data is collected with web scraping methods from the official website of PGA Tour: https://www.pgatour.com.

In [73]:
# IMPORTS
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

Now, first we want to find out the rankings of the players and their names during the last five seasons, 2017-2021. To find this data we will use PGA Tour's official site. We will create a function to obtain these values.

In [74]:
def get_name_ranking(url, season):
    df2 = pd.DataFrame(columns=["Name", "Season", "Ranking"])
    response = requests.get(url)
    content = response.content
    soup = BeautifulSoup(content, "html.parser")
    table = soup.find(id="statsTable")
    rows = table.find_all("tr")
    names = []
    for row in rows[1:126]:
        name = row.find("a").text.strip()
        ranking = int(row.find("td").text.strip())
        names.append((ranking, name))
    names = np.array(names)
    df2["Name"] = names[:, 1]
    df2["Season"] = season
    df2["Ranking"] = names[:, 0]
    return df2[:125]

Then we will collect the values for the features. To do this, we will first create a function that will search the values from a given URL and add these values to the dataframe.

In [75]:
def scrape_data(url, header, df, season):
    response = requests.get(url)
    content = response.content
    soup = BeautifulSoup(content, "html.parser")
    table = soup.find(id="statsTable")
    rows = table.find_all("tr")
    values = []
    for row in rows[1:]:
        name = row.find("a").text.strip()
        value = float(row.find_all("td")[4].text.strip().replace(",", ""))
        values.append((name, season, value))
    df2 = pd.DataFrame(values, columns=["Name", "Season", header])
    result = pd.merge(left=df, right=df2, on=["Name", "Season"], how="left")
    return result

Then we will find out the names and rankings of the players, as well as all of our features. Explanations for the features can be found in the project report.

In [76]:
df = pd.DataFrame(columns=["Name", "Season", "Ranking"])
seasons = ["2017", "2018", "2019", "2020", "2021"]
for season in seasons:
    url = f"https://www.pgatour.com/stats/stat.02671.y{season}.html"
    df2 = get_name_ranking(url, season)

    # Driving Distance
    url_driving = f"https://www.pgatour.com/stats/stat.101.y{season}.eoff.t013.html"
    header = "Driving Distance"
    df2 = scrape_data(url_driving, header, df2, season)

    # Driving Accuracy
    url_accuracy = f"https://www.pgatour.com/stats/stat.102.y{season}.eoff.t013.html"
    header = "Driving Accuracy"
    df2 = scrape_data(url_accuracy, header, df2, season)

    # Club Head Speed
    url_club = f"https://www.pgatour.com/stats/stat.02401.y{season}.eoff.t013.html"
    header = "Club Head Speed"
    df2 = scrape_data(url_club, header, df2, season)

    # Ball Speed
    url_ball = f"https://www.pgatour.com/stats/stat.02402.y{season}.eoff.t013.html"
    header = "Ball Speed"
    df2 = scrape_data(url_ball, header, df2, season)

    # Spin Rate
    url_spin = f"https://www.pgatour.com/stats/stat.02405.y{season}.eoff.t013.html"
    header = "Spin Rate"
    df2 = scrape_data(url_spin, header, df2, season)

    df = pd.concat((df, df2), ignore_index=True)
df

Unnamed: 0,Name,Season,Ranking,Driving Distance,Driving Accuracy,Club Head Speed,Ball Speed,Spin Rate
0,Justin Thomas,2017,1,309.1,54.64,116.52,174.84,2320.1
1,Jordan Spieth,2017,2,294.6,58.67,112.66,168.55,2439.6
2,Xander Schauffele,2017,3,306.5,58.80,118.33,174.24,2518.8
3,Dustin Johnson,2017,4,314.8,54.02,121.45,180.66,2499.9
4,Jon Rahm,2017,5,305.3,58.27,116.42,174.53,2193.0
...,...,...,...,...,...,...,...,...
620,C.T. Pan,2021,121,296.3,61.03,111.20,167.34,2129.2
621,Matt Kuchar,2021,122,288.0,65.81,108.60,162.18,2419.4
622,Brice Garnett,2021,123,288.1,70.86,109.53,164.71,2539.5
623,Scott Stallings,2021,124,298.2,58.83,115.96,173.80,2516.0


Now that we have gathered all of the data, let's see do we have some null values:

In [77]:
df[df["Ball Speed"].isnull()]

Unnamed: 0,Name,Season,Ranking,Driving Distance,Driving Accuracy,Club Head Speed,Ball Speed,Spin Rate
28,Patrick Cantlay,2017,29,,,,,
31,Henrik Stenson,2017,32,,,,,
57,Rory McIlroy,2017,58,,,,,
282,Shane Lowry,2019,33,,,,,
291,Tiger Woods,2019,42,,,,,
308,Collin Morikawa,2019,59,,,,,
323,Matthew Wolff,2019,74,,,,,
437,Tiger Woods,2020,63,,,,,
606,Garrick Higgo,2021,107,,,,,


As we have in total nine null values, the easiest solution is to drop these values.

In [78]:
df = df.dropna()

Now we have our data collected and ready for use. Next we have to label the data. As only the top 30 players get to participate in the Tour Championship, the final stage of the FedEx Cup playoffs, we can label our data with 1 if the player's ranking is 30 or smaller and with -1 if the player's ranking is over 30.

In [80]:
df.loc[df["Ranking"].astype(int) <= 30, "Eligible"] = 1
df.loc[df["Ranking"].astype(int) > 30, "Eligible"] = -1
df

Unnamed: 0,Name,Season,Ranking,Driving Distance,Driving Accuracy,Club Head Speed,Ball Speed,Spin Rate,Eligible
0,Justin Thomas,2017,1,309.1,54.64,116.52,174.84,2320.1,1.0
1,Jordan Spieth,2017,2,294.6,58.67,112.66,168.55,2439.6,1.0
2,Xander Schauffele,2017,3,306.5,58.80,118.33,174.24,2518.8,1.0
3,Dustin Johnson,2017,4,314.8,54.02,121.45,180.66,2499.9,1.0
4,Jon Rahm,2017,5,305.3,58.27,116.42,174.53,2193.0,1.0
...,...,...,...,...,...,...,...,...,...
620,C.T. Pan,2021,121,296.3,61.03,111.20,167.34,2129.2,-1.0
621,Matt Kuchar,2021,122,288.0,65.81,108.60,162.18,2419.4,-1.0
622,Brice Garnett,2021,123,288.1,70.86,109.53,164.71,2539.5,-1.0
623,Scott Stallings,2021,124,298.2,58.83,115.96,173.80,2516.0,-1.0


Now our dataset is ready. Let's save the data into a CSV-file and continue solving this Machine Learning in another notebook.

In [72]:
df.to_csv("../data/pga_data.csv", index=False)