# Machine Learning on PGA Tour - Data Collection

In this notebook we will collect the data, handle the potential missing values, and label it. The data is collected with web scraping methods from the official website of PGA Tour: https://www.pgatour.com.

In [1]:
# IMPORTS
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

After importing the needed libraries, we will create the dataframe where we will store the data. First we create the dataframe with default columns: the name of the player and his final ranking in PGA Tour 2021.

In [2]:
# CREATE DATAFRAME FOR THE DATA
df = pd.DataFrame(columns=["Name", "Ranking"])

Now, first we want to find out the rankings of the players and their names during season 2021. To find this data we will use PGA Tour's official site.

In [3]:
# GET DATA OF NAMES
URL = "https://www.pgatour.com/content/pgatour/stats/stat.02671.y2021.html"
response = requests.get(URL)
content = response.content
soup = BeautifulSoup(content, "html.parser")
table = soup.find(id="statsTable")
rows = table.find_all("tr")
names = []
for row in rows[1:126]:
    name = row.find("a").text.strip()
    ranking = int(row.find("td").text.strip())
    names.append((ranking, name))
names = np.array(names)
df["Name"] = names[:, 1]
df["Ranking"] = names[:, 0]

Then we will collect the values for the features. To do this, we will first create a function that will search the values from a given URL and add these values to the dataframe.

In [4]:
def scrape_data(url, header, df):
    response = requests.get(url)
    content = response.content
    soup = BeautifulSoup(content, "html.parser")
    table = soup.find(id="statsTable")
    rows = table.find_all("tr")
    values = []
    for row in rows[1:]:
        name = row.find("a").text.strip()
        value = float(row.find_all("td")[4].text.strip())
        values.append((name, value))
    df2 = pd.DataFrame(values, columns=["Name", header])
    result = pd.merge(left=df, right=df2, on="Name", how="outer").iloc[:125]
    return result

Then we will find out the average driving distances of the players (in yards).

In [5]:
# GET DATA OF DRIVING DISTANCE
URL = "https://www.pgatour.com/stats/stat.101.y2021.html"
header = "Driving Distance"
df = scrape_data(URL, header, df)
df

Unnamed: 0,Name,Ranking,Driving Distance
0,Patrick Cantlay,1,302.8
1,Jon Rahm,2,309.0
2,Kevin Na,3,288.5
3,Justin Thomas,4,303.9
4,Viktor Hovland,5,302.2
...,...,...,...
120,C.T. Pan,121,296.3
121,Matt Kuchar,122,287.8
122,Brice Garnett,123,288.1
123,Scott Stallings,124,297.4


Now our dataframe has the name of the player, ranking, and the average driving distance of the player during the PGA Tour 2021. Let's see do we have some null values at this point:

In [6]:
df[df["Driving Distance"].isnull()]

Unnamed: 0,Name,Ranking,Driving Distance
106,Garrick Higgo,107,


In our data there is one player who does not have a driving distance in the data source. There are multiple ways how we can fix this problem. As it is only one player, we choose to manually fill the driving distance. According to [PGA Tour statistics](https://www.pgatour.com/players/player.54421.garrick-higgo.html), Garrick Higgo's average driving distance during PGA Tour 2021 was 308.2 yards.

In [7]:
df.at[106, "Driving Distance"] = 308.2
df[df["Name"] == "Garrick Higgo"]

Unnamed: 0,Name,Ranking,Driving Distance
106,Garrick Higgo,107,308.2


In [8]:
df[df["Driving Distance"].isnull()]

Unnamed: 0,Name,Ranking,Driving Distance


Now all of our players have a driving distance. Let's continue to add other features to our dataset. The next one is Driving Accuracy, which tells the percentage of time a tee shot comes to rest in the fairway. This feature is quite important as it doesn't matter how long a tee shot is if it ends up in the forest.

In [9]:
# GET DATA OF DRIVING DISTANCE
URL = "https://www.pgatour.com/stats/stat.102.y2021.html"
header = "Driving Accuracy"
df = scrape_data(URL, header, df)
df

Unnamed: 0,Name,Ranking,Driving Distance,Driving Accuracy
0,Patrick Cantlay,1,302.8,60.71
1,Jon Rahm,2,309.0,63.73
2,Kevin Na,3,288.5,66.56
3,Justin Thomas,4,303.9,55.72
4,Viktor Hovland,5,302.2,63.86
...,...,...,...,...
120,C.T. Pan,121,296.3,61.11
121,Matt Kuchar,122,287.8,66.09
122,Brice Garnett,123,288.1,70.72
123,Scott Stallings,124,297.4,59.26


Now let's see if we have some null values.

In [10]:
df[df["Driving Accuracy"].isnull()]

Unnamed: 0,Name,Ranking,Driving Distance,Driving Accuracy
106,Garrick Higgo,107,308.2,


Same player again. There seems to be something with the data of Higgo. Let's manually fill this value again.  

In [11]:
df.at[106, "Driving Accuracy"] = 55.71
df[df["Name"] == "Garrick Higgo"]

Unnamed: 0,Name,Ranking,Driving Distance,Driving Accuracy
106,Garrick Higgo,107,308.2,55.71


Let's proceed to the next feature, which is the Club Head Speed. This value tells the speed at which the club impacts the ball on Par 4 and Par 5 tee shots. The value is in miles per hour (mph).

In [12]:
# GET DATA OF DRIVING DISTANCE
URL = "https://www.pgatour.com/stats/stat.02401.y2021.html"
header = "Club Head Speed"
df = scrape_data(URL, header, df)
df

Unnamed: 0,Name,Ranking,Driving Distance,Driving Accuracy,Club Head Speed
0,Patrick Cantlay,1,302.8,60.71,116.22
1,Jon Rahm,2,309.0,63.73,118.72
2,Kevin Na,3,288.5,66.56,112.40
3,Justin Thomas,4,303.9,55.72,117.01
4,Viktor Hovland,5,302.2,63.86,116.64
...,...,...,...,...,...
120,C.T. Pan,121,296.3,61.11,111.25
121,Matt Kuchar,122,287.8,66.09,108.59
122,Brice Garnett,123,288.1,70.72,109.55
123,Scott Stallings,124,297.4,59.26,115.95


I have a feeling we have one missing value. Let's see.

In [13]:
df[df["Club Head Speed"].isnull()]

Unnamed: 0,Name,Ranking,Driving Distance,Driving Accuracy,Club Head Speed
106,Garrick Higgo,107,308.2,55.71,


"How did you do that?". OK, let's fill this value manually again.

In [14]:
df.at[106, "Club Head Speed"] = 118.77
df[df["Name"] == "Garrick Higgo"]

Unnamed: 0,Name,Ranking,Driving Distance,Driving Accuracy,Club Head Speed
106,Garrick Higgo,107,308.2,55.71,118.77


Now for our last feature, which is the Ball Speed. This value tells the peak speed of the golf ball at launch on Par 4 and Par 5 tee shots. The unit is again miles per hour (mph).

In [15]:
URL = "https://www.pgatour.com/stats/stat.02402.y2021.html"
header = "Ball Speed"
df = scrape_data(URL, header, df)
df

Unnamed: 0,Name,Ranking,Driving Distance,Driving Accuracy,Club Head Speed,Ball Speed
0,Patrick Cantlay,1,302.8,60.71,116.22,174.71
1,Jon Rahm,2,309.0,63.73,118.72,178.46
2,Kevin Na,3,288.5,66.56,112.40,165.02
3,Justin Thomas,4,303.9,55.72,117.01,176.18
4,Viktor Hovland,5,302.2,63.86,116.64,173.98
...,...,...,...,...,...,...
120,C.T. Pan,121,296.3,61.11,111.25,167.40
121,Matt Kuchar,122,287.8,66.09,108.59,162.19
122,Brice Garnett,123,288.1,70.72,109.55,164.76
123,Scott Stallings,124,297.4,59.26,115.95,173.80


Again let's see if Higgo's value is missing.

In [16]:
df[df["Ball Speed"].isnull()]

Unnamed: 0,Name,Ranking,Driving Distance,Driving Accuracy,Club Head Speed,Ball Speed
106,Garrick Higgo,107,308.2,55.71,118.77,


This time, PGA official website doesn't have any statistics for Higgo's ball speed. Therefore, to obtain this value we have to do something else. One way to obtain this value is to fill it with the average of the values around it. This way we don't have to lose information and we get an approximate value that should not appear as an outlier.

In [17]:
df.at[106, "Ball Speed"] = round(df.iloc[96:116, 5].mean(), 2)
df[df["Name"] == "Garrick Higgo"]

Unnamed: 0,Name,Ranking,Driving Distance,Driving Accuracy,Club Head Speed,Ball Speed
106,Garrick Higgo,107,308.2,55.71,118.77,169.88


Now we have all of our features collected. Next we have to label the data. As only the top 30 players get to participate in the Tour Championship, the final stage of the FedEx Cup playoffs, we can label our data with 1 if the player's ranking is 30 or smaller and with -1 if the player's ranking is over 30.

In [18]:
df.loc[df["Ranking"].astype(int) <= 30, "Eligible"] = 1
df.loc[df["Ranking"].astype(int) > 30, "Eligible"] = -1
df

Unnamed: 0,Name,Ranking,Driving Distance,Driving Accuracy,Club Head Speed,Ball Speed,Eligible
0,Patrick Cantlay,1,302.8,60.71,116.22,174.71,1.0
1,Jon Rahm,2,309.0,63.73,118.72,178.46,1.0
2,Kevin Na,3,288.5,66.56,112.40,165.02,1.0
3,Justin Thomas,4,303.9,55.72,117.01,176.18,1.0
4,Viktor Hovland,5,302.2,63.86,116.64,173.98,1.0
...,...,...,...,...,...,...,...
120,C.T. Pan,121,296.3,61.11,111.25,167.40,-1.0
121,Matt Kuchar,122,287.8,66.09,108.59,162.19,-1.0
122,Brice Garnett,123,288.1,70.72,109.55,164.76,-1.0
123,Scott Stallings,124,297.4,59.26,115.95,173.80,-1.0


Now our dataset is ready. Let's save the data into a CSV-file and continue solving this Machine Learning in another notebook.

In [32]:
df.to_csv("../data/pga_data.csv", index=False)