# PGATour.com Web Scraper
### Updated: 1/24/2024 to handle the new website from late 2022

#### About the Site:
The PGA Tour website had recently gone through an update in its workings as I realized upon taking on this project. To complete the scraping as I had desired, common scraping libraries like BeautifulSoup and Selenium wouldn't work as far as I was aware. Because of this, this notebook follows a journey in learning about tables built on websites through JavaScript and scraping the data utlizing this fact.

The site utilizes GraphQL for their databases, so accessing the Network tab of Google Chrome gave insight into a couple key observations regarding its workings. First, each stat has a respective "stat_id", a number indicating which stat is being reported in the table. Also, corresponding to the drop down menus accessible to the user, year and tournament can be selected. In this way, the user can indicate which year, and through which tournament they want to observe the leaderboard for any given statistic. 

In my application of this data, I wanted to find the yearly leaderboard at the end of each competitive season, i.e. after the TOUR Championship. This notebook aims to create an SQL database in which for each year, each PGA pro has a row corresponding to the values of each of their statistics that given year.

#### Procedure:
1. Import relevant libraries
2. Define a method to obtain all "Stat IDs" from https://www.pgatour.com/stats
3. Define a method to obtain a dataframe of all players on tour and their specific stat given a Stat ID and Year
4. Merge dataframes into one dataframe
5. Convert merged dataframe to SQL database for use in subsequent notebooks

## 1. Import Relevant Libraries

In [170]:
# in this notebook, requests will be utilized to access the website's source code and pandas is required as a
# data manipulation tool
import requests
import pandas as pd

import os
from dotenv import load_dotenv

# Load environment variables from .env
load_dotenv()

# Access the API key from the environment variable
X_API_KEY = os.getenv("X_API_KEY")

## 2. Define get_stat_ids() Method to Obtain All Stat IDs

In [171]:
# get_stat_ids() is a method designed to create and return a dictionary of all stats on the PGATour website 
# and their corresponding IDs
def get_stat_ids():
    # this graphQL payload retrieves statistical information based on the given query
    # can be found in the Network tab of Chrome after utilizing the Inspect tool
    payload = {
        "operationName": "StatOverview",
        "variables": {
            "tourCode": "R",
            "year": 2024
        },
        "query": "query StatOverview($tourCode: TourCode!, $year: Int) {\n  statOverview(tourCode: $tourCode, year: $year) {\n    tourCode\n    year\n    categories {\n      category\n      displayName\n      subCategories {\n        displayName\n        stats {\n          statId\n          statTitle\n        }\n      }\n    }\n    stats {\n      statName\n      tourAvg\n      statId\n      players {\n        statId\n        playerId\n        statTitle\n        statValue\n        playerName\n        rank\n        country\n        countryFlag\n      }\n    }\n  }\n}"
    }

    page = requests.post("https://orchestrator.pgatour.com/graphql", json=payload, headers={"x-api-key": X_API_KEY})

    data = page.json()["data"]["statOverview"]["categories"]

    dict = {}
    
    # nested for loops parse the json data to find the title and corresponding ID of each statistic
    # and inserts it into a Python dictionary
    for category in data:
        for subcategory in category['subCategories']:
            for stat in subcategory['stats']:
                dict[stat['statTitle']] = stat['statId']
    return(dict)

## 3. Define get_df Method to Obtain a Data Frame for a Given Statistic

In [172]:
# get_df() is a method that takes in a given stat id and year as parameters and creates and returns a data frame
# storing each pro on tour that year and their performance in that given stat
def get_df(stat_id, year):

    STAT_ID = stat_id
    YEAR = year
    # this tournament ID corresponds to the TOUR championship
    TOURNAMENT_ID = "R2023060"

    # for the 2022-23 season, the drop down menu contains tournaments after the end of the tour season
    # the if statement covers this case making the table report values only through the end of the season
    # ignoring values after for consistency
    if year == 2023:
        payload = {
            "operationName": "StatDetails",
            "variables": {
                "tourCode": "R",
                "statId": STAT_ID,
                "year": YEAR,
                "eventQuery": {
                    "queryType": "THROUGH_EVENT",
                    "tournamentId": TOURNAMENT_ID,
                }
            },
            "query": "query StatDetails($tourCode: TourCode!, $statId: String!, $year: Int, $eventQuery: StatDetailEventQuery) {\n  statDetails(\n    tourCode: $tourCode\n    statId: $statId\n    year: $year\n    eventQuery: $eventQuery\n  ) {\n    tourCode\n    year\n    displaySeason\n    statId\n    statType\n    tournamentPills {\n      tournamentId\n      displayName\n    }\n    yearPills {\n      year\n      displaySeason\n    }\n    statTitle\n    statDescription\n    tourAvg\n    lastProcessed\n    statHeaders\n    statCategories {\n      category\n      displayName\n      subCategories {\n        displayName\n        stats {\n          statId\n          statTitle\n        }\n      }\n    }\n    rows {\n      ... on StatDetailsPlayer {\n        __typename\n        playerId\n        playerName\n        country\n        countryFlag\n        rank\n        rankDiff\n        rankChangeTendency\n        stats {\n          statName\n          statValue\n          color\n        }\n      }\n      ... on StatDetailTourAvg {\n        __typename\n        displayName\n        value\n      }\n    }\n    sponsorLogo\n  }\n}"
        }
    else:
        payload = {
            "operationName": "StatDetails",
            "variables": {
                "tourCode": "R",
                "statId": STAT_ID,
                "year": YEAR,
            },
            "query": "query StatDetails($tourCode: TourCode!, $statId: String!, $year: Int, $eventQuery: StatDetailEventQuery) {\n  statDetails(\n    tourCode: $tourCode\n    statId: $statId\n    year: $year\n    eventQuery: $eventQuery\n  ) {\n    tourCode\n    year\n    displaySeason\n    statId\n    statType\n    tournamentPills {\n      tournamentId\n      displayName\n    }\n    yearPills {\n      year\n      displaySeason\n    }\n    statTitle\n    statDescription\n    tourAvg\n    lastProcessed\n    statHeaders\n    statCategories {\n      category\n      displayName\n      subCategories {\n        displayName\n        stats {\n          statId\n          statTitle\n        }\n      }\n    }\n    rows {\n      ... on StatDetailsPlayer {\n        __typename\n        playerId\n        playerName\n        country\n        countryFlag\n        rank\n        rankDiff\n        rankChangeTendency\n        stats {\n          statName\n          statValue\n          color\n        }\n      }\n      ... on StatDetailTourAvg {\n        __typename\n        displayName\n        value\n      }\n    }\n    sponsorLogo\n  }\n}"
        }

    page = requests.post("https://orchestrator.pgatour.com/graphql", json=payload, headers={"x-api-key": X_API_KEY})

    data = page.json()["data"]["statDetails"]["rows"]

    # creates a table parsing through JSON data and creating a dataframe
    table = []
    for item in data:
        row = {
            "player": item["playerName"],
        }
        # NOTE: in the following line, the assumption is made that the only valuable stat is the 1st element
        # which is the case for my application but this could be adapted to retrieve all stats in each table
        if item["stats"]:
            row[item["stats"][0]["statName"]] = item["stats"][0]["statValue"]
        else:
            row["NoStats"] = None
        table.append(row)
    
    df = pd.DataFrame(table)
    return df

## Establish Desired Statistics and Merge Yearly Data Frames Into One

In [173]:
# desired_stats list to input which stats are important to include in the data frame
desired_stats = ['SG: Total','SG: Tee-to-Green','SG: Off-the-Tee','SG: Approach the Green','SG: Around-the-Green', 'SG: Putting']
stat_ids = get_stat_ids()
df = pd.DataFrame()

# for each year between 2013 and 2023 create a data frame for every statistic that year and merge
# in the end, concatinate each year's dataframe to one large data frame to be converted into an SQL database
for year in range(2013, 2024):
    df_year = pd.DataFrame()
    
    df_stat = get_df(stat_ids['SG: Total'], year)
    df_stat.drop(columns=['Avg'], inplace=True)
    df_year = pd.concat([df_year, df_stat], ignore_index=True)
    
    for stat_name in desired_stats:
        df_stat = get_df(stat_ids[stat_name], year)

        df_stat.rename(columns={df_stat.columns[1]: stat_name}, inplace=True)

        df_year = pd.merge(df_year, df_stat, on='player')
    df_year.insert(0, 'Year', year)
    
    df = pd.concat([df, df_year], ignore_index=True)
    
df

Unnamed: 0,Year,player,SG: Total,SG: Tee-to-Green,SG: Off-the-Tee,SG: Approach the Green,SG: Around-the-Green,SG: Putting
0,2013,Steve Stricker,2.193,1.474,.311,.567,.596,.720
1,2013,Tiger Woods,2.064,1.637,-.142,1.533,.247,.426
2,2013,Justin Rose,1.723,1.912,.459,.961,.491,-.188
3,2013,Henrik Stenson,1.618,1.614,.710,.776,.128,.004
4,2013,Sergio Garcia,1.519,.928,.131,.538,.259,.591
...,...,...,...,...,...,...,...,...
2055,2023,Hank Lebioda,-1.482,-1.141,-0.491,-0.237,-0.413,-0.341
2056,2023,Andrew Landry,-1.535,-1.558,-0.170,-1.163,-0.226,0.023
2057,2023,Max McGreevy,-1.695,-1.207,-0.358,-0.905,0.055,-0.488
2058,2023,Nick Watney,-1.720,-1.667,-1.355,-0.139,-0.173,-0.053


## Conclusion

This notebook tracks my progress--what I would consider to be leaps and bounds--in my knowledge of the workings of websites utilizing JavaScript and GraphQL to generate tables. Not only did this project allow me to reach out of my comfort zone in learning a new webscraping technique, but it gave me great practice with merging and concatenating various data frames utilizing the Pandas library. 

Overall, there is much more to be done with this notebook. For starters, one limitation that I have noticed was from 2004-2012, the aforementioned drop down doesn't have the TOUR championship as the last championship, so another condition might need to be checked. Luckly, these years are outside the scope of my data project in which I'll be using this data. 

With these limitations in mind, this notebook perfectly sets up further notebooks in this repository where I analyze this data further determining interesting patterns and trends in the world of professional golf with which I hope to improve my own amateur game.