## PGA Data Scraper 

[Taken from an effort by Patrick Young on Github](https://github.com/Patrick-Young/PGA-Data)

This scraper uses bs4 to lift stats between 2010-2017 from the pgatour.com website that is built into a single pandas dataframe.

In this notebook, we set the range of years (season) and pickle the dataframe so it can be used in multiple projects.

Patrick goes pretty far to build a very usable dataframe for analysis by merging a bunch of discrete dataframes into a master dataframe laid out to suit machine learning exercises.

PGA tour statistical data is contained on separate pages on pgatour.com/stats webste.

In [1]:
import requests # Request module
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup #Web scraping module
import matplotlib.pyplot as plt 
import seaborn as sns

### Here is Patrick's approach to scraping this data from these separate web pages:

- create a dataframe for each statistic page. Each dataframe includes the 
players and their stats
- keep only the columns that I need from that page.
- repeat steps 1-3 for years 2010-2017.
- Write it to csv so we can use it again in other projects
- 10/7/2019 - Added two stats for SG_OTT and SG_ATG

### Implementation strategy is described:

* Pulls column headers
* Pulls players from particular stats page
* Pulls statistics from page
* Create a dictionary to store player data in for particular stats page.
* Uses functions 1-4 to create a pandas dataframe to store data for that particular statistic.
* Loop through years 2010-2017 to create a dataframe from years 2010-2017
* Write to CSV for future use.

In [2]:
# Pull column headers from page
def get_headers(soup):
    '''This function get's the column names to use for the data frame.'''
    headers = []
    
    #Get rounds header
    rounds = soup.find_all(class_="rounds hidden-small hidden-medium")[0].get_text()
    headers.append(rounds)
    
    #Get other headers
    stat_headers = soup.find_all(class_="col-stat hidden-small hidden-medium")
    for header in stat_headers:
        headers.append(header.get_text())   
    return headers

In [3]:
# Pull players from page
# Get Players

def get_players(soup):
    '''This function takes the beautiful soup created and uses it to gather player names from the specified stats page.'''
    
    player_list = []
    
    #Get player as html tags
    players = soup.select('td a')[1:] #Use 1 beacuse first line of all tables is not useful.
    #Loop through list
    for player in players:
        player_list.append(player.get_text())
    
    return player_list

In [4]:
# Pull statistics from page
# Get Stats

def get_stats(soup, categories):
    '''This function takes the soup created before and the number of categories needed to generate this'''
    
    #Finds all tags with class specified and puts into a list
    stats = soup.find_all(class_="hidden-small hidden-medium")
    
    #Initialize stats list
    stat_list = []
    
    #Loop through 
    for i in range(0, len(stats)-categories+1, categories):
        temp_list = []
        for j in range(categories):
            temp_list.append(stats[i + j].get_text())
        stat_list.append(temp_list)
            
    return stat_list

In [5]:
# Create data dictionary for page

def stats_dict(players, stats):
        '''This function takes two lists, players and stats, 
        and creates a dictionary with the player being the key 
        and the stats as the vales (as a list)'''
    
        #initialize player dictionary
        player_dict = {}
    
        #Loop through player list
        for i, player in enumerate(players):
            player_dict[player] = stats[i]
    
        return player_dict

### Function to make the dataframe 

In [6]:
# Use functions 1-4 to create dataframe for statistic. "make_dataframe"
# Mega function

def make_dataframe(url, categories):
        
    ##Create soup object from url.
    response = requests.get(url)
    text = response.text
    soup = BeautifulSoup(text, 'lxml')
    
    #1. Get Headers
    headers = get_headers(soup)
    
    #2. Get Players
    players = get_players(soup)
    
    #3. Get Stats
    stats = get_stats(soup, categories)
    #print(categories)
    
    #4. Make stats dictionary.
    stats_dictionary = stats_dict(players, stats)
    
    #Make dataframe
    frame = pd.DataFrame(stats_dictionary, index = headers).T
    
    #Reset index
    frame = frame.reset_index()
    
    #For each Dataframe, change index column to 'NAME'
    frame = frame.rename(index = str, columns = {'index': 'NAME'})
    return frame

### Create the 'df_total' dataframe with everything merged together

In [15]:
# Loop through years 2010-2017 to create a dataframe from years 2010-2017
# All of the data cleaning and preprocessing happens in the next couple of code blocks.

years = [str(i) for i in range(2010, 2011)]

for year in years:
    print(year)
    #Fedex cup points
    fcp = make_dataframe("https://www.pgatour.com/stats/stat.02671.{}.html".format(year), 6)[['NAME', 'POINTS']]
    #Top 10's and wins
    top10 = make_dataframe("https://www.pgatour.com/stats/stat.138.{}.html".format(year), 5)[['NAME', 'TOP 10', '1ST']]

    #Scoring statistics, keep rounds from this page as it most accurately reflects total rounds player completed in season.
    scoring = make_dataframe("https://www.pgatour.com/stats/stat.120.{}.html".format(year), 5)[['NAME', 'ROUNDS', 'AVG']]
    scoring = scoring.rename(columns={'AVG':'SCORING'})

    #Driving Distance
    drivedistance = make_dataframe("https://www.pgatour.com/stats/stat.101.{}.html".format(year), 4)[['NAME', 'AVG.']]
    #Rename Columns
    drivedistance = drivedistance.rename(columns = {'AVG.':'DRIVE_DISTANCE'})

    # sg Around the Green
    sg_around_green = make_dataframe("https://www.pgatour.com/stats/stat.02569.{}.html".format(year), 4)[['NAME', 'AVERAGE']]
    sg_around_green = sg_around_green.rename(columns = {'AVERAGE':'SG_AROUND_GREEN'})
    
    #Driving Accuracy
    driveacc = make_dataframe("https://www.pgatour.com/stats/stat.102.{}.html".format(year), 4)[['NAME', '%']]
    #Change column name from % to FWY %
    driveacc = driveacc.rename(columns = {'%': "FWY_%"})

    #Greens in Regulation.
    gir = make_dataframe("https://www.pgatour.com/stats/stat.103.{}.html".format(year), 5)[['NAME', '%']]
    #Change column name from % to GIR %
    gir = gir.rename(columns = {'%': "GIR_%"})

    #Strokes gained putting
    sg_putting = make_dataframe("https://www.pgatour.com/stats/stat.02564.{}.html".format(year), 4)[['NAME', 'AVERAGE']]
    #Change name of average column
    sg_putting = sg_putting.rename(columns = {'AVERAGE': 'SG_P'})

    #Strokes gained tee to green
    sg_teetogreen = make_dataframe("https://www.pgatour.com/stats/stat.02674.{}.html".format(year), 6)[['NAME', 'AVERAGE']]
    #Change name of average column
    sg_teetogreen = sg_teetogreen.rename(columns = {'AVERAGE' : 'SG_TTG'})

    #sg total
    sg_total = make_dataframe("https://www.pgatour.com/stats/stat.02675.{}.html".format(year), 6)[['NAME', 'AVERAGE']]
    sg_total = sg_total.rename(columns = {'AVERAGE':'SG_T'})
   
    #sg Approach the green
    sg_approach = make_dataframe("https://www.pgatour.com/stats/stat.02568.{}.html".format(year), 4)[['NAME', 'AVERAGE']]
    sg_approach = sg_approach.rename(columns = {'AVERAGE':'SG_ATG'})
    
    #sg Off the Tee
    sg_ott = make_dataframe("https://www.pgatour.com/stats/stat.02567.{}.html".format(year), 4)[['NAME', 'AVERAGE']]
    sg_ott = sg_ott.rename(columns = {'AVERAGE':'SG_OTT'})
    
    


    #Get Dataframes into list.
    data_frames = [drivedistance, driveacc, gir, sg_putting, sg_teetogreen, sg_total,sg_approach, sg_ott]
    
    #Merge all Dataframes together
    df_one = pd.DataFrame()
    df_one = scoring
    for df in data_frames:
        df_one = pd.merge(df_one, df, on='NAME')

    #merge fex ex cup points
    df_one = pd.merge(df_one, fcp, how='outer', on='NAME')
    #Merge top 10's
    df_one = pd.merge(df_one, top10, how='outer', on='NAME')
    
    #Only get people who's scoring average isn't null.
    df_one = df_one.loc[df_one['SCORING'].isnull() == False]  
    
    #Add year column
    df_one['Year'] = year
    
    #Concat dataframe to overall dataframe
    
    if year == '2010':
        df_total = pd.DataFrame()
        df_total = pd.concat([df_total, df_one], axis=0)
    else:
        df_total = pd.concat([df_total, df_one], axis=0)



# Now save the file in a sqlite3 database

#Load sqlite package
#import sqlite3 as db
#Create connect object with example db. A new file will be created.
#conn = db.connect('pgatour_raw.db')

#Create cursor to perform actions on db.
#c = conn.cursor()

#df_total.to_sql("pgatour_stats_raw", conn, if_exists='replace')

#conn.close()

2010


In [17]:
df_total.head(10)

Unnamed: 0,NAME,ROUNDS,SCORING,DRIVE_DISTANCE,FWY_%,GIR_%,SG_P,SG_TTG,SG_T,SG_ATG,SG_OTT,POINTS,TOP 10,1ST,Year
0,Matt Kuchar,97,69.606,286.9,67.89,69.36,0.648,0.827,1.461,0.336,0.158,2728,11,1.0,2010
1,Steve Stricker,73,69.66,282.9,68.5,68.29,0.437,1.383,1.818,0.773,0.191,2028,9,2.0,2010
2,Retief Goosen,75,69.718,291.4,64.79,65.96,0.679,0.917,1.598,0.185,0.337,1360,10,,2010
3,Paul Casey,64,69.72,294.2,61.31,68.68,0.812,0.587,1.411,0.483,0.215,2250,7,,2010
4,Jim Furyk,76,69.828,276.0,71.01,67.12,0.402,1.159,1.564,0.641,0.15,2980,7,3.0,2010
5,Ernie Els,72,69.843,288.4,60.16,67.86,0.33,0.992,1.322,0.735,0.215,1438,7,2.0,2010
6,Luke Donald,71,69.85,277.0,62.36,65.28,0.87,0.619,1.493,0.661,-0.506,2700,7,,2010
7,Justin Rose,78,69.885,287.8,65.17,66.31,0.243,0.952,1.195,0.168,0.338,718,4,2.0,2010
8,Bo Van Pelt,104,69.955,292.0,65.23,69.23,0.098,1.091,1.192,0.26,0.724,445,8,,2010
9,Phil Mickelson,76,69.966,299.1,52.66,65.13,-0.147,1.151,1.001,0.738,0.185,843,6,1.0,2010


In [16]:
df_total.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 192 entries, 0 to 191
Data columns (total 15 columns):
NAME              192 non-null object
ROUNDS            192 non-null object
SCORING           192 non-null object
DRIVE_DISTANCE    192 non-null object
FWY_%             192 non-null object
GIR_%             192 non-null object
SG_P              192 non-null object
SG_TTG            192 non-null object
SG_T              192 non-null object
SG_ATG            192 non-null object
SG_OTT            192 non-null object
POINTS            192 non-null object
TOP 10            165 non-null object
1ST               165 non-null object
Year              192 non-null object
dtypes: object(15)
memory usage: 24.0+ KB


In [9]:
df_total.to_csv('2010_2020_PGA_Stats.csv')