# CSE 482 Project: An Analysis of PGA TOUR Statistics
### Tyler Rozwadowski

## Data Collection

In order to begin our analysis, we first need to gather a data set. I initially attempted to find a premade dataset, but could not find one that had all the attributes I had hoped for. This led me to write a web crawler using the BeautifulSoup library.

In [608]:
# Import all the modules needed for scraping
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

I will use the following functions to scrape statistics off of the PGA TOUR website.

In [609]:
def get_headers(soup):
    '''
    Select header classes for dataframe
    '''
    headers = []
    
    rounds = soup.find_all(class_="rounds hidden-small hidden-medium")[0].get_text()
    headers.append(rounds)
    
    stat_headers = soup.find_all(class_="col-stat hidden-small hidden-medium")
    for header in stat_headers:
        headers.append(header.get_text())
    
    return headers

def get_players(soup):
    '''
    Gather player names from the specified stats pages
    '''
    player_list = []
    
    players = soup.select("td a")[1:] # 0th index didnt work?
    for player in players:
        player_list.append(player.get_text())
        
    return player_list

def get_stats(soup, categories):
    '''
    Get the stat categories specified
    '''
    stat_list = []
    
    stats = soup.find_all(class_="hidden-small hidden-medium")
    for i in range(0, len(stats)-categories+1, categories):
        tmp = []
        for j in range(categories):
            tmp.append(stats[i+j].get_text())
        stat_list.append(tmp)
        
    return stat_list

def make_dict(players, stats):
    '''Take a list of players and a list of stats, 
        and create a dictionary with the player name as the key,
        and his stats as the values'''
    player_dict = {}
    
    for i, player in enumerate(players):
        player_dict[player] = stats[i]
        
    return player_dict
    
def make_dataframe(url, categories):
    '''Make dataframe to store stats for specific statistics'''
    
    # Create the soup object
    response = requests.get(url)
    text = response.text
    soup = BeautifulSoup(text, 'lxml') #document we're parsing, parser
    
    headers = get_headers(soup)
    players = get_players(soup)
    stats = get_stats(soup, categories)
    
    stats_dict = make_dict(players, stats)
    frame = pd.DataFrame(stats_dict, index = headers).T #flip the dataframe around
    frame = frame.reset_index()
    frame = frame.rename(index = str, columns = {'index': 'NAME'})
    
    return frame

Now that we have all our data scraping functions, we can use them to collect the data. 
I've selected certain statistics to scrape that I think will be most interesting when doing data analysis.

In [610]:
years = [str(i) for i in range(2009, 2019)]

for year in years:
    print("Collecting data for: " + year)
    
    # Fedex cup points
    fcp = make_dataframe("https://www.pgatour.com/stats/stat.02671.{}.html".format(year), 6)[['NAME', 'POINTS']]
    
    # Top 10's and wins
    top10 = make_dataframe("https://www.pgatour.com/stats/stat.138.{}.html".format(year), 5)[['NAME', 'TOP 10', '1ST']]

    #Scoring statistics, keep rounds from this page as it most accurately reflects total rounds player completed in season.
    scoring = make_dataframe("https://www.pgatour.com/stats/stat.120.{}.html".format(year), 5)[['NAME', 'ROUNDS', 'AVG']]
    scoring = scoring.rename(columns={'AVG':'SCORING'})
    
    # Total Money
    totalmoney = make_dataframe("https://www.pgatour.com/stats/stat.109.{}.html".format(year), 3)[['NAME', 'MONEY']]
    
    # Driving Distance
    drivedistance = make_dataframe("https://www.pgatour.com/stats/stat.101.{}.html".format(year), 4)[['NAME', 'AVG.']]
    drivedistance = drivedistance.rename(columns = {'AVG.':'DRIVE_DISTANCE'})

    # Driving Accuracy
    driveacc = make_dataframe("https://www.pgatour.com/stats/stat.102.{}.html".format(year), 4)[['NAME', '%']]
    driveacc = driveacc.rename(columns = {'%': "FWY_%"})

    # Greens in Regulation
    gir = make_dataframe("https://www.pgatour.com/stats/stat.103.{}.html".format(year), 5)[['NAME', '%']]
    gir = gir.rename(columns = {'%': "GIR_%"})
    
    # Scrambling Percentage
    scrambling = make_dataframe("https://www.pgatour.com/stats/stat.130.{}.html".format(year), 4)[['NAME', '%']]
    scrambling = scrambling.rename(columns = {'%': 'SCRAMBLING_%'})
    
    # Strokes gained off the tee
    sg_tee = make_dataframe("https://www.pgatour.com/stats/stat.02567.{}.html".format(year), 4)[['NAME', 'AVERAGE']]
    sg_tee = sg_tee.rename(columns = {'AVERAGE': 'SG_TEE'})
    
    # Strokes gained approach shots
    sg_approach = make_dataframe("https://www.pgatour.com/stats/stat.02568.{}.html".format(year), 4)[['NAME', 'AVERAGE']]
    sg_approach = sg_approach.rename(columns = {'AVERAGE': 'SG_APPROACH'})
    
    # Strokes gained scrambling
    sg_scrambling = make_dataframe("https://www.pgatour.com/stats/stat.02569.{}.html".format(year), 4)[['NAME', 'AVERAGE']]
    sg_scrambling = sg_scrambling.rename(columns = {'AVERAGE': 'SG_SCRAMBLE'})

    # Strokes gained putting
    sg_putting = make_dataframe("https://www.pgatour.com/stats/stat.02564.{}.html".format(year), 4)[['NAME', 'AVERAGE']]
    sg_putting = sg_putting.rename(columns = {'AVERAGE': 'SG_PUTTING'})

    # Strokes gained total
    sg_total = make_dataframe("https://www.pgatour.com/stats/stat.02675.{}.html".format(year), 6)[['NAME', 'AVERAGE']]
    sg_total = sg_total.rename(columns = {'AVERAGE':'SG_TOTAL'})
    
    #Get Dataframes into list.
    data_frames = [drivedistance, driveacc, scrambling, gir, sg_tee, sg_approach, sg_scrambling, sg_putting, sg_total]
    
    #Merge all Dataframes together
    df_one = pd.DataFrame()
    df_one = scoring
    for df in data_frames:
        df_one = pd.merge(df_one, df, on='NAME')
        
    #merge fex ex cup points
    df_one = pd.merge(df_one, fcp, how='outer', on='NAME')
    #Merge top 10's
    df_one = pd.merge(df_one, top10, how='outer', on='NAME')
    #Merge total money
    df_one = pd.merge(df_one, totalmoney, how='outer', on='NAME')
    
    #Only get people who's scoring average isn't null.
    df_one = df_one.loc[df_one['SCORING'].isnull() == False]  
    
    #Add year column
    df_one['Year'] = year
    
    #Concat dataframe to overall dataframe
    
    if year == '2009':
        df_raw = pd.DataFrame()
        df_raw = pd.concat([df_raw, df_one], axis=0)
    else:
        df_raw = pd.concat([df_raw, df_one], axis=0)

Collecting data for: 2009
Collecting data for: 2010
Collecting data for: 2011
Collecting data for: 2012
Collecting data for: 2013
Collecting data for: 2014
Collecting data for: 2015
Collecting data for: 2016
Collecting data for: 2017
Collecting data for: 2018


In [611]:
df_raw.head()

Unnamed: 0,NAME,ROUNDS,SCORING,DRIVE_DISTANCE,FWY_%,SCRAMBLING_%,GIR_%,SG_TEE,SG_APPROACH,SG_SCRAMBLE,SG_PUTTING,SG_TOTAL,POINTS,TOP 10,1ST,MONEY,Year
0,Tiger Woods,64,68.052,298.4,64.29,68.18,68.46,0.335,1.398,0.579,0.877,3.189,4000,14,6.0,"$10,508,163",2009
1,Steve Stricker,81,69.286,286.1,66.82,66.46,66.67,0.275,1.018,0.327,0.207,1.828,2750,11,3.0,"$6,332,636",2009
2,Jim Furyk,86,69.477,279.9,69.66,64.08,65.53,-0.021,0.557,0.439,0.715,1.69,2438,11,,"$3,946,515",2009
3,Zach Johnson,94,69.601,281.2,71.47,62.1,67.81,0.253,0.844,0.068,0.38,1.545,2073,9,2.0,"$4,714,813",2009
4,Tim Clark,81,69.658,280.1,74.06,62.93,66.95,0.125,0.773,0.087,0.276,1.261,1395,5,,"$2,235,105",2009


In [612]:
print(df_raw.shape)

(1862, 17)


In [613]:
# output the raw data to a CSV file
df_raw.to_csv(r'raw_data.csv')

## Data Cleaning

Now we can go through a sequence of steps to preprocess or "clean" the collected data to prepare it for analysis.

In [614]:
df_total = df_raw.copy(deep=True)
df_total.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1862 entries, 0 to 192
Data columns (total 17 columns):
NAME              1862 non-null object
ROUNDS            1862 non-null object
SCORING           1862 non-null object
DRIVE_DISTANCE    1862 non-null object
FWY_%             1862 non-null object
SCRAMBLING_%      1862 non-null object
GIR_%             1862 non-null object
SG_TEE            1862 non-null object
SG_APPROACH       1862 non-null object
SG_SCRAMBLE       1862 non-null object
SG_PUTTING        1862 non-null object
SG_TOTAL          1862 non-null object
POINTS            1858 non-null object
TOP 10            1534 non-null object
1ST               1534 non-null object
MONEY             1858 non-null object
Year              1862 non-null object
dtypes: object(17)
memory usage: 261.8+ KB


In [615]:
df_total.describe(include=['O'])

Unnamed: 0,NAME,ROUNDS,SCORING,DRIVE_DISTANCE,FWY_%,SCRAMBLING_%,GIR_%,SG_TEE,SG_APPROACH,SG_SCRAMBLE,SG_PUTTING,SG_TOTAL,POINTS,TOP 10,1ST,MONEY,Year
count,1862,1862,1862.0,1862.0,1862.0,1862.0,1862.0,1862.0,1862.0,1862.0,1862.0,1862.0,1858,1534,1534.0,1858,1862
unique,461,72,1315.0,408.0,1198.0,978.0,837.0,1013.0,1031.0,771.0,1003.0,1272.0,1029,14,7.0,1857,10
top,Sean O'Hair,81,70.92,288.1,59.93,60.0,66.67,0.316,0.202,0.047,0.0,0.298,565,1,,"$1,168,073",2018
freq,10,60,5.0,18.0,5.0,9.0,25.0,6.0,7.0,10.0,7.0,6.0,7,397,1197.0,2,193


In [616]:
# Remove dollar signs and commas from money column
df_total['MONEY'] = df_total['MONEY'].str.replace(',', '')
df_total['MONEY'] = df_total['MONEY'].str.replace('$', '')

# Remove commas from the POINTS column
df_total['POINTS'] = df_total['POINTS'].str.replace(',', '')

df_total.head()

Unnamed: 0,NAME,ROUNDS,SCORING,DRIVE_DISTANCE,FWY_%,SCRAMBLING_%,GIR_%,SG_TEE,SG_APPROACH,SG_SCRAMBLE,SG_PUTTING,SG_TOTAL,POINTS,TOP 10,1ST,MONEY,Year
0,Tiger Woods,64,68.052,298.4,64.29,68.18,68.46,0.335,1.398,0.579,0.877,3.189,4000,14,6.0,10508163,2009
1,Steve Stricker,81,69.286,286.1,66.82,66.46,66.67,0.275,1.018,0.327,0.207,1.828,2750,11,3.0,6332636,2009
2,Jim Furyk,86,69.477,279.9,69.66,64.08,65.53,-0.021,0.557,0.439,0.715,1.69,2438,11,,3946515,2009
3,Zach Johnson,94,69.601,281.2,71.47,62.1,67.81,0.253,0.844,0.068,0.38,1.545,2073,9,2.0,4714813,2009
4,Tim Clark,81,69.658,280.1,74.06,62.93,66.95,0.125,0.773,0.087,0.276,1.261,1395,5,,2235105,2009


In [617]:
# Check data for missing values
missing_vals = df_total.isnull().sum()
print("The data has {} null values".format(missing_vals.sum()))

The data has 664 null values


In [618]:
df_total.head()

Unnamed: 0,NAME,ROUNDS,SCORING,DRIVE_DISTANCE,FWY_%,SCRAMBLING_%,GIR_%,SG_TEE,SG_APPROACH,SG_SCRAMBLE,SG_PUTTING,SG_TOTAL,POINTS,TOP 10,1ST,MONEY,Year
0,Tiger Woods,64,68.052,298.4,64.29,68.18,68.46,0.335,1.398,0.579,0.877,3.189,4000,14,6.0,10508163,2009
1,Steve Stricker,81,69.286,286.1,66.82,66.46,66.67,0.275,1.018,0.327,0.207,1.828,2750,11,3.0,6332636,2009
2,Jim Furyk,86,69.477,279.9,69.66,64.08,65.53,-0.021,0.557,0.439,0.715,1.69,2438,11,,3946515,2009
3,Zach Johnson,94,69.601,281.2,71.47,62.1,67.81,0.253,0.844,0.068,0.38,1.545,2073,9,2.0,4714813,2009
4,Tim Clark,81,69.658,280.1,74.06,62.93,66.95,0.125,0.773,0.087,0.276,1.261,1395,5,,2235105,2009


In [619]:
# Impute with 0s in 1st and Top 10 columns
df_total.fillna(0, inplace = True)
df_total.replace('', '0', inplace = True)

missing_vals = df_total.isnull().sum().sum()
print("The data has {} null values".format(missing_vals))

The data has 0 null values


In [620]:
df_total.head()

Unnamed: 0,NAME,ROUNDS,SCORING,DRIVE_DISTANCE,FWY_%,SCRAMBLING_%,GIR_%,SG_TEE,SG_APPROACH,SG_SCRAMBLE,SG_PUTTING,SG_TOTAL,POINTS,TOP 10,1ST,MONEY,Year
0,Tiger Woods,64,68.052,298.4,64.29,68.18,68.46,0.335,1.398,0.579,0.877,3.189,4000,14,6,10508163,2009
1,Steve Stricker,81,69.286,286.1,66.82,66.46,66.67,0.275,1.018,0.327,0.207,1.828,2750,11,3,6332636,2009
2,Jim Furyk,86,69.477,279.9,69.66,64.08,65.53,-0.021,0.557,0.439,0.715,1.69,2438,11,0,3946515,2009
3,Zach Johnson,94,69.601,281.2,71.47,62.1,67.81,0.253,0.844,0.068,0.38,1.545,2073,9,2,4714813,2009
4,Tim Clark,81,69.658,280.1,74.06,62.93,66.95,0.125,0.773,0.087,0.276,1.261,1395,5,0,2235105,2009


In [621]:
# Make columns numeric
df_total[['ROUNDS', 'POINTS', 'MONEY', 'TOP 10', '1ST']] = df_total[['ROUNDS', 'POINTS', 'MONEY', 'TOP 10', '1ST']].apply(pd.to_numeric, downcast='integer')
df_total[['SCORING', 'DRIVE_DISTANCE', 'FWY_%', 'SCRAMBLING_%', 'GIR_%', 'SG_TEE', 'SG_APPROACH', 'SG_SCRAMBLE', 'SG_PUTTING', 'SG_TOTAL']] = df_total[['SCORING', 'DRIVE_DISTANCE', 'FWY_%', 'SCRAMBLING_%', 'GIR_%', 'SG_TEE', 'SG_APPROACH', 'SG_SCRAMBLE', 'SG_PUTTING', 'SG_TOTAL']].apply(pd.to_numeric)

df_total.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1862 entries, 0 to 192
Data columns (total 17 columns):
NAME              1862 non-null object
ROUNDS            1862 non-null int8
SCORING           1862 non-null float64
DRIVE_DISTANCE    1862 non-null float64
FWY_%             1862 non-null float64
SCRAMBLING_%      1862 non-null float64
GIR_%             1862 non-null float64
SG_TEE            1862 non-null float64
SG_APPROACH       1862 non-null float64
SG_SCRAMBLE       1862 non-null float64
SG_PUTTING        1862 non-null float64
SG_TOTAL          1862 non-null float64
POINTS            1862 non-null int16
TOP 10            1862 non-null int8
1ST               1862 non-null int8
MONEY             1862 non-null int32
Year              1862 non-null object
dtypes: float64(10), int16(1), int32(1), int8(3), object(2)
memory usage: 205.5+ KB


In [622]:
df_total.head()

Unnamed: 0,NAME,ROUNDS,SCORING,DRIVE_DISTANCE,FWY_%,SCRAMBLING_%,GIR_%,SG_TEE,SG_APPROACH,SG_SCRAMBLE,SG_PUTTING,SG_TOTAL,POINTS,TOP 10,1ST,MONEY,Year
0,Tiger Woods,64,68.052,298.4,64.29,68.18,68.46,0.335,1.398,0.579,0.877,3.189,4000,14,6,10508163,2009
1,Steve Stricker,81,69.286,286.1,66.82,66.46,66.67,0.275,1.018,0.327,0.207,1.828,2750,11,3,6332636,2009
2,Jim Furyk,86,69.477,279.9,69.66,64.08,65.53,-0.021,0.557,0.439,0.715,1.69,2438,11,0,3946515,2009
3,Zach Johnson,94,69.601,281.2,71.47,62.1,67.81,0.253,0.844,0.068,0.38,1.545,2073,9,2,4714813,2009
4,Tim Clark,81,69.658,280.1,74.06,62.93,66.95,0.125,0.773,0.087,0.276,1.261,1395,5,0,2235105,2009


In [623]:
# Output the cleaned data to a CSV file
df_total.to_csv(r'cleaned_data.csv')

## Preprocessing for Association Mining

Now that we have all of the data collected and preprocessed, we can begin to do analysis on it.

My goal is to find associations between different statistics. This will allow us to see which statistical categories have the strongest correlations. To do this, I'm going to discretize the data into bins by quartile to "rank" them. After I create item sets, I will then run the Apriori algorithm to find frequent item sets.

For each row in the dataframe, I'm going to create an item set. Each data point will be transformed from a ratio value to an ordinal value, based on its quartile.

In [624]:
df_total.describe()

Unnamed: 0,ROUNDS,SCORING,DRIVE_DISTANCE,FWY_%,SCRAMBLING_%,GIR_%,SG_TEE,SG_APPROACH,SG_SCRAMBLE,SG_PUTTING,SG_TOTAL,POINTS,TOP 10,1ST,MONEY
count,1862.0,1862.0,1862.0,1862.0,1862.0,1862.0,1862.0,1862.0,1862.0,1862.0,1862.0,1862.0,1862.0,1862.0,1862.0
mean,78.708378,70.918187,290.526155,61.642368,58.14833,65.63638,0.036876,0.064421,0.020159,0.025243,0.146905,673.451128,2.667562,0.235231,1468095.0
std,14.233424,0.699753,8.924088,5.126772,3.405698,2.727639,0.379928,0.381287,0.222911,0.350127,0.69919,520.246497,2.437529,0.588703,1397798.0
min,45.0,68.052,259.0,43.02,44.01,52.35,-1.717,-1.68,-0.93,-1.475,-3.209,0.0,0.0,0.0,0.0
25%,69.0,70.489,284.6,58.06,55.92,63.8325,-0.191,-0.17975,-0.125,-0.193,-0.2615,314.0,1.0,0.0,558328.8
50%,80.0,70.8955,290.2,61.625,58.275,65.77,0.0565,0.081,0.0225,0.04,0.1555,543.0,2.0,0.0,1033102.0
75%,89.0,71.34175,296.2,65.15,60.47,67.5,0.2935,0.31275,0.17575,0.26425,0.5685,919.0,4.0,0.0,1877568.0
max,120.0,74.4,319.7,76.88,69.33,73.52,1.485,1.533,0.728,1.13,3.189,4750.0,15.0,6.0,12030460.0


In [625]:
# quantiles using all data in dataset
df_total.quantile([.25, .5, .75], axis=0)

Unnamed: 0,ROUNDS,SCORING,DRIVE_DISTANCE,FWY_%,SCRAMBLING_%,GIR_%,SG_TEE,SG_APPROACH,SG_SCRAMBLE,SG_PUTTING,SG_TOTAL,POINTS,TOP 10,1ST,MONEY
0.25,69.0,70.489,284.6,58.06,55.92,63.8325,-0.191,-0.17975,-0.125,-0.193,-0.2615,314.0,1.0,0.0,558328.75
0.5,80.0,70.8955,290.2,61.625,58.275,65.77,0.0565,0.081,0.0225,0.04,0.1555,543.0,2.0,0.0,1033102.5
0.75,89.0,71.34175,296.2,65.15,60.47,67.5,0.2935,0.31275,0.17575,0.26425,0.5685,919.0,4.0,0.0,1877567.5


These functions are used to create item sets that can be used when doing association mining.

In [626]:
def get_itemtext(row, year_quantiles, stat, text):
    '''
    Returns a string indicating which quartile the stat is in
    '''
    result = ""
    
    if row[stat] <= year_quantiles[stat][0.25]:
        result = "below_q1_"
    elif row[stat] <= year_quantiles[stat][0.50]:
        result = "between_q1_q2_"
    elif row[stat] <= year_quantiles[stat][0.75]:
        result = "between_q2_q3_"
    else:
        result = "above_q3_"
        
    return result + text
    

def generate_itemset(row, year_quantiles):
    '''
    This fuction generates an itemset that can be fed to Apriori algorithm
    It creates a list of 'items' based on what quartile that stat is in
    '''
    row_itemset = []
    
    # rounds
    stat = 'ROUNDS'
    text = 'rounds_played'
    row_itemset.append( get_itemtext(row, year_quantiles, stat, text) )
    
    # scoring
    stat = 'SCORING'
    text = 'scoring_average'
    row_itemset.append( get_itemtext(row, year_quantiles, stat, text) )
    
    # drive distance
    stat = 'DRIVE_DISTANCE'
    text = 'driving_distance'
    row_itemset.append( get_itemtext(row, year_quantiles, stat, text) )
    
    # fwy %
    stat = 'FWY_%'
    text = 'fairway_percentage'
    row_itemset.append( get_itemtext(row, year_quantiles, stat, text) )
    
    # scrambling %
    stat = 'SCRAMBLING_%'
    text = 'scrambling_percentage'
    row_itemset.append( get_itemtext(row, year_quantiles, stat, text) )
    
    # gir %
    stat = 'GIR_%'
    text = 'greens_percentage'
    row_itemset.append( get_itemtext(row, year_quantiles, stat, text) )
    
    # strokes gained off the tee
    stat = 'SG_TEE'
    text = 'strokesgained_tee'
    row_itemset.append( get_itemtext(row, year_quantiles, stat, text) )
    
    # strokes gained approach
    stat = 'SG_APPROACH'
    text = 'strokesgained_approach'
    row_itemset.append( get_itemtext(row, year_quantiles, stat, text) )
    
    # strokes gained scrambling
    stat = 'SG_SCRAMBLE'
    text = 'strokesgained_scrambling'
    row_itemset.append( get_itemtext(row, year_quantiles, stat, text) )
    
    # strokes gained putting
    stat = 'SG_PUTTING'
    text = 'strokesgained_putting'
    row_itemset.append( get_itemtext(row, year_quantiles, stat, text) )
    
    # strokes gained total
    stat = 'SG_TOTAL'
    text = 'strokesgained_total'
    row_itemset.append( get_itemtext(row, year_quantiles, stat, text) )
    
    # points
    stat = 'POINTS'
    text = 'points_earned'
    row_itemset.append( get_itemtext(row, year_quantiles, stat, text) )
    
    # top 10
    stat = 'TOP 10'
    text = 'top10_finished'
    row_itemset.append( get_itemtext(row, year_quantiles, stat, text) )
    
    # first place finishes
    stat = '1ST'
    text = '1st_place_finishes'
    row_itemset.append( get_itemtext(row, year_quantiles, stat, text) )
    
    # money earned
    stat = 'MONEY'
    text = 'money_earned'
    row_itemset.append( get_itemtext(row, year_quantiles, stat, text) )
    
    # return the resulting item set
    return row_itemset

Create all the item sets that will be fed to the Apriori algorithm:

In [627]:
# list to hold all of the item sets
all_itemsets = [] 

# for each year, create quantiles
for year in df_total['Year'].unique():
    
    # create a dataframe for that year
    year_data = df_total.loc[df_total['Year'] == year]
    year_quantiles = year_data.quantile([.25, .5, .75], axis=0)
    
    for index, row in year_data.iterrows():
        # create an itemset for each row
        row_itemset = generate_itemset(row, year_quantiles)
        all_itemsets.append(row_itemset)
        
print(all_itemsets)

[['below_q1_rounds_played', 'below_q1_scoring_average', 'above_q3_driving_distance', 'between_q2_q3_fairway_percentage', 'above_q3_scrambling_percentage', 'above_q3_greens_percentage', 'above_q3_strokesgained_tee', 'above_q3_strokesgained_approach', 'above_q3_strokesgained_scrambling', 'above_q3_strokesgained_putting', 'above_q3_strokesgained_total', 'above_q3_points_earned', 'above_q3_top10_finished', 'above_q3_1st_place_finishes', 'above_q3_money_earned'], ['between_q1_q2_rounds_played', 'below_q1_scoring_average', 'between_q1_q2_driving_distance', 'between_q2_q3_fairway_percentage', 'above_q3_scrambling_percentage', 'between_q2_q3_greens_percentage', 'between_q2_q3_strokesgained_tee', 'above_q3_strokesgained_approach', 'above_q3_strokesgained_scrambling', 'between_q2_q3_strokesgained_putting', 'above_q3_strokesgained_total', 'above_q3_points_earned', 'above_q3_top10_finished', 'above_q3_1st_place_finishes', 'above_q3_money_earned'], ['between_q2_q3_rounds_played', 'below_q1_scoring_

In [628]:
# Now output that list to a text file
import csv

with open("stats.data","w") as f:
    wr = csv.writer(f, delimiter=' ')
    wr.writerows(all_itemsets)

## Running the Apriori Algorithm

I now have a file that is suitable input for the Apriori algorithm. This will allow us to do association rule mining to determine relationships between different statistics on the PGA TOUR.

In order to get the most insightful information from the Apriori algorithm, we needed to tweak the arguments when calling the executable.

This executable and its documentation was obtained from http://www.borgelt.net/apriori.html and its docs were accessable at http://www.borgelt.net/doc/apriori/apriori.html

Please note that the following assumes you are running this in a Linux environment, with the Apriori executable in the same directory as the data.

The code below is running the Apriori algorithm with the following arguments and parameters:
- -s20: minimum support of 20%
- -c35: minimum confidence of 35%
- -m2n2: restrict maximum and minimum number of supporting items
- -tr: set the target type to association rules

Since we are looking for statistics that lead to success on the PGA TOUR, we will use the -R option to restrict our association mining to certain consequences we are interested in.

For this purpose, we will consider the following stat 'categorizations' to be successful in professional golf:

In [629]:
!cat restrictions2.txt

ante
below_q1_scoring_average cons
above_q3_top10_finished cons
above_q3_points_earned cons
above_q3_1st_place_finishes cons
above_q3_money_earned head


In [630]:
!./apriori -s20 -c35 -m2 -n2 -tr stats.data rules.txt -R restrictions2.txt

./apriori - find frequent item sets with the apriori algorithm
version 6.27 (2017.08.01)        (c) 1996-2017   Christian Borgelt
reading restrictions2.txt ... [5 item(s)] done [0.00s].
reading stats.data ... [58 item(s), 1862 transaction(s)] done [0.00s].
filtering, sorting and recoding items ... [58 item(s)] done [0.00s].
sorting and reducing transactions ... [1847/1862 transaction(s)] done [0.00s].
building transaction tree ... [2810 node(s)] done [0.00s].
checking subsets of size 1 2 done [0.00s].
writing rules.txt ... [30 rule(s)] done [0.00s].


After tweaking parameters to get the desired results, we can see that the Apriori algorithm gives us 30 rules of interest:

In [631]:
!cat rules.txt

above_q3_1st_place_finishes <- above_q3_strokesgained_total (24.9731, 40.4301)
above_q3_top10_finished <- above_q3_strokesgained_tee (24.9731, 37.6344)
above_q3_top10_finished <- above_q3_strokesgained_total (24.9731, 63.2258)
above_q3_top10_finished <- above_q3_strokesgained_approach (25.0269, 44.4206)
above_q3_top10_finished <- above_q3_scrambling_percentage (25.0269, 37.9828)
above_q3_points_earned <- between_q2_q3_top10_finished (23.5768, 40.3189)
above_q3_money_earned <- above_q3_rounds_played (23.9527, 39.9103)
above_q3_points_earned <- above_q3_rounds_played (23.9527, 40.3587)
below_q1_scoring_average <- above_q3_rounds_played (23.9527, 37.2197)
above_q3_points_earned <- between_q2_q3_money_earned (24.8657, 40.8207)
above_q3_money_earned <- above_q3_strokesgained_tee (24.9731, 46.0215)
above_q3_points_earned <- above_q3_strokesgained_tee (24.9731, 41.0753)
below_q1_scoring_average <- above_q3_strokesgained_tee (24.9731, 47.7419)
above_q3_money_earned <- above_q3_gre

### The analysis of these rules will be done in the project report.