# Analysing statistics for PGA Tour players in the 2022 season

In this notebook, I will be using data from __[Advanced Sports Analytics](https://www.advancedsportsanalytics.com/pga-raw-data)__ to look at different statistics for PGA Tour players in the 2022 season. 

0. The first step is to import the necessary libraries for our analysis

In [1]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import datetime



1. The next step is to import the data and inspect it to look at the different columns, data types, and to check for null values. 

In [2]:
# Import the data into a pandas dataframe
pga_data = pd.read_csv('pgatour_18to22.csv')
#Print the first 5 rows the data to see how the data is arranged into rows and columns
print(pga_data.head())
#Print the data types of our columns
print(pga_data.dtypes)
#Print the number of null values in each column
print('-'*100)
print('Null values in each column:')
print(pga_data.isna().sum())

  Player_initial_last  tournament id  player id  hole_par  strokes  hole_DKP  \
0            A. Ancer      401353224       9261       288      289      60.0   
1           A. Hadwin      401353224       5548       288      286      72.5   
2           A. Lahiri      401353224       4989       144      147      21.5   
3             A. Long      401353224       6015       144      151      20.5   
4            A. Noren      401353224       3832       144      148      23.5   

   hole_FDP  hole_SDP  streak_DKP  streak_FDP  ...  purse  season  no_cut  \
0      51.1        56           3         7.6  ...   12.0    2022       0   
1      61.5        61           8        13.0  ...   12.0    2022       0   
2      17.4        27           0         0.0  ...   12.0    2022       0   
3      13.6        17           0         0.4  ...   12.0    2022       0   
4      18.1        23           0         1.2  ...   12.0    2022       0   

   Finish  sg_putt  sg_arg  sg_app  sg_ott  sg_t2g  sg_t

In [3]:
#Check the player column to see the data it gives us, as we also have a player_intial_last column
print(pga_data.player.head())
print(pga_data.Player_initial_last.head())
print('-'*100)
#Get the number of tournaments we have information about
print('Number of unique tournaments played:',pga_data['tournament name'].nunique())
print('Number of unique players:',pga_data.player.nunique())

0      Abraham Ancer
1        Adam Hadwin
2     Anirban Lahiri
3          Adam Long
4    Alexander Noren
Name: player, dtype: object
0     A. Ancer
1    A. Hadwin
2    A. Lahiri
3      A. Long
4     A. Noren
Name: Player_initial_last, dtype: object
----------------------------------------------------------------------------------------------------
Number of unique tournaments played: 67
Number of unique players: 499


## Data Info

1. It looks like we have 37 columns with information about 67 tournaments and 499 players from the 2018-2022 season.
 
    1. `hole_par` tells us whether the hole was a par 3, par 4, or par 5 (How many strokes a player needs to make on a hole to stay even with the course).
    2. `strokes` tells us the number of strokes the player made on the hole. For example, if a player makes 6 strokes on a par 4, their score for the hole is +2, or 2 strokes over par!
    3. `num_rounds` tells us the number of rounds the player played in a certain tournament. It usually has values of either 2 or 4 (explanation below).
    4. `made_cut` tells us whether a player made the *cut* (1) or not (0). PGA Tour events last 4 days and the playing field is **cut** in half after the end of the 2nd day, meaning players that *made the cut* play 4 rounds, but players that do not make the cut play 2 rounds. It could be that a player was disqualified or had to withdraw during the 1st or 3rd round, but values of 1 or 3 in `num_rounds` would be rare (good to be aware of them though!).
    5. `pos` tells us the position the player finished at the end of the tournament in `float` form, so if a player did not make the cut, it gives a `NaN` value.
    6. `date` gives us the date (normal `yyyy-mm-dd` form) on which a certain row gives information about.
    7. `purse` is the prize money for the tournament.
    8. `no_cut` tell us whether a tournament had a cut or not, this information is somewhat important as most tournaments **do** have cuts, so we might want to filter these out.
    9. `Finish` gives the same information as `pos`, but displays the information in a way familiar to golfers (CUT meaning the player didn't make the cut, Tnum meaning the player tied for num-th place.)
    10. `sg-` means **strokes gained** and is a comparison of the player's performance with the rest of the field (Please see https://www.pgatour.com/news/2016/05/31/strokes-gained-defined.html)
        1. `sg_putt` means strokes gained putting.
        2. `sg_arg` means strokes gained around the green (within 30 yards of green) (the *green* is the surface where the hole is located).
        3. `sg_agg` means strokes gained approach the green (any shot hit towards the green that is not a tee shot on a par 4 or par 5, but includes tee shots on par 3s) (a tee shot is the first shot on any hole).
        4. `sg_ott` means strokes gained off the tee.
        5. `sg_t2g` means strokes gained tee to green (`sg_t2g = sg_arg + sg_agg + sg_ott`)
        6. `sg_total = sg_t2g + sg_putt`

## Data Cleaning strategy

1. The naming of the columns is very inconsistent as some column names have title style while some do not, some contain underscores while some do not etc. The naming will have to be fixed. Some of the names of the columns also do not do a great job of explaining what the columns values are representing, especially for non-golfing audiences.

2. The columns that contain the three letter acronyms *FDP, DKP, SDP* have to do with Draft King / Fantasy Drafts, i.e. leagues where followers of the tour can select players to be in their 'fantasy team' and see who gets the most points. This data is of little use to my analysis here as we are MOSTLY INTERESTED IN THE STATS OF THE PLAYERS THEMSELVES.
**These columns (12) will be removed**

3. We have columns in the form `Unnamed: number` which only contain null values.
**These columns (3) will also be removed**

4. There are 2 columns containing information about the Players name: `Player_initial_last` and `Player`.
**This is giving us redundant information and can definitely be reduced into one column or two columns containing a forename and surname**

5. `pos` and `Finish` give us the same information as different data types, but it easier working with numeric data types. Whether a player ties for 2nd place or comes solo 2nd is not of much importance to us. If `made_cut` is False, then the `pos` is `NaN`.
**`Finish` can therefore be removed, a column containing information about whether the position is tied or solo can be added**

In [4]:
# Remove columns with the fantasy leagues information
pga_data = pga_data[pga_data.columns.drop(list(pga_data.filter(regex=\
                                                               '(hole|streak|finish|total)\_[DFS][KD]P')))]
pga_data = pga_data[pga_data.columns.drop(list(pga_data.filter(regex='Unnamed: \d')))]

#Manipulate the player name columns to remove redunadant information
forename_lastname = pga_data.player.str.split(' ')
pga_data['player_forename'] = forename_lastname.str.get(0).astype('string')
pga_data['player_surname'] = forename_lastname.str.get(1).astype('string')

In [5]:
pga_data['tied'] = pga_data.Finish.apply(lambda x: True if str(x)[0] == 'T' else \
                                         (np.nan if str(x) == 'CUT' else False))
pga_data['made_cut'] = pga_data.made_cut.replace([0, 1], [False, True])
pga_data['tournament name'] = pga_data['tournament name'].astype('string')
pga_data['course'] = pga_data.course.astype('string')
pga_data['date'] = pd.to_datetime(pga_data.date)

In [6]:
pga_data = pga_data.rename(columns = {'tournament id': 'tournament_id', 'player id': 'player_id',\
                                     'n_rounds': 'rounds_played','pos': 'position',\
                                      'tournament name': 'tournament'})
pga_data = pga_data.drop(columns = ['player','Finish','Player_initial_last'])

In [7]:
print(pga_data.info())
print(pga_data.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23780 entries, 0 to 23779
Data columns (total 22 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   tournament_id    23780 non-null  int64         
 1   player_id        23780 non-null  int64         
 2   hole_par         23780 non-null  int64         
 3   strokes          23780 non-null  int64         
 4   rounds_played    23780 non-null  int64         
 5   made_cut         23780 non-null  bool          
 6   position         13293 non-null  float64       
 7   tournament       23780 non-null  string        
 8   course           23780 non-null  string        
 9   date             23780 non-null  datetime64[ns]
 10  purse            23780 non-null  float64       
 11  season           23780 non-null  int64         
 12  no_cut           23780 non-null  int64         
 13  sg_putt          19919 non-null  float64       
 14  sg_arg           19919 non-null  float

In [8]:
course_data = pga_data.course.str.split('-')
print(course_data.head())
pga_data.course = course_data.str.get(0).str.strip().astype('string')
city_state = course_data.str.get(1)
city_state = city_state.str.split(',')
pga_data['city'] = city_state.str.get(0).str.strip().astype('string')
pga_data['state'] = city_state.str.get(1).str.strip().astype('string')
#print(pga_data.course.head())
#print(pga_data.city.head())
#print(pga_data.state.head())

0    [Muirfield Village Golf Club ,  Dublin, OH]
1    [Muirfield Village Golf Club ,  Dublin, OH]
2    [Muirfield Village Golf Club ,  Dublin, OH]
3    [Muirfield Village Golf Club ,  Dublin, OH]
4    [Muirfield Village Golf Club ,  Dublin, OH]
Name: course, dtype: object
0         Dublin, OH
1         Dublin, OH
2         Dublin, OH
3         Dublin, OH
4         Dublin, OH
            ...     
23775       Napa, CA
23776       Napa, CA
23777       Napa, CA
23778       Napa, CA
23779       Napa, CA
Name: course, Length: 23780, dtype: object
0        [ Dublin,  OH]
1        [ Dublin,  OH]
2        [ Dublin,  OH]
3        [ Dublin,  OH]
4        [ Dublin,  OH]
              ...      
23775      [ Napa,  CA]
23776      [ Napa,  CA]
23777      [ Napa,  CA]
23778      [ Napa,  CA]
23779      [ Napa,  CA]
Name: course, Length: 23780, dtype: object
