Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('/Users/robertbuckley/repos/DS-Unit-2-Applied-Modeling/PGA_Data_Historical.csv')

In [3]:
df.head()

Unnamed: 0,Player Name,Season,Statistic,Variable,Value
0,Robert Garrigus,2010,Driving Distance,Driving Distance - (ROUNDS),71
1,Bubba Watson,2010,Driving Distance,Driving Distance - (ROUNDS),77
2,Dustin Johnson,2010,Driving Distance,Driving Distance - (ROUNDS),83
3,Brett Wetterich,2010,Driving Distance,Driving Distance - (ROUNDS),54
4,J.B. Holmes,2010,Driving Distance,Driving Distance - (ROUNDS),100


In [4]:
df['Variable'].value_counts()

Official World Golf Ranking - (POINTS GAINED)                            8937
Official World Golf Ranking - (TOTAL POINTS)                             8937
Official World Golf Ranking - (COUNTRY)                                  8937
Official World Golf Ranking - (POINTS LOST)                              8937
Official World Golf Ranking - (AVG POINTS)                               8937
                                                                         ... 
Distance Analysis 200-220 yards - 8 Iron - (ROUNDS)                         1
Distance Analysis 260-280 yards - 8 Iron - (%)                              1
Distance Analysis 240-260 yards - 8 Iron - (%)                              1
Distance Analysis 260-280 yards - 8 Iron - (TOTAL ATTEMPTS WITH CLUB)       1
Distance Analysis 160-170 yards - 3 Iron - (%)                              1
Name: Variable, Length: 2081, dtype: int64

In [5]:
df['Player Name'].nunique()

3053

In [6]:
3052 * 8937

27275724

In [7]:
df.shape

(2740403, 5)

In [8]:
#Just pulling out one player "Bubba Watson"
bw = df[df['Player Name'] == 'Bubba Watson']

In [9]:
bw

Unnamed: 0,Player Name,Season,Statistic,Variable,Value
1,Bubba Watson,2010,Driving Distance,Driving Distance - (ROUNDS),77
193,Bubba Watson,2010,Driving Distance,Driving Distance - (AVG.),309.8
385,Bubba Watson,2010,Driving Distance,Driving Distance - (TOTAL DISTANCE),47703
577,Bubba Watson,2010,Driving Distance,Driving Distance - (TOTAL DRIVES),154
943,Bubba Watson,2010,Driving Accuracy Percentage,Driving Accuracy Percentage - (ROUNDS),77
...,...,...,...,...,...
2739586,Bubba Watson,2018,Fairway Bunker Tendency,Fairway Bunker Tendency - (ROUNDS),90
2739779,Bubba Watson,2018,Fairway Bunker Tendency,Fairway Bunker Tendency - (%),6.7
2739972,Bubba Watson,2018,Fairway Bunker Tendency,Fairway Bunker Tendency - (TOTAL FAIRWAY BUNKERS),63
2740165,Bubba Watson,2018,Fairway Bunker Tendency,Fairway Bunker Tendency - (POSSIBLE FWYS),940


In [10]:
#1,493 Varaibles
bw['Variable'].nunique()

1493

In [11]:
# Just the 2010 Season
bw_2010 = bw[bw['Season'] == 2010]

In [12]:
#1,458 variables for the 2010 season for Bubba Watson
bw_2010['Variable'].nunique()

1458

In [13]:
pd.options.display.max_rows = 1500
bw_2010.head()

Unnamed: 0,Player Name,Season,Statistic,Variable,Value
1,Bubba Watson,2010,Driving Distance,Driving Distance - (ROUNDS),77.0
193,Bubba Watson,2010,Driving Distance,Driving Distance - (AVG.),309.8
385,Bubba Watson,2010,Driving Distance,Driving Distance - (TOTAL DISTANCE),47703.0
577,Bubba Watson,2010,Driving Distance,Driving Distance - (TOTAL DRIVES),154.0
943,Bubba Watson,2010,Driving Accuracy Percentage,Driving Accuracy Percentage - (ROUNDS),77.0


In [14]:
df['Statistic'].nunique()

528

In [15]:
#Pulling 'Driving Distance' from the "Statistic" column for all players and all years
driving = df[df['Statistic'] == 'Driving Distance']
driving.head()

Unnamed: 0,Player Name,Season,Statistic,Variable,Value
0,Robert Garrigus,2010,Driving Distance,Driving Distance - (ROUNDS),71
1,Bubba Watson,2010,Driving Distance,Driving Distance - (ROUNDS),77
2,Dustin Johnson,2010,Driving Distance,Driving Distance - (ROUNDS),83
3,Brett Wetterich,2010,Driving Distance,Driving Distance - (ROUNDS),54
4,J.B. Holmes,2010,Driving Distance,Driving Distance - (ROUNDS),100


In [16]:
#Now pulling just 'Bubba Watson' from 'Driving Distance'-Statistic dataset
bw_driving = driving[driving['Player Name'] == 'Bubba Watson']
bw_driving['Variable'].nunique()

4

In [17]:
bw_driving.drop(columns=['Statistic'])

Unnamed: 0,Player Name,Season,Variable,Value
1,Bubba Watson,2010,Driving Distance - (ROUNDS),77.0
193,Bubba Watson,2010,Driving Distance - (AVG.),309.8
385,Bubba Watson,2010,Driving Distance - (TOTAL DISTANCE),47703.0
577,Bubba Watson,2010,Driving Distance - (TOTAL DRIVES),154.0
323084,Bubba Watson,2011,Driving Distance - (ROUNDS),85.0
323270,Bubba Watson,2011,Driving Distance - (AVG.),314.9
323456,Bubba Watson,2011,Driving Distance - (TOTAL DISTANCE),49747.0
323642,Bubba Watson,2011,Driving Distance - (TOTAL DRIVES),158.0
628195,Bubba Watson,2012,Driving Distance - (ROUNDS),68.0
628386,Bubba Watson,2012,Driving Distance - (AVG.),315.5


In [18]:
#Make name and year column
bw_driving['Name_Year'] = bw_driving['Player Name'] + bw_driving['Season'].astype('str')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [19]:
bw_driving

Unnamed: 0,Player Name,Season,Statistic,Variable,Value,Name_Year
1,Bubba Watson,2010,Driving Distance,Driving Distance - (ROUNDS),77.0,Bubba Watson2010
193,Bubba Watson,2010,Driving Distance,Driving Distance - (AVG.),309.8,Bubba Watson2010
385,Bubba Watson,2010,Driving Distance,Driving Distance - (TOTAL DISTANCE),47703.0,Bubba Watson2010
577,Bubba Watson,2010,Driving Distance,Driving Distance - (TOTAL DRIVES),154.0,Bubba Watson2010
323084,Bubba Watson,2011,Driving Distance,Driving Distance - (ROUNDS),85.0,Bubba Watson2011
323270,Bubba Watson,2011,Driving Distance,Driving Distance - (AVG.),314.9,Bubba Watson2011
323456,Bubba Watson,2011,Driving Distance,Driving Distance - (TOTAL DISTANCE),49747.0,Bubba Watson2011
323642,Bubba Watson,2011,Driving Distance,Driving Distance - (TOTAL DRIVES),158.0,Bubba Watson2011
628195,Bubba Watson,2012,Driving Distance,Driving Distance - (ROUNDS),68.0,Bubba Watson2012
628386,Bubba Watson,2012,Driving Distance,Driving Distance - (AVG.),315.5,Bubba Watson2012


In [20]:
bw_driving = bw_driving.pivot(index='Name_Year', columns='Variable', values='Value')
bw_driving

Variable,Driving Distance - (AVG.),Driving Distance - (ROUNDS),Driving Distance - (TOTAL DISTANCE),Driving Distance - (TOTAL DRIVES)
Name_Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bubba Watson2010,309.8,77,47703,154
Bubba Watson2011,314.9,85,49747,158
Bubba Watson2012,315.5,68,41652,132
Bubba Watson2013,303.7,76,44030,145
Bubba Watson2014,314.3,76,40863,130
Bubba Watson2015,315.2,71,40351,128
Bubba Watson2016,310.6,72,40382,130
Bubba Watson2017,305.6,71,34840,114
Bubba Watson2018,313.1,90,49470,158


In [21]:
bw_driving.columns

Index(['Driving Distance - (AVG.)', 'Driving Distance - (ROUNDS)',
       'Driving Distance - (TOTAL DISTANCE)',
       'Driving Distance - (TOTAL DRIVES)'],
      dtype='object', name='Variable')

In [48]:
stat_list = df['Statistic'].unique()


In [51]:
stat_list

array(['Driving Distance', 'Driving Accuracy Percentage',
       'Greens in Regulation Percentage', 'Putting Average',
       'Par Breakers', 'Total Eagles', 'Total Birdies',
       'Scoring Average (Actual)', 'Money Leaders',
       'Sand Save Percentage', 'Par 3 Birdie or Better Leaders',
       'Par 4 Birdie or Better Leaders', 'Par 5 Birdie or Better Leaders',
       'Birdie or Better Conversion Percentage', 'Putts Per Round',
       'Scoring Average', 'All-Around Ranking', 'Total Driving',
       'Scrambling', 'Ryder Cup Points', 'PGA Championship Points',
       'Putts made Distance', 'Top 10 Finishes', 'Non-member Earnings',
       'Par 3 Scoring Average', 'Par 4 Scoring Average',
       'Par 5 Scoring Average', "3-Putt Avoidance - 15-20'",
       "3-Putt Avoidance - 20-25'", "3-Putt Avoidance > 25'",
       'Current Par or Better Streak', "Rounds in the 60's",
       'Money per Event Leaders', 'Eagles (Holes per)', 'Birdie Average',
       'World Money List', 'Ball Striking', '

In [None]:
#Driving Distance, Driving Accuracy Percentage, Total Driving,'Driving Pct. 300+ (All Drives)',
#        'Driving Pct. 280-300 (All Drives)',
#        'Driving Pct. 260-280 (All Drives)',
#        'Driving Pct. 240-260 (All Drives)',
#        'Driving Pct. <= 240 (All Drives)',
#'Driving Distance - All Drives', 'Left Rough Tendency',
#        'Right Rough Tendency', 'Missed Fairway Percent - Other',
#    'Distance Analysis 200-210 yards - Driver',
#        'Distance Analysis 190-200 yards - Driver',
#        'Distance Analysis 160-170 yards - Driver',
#        'Distance Analysis 150-160 yards - Driver',
#        'Distance Analysis < 80 yards - Driver',
#        'Distance Analysis 240-260 yards - Driver',

In [70]:
drivinglist = ('Driving Distance', 'Driving Accuracy Percentage', 'Total Driving','Driving Pct. 300+ (All Drives)',
       'Driving Pct. 280-300 (All Drives)',
       'Driving Pct. 260-280 (All Drives)',
       'Driving Pct. 240-260 (All Drives)',
       'Driving Pct. <= 240 (All Drives)',
'Driving Distance - All Drives', 'Left Rough Tendency',
       'Right Rough Tendency', 'Missed Fairway Percent - Other',
   )

testdriving = df[df['Statistic'].isin(drivinglist)]

In [71]:
testdriving.drop(columns='Statistic')

Unnamed: 0,Player Name,Season,Variable,Value
0,Robert Garrigus,2010,Driving Distance - (ROUNDS),71
1,Bubba Watson,2010,Driving Distance - (ROUNDS),77
2,Dustin Johnson,2010,Driving Distance - (ROUNDS),83
3,Brett Wetterich,2010,Driving Distance - (ROUNDS),54
4,J.B. Holmes,2010,Driving Distance - (ROUNDS),100
...,...,...,...,...
2617901,Greg Chalmers,2018,Missed Fairway Percent - Other - (RELATIVE TO ...,+0.52
2617902,Wesley Bryan,2018,Missed Fairway Percent - Other - (RELATIVE TO ...,+0.51
2617903,Matt Every,2018,Missed Fairway Percent - Other - (RELATIVE TO ...,+0.63
2617904,Retief Goosen,2018,Missed Fairway Percent - Other - (RELATIVE TO ...,+0.50


In [72]:
#Make name and year column
testdriving['Name_Year'] = testdriving['Player Name'] + testdriving['Season'].astype('str')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [73]:
#Pivot, Make Variable column into multiple seperate columns
testdriving = testdriving.pivot(index='Name_Year', columns='Variable', values='Value')
testdriving

Variable,Driving Accuracy Percentage - (%),Driving Accuracy Percentage - (FAIRWAYS HIT),Driving Accuracy Percentage - (POSSIBLE FAIRWAYS),Driving Accuracy Percentage - (ROUNDS),Driving Distance - (AVG.),Driving Distance - (ROUNDS),Driving Distance - (TOTAL DISTANCE),Driving Distance - (TOTAL DRIVES),Driving Distance - All Drives - (# OF DRIVES),Driving Distance - All Drives - (ALL DRIVES),...,Missed Fairway Percent - Other - (TOTAL MISSED),Right Rough Tendency - (%),Right Rough Tendency - (POSSIBLE FWYS),Right Rough Tendency - (RELATIVE TO PAR),Right Rough Tendency - (ROUNDS),Right Rough Tendency - (TOTAL RIGHT ROUGH),Total Driving - (ACCURACY RANK),Total Driving - (DISTANCE RANK),Total Driving - (EVENTS),Total Driving - (TOTAL)
Name_Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Aaron Baddeley2010,56.65,741,1308,94,298.9,94,56202,188,1225,289.1,...,60,19.31,1196,+0.05,94,231,170,14,26,184
Aaron Baddeley2011,55.67,584,1049,77,296.2,77,44430,150,868,288.4,...,21,19.36,847,+0.07,77,164,166,51,22,217
Aaron Baddeley2012,54.30,549,1011,73,292.0,73,42048,144,774,288.7,...,28,18.05,759,+0.06,73,137,175,68,22,243
Aaron Baddeley2013,50.71,465,917,68,288.5,68,38086,132,820,281.0,...,62,16.73,801,+0.07,68,134,179,86,24,265
Aaron Baddeley2014,52.29,525,1004,74,293.8,74,39950,136,823,280.3,...,50,19.82,797,+0.09,74,158,175,60,24,235
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Zach Johnson2016,65.23,788,1208,90,280.1,90,48185,172,1026,276.0,...,38,11.15,1022,+0.02,90,114,32,167,24,199
Zach Johnson2017,66.54,680,1022,81,286.8,81,41872,146,896,281.5,...,27,10.97,893,+0.07,81,98,25,140,23,165
Zach Johnson2018,63.88,778,1218,94,289.8,94,50427,174,1004,282.9,...,31,11.23,997,+0.13,94,112,63,154,25,217
Zack Miller2011,57.06,651,1141,86,299.5,86,49116,164,1044,285.9,...,39,19.86,1027,+0.22,86,204,148,25,30,173


In [74]:
driving_stats = testdriving

In [75]:
driving_stats.to_csv('driving_stats_2010-2018.csv')