Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('/Users/robertbuckley/repos/DS-Unit-2-Applied-Modeling/PGA_Data_Historical.csv')

In [3]:
df.head()

Unnamed: 0,Player Name,Season,Statistic,Variable,Value
0,Robert Garrigus,2010,Driving Distance,Driving Distance - (ROUNDS),71
1,Bubba Watson,2010,Driving Distance,Driving Distance - (ROUNDS),77
2,Dustin Johnson,2010,Driving Distance,Driving Distance - (ROUNDS),83
3,Brett Wetterich,2010,Driving Distance,Driving Distance - (ROUNDS),54
4,J.B. Holmes,2010,Driving Distance,Driving Distance - (ROUNDS),100


In [4]:
df['Variable'].value_counts()

Official World Golf Ranking - (TOTAL POINTS)                              8937
Official World Golf Ranking - (AVG POINTS)                                8937
Official World Golf Ranking - (POINTS LOST)                               8937
Official World Golf Ranking - (POINTS GAINED)                             8937
Official World Golf Ranking - (COUNTRY)                                   8937
                                                                          ... 
Distance Analysis 220-240 yards - 8 Iron - (%)                               1
Distance Analysis 120-130 yards - 6 Iron - (ROUNDS)                          1
Distance Analysis 260-280 yards - 8 Iron - (TOTAL ATTEMPTS WITH CLUB)        1
Distance Analysis 120-130 yards - 7 Iron - (TOTAL ATTEMPTS DIST RANGE)       1
Distance Analysis 200-220 yards - Wedges - (%)                               1
Name: Variable, Length: 2081, dtype: int64

In [5]:
df['Player Name'].nunique()

3053

In [6]:
3052 * 8937

27275724

In [7]:
df.shape

(2740403, 5)

In [8]:
#Just pulling out one player "Bubba Watson"
bw = df[df['Player Name'] == 'Bubba Watson']

In [9]:
bw

Unnamed: 0,Player Name,Season,Statistic,Variable,Value
1,Bubba Watson,2010,Driving Distance,Driving Distance - (ROUNDS),77
193,Bubba Watson,2010,Driving Distance,Driving Distance - (AVG.),309.8
385,Bubba Watson,2010,Driving Distance,Driving Distance - (TOTAL DISTANCE),47703
577,Bubba Watson,2010,Driving Distance,Driving Distance - (TOTAL DRIVES),154
943,Bubba Watson,2010,Driving Accuracy Percentage,Driving Accuracy Percentage - (ROUNDS),77
...,...,...,...,...,...
2739586,Bubba Watson,2018,Fairway Bunker Tendency,Fairway Bunker Tendency - (ROUNDS),90
2739779,Bubba Watson,2018,Fairway Bunker Tendency,Fairway Bunker Tendency - (%),6.7
2739972,Bubba Watson,2018,Fairway Bunker Tendency,Fairway Bunker Tendency - (TOTAL FAIRWAY BUNKERS),63
2740165,Bubba Watson,2018,Fairway Bunker Tendency,Fairway Bunker Tendency - (POSSIBLE FWYS),940


In [10]:
#1,493 Varaibles
bw['Variable'].nunique()

1493

In [12]:
# Just the 2010 Season
bw_2010 = bw[bw['Season'] == 2010]

In [13]:
#1,458 variables for the 2010 season for Bubba Watson
bw_2010['Variable'].nunique()

1458

In [14]:
pd.options.display.max_rows = 1500
bw_2010.head(1458)

Unnamed: 0,Player Name,Season,Statistic,Variable,Value
1,Bubba Watson,2010,Driving Distance,Driving Distance - (ROUNDS),77
193,Bubba Watson,2010,Driving Distance,Driving Distance - (AVG.),309.8
385,Bubba Watson,2010,Driving Distance,Driving Distance - (TOTAL DISTANCE),47703
577,Bubba Watson,2010,Driving Distance,Driving Distance - (TOTAL DRIVES),154
943,Bubba Watson,2010,Driving Accuracy Percentage,Driving Accuracy Percentage - (ROUNDS),77
1135,Bubba Watson,2010,Driving Accuracy Percentage,Driving Accuracy Percentage - (%),55.67
1327,Bubba Watson,2010,Driving Accuracy Percentage,Driving Accuracy Percentage - (FAIRWAYS HIT),599
1519,Bubba Watson,2010,Driving Accuracy Percentage,Driving Accuracy Percentage - (POSSIBLE FAIRWAYS),1076
1590,Bubba Watson,2010,Greens in Regulation Percentage,Greens in Regulation Percentage - (ROUNDS),77
1782,Bubba Watson,2010,Greens in Regulation Percentage,Greens in Regulation Percentage - (%),68.54


In [22]:
df['Statistic'].nunique()

528

In [23]:
#Pulling 'Driving Distance' from the "Statistic" column for all players and all years
driving = df[df['Statistic'] == 'Driving Distance']
driving.head(500)

Unnamed: 0,Player Name,Season,Statistic,Variable,Value
0,Robert Garrigus,2010,Driving Distance,Driving Distance - (ROUNDS),71.0
1,Bubba Watson,2010,Driving Distance,Driving Distance - (ROUNDS),77.0
2,Dustin Johnson,2010,Driving Distance,Driving Distance - (ROUNDS),83.0
3,Brett Wetterich,2010,Driving Distance,Driving Distance - (ROUNDS),54.0
4,J.B. Holmes,2010,Driving Distance,Driving Distance - (ROUNDS),100.0
5,John Daly,2010,Driving Distance,Driving Distance - (ROUNDS),63.0
6,Graham DeLaet,2010,Driving Distance,Driving Distance - (ROUNDS),88.0
7,Angel Cabrera,2010,Driving Distance,Driving Distance - (ROUNDS),64.0
8,Charles Warren,2010,Driving Distance,Driving Distance - (ROUNDS),64.0
9,D.J. Trahan,2010,Driving Distance,Driving Distance - (ROUNDS),92.0


In [24]:
#Now pulling just 'Bubba Watson' from 'Driving Distance'-Statistic dataset
bw_driving = driving[driving['Player Name'] == 'Bubba Watson']
bw_driving['Variable'].nunique()

4

In [25]:
bw_driving

Unnamed: 0,Player Name,Season,Statistic,Variable,Value
1,Bubba Watson,2010,Driving Distance,Driving Distance - (ROUNDS),77.0
193,Bubba Watson,2010,Driving Distance,Driving Distance - (AVG.),309.8
385,Bubba Watson,2010,Driving Distance,Driving Distance - (TOTAL DISTANCE),47703.0
577,Bubba Watson,2010,Driving Distance,Driving Distance - (TOTAL DRIVES),154.0
323084,Bubba Watson,2011,Driving Distance,Driving Distance - (ROUNDS),85.0
323270,Bubba Watson,2011,Driving Distance,Driving Distance - (AVG.),314.9
323456,Bubba Watson,2011,Driving Distance,Driving Distance - (TOTAL DISTANCE),49747.0
323642,Bubba Watson,2011,Driving Distance,Driving Distance - (TOTAL DRIVES),158.0
628195,Bubba Watson,2012,Driving Distance,Driving Distance - (ROUNDS),68.0
628386,Bubba Watson,2012,Driving Distance,Driving Distance - (AVG.),315.5
