# Evaluating different projections of MLB pitching statistics for 2018 season
##### The purpose of this project was to evaluate how well different models predicted a player's statistics for the 2018 MLB season. There are 6 different models evaluated in this project- ATC, Depth Charts, Fangraphs, Steamer, The Bat, ZiPS, and an average of those 5 projections. Raw data of each projection was downloaded via fangraphs.com and compressed into one standardized spreadsheet. Exact number of projections may vary from player to player as some projections excluded certain players from their projections.
##### This project is not fully complete as the 2018 MLB season has not yet completed. Currently, this workbook does some data cleansing, then data manipulation to calculate average projections before adding the average projection for each player to the table of projections. Upon completion of the 2018 MLB season, this workbook will be updated with each player's actual statistics from the season, along with some work and visualization to see how well each projection model did at predicting players' stats.

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 25)

In [4]:
projections = pd.read_csv("2018-MLB-Projections-Pitchers.csv")

In [6]:
projections.head(10)

Unnamed: 0,Name,Team,Type,W,L,SV,HLD,ERA,GS,G,IP,H,ER,HR,SO,BB,WHIP,K/9,BB/9,FIP,WAR,ADP,playerid
0,Williams Jerez,Angels,ATC,0,0,0.0,,4.64,0,5,6.0,6,3,1,5,3,1.52,7.21,4.52,4.53,,999.0,14338
1,Jason Gurka,Angels,ATC,1,0,0.0,,4.33,0,18,11.0,15,5,1,10,5,1.77,8.02,3.89,3.47,,999.0,8007
2,Ty Buttrey,Angels,ATC,1,1,0.0,,4.92,0,11,12.0,12,6,2,9,6,1.57,7.23,4.96,4.96,,999.0,14719
3,John Lamb,Angels,ATC,1,1,0.0,,5.72,1,3,14.0,17,9,2,10,5,1.58,6.2,3.29,5.17,,999.0,8493
4,Akeel Morris,Angels,ATC,0,1,0.0,,4.25,0,16,16.0,15,8,3,18,9,1.47,10.01,5.06,5.22,,999.0,12185
5,Dayan Diaz,Angels,ATC,1,1,0.0,,4.72,0,14,16.0,16,9,3,15,7,1.42,8.52,3.75,4.71,,999.0,11543
6,Miguel Almonte,Angels,ATC,0,0,0.0,,4.84,1,11,18.0,19,10,2,16,8,1.52,8.22,4.15,4.31,,999.0,14139
7,Deck McGuire,Angels,ATC,1,1,0.0,,4.65,1,8,20.0,20,10,2,16,8,1.41,7.29,3.54,4.51,,999.0,11596
8,Taylor Cole,Angels,ATC,0,1,0.0,,7.36,1,18,21.0,30,17,1,16,8,1.83,6.88,3.27,3.84,,999.0,11964
9,Noe Ramirez,Angels,ATC,1,1,0.0,,3.95,0,25,27.0,23,12,5,28,11,1.27,9.37,3.67,4.72,,999.0,12800


In [7]:
projections.columns

Index(['Name', 'Team', 'Type', 'W', 'L', 'SV', 'HLD', 'ERA', 'GS', 'G', 'IP',
       'H', 'ER', 'HR', 'SO', 'BB', 'WHIP', 'K/9', 'BB/9', 'FIP', 'WAR', 'ADP',
       'playerid'],
      dtype='object')

### Find columns with NaNs and remove

In [10]:
nas = projections.columns[projections.isna().any()].tolist()
nas

['SV', 'HLD', 'WAR']

In [11]:
projections = projections.drop(nas, axis=1)

In [12]:
projections.columns[projections.isna().any()].tolist()

[]

In [13]:
projections.head()

Unnamed: 0,Name,Team,Type,W,L,ERA,GS,G,IP,H,ER,HR,SO,BB,WHIP,K/9,BB/9,FIP,ADP,playerid
0,Williams Jerez,Angels,ATC,0,0,4.64,0,5,6.0,6,3,1,5,3,1.52,7.21,4.52,4.53,999.0,14338
1,Jason Gurka,Angels,ATC,1,0,4.33,0,18,11.0,15,5,1,10,5,1.77,8.02,3.89,3.47,999.0,8007
2,Ty Buttrey,Angels,ATC,1,1,4.92,0,11,12.0,12,6,2,9,6,1.57,7.23,4.96,4.96,999.0,14719
3,John Lamb,Angels,ATC,1,1,5.72,1,3,14.0,17,9,2,10,5,1.58,6.2,3.29,5.17,999.0,8493
4,Akeel Morris,Angels,ATC,0,1,4.25,0,16,16.0,15,8,3,18,9,1.47,10.01,5.06,5.22,999.0,12185


In [16]:
projections.columns

Index(['Name', 'Team', 'Type', 'W', 'L', 'ERA', 'GS', 'G', 'IP', 'H', 'ER',
       'HR', 'SO', 'BB', 'WHIP', 'K/9', 'BB/9', 'FIP', 'ADP', 'playerid'],
      dtype='object')

### Create list of all player names that have projections, plus dictionary storing the number of different projections for each name
##### To make the average projections more worthwhile, I will only use players that had at least 3 projections

In [17]:
names = []
nameCounter = {}
for index, row in projections.iterrows():
    name = row["Name"]
    if name not in names:
        names.append(name)
        nameCounter[name] = 1
    else:
        nameCounter[name] += 1
print(len(names))

996


In [19]:
nameCounter

{'Williams Jerez': 4,
 'Jason Gurka': 3,
 'Ty Buttrey': 5,
 'John Lamb': 3,
 'Akeel Morris': 5,
 'Dayan Diaz': 5,
 'Miguel Almonte': 5,
 'Deck McGuire': 3,
 'Taylor Cole': 3,
 'Noe Ramirez': 5,
 'Eduardo Paredes': 4,
 'Felix Pena': 5,
 'Luke Farrell': 5,
 'Jose Alvarez': 5,
 'Odrisamer Despaigne': 5,
 'Junichi Tazawa': 5,
 'Cam Bedrosian': 5,
 'Hansel Robles': 5,
 'Keynan Middleton': 4,
 'Blake Wood': 5,
 'Jim Johnson': 5,
 'Blake Parker': 5,
 'Nick Tropeano': 5,
 'Andrew Heaney': 5,
 'Parker Bridwell': 5,
 'JC Ramirez': 5,
 'Tyler Skaggs': 5,
 'Matt Shoemaker': 5,
 'Garrett Richards': 6,
 'Shohei Ohtani': 6,
 'Jandel Gustave': 4,
 'Brady Rodgers': 5,
 'Reymin Guduan': 5,
 'Tony Sipp': 5,
 'Francis Martes': 5,
 'Joe Smith': 5,
 'Will Harris': 5,
 'Ryan Pressly': 5,
 'Hector Rondon': 5,
 'Roberto Osuna': 6,
 'Chris Devenski': 6,
 'Brad Peacock': 6,
 'Collin McHugh': 5,
 'Lance McCullers Jr.': 6,
 'Charlie Morton': 6,
 'Dallas Keuchel': 6,
 'Gerrit Cole': 6,
 'Justin Verlander': 6,
 'Dea

### Select names with 3+ projections

In [22]:
# create a list to store the names of players without 3+ projections. Add those players to that list, and remove those players from the original list of names
notEnoughEntries = []
for name in nameCounter:
    if nameCounter[name] < 3:
        notEnoughEntries.append(name)
        names.remove(name)
notEnoughEntries

['Jose Valdez',
 'Jon Niese',
 'Brett Oberholtzer',
 'Pedro Araujo',
 'Jose Ruiz',
 'Osmer Morales',
 'Greg Mahle',
 'Justin Anderson',
 'Alex Meyer',
 'Framber Valdez',
 'Joshua James',
 'C.J. Riefenhauser',
 'Edwar Cabrera',
 'Darin Downs',
 'Jorge De Leon',
 'Edgar Gonzalez',
 'J.B. Wendelken',
 'Aaron Brooks',
 'Jeremy Bleich',
 'Sean Reid-Foley',
 'Justin Shafer',
 'Jose Fernandez',
 'Mike Hauschild',
 'Matthew Tracy',
 'Brandon Cumpton',
 'Murphy Smith',
 'Cesar Valdez',
 'Zach Stewart',
 'Preston Guilmet',
 'Jason Berken',
 'Arnold Leon',
 'Franklin Morales',
 'Gavin Floyd',
 'Matt Buschmann',
 'Ricky Romero',
 'Rafael Soriano',
 'Joel Pineiro',
 'Brad Penny',
 'Bryse Wilson',
 'Kolby Allard',
 'Michael Soroka',
 'Chad Sobotka',
 'Touki Toussaint',
 'Wes Parsons',
 'Jed Bradley',
 'Grant Dayton',
 'Alex White',
 'Michael Kirkman',
 'Nick Masset',
 'Miguel Socolovich',
 'Kameron Loe',
 'Greg Smith',
 'Jordan Walden',
 'Freddy Garcia',
 'Kanekoa Texeira',
 'Corbin Burnes',
 'Andre

### Remove projections of players that have less than 3 different projections

In [23]:
for name in notEnoughEntries:
    projections = projections[projections.Name != name]

In [24]:
projections

Unnamed: 0,Name,Team,Type,W,L,ERA,GS,G,IP,H,ER,HR,SO,BB,WHIP,K/9,BB/9,FIP,ADP,playerid
0,Williams Jerez,Angels,ATC,0,0,4.64,0,5,6.0,6,3,1,5,3,1.52,7.21,4.52,4.53,999.0,14338
1,Jason Gurka,Angels,ATC,1,0,4.33,0,18,11.0,15,5,1,10,5,1.77,8.02,3.89,3.47,999.0,8007
2,Ty Buttrey,Angels,ATC,1,1,4.92,0,11,12.0,12,6,2,9,6,1.57,7.23,4.96,4.96,999.0,14719
3,John Lamb,Angels,ATC,1,1,5.72,1,3,14.0,17,9,2,10,5,1.58,6.20,3.29,5.17,999.0,8493
4,Akeel Morris,Angels,ATC,0,1,4.25,0,16,16.0,15,8,3,18,9,1.47,10.01,5.06,5.22,999.0,12185
5,Dayan Diaz,Angels,ATC,1,1,4.72,0,14,16.0,16,9,3,15,7,1.42,8.52,3.75,4.71,999.0,11543
6,Miguel Almonte,Angels,ATC,0,0,4.84,1,11,18.0,19,10,2,16,8,1.52,8.22,4.15,4.31,999.0,14139
7,Deck McGuire,Angels,ATC,1,1,4.65,1,8,20.0,20,10,2,16,8,1.41,7.29,3.54,4.51,999.0,11596
8,Taylor Cole,Angels,ATC,0,1,7.36,1,18,21.0,30,17,1,16,8,1.83,6.88,3.27,3.84,999.0,11964
9,Noe Ramirez,Angels,ATC,1,1,3.95,0,25,27.0,23,12,5,28,11,1.27,9.37,3.67,4.72,999.0,12800


### Create an average of all projections for each player and add to table

In [25]:
cols = projections.columns.values.tolist()

for name in names:
    currentPlayer = projections.loc[projections['Name'] == name]
    # set name, team, projection type
    playerName = currentPlayer["Name"].values.tolist()[0]
    team = currentPlayer["Team"].values.tolist()[0]
    projectionType = "A-Average" # the name 'A-Average' is solely for sorting purposes- want the average projections to appear first later on
    # calculate average wins
    w = currentPlayer["W"].values.tolist()
    total = 0
    for i in w:
        total = total + i
    average = total / len(w)
    w = int(round(average))
    # calculate average losses
    l = currentPlayer["L"].values.tolist()
    total = 0
    for i in l:
        total = total + i
    average = total / len(l)
    l = int(round(average))
    # calculate average earned run average
    era = currentPlayer["ERA"].values.tolist()
    total = 0
    for i in era:
        total = total + i
    average = total / len(era)
    era = int(round(average, 2))
    # calculate average games started
    gs = currentPlayer["GS"].values.tolist()
    total = 0
    for i in gs:
        total = total + i
    average = total / len(gs)
    gs = int(round(average))
    # calculate average games
    g = currentPlayer["G"].values.tolist()
    total = 0
    for i in g:
        total = total + i
    average = total / len(g)
    g = int(round(average))
    # calculate average innings pitched
    ip = currentPlayer["IP"].values.tolist()
    total = 0
    for i in ip:
        total = total + i
    average = total / len(ip)
    ip = int(round(average))
    # calculate average hits
    h = currentPlayer["H"].values.tolist()
    total = 0
    for i in h:
        total = total + i
    average = total / len(h)
    h = int(round(average))
    # calculate average earned runs
    er = currentPlayer["ER"].values.tolist()
    total = 0
    for i in er:
        total = total + i
    average = total / len(er)
    er = int(round(average))
    # calculate average home runs
    hr = currentPlayer["HR"].values.tolist()
    total = 0
    for i in hr:
        total = total + i
    average = total / len(hr)
    hr = int(round(average))
    # calculate average strikeouts
    so = currentPlayer["SO"].values.tolist()
    total = 0
    for i in so:
        total = total + i
    average = total / len(so)
    so = int(round(average))
    # calculate average walks
    bb = currentPlayer["BB"].values.tolist()
    total = 0
    for i in bb:
        total = total + i
    average = total / len(bb)
    bb = int(round(average))
    # calculate average walks and hits per inning
    whip = currentPlayer["WHIP"].values.tolist()
    total = 0
    for i in whip:
        total = total + i
    average = total / len(whip)
    whip = int(round(average))
    # calculate average strikeouts per 9 innings
    k9 = currentPlayer["K/9"].values.tolist()
    total = 0
    for i in k9:
        total = total + i
    average = total / len(k9)
    k9 = int(round(average))
    # calculate average walks per 9 innings
    bb9 = currentPlayer["BB/9"].values.tolist()
    total = 0
    for i in bb9:
        total = total + i
    average = total / len(bb9)
    bb9 = int(round(average))
    # calculate average fielding independent hitting
    fip = currentPlayer["FIP"].values.tolist()
    total = 0
    for i in fip:
        total = total + i
    average = total / len(fip)
    fip = int(round(average))
    # calculate average hits
    h = currentPlayer["H"].values.tolist()
    total = 0
    for i in h:
        total = total + i
    average = total / len(h)
    h = int(round(average))
    #set average draft position, player ID
    adp = currentPlayer["ADP"].values.tolist()[0]
    playerID = currentPlayer["playerid"].values.tolist()[0]
    
    #create dataframe of new entry with average projections
    newEntry = [playerName, team, projectionType, w, l, era, gs, g, ip, h, er, hr, so, bb, whip, k9, bb9, fip, adp, playerID]
    newEntry = pd.DataFrame([newEntry], columns=cols)
    
    #append new entry to projections
    projections = projections.append(newEntry)

In [26]:
 projections = projections.sort_values(by=["Name", "Team", "Type"])

In [27]:
projections

Unnamed: 0,Name,Team,Type,W,L,ERA,GS,G,IP,H,ER,HR,SO,BB,WHIP,K/9,BB/9,FIP,ADP,playerid
0,A.J. Cole,Yankees,A-Average,5,5,4.00,15,16,86.0,92,47,15,71,34,1.00,7.00,4.00,5.00,527.4,11467
673,A.J. Cole,Yankees,ATC,6,7,4.78,19,19,100.0,106,53,16,88,41,1.47,7.86,3.72,4.91,527.4,11467
1829,A.J. Cole,Yankees,Depth Charts,4,4,4.88,11,11,65.0,68,35,11,53,25,1.44,7.42,3.47,4.97,527.4,11467
2946,A.J. Cole,Yankees,Steamer,4,4,4.79,11,11,63.0,65,34,10,54,25,1.42,7.60,3.61,4.85,527.4,11467
1197,A.J. Cole,Yankees,The Bat,3,4,5.43,11,11,65.0,72,39,12,51,26,1.51,7.08,3.63,5.30,527.4,11467
3806,A.J. Cole,Yankees,ZiPS,8,8,4.97,25,27,137.7,149,76,24,111,51,1.45,7.26,3.33,5.08,527.4,11467
0,A.J. Minter,Braves,A-Average,3,2,3.00,0,53,52.0,42,19,5,69,21,1.00,12.00,4.00,3.00,365.7,18655
110,A.J. Minter,Braves,ATC,2,3,3.36,0,61,58.0,45,22,5,82,22,1.15,12.79,3.42,2.62,365.7,18655
1300,A.J. Minter,Braves,Depth Charts,3,2,3.30,0,55,55.0,44,20,6,73,22,1.20,11.94,3.67,3.20,365.7,18655
2109,A.J. Minter,Braves,Steamer,3,2,3.29,0,55,55.0,44,20,6,71,23,1.22,11.56,3.79,3.35,365.7,18655


### Once the 2018 MLB season has completed, this workbook will be updated to include actual player stats from the season, along with additional computations and visualizations of how accurate each projection model was

##### Notes to self-
##### Compare each projection to the player's actual stat line, see which projections tended to be most accurate
##### Consider reducing the scope of players to those that were going to play a full season- Results could be skewed from someone that was reasonably expected to play a full season but ended up not having much time in the MLB. Limit comparisons to players that pitched at least 50(?) innings this season and reasonably were expected to play that many?
##### Average how far each projection model was off for all players- i.e. if Scherzer's ZiPS projected 250 strikeouts and he actually had 300, and Sale's ZiPS projected 275 strikeouts and he actually had 250, overall ZiPS projection was off by 75 strikeouts- Think more if this would be worthwhile to look at league-wide, or for players with 50+ IP
##### Consider repeating this project for 2017, 2016, 2015, etc data to get an idea of historical accuracy for each predictive model