# World Series Baseball Predictor
Amazing bookmarked and forked git repo sources.

Key is make features, with a key feature being the Elo Score. The Random Forest is a natural sampler. Also try Logistic Regression. Train it on each world series bracket. Then predict some. Many more data points then if you use the whole bracket vs. just world series winner vs. not. Higher dimensional data, much more info now. 

Otherwise, one of the posts had the bright idea to Monte Carlo the outcomes and take the mode of the WS winner root node to see whose most likely to win. Or something like that. The Monte Carlo could be very powerful, to simulate the models over and over again, calculating probabilities at each step of the way for each realization. 

Need to see precisely how Elo works to determine if Monte Carlo is the best route - are win probabilities on an absolute scale or do they always depend on the opponent? Likely the latter. Two teams ranked 95%, when playing each other, will win only 50% of the time. 

### Elo
Probably best to initialize all teams to the same value. 162 games it should be plenty to distinguish the teams. You should make a **separate** script for calculating the elo scores for all teams for all years (since we have the data the rankings should be fixed), and then all you have to do is just load the elo scores.

### Retrosheet  
Data for elo = http://www.retrosheet.org/gamelogs/index.html  
Fields for data = http://www.retrosheet.org/gamelogs/glfields.txt  

## To do
Fit the right K_elo factors to real data? Do some machine learning for those. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
import scipy.stats as ss
%matplotlib inline

In [2]:
def calculate_elo_rank(winner_rank, loser_rank, k, penalize_loser=True):
    rank_diff = winner_rank - loser_rank
    exp = (rank_diff * -1)/400.
    odds = 1./(1 + pow(10, exp))
    new_winner_rank = round(winner_rank + (k * (1 - odds)))
    if penalize_loser:
        new_rank_diff = new_winner_rank - winner_rank
        new_loser_rank = loser_rank - new_rank_diff
    else:
        new_loser_rank = loser_rank
    if new_loser_rank < 1:
        new_loser_rank = 1
    return (new_winner_rank, new_loser_rank)

#from retrosheet
def get_gamelog(year):
    cols = ["Date","Team Away","Away Gm. No.","Team Home","Home Gm. No.","Team Away Score","Team Home Score"]
    GL = pd.read_csv("retrosheet_GL/GL"+str(year)+".TXT",usecols=[0,3,5,6,8,9,10],names=cols)
    awaywin = GL["Team Away Score"] > GL["Team Home Score"]
    homewin = -awaywin
    GL.loc[awaywin,"Winning Team"] = GL["Team Away"]
    GL.loc[awaywin,"Losing Team"] = GL["Team Home"]
    GL.loc[homewin,"Winning Team"] = GL["Team Home"]
    GL.loc[homewin,"Losing Team"] = GL["Team Away"]
    return GL

#Super fast with numpy arrays but dreadfully slow with Pandas DataFrames! 
def calc_season_elo(year,K_i,K_f):
    GL = get_gamelog(year)
    N = len(GL["Team Home"].unique())
    Team, Elo, wins = GL["Team Home"].unique().tolist(), 1500*np.ones(N), np.zeros(N)
    K_step = (K_f - K_i)/162.  #impact factor - games at end of season mean more for momentum and such
    
    for row in GL.itertuples():
        index_w = Team.index(row[8])
        index_l = Team.index(row[9])
        Elo[index_w], Elo[index_l] = calculate_elo_rank(Elo[index_w], Elo[index_l], K_i+row[3]*K_step)
        wins[index_w] += 1
    Data = zip(Team,Elo,wins)
    np.savetxt("retrosheet_GL/ELO"+str(year)+".TXT",Data,delimiter=",",fmt="%s")
    return Data

def get_season_elo(year,K_elo_i,K_elo_f):
    try:
        Data = np.genfromtxt("retrosheet_GL/ELO"+str(year)+".TXT",delimiter=",",dtype=None)
        return Data
    except:
        print "couldn't find Elo data for %d, calculating now."%year
        return calc_season_elo(year,K_elo_i,K_elo_f)

## Feature 1 - Monte Carlo the playoffs!
Now let's create the bracket and Monte Carlo the possible winners. Option to have Elo's during the playoffs as well as a linear gradient K factor to give more weight to later games in the season. Calculate "probability of winning WS" for each team as the number of times they win / number of attempts.

## To do - 
* need to set up the 1995-1997 brackets correctly. Gawd damnit.

In [3]:
#current
def get_winner(team1, elo_1, team2, elo_2, series, K):
    P_1 = 1./(1 + pow(10,(elo_2 - elo_1) / 400.))
    if (series == "ALWC") | (series == "NLWC"):
        w_req = 1
    elif (series == "ALDS1") | (series == "ALDS2") | (series == "NLDS1") | (series == "NLDS2"):
        w_req = 3
    else:
        w_req = 4
    w_1 = 0
    w_2 = 0
    while (w_1 < w_req) & (w_2 < w_req):
        ran = np.random.uniform()
        if ran <= P_1:
            w_1 += 1
        else:
            w_2 += 1
    if w_1 == w_req:
        elo_1, elo_2 = calculate_elo_rank(elo_1, elo_2, K)
        return team1, elo_1
    else:
        elo_2, elo_1 = calculate_elo_rank(elo_2, elo_1, K)
        return team2, elo_2

def get_bracket(data,year,K_elo):  #1994 no WS, 1995-1997 is really weird seeding. Gonna be super annoying...
    Team, Elo, wins = zip(*data)
    teams = pd.read_csv("csv/team.csv")
    teams = teams.loc[(teams["year"]==year)]
    teams = teams.loc[(teams["div_win"]=="Y")|(team["wc_win"]=="Y")]
    
    AL = teams.loc[(teams["league_id"]=="AL")&(teams["rank"]==1)].sort_values("w")
    team_name = AL.loc[:,"team_id_retro"].values.tolist()
    team_div = AL.loc[:,"div_id"].values.tolist()
    team_elo = []
    for t in team_name:
        t_id = Team.index(t)
        team_elo.append(Elo[t_id])
    if year >= 1994:
        WCAL = teams.loc[(teams["rank"]>1)&(teams["league_id"]=="AL")].sort_values("w",ascending=False)
        if year >= 2012:
            WCALteams = WCAL.loc[:,"team_id_retro"].values[0:2]
            t1 = Team.index(WCALteams[0])
            t2 = Team.index(WCALteams[1])
            team_name.insert(0,Team[t1])
            team_elo.insert(0,Elo[t1])
            team_name.insert(0,Team[t2])
            team_elo.insert(0,Elo[t2])
        else:
            if WCAL.loc[:,"div_id"].values[0] == team_div[-1]: #stupid 1998-2011 WC rules
                team_name[-2], team_name[-1] = team_name[-1], team_name[-2]
                team_elo[-2], team_elo[-1] = team_elo[-1], team_elo[-2]
            t1 = Team.index(WCAL.loc[:,"team_id_retro"].values[0])
            team_name.append(Team[t1])
            team_elo.append(Elo[t1])

    NL = teams.loc[(teams["league_id"]=="NL")&(teams["rank"]==1)].sort_values("w")
    dummy = NL.loc[:,"team_id_retro"].values.tolist()
    team_div = NL.loc[:,"div_id"].values.tolist()
    for t in dummy:
        t_id = Team.index(t)
        team_name.append(t)
        team_elo.append(Elo[t_id])
    if year >= 1994:
        WCNL = teams.loc[(teams["rank"]>1)&(teams["league_id"]=="NL")].sort_values("w",ascending=False)
        if year >= 2012:
            WCNLteams = WCNL.loc[:,"team_id_retro"].values[0:2]
            t1 = Team.index(WCNLteams[0])
            t2 = Team.index(WCNLteams[1])
            team_name.insert(2,Team[t1])
            team_elo.insert(2,Elo[t1])
            team_name.insert(2,Team[t2])
            team_elo.insert(2,Elo[t2])
        else:
            if WCNL.loc[:,"div_id"].values[0] == team_div[-1]: #stupid 1998-2011 WC rules
                team_name[-2], team_name[-1] = team_name[-1], team_name[-2]
                team_elo[-2], team_elo[-1] = team_elo[-1], team_elo[-2]
            t1 = Team.index(WCNL.loc[:,"team_id_retro"].values[0])
            team_name.append(Team[t1])
            team_elo.append(Elo[t1])
    return team_name, team_elo

def simulate_playoffs(team_name, team_elo, K_elo, year):
    series = ["ALCS","NLCS","WS"]
    if year > 1994:
        series = ["ALDS2","ALDS1","NLDS2","NLDS1"] + series
    if year >= 2012:
        series = ["ALWC","NLWC"] + series 
    for i,s in enumerate(series):
        winner_team, winner_elo = get_winner(team_name[2*i],team_elo[2*i],team_name[2*i+1],team_elo[2*i+1],s,K_elo)
        if s == "ALWC":
            team_name.insert(7,winner_team)
            team_elo.insert(7,winner_elo)           
        else:
            team_name.append(winner_team)
            team_elo.append(winner_elo)
    return team_name[-1]

def MC_playoffs(year, n_throws, K_elo_season_i, K_elo_season_f, K_elo_playoffs):
    ws_winners = []
    data = get_season_elo(year, K_elo_season_i, K_elo_season_f) 
    teams, elo = get_bracket(data, year, K_elo_playoffs)
    for i in xrange(0,n_throws):
        ws_winners.append(simulate_playoffs(teams[:], elo[:], K_elo_playoffs, year))
    return ws_winners

## Feature 2 - various W/L statistics about the regular season
Maybe a team makes the playoffs, but how are the wins distributed in their season:
* How did they do in the last X games of the season (half of the playoffs is about momentum).
* Maybe they are very streaky, e.g. 15 game winning streak followed by a 15 game losing streak.

In [4]:
def calc_win_stats(year,n_final_games):
    GL = get_gamelog(year)
    Team = GL["Team Home"].unique().tolist()
    EOS_Wins = np.zeros(len(Team))  #end of season wins
    game_no = 162 - n_final_games
    
    WL_dict = {}
    for t in Team:
        WL_dict[t] = []
        
    #Get number of wins in last X games of season, prep W/L streaks
    for row in GL.itertuples():
        WL_dict[row[8]].append("W")
        WL_dict[row[9]].append("L")
        
        if (row[5] >= game_no) | (row[8] >= game_no):
            if (row[8] == row[2]) & (row[3] >=game_no):
                EOS_Wins[Team.index(row[8])] += 1
            elif (row[8] == row[4]) & (row[5] >=game_no):
                EOS_Wins[Team.index(row[8])] += 1
                
    #W/L streak arrays post-processing
    WL = ["W","L"]
    for t in Team:
        while len(WL_dict[t]) < 162:  #I guess there are a few missing values?
            WL_dict[t].insert(0,WL[np.random.randint(0,2)])
        
    #get W/L streaks
    Win_range = 15
    streak_dist = {}
    for team, WLrow in WL_dict.iteritems():
        dist = np.empty(0)     #Each is a bin from -win_range to +win_range, excl. 0 
        WL_prev = WLrow[0]
        streak = 0
        for val in WLrow:
            if val == WL_prev:
                streak += 1
            else:
                index = -1 if WL_prev == "L" else 1
                dist = np.append(dist, streak*index)
                WL_prev = val
                streak = 1
        index = -1 if WL_prev == "L" else 1
        dist = np.append(dist, streak*index)
        streak_dist[team] = dist
    
    fc = ["team","eos_wins"]
    return pd.DataFrame(zip(Team,EOS_Wins), columns=fc, index=Team), streak_dist

## Finally, some Machine Learning!
Now that we've spent quite some time creating the features we want to use, it's time to input these features and see how our algorithm performs!

### To do:
* The above code needs to be generalized to all years.
* 1994 - 1997 needs to have the bracket fixed (1994 no playoffs). For now, just exclude that data maybe.
* Also, you need to decide whether you're asking "Who is going to win the WS?", or "Who is likely to go deep in the playoffs?". I think the latter will yield better results. Basically have either separate columns (for whether or not the team made it to each round) or a single column with increasing number for each round they made it to. 

### Step 1 - Calculate and prepare features

In [5]:
#parameters
K_elo_season_i = 15
K_elo_season_f = 25
K_elo_playoffs = 1
n_draws = 1000
year_cutoff = 1969

n_final_games = 10  #10 seems like a good value

In [6]:
#load data and pre-process
team = pd.read_csv("csv/team.csv")
team = team[(team["div_win"]=="Y")|(team["wc_win"]=="Y")] #only want playoff teams
team = team[team["year"]>=year_cutoff]                    #1969 = since division series was made
team = team[team["year"]!=1981]
del_columns = ["div_id","name","team_id_lahman45","franchise_id","team_id","team_id_br","ppf","bpf","park",
               "attendance","ghome","g","ab","double","triple","hr","bb","so","sb","cs","hbp","sf","ra","er","era",
              "cg","sho","sv","ipouts","ha","hra","bba","soa","e","dp","fp"]
team.drop(del_columns, axis=1, inplace=True)

#create feature arrays
team["ws_prob"] = 0
team["eos_wins"] = 0
team["wl_std"] = 0
team["wl_skew"] = 0

#skipped years
y_skip = [1981,1994,1995,1996,1997]

Get all the stats. This might take a minute.

In [7]:
#get stats
for y in range(year_cutoff, 2016):
    if y not in y_skip:
        winners = MC_playoffs(y, n_draws, K_elo_season_i, K_elo_season_f, K_elo_playoffs)
        for name, value in Counter(winners).iteritems():
            team.loc[(team["team_id_retro"]==name)&(team["year"]==y),"ws_prob"] = value/float(n_draws)

        final_wins, streaks = calc_win_stats(y,n_final_games)
        for row in final_wins.itertuples():
            index = (team["team_id_retro"]==row[1])&(team["year"]==y)
            team.loc[index,"eos_wins"] = row[2]
        for keys, values in streaks.iteritems():
            index = (team["team_id_retro"]==keys)&(team["year"]==y)
            team.loc[index,"wl_std"] = np.std(values)
            team.loc[index,"wl_skew"] = ss.skew(values)

In [8]:
#multi-label classification
team.loc[team["ws_win"] == "Y", "ws_win"] = 1
team.loc[team["ws_win"] == "N", "ws_win"] = 0
team.loc[pd.isnull(team["ws_win"]), "ws_win"] = 0

team.loc[team["wc_win"] == "Y", "wc_win"] = 1
team.loc[team["wc_win"] == "N", "wc_win"] = 0
team.loc[pd.isnull(team["wc_win"]), "wc_win"] = 0

team.loc[team["lg_win"] == "Y", "lg_win"] = 1
team.loc[team["lg_win"] == "N", "lg_win"] = 0
team.loc[pd.isnull(team["lg_win"]), "lg_win"] = 0

team.loc[team["div_win"] == "Y", "div_win"] = 1
team.loc[team["div_win"] == "N", "div_win"] = 0
team.loc[pd.isnull(team["div_win"]), "div_win"] = 0

### Step 2 - perform RF

In [9]:
Xcolumns = ["ws_prob","eos_wins","wl_std","wl_skew"]
ycolumns = ["lg_win","ws_win"]
y = team[ycolumns].astype(int).values
X = team[Xcolumns]

In [10]:
from sklearn.cross_validation import train_test_split,cross_val_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [11]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.grid_search import GridSearchCV
cv_s = StratifiedShuffleSplit(y_train,  n_iter=10 , test_size=0.1, random_state=42)
rfc = RandomForestClassifier(max_features= 'auto' ,n_estimators=50, oob_score=1, class_weight='balanced') 
param_grid = { 
        'n_estimators': [600],
        'max_features': ['sqrt'],
        'min_samples_leaf': [20]}
CV_rfc = GridSearchCV(n_jobs=-1, estimator=rfc, scoring="roc_auc", param_grid=param_grid, cv=cv_s)
CV_rfc.fit(X_train, y_train);
print("Best Parameters from gridsearch: {%s} with a score of %0.4f" % (CV_rfc.best_params_, CV_rfc.best_score_))

Best Parameters from gridsearch: {{'max_features': 'sqrt', 'n_estimators': 600, 'min_samples_leaf': 20}} with a score of 0.6988


In [12]:
model = CV_rfc.best_estimator_

In [13]:
from sklearn import metrics

y_pred = model.predict_proba(X_test) #probability that team0 wins (what Kaggle calls team 1, and wants for submission)

test_score = metrics.roc_auc_score(y_test[:,0], y_pred[0][:,1], average="weighted") #area under curve from prediction scores
print("AUC score is {0}".format(test_score))
print("OOB score is {0}".format(model.oob_score_)) #you may not need test/train split with OOB score!

AUC score is 0.540293040293
OOB score is 0.389705882353


In [14]:
print("Feature\t\tImportance\n")
for i in reversed(np.argsort(model.feature_importances_)):
    print("%s\t\t%f" % (X.columns[i], model.feature_importances_[i]))

Feature		Importance

wl_skew		0.374171
ws_prob		0.338568
wl_std		0.225095
eos_wins		0.062166


### Step 3 - KNN

In [22]:
#First scale data
from sklearn.preprocessing import StandardScaler
Xs = StandardScaler().fit_transform(X)
Xs_train, Xs_test, ys_train, ys_test = train_test_split(Xs, y, test_size=0.25, random_state=43)

In [26]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1, weights="distance")
cv_s = StratifiedShuffleSplit(ys_train,  n_iter=10 , test_size=0.1, random_state=42)
param_grid = { 
        'n_neighbors': [5,10,15,20,25],
        'weights':['distance','uniform']
}
CV_knn = GridSearchCV(n_jobs=-1, estimator=knn, scoring="roc_auc", param_grid=param_grid, cv=cv_s)
CV_knn.fit(Xs_train, ys_train);
print("Best Parameters from gridsearch: {%s} with a score of %0.4f" % (CV_knn.best_params_, CV_knn.best_score_))

Best Parameters from gridsearch: {{'n_neighbors': 15, 'weights': 'distance'}} with a score of 0.9838


# Notes

### A note about y_pred
So for multilabel classification (like in this case, where I fit for lg_win and cs_win), the predict_proba output is arranged like this:  
http://stackoverflow.com/questions/17017882/scikit-learn-predict-proba-gives-wrong-answers

The model.classes returns:  
[array([0, 1]), array([0, 1])]  
which tells me how the data is arranged (i.e. two output labels, each with binary output 0/1), and the shape of y_pred is:  
(n_choices_per_label, n_samples, n_label)

### OOB score
It is method similar to Cross-Validation, the advantage being it doesn't require a train/test split of the data. After  decision tree X has been trained, a classification error is calculated using the "out of bag" samples, i.e. bootstrapped samples from the original dataset that weren't used to train tree X. Lower OOB score is better.  
Links:  
* http://stackoverflow.com/questions/18541923/what-is-out-of-bag-error-in-random-forests
* Breiman (1996)