# Bradley Terry example notebook
What you will find in this notebook examples of using skpref:
- for setting up the modelling task based framework
- to fit a classifier that's being read in from scikit-learn on the same problem which in the background uses reduction and aggregation methods.
- to fit a Bradley-Terry model with and without covariates on the pairwise comparison data of basketball matches.
- with GridSearch

In [1]:
# Optionally change the theme of the notebook to dark
# from jupyterthemes.stylefx import set_nb_theme
# set_nb_theme('chesterish')

In [2]:
# Import skpref modules
import sys
sys.path.insert(0, "../..")
from skpref.random_utility import BradleyTerry
from skpref.task import PairwiseComparisonTask
from skpref.base import ClassificationReducer
from skpref.model_selection import GridSearchCV

# Import scikit-learn packages to be used in tandem with skpref architecture
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

# Import other useful packages
import pandas as pd
import numpy as np

# Reading in the data
The example data set will be matches played by NBA teams, we will use the 2016 season's matches to predict the results of the 2017 matches. The dataset contains:
- a column for team1 and team2 indicating the two teams that have played each other
- season_start, which indicates which season the match belongs to
- team1_wins takes the value of 1 if the team in column team1 win the match, 0 if they lost (there are no ties in basketball)
- team_1_home takes the value of 1 if team1 was playing in their home court 0 if they were paying away (no neutral courts in the NBA)

In [3]:
NBA_results = pd.read_csv('data/NBA_matches.csv')
NBA_results.head()

Unnamed: 0,team1,team2,season_start,team1_wins,team_1_home
0,Atlanta Hawks,Toronto Raptors,2014,0,0
1,Atlanta Hawks,Indiana Pacers,2014,1,1
2,Atlanta Hawks,San Antonio Spurs,2014,0,0
3,Atlanta Hawks,Charlotte Hornets,2014,0,0
4,Atlanta Hawks,New York Knicks,2014,1,1


In [4]:
NBA_results.tail()

Unnamed: 0,team1,team2,season_start,team1_wins,team_1_home
9835,Washington Wizards,Houston Rockets,2017,0,0
9836,Washington Wizards,Cleveland Cavaliers,2017,0,0
9837,Washington Wizards,Atlanta Hawks,2017,0,1
9838,Washington Wizards,Boston Celtics,2017,1,1
9839,Washington Wizards,Orlando Magic,2017,0,0


In [5]:
season_split = 2016
train_data = NBA_results[NBA_results.season_start == season_split].copy()
test_data = NBA_results[NBA_results.season_start == season_split+1].copy()

We will also use team salary data as covariates in the model later with the idea being that a team that has more money to pay to their athletes has an advantage over other teams, by having a better chance to attract the top talent in the league.

In [6]:
NBA_team_salary_budget = pd.read_csv('data/team_salary_budgets.csv')
NBA_team_salary_budget.head()

Unnamed: 0,team,season_start,salary
0,Atlanta Hawks,2014,58337671
1,Atlanta Hawks,2015,71378126
2,Atlanta Hawks,2016,95957250
3,Atlanta Hawks,2017,99375302
4,Boston Celtics,2014,59418142


# Setting up the tasks

We set up the preference learning task by using the PairwiseComparisonTassk object in skpref. This is the only extra step which might be a completely new concept to seasponed scikit-learn users. Once the task is specified, say in this case a pairwise comparison task, for any models applied in skpref, whether that is a reduction via scikit-learn or a model that is not a pairwise comparison model such as a Bradley-Terry model the package will know that the problem itself is a pairwise comparison problem and can perform reduction and aggregation adequately in the background when needed.

In this example the PairwiseComparisonTask has the following components:
- primary_table: the table that contains the observed preferences
- primary_table_alternatives_names: the column or columns that containn the alternatives, in this case both columns team1 and team2 contain alternatives
- primary_table_target_name: the column that indicates the result of the pairwise comparison
- target_column_correspondence: in the case of pairwise comparisons, when the alternatives are split across two columns, the column indicating the result usually takes the form 1/0 to show whether one of the columns, in our case team1 or team2 has been preferred. So in this column the user indicates that when the team1_wins column takes the value 1 that means that the alternative in the column team1 has won.
- features_to_use: indicates which columns to use as covariates

In [7]:
NBA_results_task_train_LR = PairwiseComparisonTask(
    primary_table=train_data,
    primary_table_alternatives_names=['team1', 'team2'],
    primary_table_target_name ='team1_wins',
    target_column_correspondence='team1',
    features_to_use=['team_1_home']
)

# For the test task, it's possible to make a copy of the training task and
# update the primary table
NBA_results_task_predict_LR = PairwiseComparisonTask(
    primary_table=test_data,
    primary_table_alternatives_names=['team1', 'team2'],
    primary_table_target_name ='team1_wins',
    target_column_correspondence='team1',
    features_to_use=['team_1_home']
)

# Let's fit a reduction to logistic regression first
The only covariate we will use in this for now will be the team_1_home column, which should return a method that only learns what the home team advantage was on average

$P(team1\_wins=1) = logistic(\alpha + \beta_1 team\_1\_home)$

In [8]:
my_log_red = ClassificationReducer(LogisticRegression(solver='lbfgs'))
my_log_red.fit_task(NBA_results_task_train_LR)
preds = my_log_red.predict_task(NBA_results_task_predict_LR)

In [9]:
# predict_task returns a SubsetPosetVector which has the attributes
# top_input_data and boot_input_data corresponding to chosen and not chosen 
# alternatives.
preds.top_input_data, preds.boot_input_data

(array(['Dallas Mavericks', 'Charlotte Hornets', 'Brooklyn Nets', ...,
        'Washington Wizards', 'Washington Wizards', 'Orlando Magic'],
       dtype=object),
 array(['Atlanta Hawks', 'Atlanta Hawks', 'Atlanta Hawks', ...,
        'Atlanta Hawks', 'Boston Celtics', 'Washington Wizards'],
       dtype=object))

In [10]:
NBA_results_task_predict_LR.primary_table.head()

Unnamed: 0,team1,team2,season_start,team1_wins,team_1_home
7380,Atlanta Hawks,Dallas Mavericks,2017,1,0
7381,Atlanta Hawks,Charlotte Hornets,2017,0,0
7382,Atlanta Hawks,Brooklyn Nets,2017,0,0
7383,Atlanta Hawks,Miami Heat,2017,0,0
7384,Atlanta Hawks,Chicago Bulls,2017,0,0


In [11]:
NBA_results_task_predict_LR.primary_table.tail()

Unnamed: 0,team1,team2,season_start,team1_wins,team_1_home
9835,Washington Wizards,Houston Rockets,2017,0,0
9836,Washington Wizards,Cleveland Cavaliers,2017,0,0
9837,Washington Wizards,Atlanta Hawks,2017,0,1
9838,Washington Wizards,Boston Celtics,2017,1,1
9839,Washington Wizards,Orlando Magic,2017,0,0


In [12]:
# All this learns so far is the home team advantage, since its the only 
# covariate in the test_data table
my_log_red.predict_proba_task(NBA_results_task_predict_LR,
                              outcome=['Dallas Mavericks', 'Atlanta Hawks'])

{'Dallas Mavericks': array([0.58319366, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]),
 'Atlanta Hawks': array([0.41680634, 0.41680634, 0.41680634, ..., 0.41680634, 0.        ,
        0.        ])}

In [13]:
my_log_red.predict_proba_task(NBA_results_task_predict_LR,
                              column=['team1', 'team2'])

{'team1 is preferred': array([0.41680634, 0.41680634, 0.41680634, ..., 0.58319366, 0.58319366,
        0.41680634]),
 'team2 is preferred': array([0.58319366, 0.58319366, 0.58319366, ..., 0.41680634, 0.41680634,
        0.58319366])}

## Now let's fit a Bradley Terry model
This will be defined in a slightly different task, becasue in this case we don't want to use any covariates, whereas scikit-learn logistic regression doesn't work when running with no covariates.

In [14]:
NBA_results_task_train_BT = PairwiseComparisonTask(
    primary_table=train_data,
    primary_table_alternatives_names=['team1', 'team2'],
    primary_table_target_name ='team1_wins',
    target_column_correspondence='team1',
    features_to_use=None
)

NBA_results_task_predict_BT = PairwiseComparisonTask(
    primary_table=test_data,
    primary_table_alternatives_names=['team1', 'team2'],
    primary_table_target_name ='team1_wins',
    target_column_correspondence='team1',
    features_to_use=None
)

In [15]:
# Fitting Bradley Terry model
mybt = BradleyTerry(method='BFGS', alpha=1e-5)
mybt.fit_task(NBA_results_task_train_BT)

In [16]:
mybt.params_

Unnamed: 0,entity,learned_strength
0,Atlanta Hawks,0.047522
1,Boston Celtics,0.580896
2,Brooklyn Nets,-1.178393
3,Charlotte Hornets,-0.278154
4,Chicago Bulls,-0.037967
5,Cleveland Cavaliers,0.489737
6,Dallas Mavericks,-0.386261
7,Denver Nuggets,-0.040408
8,Detroit Pistons,-0.225709
9,Golden State Warriors,1.538386


We can use the latent alternative strength parameters that Bradley-Terry models learn to rank the teams

In [17]:
mybt.rank_entities(ascending=False)

['Golden State Warriors',
 'San Antonio Spurs',
 'Houston Rockets',
 'Boston Celtics',
 'Los Angeles Clippers',
 'Utah Jazz',
 'Cleveland Cavaliers',
 'Toronto Raptors',
 'Washington Wizards',
 'Oklahoma City Thunder',
 'Memphis Grizzlies',
 'Atlanta Hawks',
 'Portland Trail Blazers',
 'Milwaukee Bucks',
 'Indiana Pacers',
 'Miami Heat',
 'Chicago Bulls',
 'Denver Nuggets',
 'Detroit Pistons',
 'Charlotte Hornets',
 'New Orleans Pelicans',
 'Dallas Mavericks',
 'Sacramento Kings',
 'Minnesota Timberwolves',
 'New York Knicks',
 'Orlando Magic',
 'Philadelphia 76ers',
 'Los Angeles Lakers',
 'Phoenix Suns',
 'Brooklyn Nets']

In [18]:
# we can create the probability for each team winning in a specific observaion,

mybt.predict_proba_task(NBA_results_task_predict_BT,
                        outcome=['Atlanta Hawks', 'Washington Wizards'])

{'Atlanta Hawks': array([0.60677663, 0.58070679, 0.77310272, ..., 0.42217622, 0.        ,
        0.        ]),
 'Washington Wizards': array([0.        , 0.        , 0.        , ..., 0.57782378, 0.44533727,
        0.7343335 ])}

In [19]:
mybt.predict_proba_task(NBA_results_task_predict_BT, column=['team1', 'team2'])

{'team1 is preferred': array([0.60677663, 0.58070679, 0.77310272, ..., 0.57782378, 0.44533727,
        0.7343335 ]),
 'team2 is preferred': array([0.39322337, 0.41929321, 0.22689728, ..., 0.42217622, 0.55466273,
        0.2656665 ])}

In [20]:
mybt.predict_choice_task(NBA_results_task_predict_BT)

array(['Atlanta Hawks', 'Atlanta Hawks', 'Atlanta Hawks', ...,
       'Washington Wizards', 'Boston Celtics', 'Washington Wizards'],
      dtype=object)

In [21]:
preds = mybt.predict_task(NBA_results_task_predict_BT)

In [22]:
preds.top_input_data, preds.boot_input_data

(array(['Atlanta Hawks', 'Atlanta Hawks', 'Atlanta Hawks', ...,
        'Washington Wizards', 'Boston Celtics', 'Washington Wizards'],
       dtype=object),
 array(['Dallas Mavericks', 'Charlotte Hornets', 'Brooklyn Nets', ...,
        'Atlanta Hawks', 'Washington Wizards', 'Orlando Magic'],
       dtype=object))

## Run model with team salary budget

In [23]:
NBA_results_task_train = PairwiseComparisonTask(
    primary_table=train_data,
    primary_table_alternatives_names=['team1', 'team2'],
    primary_table_target_name ='team1_wins',
    target_column_correspondence='team1',
    features_to_use=['salary', 'team1_home'],
    secondary_table=NBA_team_salary_budget,
    secondary_to_primary_link={
        'team': ['team1', 'team2'],
        'season_start': 'season_start'
    })

NBA_results_task_predict = PairwiseComparisonTask(
    primary_table=test_data,
    primary_table_alternatives_names=['team1', 'team2'],
    primary_table_target_name ='team1_wins',
    target_column_correspondence='team1',
    features_to_use=['salary', 'team1_home'],
    secondary_table=NBA_team_salary_budget,
    secondary_to_primary_link={
        'team': ['team1', 'team2'],
        'season_start': 'season_start'
    })

# Let's fit a reduction to logistic regression with the salary covariate

In [24]:
my_log_red = ClassificationReducer(LogisticRegression(solver='lbfgs'))
my_log_red.fit_task(NBA_results_task_train)
preds = my_log_red.predict_task(NBA_results_task_predict)

In [25]:
preds.top_input_data, preds.boot_input_data

(array(['Atlanta Hawks', 'Charlotte Hornets', 'Atlanta Hawks', ...,
        'Washington Wizards', 'Washington Wizards', 'Washington Wizards'],
       dtype=object),
 array(['Dallas Mavericks', 'Atlanta Hawks', 'Brooklyn Nets', ...,
        'Atlanta Hawks', 'Boston Celtics', 'Orlando Magic'], dtype=object))

In [26]:
# All this learns so far is the home team advantage, since its the only 
# covariate in the test_data table
my_log_red.predict_proba_task(NBA_results_task_predict, column='team1')

{'team1 is preferred': array([0.55251265, 0.43135651, 0.51319443, ..., 0.59426105, 0.53495344,
        0.60849633])}

# Bradley Terry model with salary covariate

In [27]:
mybt = BradleyTerry(method='BFGS', alpha=1e-5)
mybt.fit_task(NBA_results_task_train)
mybt.rank_entities(ascending=False)

array(['Golden State Warriors', 'San Antonio Spurs', 'Houston Rockets',
       'Utah Jazz', 'Boston Celtics', 'Oklahoma City Thunder',
       'Washington Wizards', 'Toronto Raptors', 'Los Angeles Clippers',
       'Denver Nuggets', 'Atlanta Hawks', 'Indiana Pacers',
       'Chicago Bulls', 'Cleveland Cavaliers', 'Memphis Grizzlies',
       'Miami Heat', 'Milwaukee Bucks', 'Charlotte Hornets',
       'Minnesota Timberwolves', 'Portland Trail Blazers',
       'New Orleans Pelicans', 'Sacramento Kings', 'Detroit Pistons',
       'Dallas Mavericks', 'Philadelphia 76ers', 'New York Knicks',
       'Phoenix Suns', 'Los Angeles Lakers', 'Orlando Magic',
       'Brooklyn Nets'], dtype=object)

In [28]:
mybt.predict_proba_task(NBA_results_task_predict, column=['team1', 'team2'])

{'team1 is preferred': array([0.69081236, 0.47898395, 0.74603095, ..., 0.65408203, 0.42937596,
        0.82114576]),
 'team2 is preferred': array([0.30918764, 0.52101605, 0.25396905, ..., 0.34591797, 0.57062404,
        0.17885424])}

In [29]:
mybt.predict_choice_task(NBA_results_task_predict)

array(['Atlanta Hawks', 'Charlotte Hornets', 'Atlanta Hawks', ...,
       'Washington Wizards', 'Boston Celtics', 'Washington Wizards'],
      dtype=object)

In [30]:
mybt.predict_task(NBA_results_task_predict).top_input_data

array(['Atlanta Hawks', 'Charlotte Hornets', 'Atlanta Hawks', ...,
       'Washington Wizards', 'Boston Celtics', 'Washington Wizards'],
      dtype=object)

# Example using GridSearchCV()

In [31]:
to_tune = {'alpha': [1, 2, 4], 'method': ['BFGS']}
gs_bt = GridSearchCV(BradleyTerry(), to_tune,  cv=3, scoring='neg_log_loss')
gs_bt.fit_task(NBA_results_task_train)
gs_bt.inspect_results()

The model with the best parameters was:
BradleyTerry(alpha=2, initial_params=None, max_iter=None, method='BFGS',
       tol=1e-05)
With a score of -0.6265008194657992
All the trials results summarised in descending score
   alpha method  mean_test_score
1      2   BFGS        -0.626501
0      1   BFGS        -0.626742
2      4   BFGS        -0.628853


In [32]:
# Showing that sklearn.metrics works also
to_tune = {'alpha': [1, 2, 4], 'method': ['BFGS']}
gs_bt = GridSearchCV(BradleyTerry(), to_tune,  cv=3, scoring=f1_score)
gs_bt.fit_task(NBA_results_task_train)
gs_bt.inspect_results()

The model with the best parameters was:
BradleyTerry(alpha=4, initial_params=None, max_iter=None, method='BFGS',
       tol=1e-05)
With a score of 0.6337744652191033
All the trials results summarised in descending score
   alpha method  mean_test_score
2      4   BFGS         0.633774
1      2   BFGS         0.631136
0      1   BFGS         0.630085


In [33]:
to_tune = {'C': [0.5, 1, 2, 4, 8], 'solver': ['saga'], 'penalty': ['l1','l2'],
           'fit_intercept': [True, False]}
gs_lr = GridSearchCV(ClassificationReducer(LogisticRegression()), to_tune,
                     cv=3, scoring='neg_log_loss')
gs_lr.fit_task(NBA_results_task_train)
gs_lr.inspect_results()

The model with the best parameters was:
ClassificationReducer(model=LogisticRegression(C=4, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l1', random_state=None, solver='saga',
          tol=0.0001, verbose=0, warm_start=False))
With a score of -0.6865141345460581
All the trials results summarised in descending score
    model__C  model__fit_intercept model__penalty model__solver  \
12       4.0                  True             l1          saga   
15       4.0                 False             l2          saga   
4        1.0                  True             l1          saga   
18       8.0                 False             l1          saga   
3        0.5                 False             l2          saga   
11       2.0                 False             l2          saga   
6        1.0                 False             l1          saga   
13       4.0                  True         

In [34]:
gs_lr.predict_task(NBA_results_task_predict).top_input_data

array(['Atlanta Hawks', 'Charlotte Hornets', 'Atlanta Hawks', ...,
       'Washington Wizards', 'Washington Wizards', 'Washington Wizards'],
      dtype=object)

In [35]:
gs_lr.predict_proba_task(NBA_results_task_predict, column='team1')

{'team1 is preferred': array([0.55251667, 0.43135767, 0.51319756, ..., 0.59426662, 0.53495777,
        0.60850217])}

In [36]:
gs_bt.predict_proba_task(NBA_results_task_predict, column='team1')

{'team1 is preferred': array([0.66884072, 0.47279872, 0.70427284, ..., 0.64116639, 0.44976495,
        0.78764947])}

In [37]:
gs_bt.rank_entities(ascending=False)

array(['Golden State Warriors', 'San Antonio Spurs', 'Houston Rockets',
       'Utah Jazz', 'Boston Celtics', 'Oklahoma City Thunder',
       'Washington Wizards', 'Toronto Raptors', 'Los Angeles Clippers',
       'Denver Nuggets', 'Atlanta Hawks', 'Indiana Pacers',
       'Chicago Bulls', 'Cleveland Cavaliers', 'Memphis Grizzlies',
       'Miami Heat', 'Milwaukee Bucks', 'Charlotte Hornets',
       'Minnesota Timberwolves', 'Portland Trail Blazers',
       'Detroit Pistons', 'New Orleans Pelicans', 'Sacramento Kings',
       'Philadelphia 76ers', 'Dallas Mavericks', 'New York Knicks',
       'Phoenix Suns', 'Los Angeles Lakers', 'Orlando Magic',
       'Brooklyn Nets'], dtype=object)