# Odds Joining (WIP)

The purpose of this notebook is to determine how to join odds data, from [here](http://www.tennis-data.co.uk/alldata.php), with the parsed data.  This is difficult because player's names are represented differently, the parsed data doesn't have match dates, tournaments are represented differently, and so on.  There are various manual corrections we will need to make to create our basis for joining the data here.

Here, we attempt to join by player last name / first initial, year, and tournament location (the locations are more stable over time than the names of the tournaments, and it is cumbersome to come up with consistent mappings for many different name representations).

#### Read in Data

Before anything else, we have to read in the data...

In [1]:
from pathlib import Path
from tennis_new.infra.defs import REPO_DIR

ODDS_PATH = Path.joinpath(
    REPO_DIR,
    'fetch/odds_data/'
)

In [128]:
all_odds.iloc[-1]

Unnamed: 0                      2404
ATP                               59
Location                    Shanghai
Tournament          Shanghai Masters
Date                        10/13/19
Series                  Masters 1000
Court                        Outdoor
Surface                         Hard
Round                      The Final
Best of                            3
Winner                   MEDVEDEV D.
Loser                      ZVEREV A.
WRank                              4
LRank                              6
W1                                 6
L1                                 4
W2                                 6
L2                                 1
W3                               NaN
L3                               NaN
W4                               NaN
L4                               NaN
W5                               NaN
L5                               NaN
Wsets                              2
Lsets                              0
Comment                    Completed
B

In [2]:
import pandas as pd
from tennis_new.fetch.get_joined import read_joined

YEAR = 2006
odds_df = pd.read_csv(Path.joinpath(ODDS_PATH, "%d.csv" % YEAR))
all_odds = pd.read_csv("./all_odds.csv", na_values = ' ')  # There is a weird thing in the odds data with how nulls are sometimes represented
jd = read_joined()

  interactivity=interactivity, compiler=compiler, result=result)
  if (yield from self.run_code(code, result)):


In [129]:
odds_df.iloc[-1]

ATP                    67
Location         Shanghai
Tournament    Masters Cup
Date             11/19/06
Series        Masters Cup
Court              Indoor
Surface              Hard
Round           The Final
Best of                 5
Winner         Federer R.
Loser            Blake J.
WRank                   1
LRank                   8
WPts                 7620
LPts                 2130
W1                      6
L1                      0
W2                      6
L2                      3
W3                      6
L3                      4
W4                    NaN
L4                    NaN
W5                    NaN
L5                    NaN
Wsets                   3
Lsets                   0
Comment         Completed
B365W                1.07
B365L                 8.5
CBW                  1.06
CBL                     9
EXW                   1.1
EXL                   5.8
PSW                 1.099
PSL                  9.15
UBW                  1.08
UBL                     8
Name: 2908, 

In [4]:
# Look into a particular weird case
ws = jd[jd['score'].notnull()]
ws['score'][ws['score'].map(lambda x: 'UNP' in x)].value_counts()

(UNP)    2
Name: score, dtype: int64

#### Name Processing

Let's process the player's names so we can join on them.  Certain players' names are inconsistently represented, especially in the odds data.  Here, we map inconsistent representations of the same player to consistent ones, as well as correct players' names so that last names appear the same in the odds data and in the atp data.

In [65]:
# TODO: Confirm names for a given player don't change over time?
ODDS_NAME_CORRECTIONS = {  # Corrections for the odds data
    'DEL POTRO J. M.': 'DEL POTRO J.M.',
    'GAMBILL J. M.': 'GAMBILL J.M.',
    'QUERRY S.': 'QUERREY S.',
    'BAUTISTA AGUT R.': 'BAUTISTA R.',
    'BOGOMOLOV JR. A.': 'BOGOMOLOV A.',
    'RAMIREZ HIDALGO R.': 'RAMIREZ-HIDALGO R.',
    'CARRENO BUSTA P.': 'CARRENO-BUSTA P.',
    'MUNOZ-DE LA NAVA D.': 'MUNOZ-DE-LA-NAVA D.',
    'MUNOZ DE LA NAVA D.': 'MUNOZ-DE-LA-NAVA D.',
    'DEL BONIS F.': 'DELBONIS F.',
    'HANTSCHEK M.': 'HANTSCHK M.',
    'HAIDER-MAUER A.': 'HAIDER-MAURER A.',
    'DE HEART R.': 'DEHEART R.',
    'MATSUKEVITCH D.': 'MATSUKEVICH D.',
    'NADAL-PARERA R.': 'NADAL R.',
    'AL GHAREEB M.': 'GHAREEB M.',
    "AL-GHAREEB M.": "GHAREEB M.",
    'WANG Y. JR': 'WANG-JR. Y.',
    "SANCHEZ DE LUNA J.": "SANCHEZ-DE-LUNA J.A",
    "SANCHEZ DE LUNA J.A.": "SANCHEZ-DE-LUNA J.A.",
    "DEV VARMAN S.": "DEVVARMAN S.",
    "GRANOLLERS-PUJOL M.": "GRANOLLERS M.",
    "GRANOLLERS PUJOL G.": "GRANOLLERS G.",
    "GRANOLLERS-PUJOL G.": "GRANOLLERS G.",
    "BAHROUZYAN O.": "AWADHY O.",
    "ALI MUTAWA J.M.": "AL-MUTAWA J.M.",
    "AL MUTAWA J.": "AL-MUTAWA J.",
    "ZAYID M. S.": "ZAYID M.S.",
    "GALLARDO VALLES M.": "GALLARDO-VALLES M.",
    "RIBA-MADRID P.": "RIBA P.",
    "CHEKOV P.": "CHEKHOV P.",
    "SAAVEDRA CORVALAN C.": "SAAVEDRA-CORVALAN C.",
    "HAJI A.": "HAJJI A.",
    "ZAYED M. S.": "ZAYED M.S.",
    "KUNITCIN I.": "KUNITSYN I.",
    "DEEN HESHAAM A.": "DEEN-HESHAAM A.",
    "ESTRELLA V.": "ESTRELLA-BURGOS V.",
    "SCHUTTLER P.": "SCHUETTLER R.",
    "TYURNEV E.": "TIURNEV E.",
    "SULTAN-KHALFAN A.": "KHALFAN S.",
    "VAN D. MERWE I.": "VAN DER MERWE I.",
    "ALAWADHI O.": "AWADHY O.",
    "RASCON T.": "RASCON-LOPE J.T.",
    "RUEVSKI P.": "RUSEVSKI P.",
    "ESTRELLA BURGOS V.": "ESTRELLA-BURGOS V.",
    "VAN DER DIUM A.": "VAN DER DUIM A.",
    "AL KHULAIFI N.G.": "AL-KHULAIFI N.G."
}

JD_WHOLE_NAME_CORRECTIONS = {  # Corrections before parsing last name
    'IVAN NAVARRO': 'IVAN NAVARRO-PASTOR',
    'DANIEL MUNOZ DE LA NAVA': 'DANIEL MUNOZ-DE-LA-NAVA',
    'MIGUEL ANGEL LOPEZ JAEN': 'MIGUEL-ANGEL LOPEZ-JAEN',
    'YU JR. WANG': 'YU WANG-JR.',
    "OMAR ALAWADHI": "OMAR AWADHY",
    "MIGUEL ANGEL REYES-VARELA": "MIGUEL-ANGEL REYES-VARELA",
    "ISRAEL MATOS GIL": "ISRAEL MATOS-GIL",
    "ARIEZ ELYAAS DEEN HESHAAM": "ARIEZ-ELYAAS DEEN-HESHAAM",
    "ENRIQUE LOPEZ PEREZ": "ENRIQUE LOPEZ-PEREZ",
    "VICTOR ESTRELLA BURGOS": "VICTOR ESTRELLA-BURGOS",
    "MAHMOUD-NADER AL BALOUSHI": "MAHMOUD NADER"
}


JD_NAME_CORRECTIONS = {  # Corrections after parsing last name
    'MARTIN DEL POTRO': 'DEL POTRO',
    'IGNACIO LONDERO': 'LONDERO',
    'FERREIRA SILVA': "SILVA",
    "ELAHI GALAN": "GALAN",
    "CARLOS FERRERO": "FERRERO",
    "IGNACIO CHELA": "CHELA",
    "ALBERT VILOCA-PUIG": "VILOCA",
    "BURRIEZA-LOPEZ": "BURRIEZA",
    "BOGOMOLOV JR.": "BOGOMOLOV",
    # "KHALFAN": "AL-ALAWI",
    # "ALAWADHI": "BAHROUZYAN",
    "BAUTISTA AGUT": "BAUTISTA",
    "RAMIREZ HIDALGO": "RAMIREZ-HIDALGO",
    "VASSALLO ARGUELLO": "VASSALLO-ARGUELLO",
    "CARRENO BUSTA": "CARRENO-BUSTA",
    "PABLO BRZEZICKI": "BRZEZICKI",
    "TRUJILLO-SOLER": "TRUJILLO",
    "MARCO MORONI": "MORONI",
    "SALVA-VIDAL": "SALVA",
    "ANTONIO SANCHEZ-DE LUNA": "SANCHEZ-DE-LUNA",
    "SEBASTIAN CABAL": "CABAL",
    "SHANNAN ZAYID": "ZAYID",
    "SHANAN ZAYED": "ZAYED",
    "VIJAY SUNDAR PRASHANTH": 'PRASHANTH',
    "DON GRUBER": "GRUBER",
    "PAUL FRUTTERO": "FRUTTERO",
    "LUQUE-VELASCO": "LUQUE",
    "ANTONIO MARIN": "MARIN"
}

In [6]:
sorted(odds_df.columns)

['ATP',
 'B365L',
 'B365W',
 'Best of',
 'CBL',
 'CBW',
 'Comment',
 'Court',
 'Date',
 'EXL',
 'EXW',
 'L1',
 'L2',
 'L3',
 'L4',
 'L5',
 'LPts',
 'LRank',
 'Location',
 'Loser',
 'Lsets',
 'PSL',
 'PSW',
 'Round',
 'Series',
 'Surface',
 'Tournament',
 'UBL',
 'UBW',
 'W1',
 'W2',
 'W3',
 'W4',
 'W5',
 'WPts',
 'WRank',
 'Winner',
 'Wsets']

In [7]:
# Make all names upper case for consistent capitalization, and put through whole name corrections 
jd['winner_name'] = jd['winner_name'].map(lambda x: x.upper())
jd['loser_name'] = jd['loser_name'].map(lambda x: x.upper())
jd['winner_name'] = jd['winner_name'].map(lambda x: JD_WHOLE_NAME_CORRECTIONS.get(x, x))
jd['loser_name'] = jd['loser_name'].map(lambda x: JD_WHOLE_NAME_CORRECTIONS.get(x, x))
all_odds['Winner'] = all_odds['Winner'].map(lambda x: x.upper())
all_odds['Loser'] = all_odds['Loser'].map(lambda x: x.upper())
all_odds['Winner'] = all_odds['Winner'].map(lambda x: ODDS_NAME_CORRECTIONS.get(x, x))
all_odds['Loser'] = all_odds['Loser'].map(lambda x: ODDS_NAME_CORRECTIONS.get(x, x))

In [8]:
# Parse last names
def last_name_jd(n):
    return ' '.join(n.upper().strip().split(' ')[1: ])

def last_name_odds(n):
    return ' '.join(n.upper().strip().split(' ')[: -1])

jd['winner_last_name'] = jd['winner_name'].map(last_name_jd)
jd['loser_last_name'] = jd['loser_name'].map(last_name_jd)
jd['winner_last_name'] = jd['winner_last_name'].map(lambda x: JD_NAME_CORRECTIONS.get(x, x))
jd['loser_last_name'] = jd['loser_last_name'].map(lambda x: JD_NAME_CORRECTIONS.get(x, x))
all_odds['winner_last_name'] = all_odds['Winner'].map(last_name_odds)
all_odds['loser_last_name'] = all_odds['Loser'].map(last_name_odds)

In [10]:
# Which (winner) names from the odds aren't represented in the ATP data?
top_missing_winners = all_odds['Winner'][~all_odds['winner_last_name'].isin(jd['winner_last_name'])].value_counts()
top_missing_winners                              

Series([], Name: Winner, dtype: int64)

In [11]:
# Which (loser) names from the odds aren't represented in the ATP data?
top_missing_losers = all_odds['Loser'][~all_odds['loser_last_name'].isin(jd['loser_last_name'])].value_counts()
top_missing_losers                              

Series([], Name: Loser, dtype: int64)

In [12]:
TO_INVESTIGATE = 'KHULAIFI'

In [13]:
all_odds['Loser'][all_odds['loser_last_name'].map(lambda x: TO_INVESTIGATE in x)].value_counts()

AL-KHULAIFI N.G.    1
Name: Loser, dtype: int64

In [14]:
all_odds[[
    'Winner',
    'winner_last_name',
    'Loser',
    'loser_last_name',
    'Tournament',
    'Location',
    'Date',
    'W1',
    'L1',
    'W2',
    'L2'
]][all_odds['loser_last_name'].map(lambda x: TO_INVESTIGATE in x)].head()

Unnamed: 0,Winner,winner_last_name,Loser,loser_last_name,Tournament,Location,Date,W1,L1,W2,L2
72,OGORODOV O.,OGORODOV,AL-KHULAIFI N.G.,AL-KHULAIFI,Qatar Open,Doha,12/31/01,6.0,1.0,6.0,2.0


In [15]:
jd[[
    'loser_name',
    'loser_last_name',
]][jd['loser_last_name'].map(lambda x: TO_INVESTIGATE in x)].drop_duplicates(
    ['loser_name', 'loser_last_name']
)

Unnamed: 0,loser_name,loser_last_name
144402,NASSER-GHANIM AL-KHULAIFI,AL-KHULAIFI


#### Score Parsing

Now let's parse the scores from the jd data set so that we can join on score between odds and atp as well.

In [16]:
from copy import copy
import numpy as np

WALKOVER_DEFS = [
    'W/O',
    'DEF'
]

def parse_numeric_set(s, origs):
    if len(s) == 2:
        return (int(s[0]), int(s[1]), True)
    else:
        for b in range(1, len(s)):
            s1, s2 = int(s[:b]), int(s[b:])
            if abs(s1 - s2) <= 2:
                return (s1, s2, True)
        return (np.nan, np.nan, False)

def parse_set_score(s):
    # TODO: Parse scores when there are retirements
    # TODO: parse oddS_comment in here.
    origs = copy(s)
    comment = None 
    if 'RET' in s:
        comment = 'Retired' 
        s = s.strip(' (RET)') 
    elif any([x in s for x in WALKOVER_DEFS]):
        comment = 'Walkover'
        return (np.nan, np.nan, comment)
    elif 'UNP' in s:
        comment = 'Match Not Played'
        return (np.nan, np.nan, comment)
    s = s.strip(' (NA)')
    if len(s) == 0:
        return (np.nan, np.nan, comment)
    else:
        s1, s2, completion_flag = parse_numeric_set(s, origs)
        if not completion_flag:
            comment = "ERROR"
        return (s1, s2, comment)
    
def parse_match_score(s):
    if pd.isnull(s):
        return {}
    set_scores = s.split(';')
    out = {}
    for idx, ss in enumerate(set_scores):
        w, l, comment = parse_set_score(ss)
        out.update({
            'W%d' % (idx + 1): w,
            'L%d' % (idx + 1): l,
            'join_comment': comment
        })
    return out 

In [17]:
parsed_scores = jd['score'].map(parse_match_score)
score_df = pd.DataFrame(parsed_scores.tolist())
score_df.head()

Unnamed: 0,L1,L2,L3,L4,L5,W1,W2,W3,W4,W5,join_comment
0,3.0,2.0,6.0,1.0,,6.0,6.0,5.0,6.0,,
1,3.0,3.0,5.0,,,6.0,6.0,6.0,,,
2,5.0,6.0,4.0,1.0,,6.0,5.0,6.0,6.0,,
3,2.0,5.0,2.0,,,6.0,6.0,6.0,,,
4,1.0,2.0,4.0,,,6.0,6.0,6.0,,,


In [18]:
new_jd = pd.concat([jd, score_df], axis=1)

In [19]:
score_cols = score_df.columns.tolist()

In [20]:
comment_cols = [x for x in score_cols if 'comment' in x]

In [21]:
# Look at malformed scores...
new_jd[
    ['score'] + comment_cols + 
    [
        'tourney_dates',
        'tourney_title',
        'winner_name',
        'loser_name',
        'tour_type',
        'tourney_url_suffix'
    ]
][(new_jd[comment_cols] == 'ERROR').any(axis=1)]

Unnamed: 0,score,join_comment,tourney_dates,tourney_title,winner_name,loser_name,tour_type,tourney_url_suffix
89376,22;300 (RET),ERROR,1987.05.04,Forest Hills,BORIS BECKER,FRANCISCO MACIEL,atp,/en/scores/archive/forest-hills/415/1987/results
100790,24;030 (RET),ERROR,1989.04.17,Tokyo Outdoor,BILL SCANLON,PAT CASH,atp,/en/scores/archive/tokyo/329/1989/results
106054,64;640,ERROR,1990.02.12,Croydon,NUNO MARQUES,RICHARD VOGEL,challenger,/en/scores/archive/croydon/491/1990/results
106081,63;600,ERROR,1990.02.12,Nairobi-1,PAOLO PAMBIANCO,STEFAN LOCHBIHLER,challenger,/en/scores/archive/nairobi/252/1990/results
111355,62;620,ERROR,1991.01.21,Vina Del Mar,GERARDO VACAREZZA,ALVARO JORDAN,challenger,/en/scores/archive/vina-del-mar/204/1991/results
230099,60;70;0500,ERROR,2008.04.14,Athens,CHARALAMPOS KAPOGIANNIS,PIER BISBIKOS,challenger,/en/scores/archive/athens/3801/2008/results


None of the malformed scores should affect our data joining (either all challenger or too old) 

In [22]:
jd = pd.concat([jd, score_df], axis=1)
jd.shape

(373236, 41)

In [23]:
# Visually check that this is processing retirements appropriately...
score_cols = score_df.columns.tolist()
jd[['score'] + score_cols][jd['join_comment'] == 'Retired'].head()

Unnamed: 0,score,L1,L2,L3,L4,L5,W1,W2,W3,W4,W5,join_comment
148,63;75;20 (RET),3.0,5.0,0.0,,,6.0,7.0,2.0,,,Retired
188,46;46;86;20 (RET),6.0,6.0,6.0,0.0,,4.0,4.0,8.0,2.0,,Retired
205,75;57;86;57 (RET),5.0,7.0,6.0,7.0,,7.0,5.0,8.0,5.0,,Retired
257,62;53 (RET),2.0,3.0,,,,6.0,5.0,,,,Retired
313,64;64;21 (RET),4.0,4.0,1.0,,,6.0,6.0,2.0,,,Retired


In [24]:
all_odds['join_comment'] = all_odds['Comment']
all_odds.loc[
    ~all_odds['join_comment'].isin(['Retired', 'Walkover']),
    'join_comment'
] = None

#### Location Mapping

Earlier, we did tournament mapping -- now, we'll just try location mapping -- this will hopefully make things less cumbersome, even if a little less accurate.

First, are there any instances in the data where the location is missing?

In [25]:
missing_location = jd[
    jd['tourney_location'].isnull() &
    (jd['year'] >= 2002)
]
missing_location.shape

(79, 41)

Yes, there are cases...what tournaments?

In [26]:
missing_location['tourney_title'].value_counts()

Tennis Masters Cup    49
ATP Finals            30
Name: tourney_title, dtype: int64

Are there missing cases in the odds data?

In [27]:
all_odds['Location'].isnull().value_counts()

False    48777
Name: Location, dtype: int64

Good, no!

In [28]:
# Create additional column with modified locations for joining 
all_odds['join_location'] = all_odds['Location']
all_odds.loc[
    all_odds['Tournament'] == 'Masters Cup',
    'join_location'
] = 'MASTERS_CUP'

In [29]:
# Fill in missing locations in the jd data to match what exists in the odds data
jd['join_location'] = jd['tourney_location']
jd['tourney_title'][
    jd['tourney_location'].isnull() &
    (jd['year'] >= 2002)
].isin(['Tennis Masters Cup', 'ATP Finals']).all()
jd.loc[
    jd['tourney_location'].isnull() &
    (jd['year'] >= 2002),
    'join_location'
] = 'MASTERS_CUP'
jd.loc[
    jd['tourney_title'] == "London / Queen's Club",
    'join_location'
] = "QUEEN'S CLUB"

In [30]:
ODDS_LOCATION_CORRECTIONS = {
    "'S-HERTOGENBOSCH": "S-HERTOGENBOSCH",
    "HO CHI MIN CITY": "HO CHI MINH CITY",
    "ST. POLTEN": "ST. POELTEN",
    "QUEENS CLUB": "QUEEN'S CLUB"
}

JD_LOCATION_CORRECTIONS = {
    "HO CHI MINH": "HO CHI MINH CITY"
}

In [31]:
jd['join_location'].fillna('', inplace=True)

In [32]:
jd['join_location'] = jd['join_location'].map(lambda x: x.split(',')[0].strip().upper())
jd['join_location'] = jd['join_location'].map(lambda x: JD_LOCATION_CORRECTIONS.get(x, x))

In [33]:
all_odds['join_location'] = all_odds['join_location'].map(lambda x: x.strip().upper())
all_odds['join_location'] = all_odds['join_location'].map(lambda x: ODDS_LOCATION_CORRECTIONS.get(x, x))
missing_locs = all_odds['join_location'][~all_odds['join_location'].isin(jd['join_location'])]
missing_locs.value_counts()

Series([], Name: join_location, dtype: int64)

#### Joining

In [79]:
all_odds['odds_match_id'] = range(all_odds.shape[0])

In [80]:
join_cols = sorted([
    'winner_last_name',
    'loser_last_name',
    'join_location',
    'year',
    'join_comment'
])
join_cols

['join_comment',
 'join_location',
 'loser_last_name',
 'winner_last_name',
 'year']

In [81]:
merged = pd.merge(
    all_odds,
    jd,
    on=join_cols,
    how='left'
)

In [82]:
merged['match_id'].isnull().value_counts()

False    46558
True      2300
Name: match_id, dtype: int64

2300 matches are missing!

In [99]:
merged['match_id'].value_counts().value_counts()

1    46456
2       51
Name: match_id, dtype: int64

51 tennis-matches were double-joined to the ATP data!

In [110]:
# Look into double-joined matches
vcs = merged['match_id'].value_counts()
double_idx = vcs[vcs == 2].index.tolist()
doubles = merged[merged['match_id'].isin(double_idx)].copy()
doubles.sort_values('match_id', ascending=True, inplace=True)

In [118]:
doubles[[
    'Winner',
    'winner_name',
    'winner_last_name',
    'Loser',
    'loser_name',
    'loser_last_name',
    'Tournament',
    'tourney_title',
    'Location',
    'tourney_location',
    'year',
    'match_id'
]].head(40)

Unnamed: 0,Winner,winner_name,winner_last_name,Loser,loser_name,loser_last_name,Tournament,tourney_title,Location,tourney_location,year,match_id
6615,MONTANES A.,ALBERT MONTANES,MONTANES,LOPEZ F.,FELICIANO LOPEZ,LOPEZ,CAM Open Comunidad Valenciana,Valencia,Valencia,Valencia,2004,Albert Montanes*Feliciano Lopez*2004_573*Round...
6621,MONTANES A.,ALBERT MONTANES,MONTANES,LOPEZ M.,FELICIANO LOPEZ,LOPEZ,CAM Open Comunidad Valenciana,Valencia,Valencia,Valencia,2004,Albert Montanes*Feliciano Lopez*2004_573*Round...
6622,MONTANES A.,ALBERT MONTANES,MONTANES,LOPEZ M.,MARC LOPEZ,LOPEZ,CAM Open Comunidad Valenciana,Valencia,Valencia,Valencia,2004,Albert Montanes*Marc Lopez*2004_573*Quarter-Fi...
6616,MONTANES A.,ALBERT MONTANES,MONTANES,LOPEZ F.,MARC LOPEZ,LOPEZ,CAM Open Comunidad Valenciana,Valencia,Valencia,Valencia,2004,Albert Montanes*Marc Lopez*2004_573*Quarter-Fi...
37251,MURRAY A.,ANDY MURRAY,MURRAY,FERRER D.,DAVID FERRER,FERRER,French Open,ATP Masters 1000 Paris,Paris,"Paris, France",2015,Andy Murray*David Ferrer*2015_352*Semi-Finals
38518,MURRAY A.,ANDY MURRAY,MURRAY,FERRER D.,DAVID FERRER,FERRER,BNP Paribas Masters,ATP Masters 1000 Paris,Paris,"Paris, France",2015,Andy Murray*David Ferrer*2015_352*Semi-Finals
37250,MURRAY A.,ANDY MURRAY,MURRAY,FERRER D.,DAVID FERRER,FERRER,French Open,Roland Garros,Paris,"Paris, France",2015,Andy Murray*David Ferrer*2015_520*Quarter-Finals
38517,MURRAY A.,ANDY MURRAY,MURRAY,FERRER D.,DAVID FERRER,FERRER,BNP Paribas Masters,Roland Garros,Paris,"Paris, France",2015,Andy Murray*David Ferrer*2015_520*Quarter-Finals
32836,MURRAY A.,ANDY MURRAY,MURRAY,MAYER F.,FLORIAN MAYER,MAYER,US Open,US Open,New York,"New York, United States",2013,Andy Murray*Florian Mayer*2013_560*Round of 32
32821,MURRAY A.,ANDY MURRAY,MURRAY,MAYER L.,FLORIAN MAYER,MAYER,US Open,US Open,New York,"New York, United States",2013,Andy Murray*Florian Mayer*2013_560*Round of 32


Need to distinguish Paris Masters from French Open, and also should probably join on round because of first name last name collisions...now let's look into the cases that could not find matches

In [98]:
# Look into the missing data by location and year -- are certain tournaments missing as a whole?
missing = merged[merged['match_id'].isnull()]
missing.groupby(['join_location', 'year']).size()

join_location   year
ACAPULCO        2003     4
                2004     1
                2005     2
                2006     2
                2012     2
                2016     1
ADELAIDE        2005     1
                2007    24
AMERSFOORT      2005     2
                2006     2
AUCKLAND        2006    31
                2012    27
                2013     1
BARCELONA       2002     1
                2003     1
                2004     1
                2006     1
                2007     1
                2011     2
                2012     3
                2013     4
                2014     2
BASTAD          2012    27
                2013     3
                2014     2
BEIJING         2004     1
                2006     1
                2011     1
                2016    31
BOGOTA          2015    27
                        ..
ST. PETERSBURG  2013     1
STOCKHOLM       2003     1
                2004     1
                2012     1
STUTTGART       2004     3
SYDNEY 

In [127]:
missing[[
    'Winner',
    'winner_last_name',
    'Loser',
    'loser_last_name',
    'Tournament',
    'year',
    'join_location',
    'join_comment'
]].head(20)

Unnamed: 0,Winner,winner_last_name,Loser,loser_last_name,Tournament,year,join_location,join_comment
40,PAVEL A.,PAVEL,MARIN J.A.,MARIN,TATA Open,2002,CHENNAI,
387,SARETTA F.,SARETTA,MARIN J.A.,MARIN,Bellsouth Open,2002,VINA DEL MAR,
398,SARETTA F.,SARETTA,NALBANDIAN D.,NALBANDIAN,Bellsouth Open,2002,VINA DEL MAR,Retired
437,BJORKMAN J.,BJORKMAN,PHILIPPOUSSIS M.,PHILIPPOUSSIS,Kroger St. Jude,2002,MEMPHIS,
438,BLAKE J.,BLAKE,CHANG M.,CHANG,Kroger St. Jude,2002,MEMPHIS,
439,DAMM M.,DAMM,PLESS K.,PLESS,Kroger St. Jude,2002,MEMPHIS,
440,DENT T.,DENT,GINEPRI R.,GINEPRI,Kroger St. Jude,2002,MEMPHIS,
441,GAMBILL J.M.,GAMBILL,DAVYDENKO N.,DAVYDENKO,Kroger St. Jude,2002,MEMPHIS,
442,GIMELSTOB J.,GIMELSTOB,FISH M.,FISH,Kroger St. Jude,2002,MEMPHIS,
443,GOLDSTEIN P.,GOLDSTEIN,ARTHURS W.,ARTHURS,Kroger St. Jude,2002,MEMPHIS,


In [126]:
def find_jd_case(
    case,
    winner_last_name=True,
    loser_last_name=True,
    location=True,
    year=True
):
    conditions = True 
    if winner_last_name:
        conditions = conditions & (jd['winner_last_name'] == case['winner_last_name'])
    if loser_last_name:
        conditions = conditions & (jd['loser_last_name'] == case['loser_last_name'])
    if location:
        conditions = conditions & (jd['join_location'] == case['join_location'])
    if year:
        conditions = conditions & (jd['year'] == case['year'])
    return jd[conditions][join_cols + ['round', 'score']]

find_jd_case(missing.iloc[3], location=True, year=True, winner_last_name=True, loser_last_name=False)

Unnamed: 0,join_comment,join_location,loser_last_name,winner_last_name,year,round,score


The scores are often very different!  Maybe we'll just join on names / tourney / year / round...

In [None]:
jd['match_id']

In [None]:
def get_tourney(x, idx):
    mapped = TOURNAMENT_MAPPING[x]
    if isinstance(mapped, str):
        return mapped if idx == 0 else None
    if idx >= len(mapped):
        return None
    else:
        return mapped[idx]

idx = 0
merged_dfs = []
while(True):
    print(idx)
    cur_odds = odds_df.copy()
    cur_odds['tourney_title'] = odds_df['Tournament'].map(lambda x: get_tourney(x, idx))
    if cur_odds['tourney_title'].isnull().all():
         break
    merged_dfs.append(
        pd.merge(
            jd_2018,
            cur_odds,
            on=join_cols
        )
    )
    idx += 1

In [None]:
all_merged = pd.concat(merged_dfs)

In [None]:
all_merged.to_csv("./merged_2018.csv", index=False)

In [None]:
assert all_merged['match_id'].value_counts().max() == 1

In [None]:
missing = odds_df[~odds_df['odds_match_id'].isin(all_merged['odds_match_id'].tolist())]

In [None]:
missing.shape

In [None]:
missing['Comment'].value_counts()

In [None]:
missing[missing['Comment'] == 'Completed'][[
    'Winner', 'Loser', 'winner_last_name', 'loser_last_name'
]]

In [None]:
# What tournaments are ALWAYS missing!?
missing[~missing['Tournament'].isin(all_merged['Tournament'])]['Tournament'].value_counts()

In [None]:
missing[missing['Tournament'] == "CHENGDU OPEN"].iloc[0]

In [None]:
TOURNAMENT_MAPPING['FRENCH OPEN']

In [None]:
get_tourney('FRENCH OPEN', 0)

In [None]:
def inspect_match(w=None, l=None):
    if w is not None:
        rel = jd_2018[
            (jd_2018['winner_last_name'] == w)
        ]
    else:
        rel = jd_2018
    if l is not None:
        rel = rel[
            rel['loser_last_name'] == w
        ]
    return rel[[
        'winner_last_name',
        'loser_last_name',
        'tourney_title'
    ] + score_cols]

inspect_match('POLANSKY')

In [None]:
ranked = all_merged[
    all_merged['WRank'].notnull() &
    all_merged['LRank'].notnull()
]

In [None]:
(ranked['WRank'] < ranked['LRank']).mean()

In [None]:
(ranked['B365W'] <= ranked['B365L']).mean()

In [None]:
(ranked['B365W'] < ranked['B365L']).mean()

In [None]:
import numpy as np

with_scores = jd[jd['score'].notnull()]
winner_sets = np.zeros(with_scores.shape[0])
for set_index in range(1, 6):
    winner_sets += (with_scores['W%d' % set_index] > with_scores['L%d' % set_index]).astype(int)

In [None]:
pd.Series(winner_sets).value_counts()

In [None]:
with_scores[winner_sets == 1][['score', 'tourney_url_suffix']]

In [None]:
with_scores[with_scores['score'].map(lambda x: 'W/O' in x)][['winner_name', 'loser_name']]

In [None]:
with_scores