# Overview

Once we have tabular odds (Betfair) and goal timings (footystats), we can find/identify goals in the odds data set.  
For the goal to be considered identified, the following criteria must be met:  
- A sharp jump in odds(prices), whose direction is in accordance with the scoring team;  
- The timing must be compatible.  

Objective: find/identify goals in the odds data set.

# Setup

## Imports

In [267]:
# data wrangling
import numpy as np
import pandas as pd
import json
# math
import math
# processing time
from tqdm import tqdm
# graphs
import matplotlib.pyplot as plt

# Get goal timings (game-wise)

The `footystats` data set contains goal timings. Let us fetch the data set and parse the goal timings. 

In [268]:
FOOTY_EVENT_IDS_PATH = '../data/interim/footy_event_ids.csv'
footy = pd.read_csv(FOOTY_EVENT_IDS_PATH)
footy

Unnamed: 0,timestamp,date_GMT,status,home_team_name,away_team_name,home_team_goal_count,away_team_goal_count,home_team_goal_timings,away_team_goal_timings,country,country_code,event_id
0,1430589600,2015-05-02 18:00:00,complete,Olimpo,Estudiantes,0,0,,,argentina,AR,27433105.0
1,1430597700,2015-05-02 20:15:00,complete,Gimnasia La Plata,Newell's Old Boys,0,0,,,argentina,AR,27433103.0
2,1430601000,2015-05-02 21:10:00,complete,San Lorenzo,Vélez Sarsfield,1,0,83,,argentina,AR,27433106.0
3,1430605800,2015-05-02 22:30:00,complete,Argentinos Juniors,Aldosivi,0,1,,45'1,argentina,AR,
4,1430608500,2015-05-02 23:15:00,complete,Racing Club,Lanús,2,0,7290,,argentina,AR,27433114.0
...,...,...,...,...,...,...,...,...,...,...,...,...
36081,1583632800,2020-03-08 02:00:00,complete,Colorado Rapids,Orlando City,2,1,6490,82,usa,US,29700007.0
36082,1583636400,2020-03-08 03:00:00,complete,LA Galaxy,Vancouver Whitecaps,0,1,,74,usa,US,29700009.0
36083,1583636400,2020-03-08 03:00:00,complete,Seattle Sounders,Columbus Crew,1,1,79,33,usa,US,29700008.0
36084,1583708400,2020-03-08 23:00:00,complete,Portland Timbers,Nashville SC,1,0,12,,usa,US,29713521.0


Goal timings are available as strings joined with commas.  e.g '29,48,69'.  
For goals scored during stoppage time, an apostrophe joins regular and stoppage time.  
So, "45'1" relates to 1st half stoppage time, and '46' relates to the beginning of the second half.  
Recall that Betfair timestamps are absolute timestamps.  
By identifying the kickoff time, we have a measure of 'minutes from kickoff'.  
In contrast, footystats goal timings are game-wise timings.  
In the end of the day, we need to identify goals in the Betfair data set, so we parse footystats timings and add 15min to second-half goals.  
Further ahead, we account for stoppage time and eventual delays in match restarts, but 15min is a first baseline.

In [269]:
def parse_goal_timings(goal_timings_str):
    """
    Parse a goal timings string from footystats data set.
    Add 15 minutes (halftime) to second half timings.
    Return an empty list if argument is nan.
    
    Args:
        goal_timings_str(str): goal timings string, such as "13,26,32,45'3,71,90'2".
        
    Return:
        goal_timings(list): list of integers.
    """
    if isinstance(goal_timings_str, float):
        if math.isnan(goal_timings_str):
            return []
        else:
            raise TypeError("argument must be a string or nan.")
    goal_timings = []
    for s in goal_timings_str.split(','):
        if "'" in s:
            t = int(s.split("'")[0]) + 15 * (int(s.split("'")[0]) > 45) + int(s.split("'")[1])
        else:
            t = int(s) + 15 * (int(s) > 45)
        goal_timings.append(t)
    return goal_timings

Example.

In [270]:
parse_goal_timings("29,45'2,48,69")

[29, 47, 63, 84]

Parse relevant columns.

In [271]:
footy['home_team_goal_timings_parsed'] = footy['home_team_goal_timings'].apply(parse_goal_timings)
footy['away_team_goal_timings_parsed'] = footy['away_team_goal_timings'].apply(parse_goal_timings)

Check for missing goal timing info. Goal timings when missing are `-1`.

In [272]:
mask = footy['home_team_goal_timings_parsed'].apply(lambda x: -1 not in x) * footy['away_team_goal_timings_parsed'].apply(lambda x: -1 not in x)

In [273]:
footy = footy[mask].copy()
footy.shape

(36071, 14)

We need to uniquely identify one goal in a specific minute in a match.  
On very rare occasions, there is a very quick second goal in the same 'minute'.  
We exclude those matches.

In [274]:
no_goals_in_same_minute = []
for home_team_goal_timings_parsed_curr, away_team_goal_timings_parsed_curr in zip(footy['home_team_goal_timings_parsed'], footy['away_team_goal_timings_parsed']):
    no_goals_in_same_minute.append(len(set(home_team_goal_timings_parsed_curr + away_team_goal_timings_parsed_curr)) == 
                                   len(home_team_goal_timings_parsed_curr + away_team_goal_timings_parsed_curr))

In [275]:
footy = footy[no_goals_in_same_minute].copy()

In [276]:
footy.shape

(36058, 14)

In addition to parsing, let us get useful information for each goal:  
- scoring team;
- match score before the goal.

In [277]:
time_scoring_team_previous_score_tuples = []

for home_team_goal_timings_parsed_cur, away_team_goal_timings_parsed_cur in zip(footy['home_team_goal_timings_parsed'], footy['away_team_goal_timings_parsed']):
    home_team_goal_timings_tuples_cur = [(t, 'home') for t in home_team_goal_timings_parsed_cur]
    away_team_goal_timings_tuples_cur = [(t, 'away') for t in away_team_goal_timings_parsed_cur]
    time_scoring_team_tuples_cur = sorted(home_team_goal_timings_tuples_cur + away_team_goal_timings_tuples_cur)

    time_scoring_team_previous_score_tuples_cur = []
    score_home = 0
    score_away = 0    
    
    for t, scoring_team in time_scoring_team_tuples_cur:
        new_tup = (t, scoring_team, (score_home, score_away))
        if scoring_team == 'home':
            score_home += 1
        else:
            score_away += 1
        time_scoring_team_previous_score_tuples_cur.append(new_tup)
    time_scoring_team_previous_score_tuples.append(time_scoring_team_previous_score_tuples_cur)

In [278]:
time_scoring_team_previous_score_tuples[:10]

[[],
 [],
 [(98, 'home', (0, 0))],
 [(46, 'away', (0, 0))],
 [(87, 'home', (0, 0)), (105, 'home', (1, 0))],
 [(46, 'away', (0, 0)), (67, 'home', (0, 1))],
 [(16, 'home', (0, 0)), (26, 'home', (1, 0)), (66, 'away', (2, 0))],
 [(15, 'away', (0, 0)), (83, 'home', (0, 1))],
 [(33, 'home', (0, 0)), (77, 'home', (1, 0))],
 [(99, 'home', (0, 0)), (102, 'home', (1, 0))]]

In [279]:
footy['time_scoring_team_previous_score_tuples'] = time_scoring_team_previous_score_tuples

Filter records with goals.

In [280]:
footy_with_goals = footy[footy['home_team_goal_count'] + footy['away_team_goal_count'] > 0]

In [281]:
footy_with_goals

Unnamed: 0,timestamp,date_GMT,status,home_team_name,away_team_name,home_team_goal_count,away_team_goal_count,home_team_goal_timings,away_team_goal_timings,country,country_code,event_id,home_team_goal_timings_parsed,away_team_goal_timings_parsed,time_scoring_team_previous_score_tuples
2,1430601000,2015-05-02 21:10:00,complete,San Lorenzo,Vélez Sarsfield,1,0,83,,argentina,AR,27433106.0,[98],[],"[(98, home, (0, 0))]"
3,1430605800,2015-05-02 22:30:00,complete,Argentinos Juniors,Aldosivi,0,1,,45'1,argentina,AR,,[],[46],"[(46, away, (0, 0))]"
4,1430608500,2015-05-02 23:15:00,complete,Racing Club,Lanús,2,0,7290,,argentina,AR,27433114.0,"[87, 105]",[],"[(87, home, (0, 0)), (105, home, (1, 0))]"
5,1430609400,2015-05-02 23:30:00,complete,Rosario Central,Huracán,1,1,52,45'1,argentina,AR,27433116.0,[67],[46],"[(46, away, (0, 0)), (67, home, (0, 1))]"
6,1430676000,2015-05-03 18:00:00,complete,Atlético Rafaela,Defensa y Justicia,2,1,1626,51,argentina,AR,27433110.0,"[16, 26]",[66],"[(16, home, (0, 0)), (26, home, (1, 0)), (66, ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36081,1583632800,2020-03-08 02:00:00,complete,Colorado Rapids,Orlando City,2,1,6490,82,usa,US,29700007.0,"[79, 105]",[97],"[(79, home, (0, 0)), (97, away, (1, 0)), (105,..."
36082,1583636400,2020-03-08 03:00:00,complete,LA Galaxy,Vancouver Whitecaps,0,1,,74,usa,US,29700009.0,[],[89],"[(89, away, (0, 0))]"
36083,1583636400,2020-03-08 03:00:00,complete,Seattle Sounders,Columbus Crew,1,1,79,33,usa,US,29700008.0,[94],[33],"[(33, away, (0, 0)), (94, home, (0, 1))]"
36084,1583708400,2020-03-08 23:00:00,complete,Portland Timbers,Nashville SC,1,0,12,,usa,US,29713521.0,[12],[],"[(12, home, (0, 0))]"


# Find goals

## method

**candidates**  
The records that are candidates for goal identification are those that represent a sharp price jump.  
After analysis of the behavior of prices for a reasonable sample of matches, the change in logodds is the chosen variable of interest.  
We define, based on the aforementioned analysis (which is not the scope of this notebook), a threshold.  
Records that represent an absolute change in logodds above the threshold are candidates.  
We analyze both prices for the `home team win` and for the `away team win` bets.

**matching criteria**  
Check candidates progressively for each goal.  
Goals must be identified in order, i.e., if the first goal was not found, exhausting all candidates, we do not search for the second goal.  
For first-half goals, market minutes after kickoff and game timing must match, allowing for a difference of one minute due to rounding.  
For second-half goals, we allow for a gap of up to 15min for the first second-half goal, considering stoppage time/delays.  
From the next goal on, as we now know the stoppage time, the time differences must match exactly, allowing for a difference of one minute due to rounding.  

In [282]:
# constants
ABS_LOGODDS_DIFF_THRESHOLD = 0.5

def find_goals(bf_df, footy, win_market, event_id):
    """
    Find goals in market data set.
    
    Args:
        bf_df(pandas DataFrame): relevant processed odds DataFrame.
        footy(pandas DataFrame): relevant goal timings DataFrame.
        win_market(str): the bet market whose prices will be used. 'home' or 'away'.
        event_id(int): Beifair event id of the match.
        
    Returns:
        found_goals_bool_list(list): list of booleans indicating if goals were found.
        t_bf_list(list): minutes from kickoff of searched goals.
        t_footy_list(list): game timings of searched goals.
        scoring_team_list(list): scoring team for each goal.
        previous_score_list(list): score before the goal.
    """
    
    # goal timings, scoring_team, previous_score
    try:
        time_scoring_team_previous_score_tuples  = footy[footy['event_id'] == event_id]['time_scoring_team_previous_score_tuples'].iloc[0]
    except IndexError as e:
        print(f'event id: {event_id}, goal data not found. {e}')
        return

    # candidates
    df = bf_df[bf_df['event_id'] == event_id]
    df_goal_candidates = df[(df[f'abs_logodds_{win_market}_diff'] > ABS_LOGODDS_DIFF_THRESHOLD) & (df['min_from_ko'] > 0)]
        
    n_candidates = len(df_goal_candidates)
    # control variables
    next_unchecked_candidate_index = 0
    any_goal_found_in_second_half = 0
    gap_first_goal_in_second_half = 0

    found_goals_bool_list = []
    t_bf_list = []
    t_footy_list = []
    scoring_team_list = []
    previous_score_list = []

    for t_footy, scoring_team, previous_score in time_scoring_team_previous_score_tuples:
        found_goal = 0
        t_bf = np.nan
        if (t_footy <= 60) | (any_goal_found_in_second_half == 1):
            time_range = range(-1, 1+1)
        else:
            time_range = range(-1, 15+1)
        while (not found_goal) & (next_unchecked_candidate_index < n_candidates):
            t_candidate = df_goal_candidates['min_from_ko'].iloc[next_unchecked_candidate_index]
            logodds_diff_candidate = df_goal_candidates[f'logodds_{win_market}_diff'].iloc[next_unchecked_candidate_index]
            gap = t_candidate - t_footy
            goal_direction_indicator = 2 * (scoring_team == win_market) - 1
            if ((gap - gap_first_goal_in_second_half) in time_range) & (goal_direction_indicator * logodds_diff_candidate > 0):
                found_goal = 1
                t_bf = t_candidate
                if (gap_first_goal_in_second_half == 0) & (t_footy > 60):
                    any_goal_found_in_second_half = 1
                    gap_first_goal_in_second_half = gap
            next_unchecked_candidate_index += 1
        found_goals_bool_list.append(found_goal)
        t_bf_list.append(t_bf)
        t_footy_list.append(t_footy)
        scoring_team_list.append(scoring_team)
        previous_score_list.append(previous_score)
        
    return found_goals_bool_list, t_bf_list, t_footy_list, scoring_team_list, previous_score_list

Let us see an example.

In [283]:
PROCESSED_TABULAR_ODDS_2015_PATH = '../data/interim/processed_tabular_odds_2015.csv'
bf_df = pd.read_csv(PROCESSED_TABULAR_ODDS_2015_PATH)

In [284]:
find_goals(bf_df, footy_with_goals, 'home', 27448838)

([1, 1, 1, 1, 1],
 [9, 10, 28, 74, 89],
 [10, 11, 29, 71, 86],
 ['home', 'home', 'away', 'home', 'away'],
 [(0, 0), (1, 0), (2, 0), (2, 1), (3, 1)])

In [285]:
find_goals(bf_df, footy, 'away', 27448838)

([1, 1, 1, 1, 1],
 [9, 10, 29, 75, 89],
 [10, 11, 29, 71, 86],
 ['home', 'home', 'away', 'home', 'away'],
 [(0, 0), (1, 0), (2, 0), (2, 1), (3, 1)])

# Run for all years

In [212]:
for year in range(2015, 2020+1):
    print(f'Finding goals for year {year}...')    
    processed_tabular_odds_path = f'../data/interim/processed_tabular_odds_{year}.csv'
    bf_df = pd.read_csv(processed_tabular_odds_path)
    bf_df_event_ids = bf_df['event_id'].values
    footy_with_goals_event_ids = footy_with_goals['event_id'].dropna().astype(int).values
    event_ids_with_goals = set(bf_df_event_ids).intersection(set(footy_with_goals_event_ids))
    found_goals_bool_list, t_bf_list, t_footy_list, win_market_list, event_id_list, scoring_team_list, previous_score_list = [], [], [], [], [], [], []
    for event_id in tqdm(event_ids_with_goals):
        for win_market in ['home', 'away']:
            found_goals_bool_list_cur, t_bf_list_cur, t_footy_list_cur, scoring_team_list_cur, previous_score_list_cur = find_goals(
                bf_df, footy_with_goals, win_market, event_id)
            assert len(found_goals_bool_list_cur) == len(t_bf_list_cur) == len(t_footy_list_cur)
            n_goals = len(t_footy_list_cur)
            found_goals_bool_list.extend(found_goals_bool_list_cur)
            t_bf_list.extend(t_bf_list_cur)
            t_footy_list.extend(t_footy_list_cur)
            win_market_list.extend([win_market] * n_goals)
            event_id_list.extend([event_id] * n_goals)
            scoring_team_list.extend(scoring_team_list_cur)
            previous_score_list.extend(previous_score_list_cur)
    goal_search_results = pd.DataFrame({'year': year,
                                       'event_id': event_id_list,
                                       't_footy': t_footy_list,
                                       'win_market': win_market_list,
                                       'found_goals_bool': found_goals_bool_list,
                                       't_bf': t_bf_list,
                                        'scoring_team': scoring_team_list,
                                       'previous_score': previous_score_list})
    export_filepath = f'../data/interim/goal_search_results_{year}.csv'
    goal_search_results.to_csv(export_filepath)
    print(f'Results for year {year} exported.')

Finding goals for year 2015...


100%|█████████████████████████████████████████████████████████████████████████████| 3488/3488 [00:32<00:00, 106.99it/s]


Results for year 2015 exported.
Finding goals for year 2016...


100%|██████████████████████████████████████████████████████████████████████████████| 5917/5917 [01:15<00:00, 78.09it/s]


Results for year 2016 exported.
Finding goals for year 2017...


100%|██████████████████████████████████████████████████████████████████████████████| 5931/5931 [01:06<00:00, 89.65it/s]


Results for year 2017 exported.
Finding goals for year 2018...


100%|██████████████████████████████████████████████████████████████████████████████| 5346/5346 [00:58<00:00, 91.24it/s]


Results for year 2018 exported.
Finding goals for year 2019...


100%|██████████████████████████████████████████████████████████████████████████████| 6397/6397 [01:34<00:00, 68.02it/s]


Results for year 2019 exported.
Finding goals for year 2020...


100%|█████████████████████████████████████████████████████████████████████████████| 1717/1717 [00:14<00:00, 119.21it/s]


Results for year 2020 exported.
