# Overview

TODO: explain better. Use text in merge_betfair_footystats...

Context:
- there is not a predefined `key` linking the game and the market data sets
- we need to build a key using team names and the datetime of the event
- our key will be "`home team-country`, `away team-country`, `datetime`"
- `country` is needed because there are teams with the same name in different countries (e.g., Arsenal-EN, Arsenal-AR)
- so, we need to match team names from the game and the market data sets
- for example, we need to identify that `Manchester United` in the game data set corresponds to `Man U` in the market data set
- in this notebook we prepare a table with the most similar team names from the market data set for each team name in the game data set
- the table will then be used for a final, manual team name matching step

Objectives:
- get team names from the game and the market data sets
- for each team name in the game data set, get the most similar names in the market data set
- prepare a user-friendly table to be manually checked for final team name matching

Notes:
- the same team may appear with different names in the market data set

# Setup

## Imports

In [1]:
import pandas as pd
import os
import bz2
import json
from tqdm import tqdm
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pickle

## Team names in the game data set (FootyStats)

The game data set is our reference for names to be matched, as it defines the relevant set of matches.  
The market data set is much larger and was downloaded in bulk, so we will extract data from there based on the game data set matches.  
So here we build a list of the reference team names that will be linked to one or more names/aliases found in the market data set.

In [2]:
FOOTYSTATS_FILEPATH = '../data/raw/footystats/footystats.csv'

In [3]:
footy = pd.read_csv(FOOTYSTATS_FILEPATH)

In [4]:
footy

Unnamed: 0,timestamp,date_GMT,status,home_team_name,away_team_name,home_team_goal_count,away_team_goal_count,home_team_goal_timings,away_team_goal_timings,country
0,1475830200,2016-10-07 08:50:00,complete,Brisbane Roar,Melbourne Victory FC,1,1,90'6,83,australia
1,1475908500,2016-10-08 06:35:00,complete,Wellington Phoenix,Melbourne City FC,0,1,,31,australia
2,1475916600,2016-10-08 08:50:00,complete,Western Sydney Wanderers,Sydney FC,0,4,,51558589,australia
3,1475924400,2016-10-08 11:00:00,complete,Perth Glory FC,Central Coast Mariners,3,3,32735,568486,australia
4,1475992800,2016-10-09 06:00:00,complete,Newcastle Jets FC,Adelaide United,1,1,17,29,australia
...,...,...,...,...,...,...,...,...,...,...
36081,1495373400,2017-05-21 13:30:00,complete,Kaiserslautern,Nürnberg,1,0,20,,germany
36082,1495373400,2017-05-21 13:30:00,complete,Greuther Fürth,Union Berlin,1,2,66,3878,germany
36083,1495373400,2017-05-21 13:30:00,complete,Fortuna Düsseldorf,Erzgebirge Aue,1,0,39,,germany
36084,1495373400,2017-05-21 13:30:00,complete,Sandhausen,Hannover 96,1,1,57,60,germany


TODO: Explain better. Why map to GB?

Get country codes that match the market data set country identification codes.  
Note that both England and Scotland map to `GB`.

In [5]:
footy_country_codes = pd.read_csv('config/footy_country_codes.csv')
footy_country_codes

Unnamed: 0,country_name,country_code
0,argentina,AR
1,australia,AU
2,brazil,BR
3,china,CN
4,england,GB
5,france,FR
6,germany,DE
7,israel,IL
8,italy,IT
9,japan,JP


In [6]:
footy = pd.merge(footy, footy_country_codes, how='inner', left_on='country', right_on='country_name').copy()

In [7]:
footy_team_names = pd.concat([footy[['home_team_name', 'country', 'country_code']].\
                              rename(columns={'home_team_name':'team_name'}),
                              footy[['away_team_name', 'country', 'country_code']].\
                              rename(columns={'away_team_name':'team_name'})]
                            ).drop_duplicates().reset_index(drop=True)
footy_team_names

Unnamed: 0,team_name,country,country_code
0,Brisbane Roar,australia,AU
1,Wellington Phoenix,australia,AU
2,Western Sydney Wanderers,australia,AU
3,Perth Glory FC,australia,AU
4,Newcastle Jets FC,australia,AU
...,...,...,...
534,CSA,brazil,BR
535,Fortaleza,brazil,BR
536,Santa Cruz,brazil,BR
537,Yokohama,japan,JP


## Team names in the market data set (Betfair)

The raw market data set is large and includes many more matches than the relevant matches found in the game data set.  
We therefore need to screen and get metadata on the market data set.  
The metadata will be useful in this notebook for team name matching, and will also be useful in a downstream notebook when we extract market data per se for selected relevant matches. 


TODO: deal with Canada.

### get metadata on the market data set

In [8]:
BF_RAW_DATA_FOLDERPATH = '../data/raw/betfair/xds_nfs'

Get all filepaths for market data files (.bz2) under the Betfair raw data folder.

In [9]:
def get_bz2_filepaths(directory_path):
    """Get all filepaths of .bz2 files under a given folder.
    
        Args
            directory_path(str): The path to the folder to be screened.
        Returns
            bz2_filepaths(list): A list with all filepaths of .bz2 files under directory_path.
    """
    bz2_filepaths = []
    for dirname, dirs, files in os.walk(directory_path):
        for filename in files:
            if filename.endswith('.bz2'):
                bz2_filepaths.append((os.path.join(dirname,filename)))
    return(bz2_filepaths)

In [10]:
bz2_filepaths = get_bz2_filepaths(BF_RAW_DATA_FOLDERPATH)

Let's check how many filepaths we got. The number should be close to 300k.  
This number is much larger than the number of relevant matches in the game data set (~36k) because (1) we downloaded market data in bulk, including matches of young, women's teams, etc. and (2) the market data for a single match may be split into more than one file.  
In a downstream notebook, we extract market data for the relevant matches.  
Here we extract team names to build the key to JOIN the game and the market data sets.

In [11]:
len(bz2_filepaths)

303894

Let us check a concrete example and build functions to extract the metadata according to the JSON structure of the files.

In [12]:
bz2_filepath = bz2_filepaths[0]
bz2_filepath

'../data/raw/betfair/xds_nfs/hdfs_supreme/BASIC/2015/Aug/20/27511102/1.119998414.bz2'

In [13]:
str_from_bz2 = bz2.BZ2File(bz2_filepath).read().decode("utf-8")

In [14]:
records_from_bz2 = [json.loads(line) for line in str_from_bz2.splitlines()]

In [15]:
len(records_from_bz2)

198

In [16]:
record = records_from_bz2[0]
record

{'op': 'mcm',
 'clk': '1118582166',
 'pt': 1439390937696,
 'mc': [{'id': '1.119998414',
   'marketDefinition': {'bspMarket': False,
    'turnInPlayEnabled': True,
    'persistenceEnabled': True,
    'marketBaseRate': 5.0,
    'eventId': '27511102',
    'eventTypeId': '1',
    'numberOfWinners': 1,
    'bettingType': 'ODDS',
    'marketType': 'MATCH_ODDS',
    'marketTime': '2015-08-15T18:30:00.000Z',
    'suspendTime': '2015-08-15T18:30:00.000Z',
    'bspReconciled': False,
    'complete': True,
    'inPlay': False,
    'crossMatching': True,
    'runnersVoidable': False,
    'numberOfActiveRunners': 3,
    'betDelay': 0,
    'status': 'OPEN',
    'runners': [{'status': 'ACTIVE',
      'sortPriority': 1,
      'id': 201262,
      'name': 'Catania'},
     {'status': 'ACTIVE', 'sortPriority': 2, 'id': 501219, 'name': 'Cesena'},
     {'status': 'ACTIVE', 'sortPriority': 3, 'id': 58805, 'name': 'The Draw'}],
    'regulators': ['MR_INT'],
    'countryCode': 'IT',
    'discountAllowed': Fals

By observing the JSON structure and verifying the Betfair documentation, we write a function to extract metadata for a given bz2 file.

In [17]:
def get_match_metadata(records_from_bz2):
    match_metadata = {}

    for record in records_from_bz2:
        mc = record['mc'][0]
        if 'marketDefinition' in mc.keys():
            if 'runners' in mc['marketDefinition'].keys():
                match_metadata['eventId'] = mc['marketDefinition']['eventId']
                match_metadata['openDate'] = mc['marketDefinition']['openDate']
                match_metadata['home_team'] = mc['marketDefinition']['runners'][0]['name']
                match_metadata['away_team'] = mc['marketDefinition']['runners'][1]['name']
                match_metadata['countryCode'] = mc['marketDefinition']['countryCode']
                break
    return match_metadata

We then write code to loop over all files, get metadata and extract team names, excluding matches that for sure are not considered relevant, such as young and women's teams. 

In [18]:
def get_metadata_and_team_names(bz2_filepaths, suffixes_for_exclusion):
    """Get metadata and team names for a given list of bz2 filepaths.
    
        Excludes records that end with a black listed suffix.
        Example: 'Barcelona U23' is a young team, not relevant for our project.
        
        Args:
            bz2_filepaths(list): list of bz2 filepaths
            suffixes_for_exclusion(list): black listed suffixes
            
        Returns:
            all_match_metadata(dict): a dict with a list of dictionaries per key (year, month).
                Example inner dict: {'eventId': '29529315',
                                      'openDate': '2019-10-20T15:30:00.000Z',
                                      'home_team': 'Catanzaro',
                                      'away_team': 'Potenza',
                                      'countryCode': 'IT'}
            bf_team_names(dict): a dict with a list of team names per key (country).
                These are all the teams that played in a match that occured in the country at hand.
                As there are international matches and we cannot filter out intl' away teams,
                these intl' away teams will show up. It will also not exclude domestic teams that appear only as visitors.
                So we conservatively keep them and check team names manually.
            error_count (int): number of possibly corrupted files, that were not properly read and decoded.
    """
    
    
    all_match_metadata = {}
    error_bz2_filepaths = {}
    error_count = 0
    bf_team_names = {}

    for bz2_filepath in tqdm(bz2_filepaths):
        year = bz2_filepath.split('/')[-5]  # known from filepath structure
        month = bz2_filepath.split('/')[-4]
        try:
            str_from_bz2 = bz2.BZ2File(bz2_filepath).read().decode("utf-8")
            records_from_bz2 = [json.loads(line) for line in str_from_bz2.splitlines()]
            match_metadata = get_match_metadata(records_from_bz2)
            is_suffix_excludable = False
            for suffix in suffixes_for_exclusion:
                if match_metadata['home_team'].endswith(suffix) or match_metadata['away_team'].endswith(suffix):
                    is_suffix_excludable = True
                    continue
            if is_suffix_excludable:
                continue
            for team_name in [match_metadata['home_team'], match_metadata['away_team']]:
                if match_metadata['countryCode'] not in bf_team_names:
                    bf_team_names[match_metadata['countryCode']] = []
                if team_name not in bf_team_names[match_metadata['countryCode']]:
                    bf_team_names[match_metadata['countryCode']].append(team_name)
            if (year, month) not in all_match_metadata.keys():
                all_match_metadata[(year, month)] = []
            all_match_metadata[(year, month)].append(match_metadata)
        except:
            if (year, month) not in error_bz2_filepaths.keys():
                error_bz2_filepaths[(year, month)] = []
            error_bz2_filepaths[(year, month)].append(bz2_filepath)
            error_count += 1
    return all_match_metadata, bf_team_names, error_count

TODO: explain exclusion

In [19]:
SUFFIXES_FOR_EXCLUSION = ['(W)', 'U21', 'U20', 'U19', 'U23', '(Y)', '(Res)']

In [20]:
all_match_metadata, bf_team_names, error_count = get_metadata_and_team_names(bz2_filepaths, SUFFIXES_FOR_EXCLUSION)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 303894/303894 [03:17<00:00, 1535.59it/s]


Example metadata

In [21]:
all_match_metadata[('2019','Oct')][:3]

[{'eventId': '29529315',
  'openDate': '2019-10-20T15:30:00.000Z',
  'home_team': 'Catanzaro',
  'away_team': 'Potenza',
  'countryCode': 'IT'},
 {'eventId': '29503826',
  'openDate': '2019-10-20T14:45:00.000Z',
  'home_team': 'Feyenoord',
  'away_team': 'Heracles',
  'countryCode': 'NL'},
 {'eventId': '29536309',
  'openDate': '2019-10-20T13:00:00.000Z',
  'home_team': 'VfL Oldenburg',
  'away_team': 'FC Eintracht Northeim',
  'countryCode': 'DE'}]

Example team names

In [22]:
bf_team_names['BR'][:15]

['Fluminense',
 'Paysandu',
 'Palmeiras',
 'Cruzeiro',
 'Santos',
 'Corinthians',
 'Internacional',
 'Ituano',
 'Catanduvense',
 'Rio Preto',
 'Coritiba',
 'Gremio',
 'Novo Hamburgo',
 'Lajeadense',
 'Atletico MG']

proportion of possibly corrupted files

In [23]:
error_count/ len(bz2_filepaths)

0.0004903025397013433

This seems an aaceptable ratio.

# fuzzywuzzy: string similarity for team name matching

Having extracted market data set (Betfair) team names per host country, and knowing that the names do not necessarily perfectly match the names in the game data set (FooytyStats), we get the most similar names from Betfair for each name in FootyStats, so we can later manually select the team names and aliases in Betfair that indeed correspond to the given team name in FootyStats.

For example, we may have `Manchester United` in FootyStats, and both `Man U` and `Man Utd` in Betfair. We may get as the most similar names: `Man U`, `Man Utd`, `Man City`, so we need to manually select the two first names and reject the third. 

Here we prepare a table for the manual checks.

We use the WRatio as implemented in the [fuzzywuzzy](https://pypi.org/project/fuzzywuzzy/) library as the score that measures string similarity.  
It uses [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) to calculate the differences between sequences.  
We then select the `n` Betfair team names that are most similar to the reference FootyStats name.

Example:

In [24]:
# extract method: get the `n` strings that are most similar to the reference string.
process.extract('Corinthians', bf_team_names['BR']) # default n=5

[('Corinthians', 100),
 ('Corinthians SP', 95),
 ('Corinthians Paulista U2', 90),
 ('Coritiba', 74),
 ('Cordino', 64)]

We apply the extract method to all FootyStats team names, using `n`=13.  
This parameter was set to a high number because we found empirically that in some extreme cases the relevant team name may appear up to the 10th position.

In [25]:
fuzzy_series = footy_team_names.apply(lambda x: process.extract(x['team_name'], 
                                                                bf_team_names[x['country_code']], 
                                                                limit=13), 
                                      axis=1)
fuzzy_series

0      [(Brisbane Roar, 100), (Brisbane, 90), (WDSC W...
1      [(Wellington Phoenix, 100), (Wellington Poenix...
2      [(Western Sydney Wanderers, 100), (Sydney, 90)...
3      [(Perth Glory, 95), (Perth, 90), (Perth SC, 86...
4      [(Newcastle Jets FC, 100), (Newcastle Jets, 95...
                             ...                        
534    [(CSA, 100), (CSA AI, 90), (CSA AL, 90), (Icas...
535    [(Fortaleza, 100), (Fortaleza B, 95), (Fortale...
536    [(Santa Cruz, 100), (Santa Cruz PE, 95), (Sant...
537    [(Yokohama FM, 95), (Yokohama FC, 95), (Yokoha...
538    [(Inter Miami CF, 95), (Inter Miami II, 95), (...
Length: 539, dtype: object

## prepare data for manual check

Having extracted the most similar team names, we prepare a pandas DataFrame to be exported for the manual check step.

In [26]:
# explode results into columns
fuzzy_df = fuzzy_series.apply(lambda x: [item for sublist in x for item in sublist]).apply(pd.Series)
fuzzy_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,Brisbane Roar,100,Brisbane,90,WDSC Wolves FC Brisbane,86,Brisbane Strikers,73,UAE,60,...,Blake,54,Port Melbourne Sharks,49,Ceres Negros,48,Melbourne Victory,47,Sydney,45
1,Wellington Phoenix,100,Wellington Poenix,97,Wellington,90,Ben,57,Lions FC,56,...,Hellenic,46,Campbelltown City,46,Western United,44,Bulleen Lions,43,Peninsula Power,42
2,Western Sydney Wanderers,100,Sydney,90,Western Sydney,90,Sydney United 58,86,Western Knights,86,...,Sydney Olympic,86,Test,64,A,60,Ben,60,West Adelaide,59
3,Perth Glory,95,Perth,90,Perth SC,86,FC Seoul,86,Cairns FC,86,...,St George FC,62,Edgeworth FC,59,Macarthur FC,54,Sorrento FC,53,Melbourne Victory FC,53
4,Newcastle Jets FC,100,Newcastle Jets,95,Newcastle,90,FNQ FC Heat,86,Olympia FC,86,...,Cairns FC,86,Avondale FC,86,Canberra FC,86,Lions FC,86,Olympic FC,86
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
534,CSA,100,CSA AI,90,CSA AL,90,Icasa,75,asa,67,...,Catanduvense,60,Chapecoense,60,Portuguesa SP,60,Ceara,60,Sao Paulo,60
535,Fortaleza,100,Fortaleza B,95,Fortaleza EC,95,Ceara SC Fortaleza B,90,Fortaleza EC B (Brazil),90,...,Floresta EC,60,Cruzeiro de Porto Alegr,60,Barra FC Porto Alegre,60,Galvez,60,EC Sao Jose Porto Alegre,60
536,Santa Cruz,100,Santa Cruz PE,95,Santa Cruz RS,95,Santa Cruz-SE,95,Santa Cruz RN,95,...,Santa Cruz FC RN,90,Santa Cruz FC (RJ),90,Osvaldo Cruz FC,86,Santa Quiteria FC,86,Int Santa Maria,86
537,Yokohama FM,95,Yokohama FC,95,Yokohama SCC,90,Yokohama F Marinos,90,Yokogawa Musashino,68,...,Okayama II,56,Tokoha University SC,56,Matsumoto Yamaga FC,56,Tokuyama University,56,R. Velho Takamatsu SC,56


In [27]:
# give meaninful names to columns
fuzzy_df.columns = [item for sublist in [(f'bf_team_{i}', f'score_{i}') for i, _ in enumerate(fuzzy_series.iloc[0])] for item in sublist]
fuzzy_df

Unnamed: 0,bf_team_0,score_0,bf_team_1,score_1,bf_team_2,score_2,bf_team_3,score_3,bf_team_4,score_4,...,bf_team_8,score_8,bf_team_9,score_9,bf_team_10,score_10,bf_team_11,score_11,bf_team_12,score_12
0,Brisbane Roar,100,Brisbane,90,WDSC Wolves FC Brisbane,86,Brisbane Strikers,73,UAE,60,...,Blake,54,Port Melbourne Sharks,49,Ceres Negros,48,Melbourne Victory,47,Sydney,45
1,Wellington Phoenix,100,Wellington Poenix,97,Wellington,90,Ben,57,Lions FC,56,...,Hellenic,46,Campbelltown City,46,Western United,44,Bulleen Lions,43,Peninsula Power,42
2,Western Sydney Wanderers,100,Sydney,90,Western Sydney,90,Sydney United 58,86,Western Knights,86,...,Sydney Olympic,86,Test,64,A,60,Ben,60,West Adelaide,59
3,Perth Glory,95,Perth,90,Perth SC,86,FC Seoul,86,Cairns FC,86,...,St George FC,62,Edgeworth FC,59,Macarthur FC,54,Sorrento FC,53,Melbourne Victory FC,53
4,Newcastle Jets FC,100,Newcastle Jets,95,Newcastle,90,FNQ FC Heat,86,Olympia FC,86,...,Cairns FC,86,Avondale FC,86,Canberra FC,86,Lions FC,86,Olympic FC,86
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
534,CSA,100,CSA AI,90,CSA AL,90,Icasa,75,asa,67,...,Catanduvense,60,Chapecoense,60,Portuguesa SP,60,Ceara,60,Sao Paulo,60
535,Fortaleza,100,Fortaleza B,95,Fortaleza EC,95,Ceara SC Fortaleza B,90,Fortaleza EC B (Brazil),90,...,Floresta EC,60,Cruzeiro de Porto Alegr,60,Barra FC Porto Alegre,60,Galvez,60,EC Sao Jose Porto Alegre,60
536,Santa Cruz,100,Santa Cruz PE,95,Santa Cruz RS,95,Santa Cruz-SE,95,Santa Cruz RN,95,...,Santa Cruz FC RN,90,Santa Cruz FC (RJ),90,Osvaldo Cruz FC,86,Santa Quiteria FC,86,Int Santa Maria,86
537,Yokohama FM,95,Yokohama FC,95,Yokohama SCC,90,Yokohama F Marinos,90,Yokogawa Musashino,68,...,Okayama II,56,Tokoha University SC,56,Matsumoto Yamaga FC,56,Tokuyama University,56,R. Velho Takamatsu SC,56


In [28]:
# concatenate original DF and fuzzy results
footy_team_names_fuzzy = pd.concat([footy_team_names, fuzzy_df], axis=1)
footy_team_names_fuzzy

Unnamed: 0,team_name,country,country_code,bf_team_0,score_0,bf_team_1,score_1,bf_team_2,score_2,bf_team_3,...,bf_team_8,score_8,bf_team_9,score_9,bf_team_10,score_10,bf_team_11,score_11,bf_team_12,score_12
0,Brisbane Roar,australia,AU,Brisbane Roar,100,Brisbane,90,WDSC Wolves FC Brisbane,86,Brisbane Strikers,...,Blake,54,Port Melbourne Sharks,49,Ceres Negros,48,Melbourne Victory,47,Sydney,45
1,Wellington Phoenix,australia,AU,Wellington Phoenix,100,Wellington Poenix,97,Wellington,90,Ben,...,Hellenic,46,Campbelltown City,46,Western United,44,Bulleen Lions,43,Peninsula Power,42
2,Western Sydney Wanderers,australia,AU,Western Sydney Wanderers,100,Sydney,90,Western Sydney,90,Sydney United 58,...,Sydney Olympic,86,Test,64,A,60,Ben,60,West Adelaide,59
3,Perth Glory FC,australia,AU,Perth Glory,95,Perth,90,Perth SC,86,FC Seoul,...,St George FC,62,Edgeworth FC,59,Macarthur FC,54,Sorrento FC,53,Melbourne Victory FC,53
4,Newcastle Jets FC,australia,AU,Newcastle Jets FC,100,Newcastle Jets,95,Newcastle,90,FNQ FC Heat,...,Cairns FC,86,Avondale FC,86,Canberra FC,86,Lions FC,86,Olympic FC,86
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
534,CSA,brazil,BR,CSA,100,CSA AI,90,CSA AL,90,Icasa,...,Catanduvense,60,Chapecoense,60,Portuguesa SP,60,Ceara,60,Sao Paulo,60
535,Fortaleza,brazil,BR,Fortaleza,100,Fortaleza B,95,Fortaleza EC,95,Ceara SC Fortaleza B,...,Floresta EC,60,Cruzeiro de Porto Alegr,60,Barra FC Porto Alegre,60,Galvez,60,EC Sao Jose Porto Alegre,60
536,Santa Cruz,brazil,BR,Santa Cruz,100,Santa Cruz PE,95,Santa Cruz RS,95,Santa Cruz-SE,...,Santa Cruz FC RN,90,Santa Cruz FC (RJ),90,Osvaldo Cruz FC,86,Santa Quiteria FC,86,Int Santa Maria,86
537,Yokohama,japan,JP,Yokohama FM,95,Yokohama FC,95,Yokohama SCC,90,Yokohama F Marinos,...,Okayama II,56,Tokoha University SC,56,Matsumoto Yamaga FC,56,Tokuyama University,56,R. Velho Takamatsu SC,56


We build an ancillary DF with zeros under team names.  
In the manual check, these zeros will be substituted for ones for names/aliases that end up being confirmed as correct.

In [29]:
footy_team_names_fuzzy_aux = footy_team_names_fuzzy.copy()

In [30]:
for col in [f'bf_team_{i}' for i in range(0, 13)]:
    footy_team_names_fuzzy_aux[col] = 0

In [31]:
footy_team_names_fuzzy_aux

Unnamed: 0,team_name,country,country_code,bf_team_0,score_0,bf_team_1,score_1,bf_team_2,score_2,bf_team_3,...,bf_team_8,score_8,bf_team_9,score_9,bf_team_10,score_10,bf_team_11,score_11,bf_team_12,score_12
0,Brisbane Roar,australia,AU,0,100,0,90,0,86,0,...,0,54,0,49,0,48,0,47,0,45
1,Wellington Phoenix,australia,AU,0,100,0,97,0,90,0,...,0,46,0,46,0,44,0,43,0,42
2,Western Sydney Wanderers,australia,AU,0,100,0,90,0,90,0,...,0,86,0,64,0,60,0,60,0,59
3,Perth Glory FC,australia,AU,0,95,0,90,0,86,0,...,0,62,0,59,0,54,0,53,0,53
4,Newcastle Jets FC,australia,AU,0,100,0,95,0,90,0,...,0,86,0,86,0,86,0,86,0,86
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
534,CSA,brazil,BR,0,100,0,90,0,90,0,...,0,60,0,60,0,60,0,60,0,60
535,Fortaleza,brazil,BR,0,100,0,95,0,95,0,...,0,60,0,60,0,60,0,60,0,60
536,Santa Cruz,brazil,BR,0,100,0,95,0,95,0,...,0,90,0,90,0,86,0,86,0,86
537,Yokohama,japan,JP,0,95,0,95,0,90,0,...,0,56,0,56,0,56,0,56,0,56


Intercalate rows with team names and rows with zeros for convenience.

In [32]:
footy_team_names_fuzzy.index = range(0, 2*len(footy_team_names_fuzzy), 2)

In [33]:
footy_team_names_fuzzy_aux.index = range(1, 2*len(footy_team_names_fuzzy)+1, 2)

In [34]:
# write row type before concatenating DFs
footy_team_names_fuzzy['row_type'] = 'bf_team_names'
footy_team_names_fuzzy_aux['row_type'] = 'name_matching'

In [35]:
footy_team_names_fuzzy_for_manual_check = pd.concat([footy_team_names_fuzzy, footy_team_names_fuzzy_aux]).sort_index()
footy_team_names_fuzzy_for_manual_check

Unnamed: 0,team_name,country,country_code,bf_team_0,score_0,bf_team_1,score_1,bf_team_2,score_2,bf_team_3,...,score_8,bf_team_9,score_9,bf_team_10,score_10,bf_team_11,score_11,bf_team_12,score_12,row_type
0,Brisbane Roar,australia,AU,Brisbane Roar,100,Brisbane,90,WDSC Wolves FC Brisbane,86,Brisbane Strikers,...,54,Port Melbourne Sharks,49,Ceres Negros,48,Melbourne Victory,47,Sydney,45,bf_team_names
1,Brisbane Roar,australia,AU,0,100,0,90,0,86,0,...,54,0,49,0,48,0,47,0,45,name_matching
2,Wellington Phoenix,australia,AU,Wellington Phoenix,100,Wellington Poenix,97,Wellington,90,Ben,...,46,Campbelltown City,46,Western United,44,Bulleen Lions,43,Peninsula Power,42,bf_team_names
3,Wellington Phoenix,australia,AU,0,100,0,97,0,90,0,...,46,0,46,0,44,0,43,0,42,name_matching
4,Western Sydney Wanderers,australia,AU,Western Sydney Wanderers,100,Sydney,90,Western Sydney,90,Sydney United 58,...,86,Test,64,A,60,Ben,60,West Adelaide,59,bf_team_names
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1073,Santa Cruz,brazil,BR,0,100,0,95,0,95,0,...,90,0,90,0,86,0,86,0,86,name_matching
1074,Yokohama,japan,JP,Yokohama FM,95,Yokohama FC,95,Yokohama SCC,90,Yokohama F Marinos,...,56,Tokoha University SC,56,Matsumoto Yamaga FC,56,Tokuyama University,56,R. Velho Takamatsu SC,56,bf_team_names
1075,Yokohama,japan,JP,0,95,0,95,0,90,0,...,56,0,56,0,56,0,56,0,56,name_matching
1076,Inter Miami,usa,US,Inter Miami CF,95,Inter Miami II,95,Inter Miami 2,95,Inter,...,67,Miami United SC,66,FC Miami City,64,Serbia,60,Miami Beach CF,59,bf_team_names


# Export

In [36]:
INTERIM_FOLDERPATH = '../data/interim'

footy_team_names_fuzzy_for_manual_check

In [37]:
footy_team_names_fuzzy_for_manual_check.to_csv(os.path.join(INTERIM_FOLDERPATH, 
                                                            'footy_team_names_fuzzy_for_manual_check.csv'), 
                                               index=False)

We use `pickle` to dump Python objects to disk.

all_match_metadata

In [38]:
with open(os.path.join(INTERIM_FOLDERPATH, 'all_match_metadata'), 'wb') as fp:
    pickle.dump(all_match_metadata, fp)

In [39]:
# load and test if serialization/deserialization worked
with open(os.path.join(INTERIM_FOLDERPATH, 'all_match_metadata'), 'rb') as fp:
    all_match_metadata_from_pickle = pickle.load(fp)
list(all_match_metadata_from_pickle.keys())[:5]

[('2015', 'Aug'),
 ('2015', 'Sep'),
 ('2015', 'Jul'),
 ('2015', 'Oct'),
 ('2015', 'May')]

bz2_filepaths

In [40]:
with open(os.path.join(INTERIM_FOLDERPATH, 'bz2_filepaths'), 'wb') as fp:
    pickle.dump(bz2_filepaths, fp)

bz2_filepaths

In [41]:
# load and test if serialization/deserialization worked
with open(os.path.join(INTERIM_FOLDERPATH, 'bz2_filepaths'), 'rb') as fp:
    bz2_filepaths_from_pickle = pickle.load(fp)
bz2_filepaths_from_pickle[:5]

['../data/raw/betfair/xds_nfs/hdfs_supreme/BASIC/2015/Aug/20/27511102/1.119998414.bz2',
 '../data/raw/betfair/xds_nfs/hdfs_supreme/BASIC/2015/Aug/20/27510736/1.119986613.bz2',
 '../data/raw/betfair/xds_nfs/hdfs_supreme/BASIC/2015/Aug/20/27515085/1.120096574.bz2',
 '../data/raw/betfair/xds_nfs/hdfs_supreme/BASIC/2015/Aug/20/27510738/1.119986753.bz2',
 '../data/raw/betfair/xds_nfs/hdfs_supreme/BASIC/2015/Aug/20/27505689/1.119868849.bz2']

In [42]:
len(bz2_filepaths_from_pickle)

303894