## 1. Database Preprocessing
For data preprocessing, Pandas and Sqlite3 are used to extract data from sql files and perform data preprocessing including dealing with missing values etc.

In [1]:
import os
import pandas as pd
import sqlite3

def select_return_table(table_name):
    # Select from all records and convert to pandas dataframe
    data = curs.execute('SELECT * FROM %s' % table_name).fetchall()
    column = [element[1] for element in curs.execute('PRAGMA table_info(%s)' % table_name).fetchall()]
    return pd.DataFrame(data, columns=column)

def get_missing_value_perc(df, cond=lambda x: x == 'null'):
    # Check missing value and output percentage
    df_sum = df.applymap(cond).sum()
    df_percentage = df.applymap(cond).sum() / df.applymap(lambda data: data == 'null').count()
    df_percentage = df_percentage.apply(lambda x: '{0:.2f}%'.format(x * 100))
    return pd.concat([df_sum, df_percentage], axis=1, keys=['Missing Value', 'Missing Value (%)'])

## 1.1 Database Connection
Five tables were created during the crawling stage, containing information about race, horse, individual past result (of trainer, jockey, breeder and owner), trainer and jockey profiles. Those sql files can be easily transformed into Pandas dataframe for further processing.

In [2]:
# Establish database connection and check table name
conn = sqlite3.connect(os.path.abspath(os.path.join(os.path.dirname(os.getcwd()), 'crawler\\data\\race.db')))
curs = conn.cursor()
table_name = curs.execute('SELECT name FROM sqlite_master WHERE type="table"').fetchall()
print(table_name)

[('race_record',), ('horse_record',), ('individual_record',), ('trainer_profile',), ('jockey_profile',)]


In [3]:
# Read from record data
record_dict = {name[0]: select_return_table(name[0]) for name in table_name}
race_df = record_dict['race_record']
horse_df = record_dict['horse_record']
individual_df = record_dict['individual_record']
trainer_df = record_dict['trainer_profile']
jockey_df = record_dict['jockey_profile']

## 1.2 Integrity Check
This step is to check the integrity of the crawled data and confirm that there is little inconsistency with the actual data presented online as well as some basic rules for national horse racing events in Japan (e.g. there should be 12 races for one place). More details are in <a href='http://www.jra.go.jp/'> JRA (Japan Racing Association) webpage</a>.

In [4]:
# Check the availability of data for each type of individual
unique_list = {
    'horse': race_df['horse'].unique(), 
    'trainer': race_df['trainer'].unique(), 'jockey': race_df['jockey'].unique(),
    'owner': race_df['owner'].unique(), 'breeder': horse_df['breeder'].unique()   
}
print('Horse: ' + '{:1.2f}'.format(len(horse_df['horse_name'].unique()) / len(unique_list['horse'])))
print('Trainer: ' + '{:1.2f}'.format(len(trainer_df['trainer_name'].unique()) / len(unique_list['trainer'])))
print('Jockey: ' + '{:1.2f}'.format(len(jockey_df['jockey_name'].unique()) / len(unique_list['jockey'])))

Horse: 0.66
Trainer: 0.35
Jockey: 0.33


### 1.2.1 Race Record
The following shows the attributes in race record table as well as some basic statistics. Grouping race records by run data, place and race number yields the total numebr of race happening in one place on a certain date.

In [5]:
# Snapshot of tha race_record dataframe
print(race_df.shape)
race_df.describe().T

(863877, 27)


Unnamed: 0,count,unique,top,freq
run_date,863877,1844,2009-03-21,575
place,863877,10,東京,126617
race,863877,12,12R,74786
title,863877,2264,3歳未勝利,215445
type,863877,3,ダ,418060
track,863877,4,右,561623
distance,863877,80,1200m,183649
weather,863877,6,晴,538975
condition,863877,4,良,644183
time,863877,123,10:30,19683


The following proves that it is thus true that around 12 races are held for a single place for national racing events in Japan. A sample of 10 racing events are shown below.

In [6]:
# Ensure that (almost) all races on the same day at the same place have a count of 12
race_count = curs.execute('SELECT DISTINCT run_date, place, race from race_record').fetchall()
race_count_df = pd.DataFrame(race_count, columns=['run_date', 'place', 'race'])
race_count_df.groupby(['run_date', 'place']).count().sample(n=10)

Unnamed: 0_level_0,Unnamed: 1_level_0,race
run_date,place,Unnamed: 2_level_1
2000-02-05,東京,12
2012-10-21,東京,12
2006-05-13,新潟,12
2003-06-29,函館,12
2003-03-02,中山,12
2010-09-25,札幌,12
2011-11-06,京都,12
2006-11-18,東京,12
2017-04-29,新潟,12
2006-07-09,福島,12


### 1.2.2 Horse Record
Similar check is done for horse record.

In [7]:
# Check the data columns
horse_df.sample(n=3)

Unnamed: 0,horse_name,date_of_birth,trainer,owner,breeder,place_of_birth,transaction_price,prize_obtained,race_record,highlight_race,relatives,parents,status,gender,breed,offer_info
29736,タイセイグレース,2006年3月2日,佐々木亜 (美浦),高木競走馬育成牧場,里深牧場,日高町,-,0万円,3戦0勝 [ 0-0-0-3 ],,ハイパーレスキュー 、 パームジュメイラ,アグネスフライト マックスタムタム,抹消,牝,栗毛,
14690,スプレッドスマイル,2001年4月4日,清水出美 (栗東),寺田寿男,Phillip F. McCarthy West Coast Stables P&J Far...,米,-,"1,150万円 (中央) /9万円 (地方)",29戦1勝 [ 1-0-4-24 ],04'3歳未勝利,,Dance Brightly Bright Image,抹消,牡,栗毛,
22869,エアリアーナ,2004年2月25日,杉浦宏昭 (美浦),西山茂行,西山牧場,鵡川町,-,"1,598万円 (中央) /50万円 (地方)",37戦1勝 [ 1-1-4-31 ],07'3歳未勝利,ニシノガルーダ 、 ニシノフェニックス,マリエンバード ブランドイメージ,抹消,牝,青毛,


In [8]:
# Snapshot of tha race_record dataframe
print(horse_df.shape)
horse_df.describe().T

(56226, 16)


Unnamed: 0,count,unique,top,freq
horse_name,56226,56226,グラスルージュ,1
date_of_birth,56226,3628,2014年4月30日,52
trainer,56226,1740,(地方),345
owner,56226,5575,サンデーレーシング,826
breeder,56226,3400,ノーザンファーム,3195
place_of_birth,56226,99,浦河町,10510
transaction_price,56226,6130,-,43830
prize_obtained,56226,22408,0万円,8734
race_record,56226,16748,2戦0勝 [ 0-0-0-2 ],2109
highlight_race,56226,11528,,18813


### 1.2.3 Individual Record
Similar check is done for individual record. It provides yearly consolidated results from each individual related to horse racing events.

In [9]:
# Check the data columns
individual_df.sample(n=3)

Unnamed: 0,individual_type,name,year,rank,first,second,third,out,races_major,wins_major,...,wins_flat,races_grass,wins_grass,races_dirt,wins_dirt,wins_percent,wins_percent_2nd,wins_percent_3rd,prize_obtained,representative_horse
40137,生産者,前川牧場,1997,646.0,1,2,3,31,1,0,...,1,24,1,13,0,0.027,0.081,0.162,2198.1,テツマスター
73987,生産者,高田良一,累計,,0,0,1,6,0,0,...,0,2,0,5,0,0.0,0.0,0.143,130.0,
50635,馬主,加藤久枝,2012,644.0,1,1,2,20,0,0,...,1,8,0,16,1,0.042,0.083,0.167,1131.0,トーブプリンセス


In [10]:
# Snapshot of tha race_record dataframe
print(individual_df.shape)
individual_df.describe().T

(77703, 23)


Unnamed: 0,count,unique,top,freq
individual_type,77703,4,生産者,47286
name,77703,6664,バンブー牧場,34
year,77703,34,累計,6663
rank,77703,1487,,6663
first,77703,428,0,31257
second,77703,425,0,33025
third,77703,418,0,31914
out,77703,1371,1,5682
races_major,77703,306,0,51219
wins_major,77703,72,0,71358


### 1.2.4 Trainer Profile
Similar check is done for trainer profiles. It lists personal information for a certain trainer.

In [11]:
# Check the data columns
trainer_df.sample(n=3)

Unnamed: 0,trainer_name,date_of_birth,place_of_birth,first_run_date,first_run_horse,first_win_date,first_win_horse
19,[東]高松邦男,1950/12/29,東京都,1998/03/08,ブランドアケミ,1998/08/08,マルタカダイジン
274,[外]シアカ,1952/03/21,,,,,
199,[地]山田勇,1949/03/21,,,,,


In [12]:
# Snapshot of tha race_record dataframe
print(trainer_df.shape)
trainer_df.describe().T

(368, 7)


Unnamed: 0,count,unique,top,freq
trainer_name,368,368,[東]前田禎,1
date_of_birth,368,362,1930/06/05,2
place_of_birth,368,29,,239
first_run_date,368,99,,239
first_run_horse,368,129,,240
first_win_date,368,123,,240
first_win_horse,368,129,,240


### 1.2.5 Jockey Profile
Similar check is done for jockey profiles. It lists personal information for a certain jockey.

In [13]:
# Check the data columns
jockey_df.sample(n=3)

Unnamed: 0,jockey_name,date_of_birth,place_of_birth,blood_type,height,weight,first_flat_run_date,first_flat_run_horse,first_flat_win_date,first_flat_win_horse,first_obs_run_date,first_obs_run_horse,first_obs_win_date,first_obs_win_horse
85,西谷誠,1964/06/03,埼玉県,B型,167cm,58kg,1984/03/03,シンナリティ,1984/03/11,シンセイカン,1984/03/18,シンテイアス,1985/01/06,ホクテンプリンス
226,大野拓弥,1990/02/26,福島県,A型,160cm,48kg,2008/03/01,トウショウブリーズ,2008/10/26,サザンスターディ,,,,
99,横山義行,1962/07/28,,,,,1984/03/03,ダービーコーラス,1984/11/10,ミトモスイセイ,1985/01/19,キミノイッセイ,1986/04/27,ダイナエメラルド


In [14]:
# Snapshot of tha race_record dataframe
print(jockey_df.shape)
jockey_df.describe().T

(269, 14)


Unnamed: 0,count,unique,top,freq
jockey_name,269,269,エスピノ,1
date_of_birth,269,266,1984/07/29,2
place_of_birth,269,30,,158
blood_type,269,5,,160
height,269,20,,158
weight,269,15,,158
first_flat_run_date,269,164,,19
first_flat_run_horse,269,251,,19
first_flat_win_date,269,181,,79
first_flat_win_horse,269,190,,79


## 1.3 Preprocessing
The following shows further preprocessing of the dataset. Predominantly it resolves around dealing with missing values within each columns. As majority of the columns contain no missing values and some of them having over 90% of missing values or below 1%, simple dropping is performed for these columns.

### 1.3.1 Race Record

As the percentage of missing value for the latter columns is trivial in this case, it can be safely dropped without affecting the entire dataset.

In [15]:
# Check missing value
get_missing_value_perc(race_df)

Unnamed: 0,Missing Value,Missing Value (%)
run_date,0,0.00%
place,0,0.00%
race,0,0.00%
title,0,0.00%
type,0,0.00%
track,0,0.00%
distance,0,0.00%
weather,0,0.00%
condition,0,0.00%
time,0,0.00%


In [16]:
race_df = race_df.loc[race_df.applymap(lambda x: x == 'null').sum(axis=1) == 0, :]

### 1.3.2 Horse Record

The last column named 'offer_info' can simply be dropped from the dataset as it has 98.39% of missing values.

In [17]:
# Check missing value
get_missing_value_perc(horse_df)

Unnamed: 0,Missing Value,Missing Value (%)
horse_name,0,0.00%
date_of_birth,0,0.00%
trainer,0,0.00%
owner,0,0.00%
breeder,0,0.00%
place_of_birth,0,0.00%
transaction_price,0,0.00%
prize_obtained,0,0.00%
race_record,0,0.00%
highlight_race,0,0.00%


In [18]:
horse_df = horse_df.drop('offer_info', axis=1)

### 1.3.3 Individual Record

As individual records with missing value are all tied to solely one person, it could be simply dropped from the table.

In [19]:
# Check missing value
get_missing_value_perc(individual_df)

Unnamed: 0,Missing Value,Missing Value (%)
individual_type,0,0.00%
name,0,0.00%
year,0,0.00%
rank,0,0.00%
first,0,0.00%
second,0,0.00%
third,0,0.00%
out,0,0.00%
races_major,0,0.00%
wins_major,0,0.00%


In [20]:
individual_df.loc[individual_df.applymap(lambda x: x == 'null').sum(axis=1) > 0]['name'].value_counts(ascending=False)[:5]

Series([], Name: name, dtype: int64)

In [21]:
individual_df = individual_df.loc[individual_df.applymap(lambda x: x == 'null').sum(axis=1) == 0, :]

### 1.3.4 Trainer & Jockey Profile

Regarding place of birth, it is assumed that trainer/jockey without such as record would be treated as from outside Tokyo. With regards with other attributes, some further feature engineering is believed to be executable instead of treating them as missing values tentatively. It is suggestible that attributes such as first run date can be derived from the race record table despite not being shown here.

In [22]:
# Check missing value
get_missing_value_perc(trainer_df)

Unnamed: 0,Missing Value,Missing Value (%)
trainer_name,0,0.00%
date_of_birth,0,0.00%
place_of_birth,239,64.95%
first_run_date,239,64.95%
first_run_horse,240,65.22%
first_win_date,240,65.22%
first_win_horse,240,65.22%


In [23]:
# Check missing value
get_missing_value_perc(jockey_df)

Unnamed: 0,Missing Value,Missing Value (%)
jockey_name,0,0.00%
date_of_birth,0,0.00%
place_of_birth,158,58.74%
blood_type,160,59.48%
height,158,58.74%
weight,158,58.74%
first_flat_run_date,19,7.06%
first_flat_run_horse,19,7.06%
first_flat_win_date,79,29.37%
first_flat_win_horse,79,29.37%


In [24]:
trainer_df['place_of_birth'] = trainer_df['place_of_birth'].apply(lambda x: x if x != 'null' else u'地方')
jockey_df['place_of_birth'] = jockey_df['place_of_birth'].apply(lambda x: x if x != 'null' else u'地方')

Finally, we can output the dataframes as csv files for further analysis.

In [25]:
race_df.to_csv('data/race.csv', encoding='utf-8')
horse_df.to_csv('data/horse.csv', encoding='utf-8')
individual_df.to_csv('data/individual.csv', encoding='utf-8')
trainer_df.to_csv('data/trainer.csv', encoding='utf-8')
jockey_df.to_csv('data/jockey.csv', encoding='utf-8')