## 1. Database Preprocessing
For data preprocessing, Pandas and Sqlite3 are used to extract data from sql files and perform data preprocessing including dealing with missing values etc.

In [1]:
import os
import pandas as pd
import sqlite3

from preprocessing import select_return_table, get_missing_value_perc

## 1.1 Database Connection
Five tables were created during the crawling stage, containing information about race, horse, individual past result (of trainer, jockey, breeder and owner), trainer and jockey profiles. Those sql files can be easily transformed into Pandas dataframe for further processing.

In [2]:
# Establish database connection and check table name
conn = sqlite3.connect(os.path.abspath(os.path.join(os.path.dirname(os.getcwd()), 'crawler\\data\\race.db')))
curs = conn.cursor()
table_name = curs.execute('SELECT name FROM sqlite_master WHERE type="table"').fetchall()
print(table_name)

[('race_record',), ('horse_record',), ('individual_record',), ('trainer_profile',), ('jockey_profile',)]


In [3]:
# Read from record data
record_dict = {name[0]: select_return_table(curs, name[0]) for name in table_name}
race_df = record_dict['race_record']
horse_df = record_dict['horse_record']
individual_df = record_dict['individual_record']
trainer_df = record_dict['trainer_profile']
jockey_df = record_dict['jockey_profile']

## 1.2 Integrity Check
This step is to check the integrity of the crawled data and confirm that there is little inconsistency with the actual data presented online as well as some basic rules for national horse racing events in Japan (e.g. there should be 12 races for one place). More details are in <a href='http://www.jra.go.jp/'> JRA (Japan Racing Association) webpage</a>.

In [4]:
# Check the availability of data for each type of individual
unique_list = {
    'horse': race_df['horse'].unique(), 
    'trainer': race_df['trainer'].unique(), 'jockey': race_df['jockey'].unique(),
    'owner': race_df['owner'].unique(), 'breeder': horse_df['breeder'].unique()   
}
print('Horse: ' + '{:1.2f}'.format(len(horse_df['horse_name'].unique()) / len(unique_list['horse'])))
print('Trainer: ' + '{:1.2f}'.format(len(trainer_df['trainer_name'].unique()) / len(unique_list['trainer'])))
print('Jockey: ' + '{:1.2f}'.format(len(jockey_df['jockey_name'].unique()) / len(unique_list['jockey'])))

Horse: 1.00
Trainer: 1.00
Jockey: 0.99


### 1.2.1 Race Record
The following shows the attributes in race record table as well as some basic statistics. Grouping race records by run data, place and race number yields the total numebr of race happening in one place on a certain date.

In [5]:
# Snapshot of tha race_record dataframe
print(race_df.shape)
race_df.describe().T

(865738, 31)


Unnamed: 0,count,unique,top,freq
run_date,865738,1848,2009-03-21,575
place,865738,10,東京,127180
race,865738,12,12R,75021
title,865738,2264,3歳未勝利,216019
type,865738,3,ダ,419070
track,865738,4,右,562894
distance,865738,80,1200m,183950
weather,865738,6,晴,540224
condition,865738,4,良,645264
time,865738,123,10:30,19712


The following proves that it is thus true that around 12 races are held for a single place for national racing events in Japan. A sample of 10 racing events are shown below.

In [6]:
# Ensure that (almost) all races on the same day at the same place have a count of 12
race_count = curs.execute('SELECT DISTINCT run_date, place, race from race_record').fetchall()
race_count_df = pd.DataFrame(race_count, columns=['run_date', 'place', 'race'])
race_count_df.groupby(['run_date', 'place']).count().sample(n=10)

Unnamed: 0_level_0,Unnamed: 1_level_0,race
run_date,place,Unnamed: 2_level_1
2013-02-16,京都,12
2010-05-23,京都,12
2011-10-09,京都,12
2013-04-20,東京,12
2011-07-23,函館,12
2006-10-21,東京,12
2012-06-30,中京,12
2006-05-13,新潟,12
2001-08-25,新潟,12
2008-07-06,福島,12


### 1.2.2 Horse Record
Similar check is done for horse record.

In [7]:
# Check the data columns
horse_df.sample(n=3)

Unnamed: 0,horse_id,horse_name,date_of_birth,trainer,owner,breeder,place_of_birth,transaction_price,prize_obtained,race_record,highlight_race,relatives,parents,status,gender,breed,offer_info,breeder_id
18960,2000104336,ポートバブルガム,2000年5月16日,高橋成忠 (栗東),水戸眞知子,本巣勝,浦河町,-,75万円 (中央),7戦0勝 [ 0-0-0-7 ],,ファンドリポポ 、 ポートブライアンズ,1993109219 1980102237,抹消,牡,鹿毛,,230040
48086,2007100998,トーセンクレイジー,2007年2月17日,小笠倫弘 (美浦),島川隆哉,矢野牧場,新ひだか町,850万円 (2008年 北海道セレクションセール),58万円 (地方),14戦2勝 [ 2-2-2-8 ],サラ系3歳10組,アルスマルカート 、 トミケンジェスト,000a0003a1 000a006886,抹消,牝,黒鹿毛,,801307
58336,2009102605,バレンタインパパ,2009年2月14日,菊沢隆徳 (美浦),長谷川清英,田端牧場,日高町,"1,627万円 (2010年 セレクトセール)",400万円 (中央) /10万円 (地方),10戦0勝 [ 0-1-0-9 ],,キャッチミーアップ 、 アサケハーツ,2000101426 000a010f07,抹消,牡,鹿毛,,901358


In [8]:
# Snapshot of tha race_record dataframe
print(horse_df.shape)
horse_df.describe().T

(86898, 18)


Unnamed: 0,count,unique,top,freq
horse_id,86898,86898,2011102358,1
horse_name,86898,85820,トリガー,3
date_of_birth,86898,3905,2002年4月3日,73
trainer,86898,2065,(地方),542
owner,86898,6511,サンデーレーシング,1304
breeder,86898,4265,ノーザンファーム,4750
place_of_birth,86898,107,浦河町,16137
transaction_price,86898,7779,-,68212
prize_obtained,86898,30912,0万円,14794
race_record,86898,22588,2戦0勝 [ 0-0-0-2 ],3642


### 1.2.3 Individual Record
Similar check is done for individual record. It provides yearly consolidated results from each individual related to horse racing events.

In [9]:
# Check the data columns
individual_df.sample(n=3)

Unnamed: 0,individual_id,individual_type,name,year,rank,first,second,third,out,races_major,...,wins_flat,races_grass,wins_grass,races_dirt,wins_dirt,wins_percent,wins_percent_2nd,wins_percent_3rd,prize_obtained,representative_horse
95012,900008,馬主,小池宗人,2013,824,1,0,0,2,0,...,1,0,0,3,1,0.333,0.333,0.333,500.0,アイムオンファイア
28820,52006,馬主,三木久史,1991,927,1,0,0,1,0,...,1,0,0,2,1,0.5,0.5,0.5,970.0,マルサンホマレ
59121,80828,生産者,John L. Frost,1997,1103,1,1,0,1,0,...,1,1,0,2,1,0.333,0.667,0.667,700.0,ボストンファックス


In [10]:
# Snapshot of tha race_record dataframe
print(individual_df.shape)
individual_df.describe().T

(115271, 24)


Unnamed: 0,count,unique,top,freq
individual_id,115271,11785,148800,34
individual_type,115271,4,生産者,51931
name,115271,10410,シンボリ牧場,97
year,115271,34,累計,11785
rank,115271,1501,,11785
first,115271,517,0,49125
second,115271,515,0,51562
third,115271,510,0,49672
out,115271,1652,1,9689
races_major,115271,358,0,77351


### 1.2.4 Trainer Profile
Similar check is done for trainer profiles. It lists personal information for a certain trainer.

In [11]:
# Check the data columns
trainer_df.sample(n=3)

Unnamed: 0,trainer_id,trainer_name,date_of_birth,place_of_birth,first_run_date,first_run_horse,first_win_date,first_win_horse
1007,5707,[地]諏訪貴正,1972/09/19,,,,,
456,5338,[地]日野啓二,1955/01/21,,,,,
963,5682,[外]マリンズ,1956/09/15,,,,,


In [12]:
# Snapshot of tha race_record dataframe
print(trainer_df.shape)
trainer_df.describe().T

(1021, 8)


Unnamed: 0,count,unique,top,freq
trainer_id,1021,1021,05380,1
trainer_name,1021,1016,[外]フーラハ,2
date_of_birth,1021,987,1965/09/29,2
place_of_birth,1021,36,,799
first_run_date,1021,157,,798
first_run_horse,1021,221,,801
first_win_date,1021,210,,800
first_win_horse,1021,221,,801


### 1.2.5 Jockey Profile
Similar check is done for jockey profiles. It lists personal information for a certain jockey.

In [13]:
# Check the data columns
jockey_df.sample(n=3)

Unnamed: 0,jockey_id,jockey_name,date_of_birth,place_of_birth,blood_type,height,weight,first_flat_run_date,first_flat_run_horse,first_flat_win_date,first_flat_win_horse,first_obs_run_date,first_obs_run_horse,first_obs_win_date,first_obs_win_horse
732,5499,ウォルシ,1979/05/14,,,,,,,,,2013/03/23,ブラックステアマウンテン,,
360,1062,川島信二,1982/11/24,東京都,B型,158cm,45kg,2001/03/03,イスズペルル,2001/03/04,オースミダイモン,,,,
788,5538,シュミノ,1993/11/18,,,,,2016/12/03,グランパルファン,2016/12/04,サンティール,,,,


In [14]:
# Snapshot of tha race_record dataframe
print(jockey_df.shape)
jockey_df.describe().T

(807, 15)


Unnamed: 0,count,unique,top,freq
jockey_id,807,807,00867,1
jockey_name,807,801,ウィリア,2
date_of_birth,807,780,1980/08/27,2
place_of_birth,807,41,,573
blood_type,807,5,,578
height,807,25,,571
weight,807,17,,571
first_flat_run_date,807,425,,72
first_flat_run_horse,807,728,,72
first_flat_win_date,807,406,,354


## 1.3 Preprocessing
The following shows further preprocessing of the dataset. Predominantly it resolves around dealing with missing values within each columns. As majority of the columns contain no missing values and some of them having over 90% of missing values or below 1%, simple dropping is performed for these columns.

### 1.3.1 Race Record

As the percentage of missing value for the latter columns is trivial in this case, it can be safely dropped without affecting the entire dataset.

In [15]:
# Check missing value
get_missing_value_perc(race_df)

Unnamed: 0,Missing Value,Missing Value (%)
run_date,0,0.00%
place,0,0.00%
race,0,0.00%
title,0,0.00%
type,0,0.00%
track,0,0.00%
distance,0,0.00%
weather,0,0.00%
condition,0,0.00%
time,0,0.00%


In [16]:
race_df = race_df.loc[(race_df.applymap(lambda x: x == 'null').sum(axis=1) == 0) & (race_df['run_time'] != ''), :]

### 1.3.2 Horse Record

The last column named 'offer_info' can simply be dropped from the dataset as it has 98.39% of missing values.

In [17]:
# Check missing value
get_missing_value_perc(horse_df)

Unnamed: 0,Missing Value,Missing Value (%)
horse_id,0,0.00%
horse_name,0,0.00%
date_of_birth,0,0.00%
trainer,0,0.00%
owner,0,0.00%
breeder,0,0.00%
place_of_birth,0,0.00%
transaction_price,0,0.00%
prize_obtained,0,0.00%
race_record,0,0.00%


In [18]:
horse_df = horse_df.drop('offer_info', axis=1)

### 1.3.3 Individual Record

As individual records with missing value are all tied to solely one person, it could be simply dropped from the table.

In [19]:
# Check missing value
get_missing_value_perc(individual_df)

Unnamed: 0,Missing Value,Missing Value (%)
individual_id,0,0.00%
individual_type,0,0.00%
name,0,0.00%
year,0,0.00%
rank,0,0.00%
first,0,0.00%
second,0,0.00%
third,0,0.00%
out,0,0.00%
races_major,0,0.00%


In [20]:
individual_df.loc[individual_df.applymap(lambda x: x == 'null').sum(axis=1) > 0]['name'].value_counts(ascending=False)[:5]

Series([], Name: name, dtype: int64)

In [21]:
individual_df = individual_df.loc[individual_df.applymap(lambda x: x == 'null').sum(axis=1) == 0, :]
individual_df = individual_df.loc[individual_df['year'] != u'累計']

### 1.3.4 Trainer & Jockey Profile

Regarding place of birth, it is assumed that trainer/jockey without such as record would be treated as from outside Tokyo. With regards with other attributes, some further feature engineering is believed to be executable instead of treating them as missing values tentatively. It is suggestible that attributes such as first run date can be derived from the race record table despite not being shown here.

In [22]:
# Check missing value
get_missing_value_perc(trainer_df)

Unnamed: 0,Missing Value,Missing Value (%)
trainer_id,0,0.00%
trainer_name,0,0.00%
date_of_birth,0,0.00%
place_of_birth,799,78.26%
first_run_date,798,78.16%
first_run_horse,801,78.45%
first_win_date,800,78.35%
first_win_horse,801,78.45%


In [23]:
# Check missing value
get_missing_value_perc(jockey_df)

Unnamed: 0,Missing Value,Missing Value (%)
jockey_id,0,0.00%
jockey_name,0,0.00%
date_of_birth,0,0.00%
place_of_birth,573,71.00%
blood_type,578,71.62%
height,571,70.76%
weight,571,70.76%
first_flat_run_date,72,8.92%
first_flat_run_horse,72,8.92%
first_flat_win_date,354,43.87%


In [24]:
trainer_df['place_of_birth'] = trainer_df['place_of_birth'].apply(lambda x: x if x != 'null' else u'地方')
jockey_df['place_of_birth'] = jockey_df['place_of_birth'].apply(lambda x: x if x != 'null' else u'地方')

Finally, we can output the dataframes as csv files for further analysis.

In [25]:
race_df.to_csv('data/race.csv', encoding='utf-8')
horse_df.to_csv('data/horse.csv', encoding='utf-8')
individual_df.to_csv('data/individual.csv', encoding='utf-8')
trainer_df.to_csv('data/trainer.csv', encoding='utf-8')
jockey_df.to_csv('data/jockey.csv', encoding='utf-8')