## 1. Database Preprocessing
For data preprocessing, Pandas and Sqlite3 are used to extract data from sql files and perform data preprocessing including dealing with missing values etc.

In [1]:
import pandas as pd
import sqlite3

def select_return_table(table_name):
    # Select from all records and convert to pandas dataframe
    data = curs.execute('SELECT * FROM %s' % table_name).fetchall()
    column = [element[1] for element in curs.execute('PRAGMA table_info(%s)' % table_name).fetchall()]
    return pd.DataFrame(data, columns=column)

def get_missing_value_perc(df, cond=lambda x: x == 'null'):
    # Check missing value and output percentage
    df_sum = df.applymap(cond).sum()
    df_percentage = df.applymap(cond).sum() / df.applymap(lambda data: data == 'null').count()
    df_percentage = df_percentage.apply(lambda x: '{0:.2f}%'.format(x * 100))
    return pd.concat([df_sum, df_percentage], axis=1, keys=['Missing Value', 'Missing Value (%)'])

## 1.1 Database Connection
Five tables were created during the crawling stage, containing information about race, horse, individual past result (of trainer, jockey, breeder and owner), trainer and jockey profiles. Those sql files can be easily transformed into Pandas dataframe for further processing.

In [2]:
# Establish database connection and check table name
conn = sqlite3.connect('temp/race.db')
curs = conn.cursor()
table_name = curs.execute('SELECT name FROM sqlite_master WHERE type="table"').fetchall()
print(table_name)

[('race_record',), ('horse_record',), ('individual_record',), ('trainer_profile',), ('jockey_profile',)]


In [3]:
# Read from record data
record_dict = {name[0]: select_return_table(name[0]) for name in table_name}
race_df = record_dict['race_record']
horse_df = record_dict['horse_record']
individual_df = record_dict['individual_record']
trainer_df = record_dict['trainer_profile']
jockey_df = record_dict['jockey_profile']

## 1.2 Integrity Check
This step is to check the integrity of the crawled data and confirm that there is little inconsistency with the actual data presented online as well as some basic rules for national horse racing events in Japan (e.g. there should be 12 races for one place). More details are in <a href='http://www.jra.go.jp/'> JRA (Japan Racing Association) webpage</a>.

### 1.2.1 Race Record
The following shows the attributes in race record table as well as some basic statistics. Grouping race records by run data, place and race number yields the total numebr of race happening in one place on a certain date.

In [4]:
# Snapshot of tha race_record dataframe
print(race_df.shape)
race_df.describe().T

(287681, 27)


Unnamed: 0,count,unique,top,freq
run_date,287681,681,2005-05-21,547
place,287681,10,中山,41958
race,287681,12,3R,26656
title,287681,1642,3歳未勝利,67621
type,287681,3,ダ,140215
track,287681,4,右,189136
distance,287681,67,1200m,68470
weather,287681,6,晴,170843
condition,287681,4,良,225379
time,287681,85,12:50,8336


The following proves that it is thus true that around 12 races are held for a single place for national racing events in Japan. A sample of 10 racing events are shown below.

In [5]:
# Ensure that (almost) all races on the same day at the same place have a count of 12
race_count = curs.execute('SELECT DISTINCT run_date, place, race from race_record').fetchall()
race_count_df = pd.DataFrame(race_count, columns=['run_date', 'place', 'race'])
race_count_df.groupby(['run_date', 'place']).count().sample(n=10)

Unnamed: 0_level_0,Unnamed: 1_level_0,race
run_date,place,Unnamed: 2_level_1
2004-02-08,東京,12
2003-11-15,福島,12
2000-12-24,中山,9
2000-07-09,阪神,12
2003-09-21,中山,12
2001-07-29,函館,12
2004-04-10,福島,12
2003-12-13,中山,12
2001-09-30,阪神,12
2005-03-06,中山,12


### 1.2.2 Horse Record
Similar check is done for horse record.

In [6]:
# Check the data columns
horse_df.sample(n=3)

Unnamed: 0,horse_name,date_of_birth,trainer,owner,breeder,place_of_birth,transaction_price,prize_obtained,race_record,highlight_race,relatives,parents,status,gender,breed,offer_info
24841,マイネルグロッソ,2002年5月5日,清水英克 (美浦),サラブレッドクラブ・ラフィアン,福山牧場,門別町,-,"4,074万円 (中央)",41戦2勝 [ 2-3-5-31 ],08'喜多方特別(５００万下),レザーノート 、 ブレンニューライフ,スターオブコジーン ラベンダーノート,抹消,牡,鹿毛,
32484,トーセンクロス,2004年5月15日,小笠倫弘 (美浦),島川隆哉,Robert N. Clay & Fair Way Equine LLC,米,-,"6,350万円 (中央) /887万円 (地方)",29戦5勝 [ 5-1-5-18 ],12'川崎スパーキングスプリント(OP),,Broad Brush Ballerina Princess,抹消,牡,鹿毛,
7168,ヘイアンエルドラド,1998年1月27日,松山康久 (美浦),荻原昭二,Land of Believe Farm Inc.,米,-,"6,840万円 (中央) /100万円 (地方)",22戦5勝 [ 5-2-2-13 ],02'筑波山特別(1000万下),,フレンチデピュティ Canadian Halo,抹消,牡,鹿毛,


In [7]:
# Snapshot of tha race_record dataframe
print(horse_df.shape)
horse_df.describe().T

(32648, 16)


Unnamed: 0,count,unique,top,freq
horse_name,32648,32648,アナザーステージ,1
date_of_birth,32648,1905,2002年4月3日,72
trainer,32648,1447,藤沢和雄 (美浦),147
owner,32648,4677,サンデーレーシング,493
breeder,32648,3017,ノーザンファーム,1175
place_of_birth,32648,92,浦河町,6205
transaction_price,32648,2771,-,27523
prize_obtained,32648,15805,0万円,5213
race_record,32648,12211,2戦0勝 [ 0-0-0-2 ],1167
highlight_race,32648,6651,,10355


### 1.2.3 Individual Record
Similar check is done for individual record. It provides yearly consolidated results from each individual related to horse racing events.

In [8]:
# Check the data columns
individual_df.sample(n=3)

Unnamed: 0,individual_type,name,year,rank,first,second,third,out,races_major,wins_major,...,wins_flat,races_grass,wins_grass,races_dirt,wins_dirt,wins_percent,wins_percent_2nd,wins_percent_3rd,prize_obtained,representative_horse
76173,馬主,水野恵吉,2002,1216,0,0,0,1,0,0,...,0,0,0,1,0,0.0,0.0,0.0,0.0,エフワンライデン
32282,馬主,菊池昭雄,2010,556,0,4,4,11,0,0,...,0,15,0,4,0,0.0,0.211,0.421,1710.0,ケイビイテルマ
29307,馬主,今秀幸,2004,1170,0,0,0,1,0,0,...,0,0,0,1,0,0.0,0.0,0.0,0.0,リボンフォールズ


In [9]:
# Snapshot of tha race_record dataframe
print(individual_df.shape)
individual_df.describe().T

(84936, 23)


Unnamed: 0,count,unique,top,freq
individual_type,84936,4,馬主,35940
name,84936,7384,山元哲二,33
year,84936,33,2000,4159
rank,84936,1499,1313,736
first,84936,190,0,34471
second,84936,164,0,36226
third,84936,152,0,34665
out,84936,744,1,5995
races_major,84936,124,0,56020
wins_major,84936,29,0,78458


### 1.2.4 Trainer Profile
Similar check is done for trainer profiles. It lists personal information for a certain trainer.

In [10]:
# Check the data columns
trainer_df.sample(n=3)

Unnamed: 0,trainer_name,date_of_birth,place_of_birth,first_run_date,first_run_horse,first_win_date,first_win_horse
478,[地]野村正直,1951/02/23,,,,,
635,[地]町野良隆,1958/02/07,,,,,
198,[東]山崎彰義,1931/08/18,,,,,


In [11]:
# Snapshot of tha race_record dataframe
print(trainer_df.shape)
trainer_df.describe().T

(784, 7)


Unnamed: 0,count,unique,top,freq
trainer_name,784,784,[西]河内洋,1
date_of_birth,784,756,1944/02/15,2
place_of_birth,784,36,,563
first_run_date,784,154,,565
first_run_horse,784,220,,565
first_win_date,784,208,,565
first_win_horse,784,220,,565


### 1.2.5 Jockey Profile
Similar check is done for jockey profiles. It lists personal information for a certain jockey.

In [12]:
# Check the data columns
jockey_df.sample(n=3)

Unnamed: 0,jockey_name,date_of_birth,place_of_birth,blood_type,height,weight,first_flat_run_date,first_flat_run_horse,first_flat_win_date,first_flat_win_horse,first_obs_run_date,first_obs_run_horse,first_obs_win_date,first_obs_win_horse
444,マンビー,1975/05/20,,,,,2002/04/06,スプリングドキッチ,2002/04/28,シンボリスキャン,,,,
212,向山牧,1965/07/05,,,,,1992/07/26,ヒロインセイコー,1994/09/04,ダンツダンサー,,,,
453,五十嵐恭,1983/12/06,,,,,2002/07/13,トウゲンキョウ,,,,,,


In [13]:
# Snapshot of tha race_record dataframe
print(jockey_df.shape)
jockey_df.describe().T

(593, 14)


Unnamed: 0,count,unique,top,freq
jockey_name,593,593,上松瀬竜,1
date_of_birth,593,580,1961/11/25,2
place_of_birth,593,36,,408
blood_type,593,5,,410
height,593,23,,408
weight,593,16,,408
first_flat_run_date,593,316,,29
first_flat_run_horse,593,557,,29
first_flat_win_date,593,318,,244
first_flat_win_horse,593,348,,244


## 1.3 Preprocessing
The following shows further preprocessing of the dataset. Predominantly it resolves around dealing with missing values within each columns. As majority of the columns contain no missing values and some of them having over 90% of missing values or below 1%, simple dropping is performed for these columns.

### 1.3.1 Race Record

As the percentage of missing value for the latter columns is trivial in this case, it can be safely dropped without affecting the entire dataset.

In [14]:
# Check missing value
get_missing_value_perc(race_df)

Unnamed: 0,Missing Value,Missing Value (%)
run_date,0,0.00%
place,0,0.00%
race,0,0.00%
title,0,0.00%
type,0,0.00%
track,0,0.00%
distance,0,0.00%
weather,0,0.00%
condition,0,0.00%
time,0,0.00%


In [15]:
race_df = race_df.loc[race_df.applymap(lambda x: x == 'null').sum(axis=1) == 0, :]

### 1.3.2 Horse Record

The last column named 'offer_info' can simply be dropped from the dataset as it has 98.39% of missing values.

In [16]:
# Check missing value
get_missing_value_perc(horse_df)

Unnamed: 0,Missing Value,Missing Value (%)
horse_name,0,0.00%
date_of_birth,0,0.00%
trainer,0,0.00%
owner,0,0.00%
breeder,0,0.00%
place_of_birth,0,0.00%
transaction_price,0,0.00%
prize_obtained,0,0.00%
race_record,0,0.00%
highlight_race,0,0.00%


In [17]:
horse_df = horse_df.drop('offer_info', axis=1)

### 1.3.3 Individual Record

As individual records with missing value are all tied to solely one person, it could be simply dropped from the table.

In [18]:
# Check missing value
get_missing_value_perc(individual_df)

Unnamed: 0,Missing Value,Missing Value (%)
individual_type,0,0.00%
name,0,0.00%
year,0,0.00%
rank,0,0.00%
first,63,0.07%
second,63,0.07%
third,63,0.07%
out,63,0.07%
races_major,63,0.07%
wins_major,63,0.07%


In [19]:
individual_df.loc[individual_df.applymap(lambda x: x == 'null').sum(axis=1) > 0]['name'].value_counts(ascending=False)[:5]

久保牧場                 1
安達篤                  1
岡冨俊一                 1
ロケット                 1
Emory A. Hamilton    1
Name: name, dtype: int64

In [20]:
individual_df = individual_df.loc[individual_df.applymap(lambda x: x == 'null').sum(axis=1) == 0, :]

### 1.3.4 Trainer & Jockey Profile

Regarding place of birth, it is assumed that trainer/jockey without such as record would be treated as from outside Tokyo. With regards with other attributes, some further feature engineering is believed to be executable instead of treating them as missing values tentatively. It is suggestible that attributes such as first run date can be derived from the race record table despite not being shown here.

In [21]:
# Check missing value
get_missing_value_perc(trainer_df)

Unnamed: 0,Missing Value,Missing Value (%)
trainer_name,0,0.00%
date_of_birth,0,0.00%
place_of_birth,563,71.81%
first_run_date,565,72.07%
first_run_horse,565,72.07%
first_win_date,565,72.07%
first_win_horse,565,72.07%


In [22]:
# Check missing value
get_missing_value_perc(jockey_df)

Unnamed: 0,Missing Value,Missing Value (%)
jockey_name,0,0.00%
date_of_birth,0,0.00%
place_of_birth,408,68.80%
blood_type,410,69.14%
height,408,68.80%
weight,408,68.80%
first_flat_run_date,29,4.89%
first_flat_run_horse,29,4.89%
first_flat_win_date,244,41.15%
first_flat_win_horse,244,41.15%


In [23]:
trainer_df['place_of_birth'] = trainer_df['place_of_birth'].apply(lambda x: x if x != 'null' else u'地方')
jockey_df['place_of_birth'] = jockey_df['place_of_birth'].apply(lambda x: x if x != 'null' else u'地方')

Finally, we can output the dataframes as csv files for further analysis.

In [24]:
race_df.to_csv('data/race.csv')
horse_df.to_csv('data/horse.csv')
individual_df.to_csv('data/individual.csv')
trainer_df.to_csv('data/trainer.csv')
jockey_df.to_csv('data/jockey.csv')