## 1. Database Preprocessing
For data preprocessing, Pandas and Sqlite3 are used to extract data from sql files and perform data preprocessing including dealing with missing values etc.

In [1]:
import os
import pandas as pd
import sqlite3

def select_return_table(table_name):
    # Select from all records and convert to pandas dataframe
    data = curs.execute('SELECT * FROM %s' % table_name).fetchall()
    column = [element[1] for element in curs.execute('PRAGMA table_info(%s)' % table_name).fetchall()]
    return pd.DataFrame(data, columns=column)

def get_missing_value_perc(df, cond=lambda x: x == 'null'):
    # Check missing value and output percentage
    df_sum = df.applymap(cond).sum()
    df_percentage = df.applymap(cond).sum() / df.applymap(lambda data: data == 'null').count()
    df_percentage = df_percentage.apply(lambda x: '{0:.2f}%'.format(x * 100))
    return pd.concat([df_sum, df_percentage], axis=1, keys=['Missing Value', 'Missing Value (%)'])

## 1.1 Database Connection
Five tables were created during the crawling stage, containing information about race, horse, individual past result (of trainer, jockey, breeder and owner), trainer and jockey profiles. Those sql files can be easily transformed into Pandas dataframe for further processing.

In [2]:
# Establish database connection and check table name
conn = sqlite3.connect(os.path.abspath(os.path.join(os.path.dirname(os.getcwd()), 'crawler\\data\\race.db')))
curs = conn.cursor()
table_name = curs.execute('SELECT name FROM sqlite_master WHERE type="table"').fetchall()
print(table_name)

[('race_record',), ('horse_record',), ('individual_record',), ('trainer_profile',), ('jockey_profile',)]


In [3]:
# Read from record data
record_dict = {name[0]: select_return_table(name[0]) for name in table_name}
race_df = record_dict['race_record']
horse_df = record_dict['horse_record']
individual_df = record_dict['individual_record']
trainer_df = record_dict['trainer_profile']
jockey_df = record_dict['jockey_profile']

## 1.2 Integrity Check
This step is to check the integrity of the crawled data and confirm that there is little inconsistency with the actual data presented online as well as some basic rules for national horse racing events in Japan (e.g. there should be 12 races for one place). More details are in <a href='http://www.jra.go.jp/'> JRA (Japan Racing Association) webpage</a>.

In [4]:
# Check the availability of data for each type of individual
unique_list = {
    'horse': race_df['horse'].unique(), 
    'trainer': race_df['trainer'].unique(), 'jockey': race_df['jockey'].unique(),
    'owner': race_df['owner'].unique(), 'breeder': horse_df['breeder'].unique()   
}
print('Horse: ' + '{:1.2f}'.format(len(horse_df['horse_name'].unique()) / len(unique_list['horse'])))
print('Trainer: ' + '{:1.2f}'.format(len(trainer_df['trainer_name'].unique()) / len(unique_list['trainer'])))
print('Jockey: ' + '{:1.2f}'.format(len(jockey_df['jockey_name'].unique()) / len(unique_list['jockey'])))

Horse: 1.00
Trainer: 0.97
Jockey: 0.99


### 1.2.1 Race Record
The following shows the attributes in race record table as well as some basic statistics. Grouping race records by run data, place and race number yields the total numebr of race happening in one place on a certain date.

In [5]:
# Snapshot of tha race_record dataframe
print(race_df.shape)
race_df.describe().T

(864670, 31)


Unnamed: 0,count,unique,top,freq
run_date,864670,1846,2009-03-21,575
place,864670,10,東京,126818
race,864670,12,12R,74921
title,864670,2264,3歳未勝利,215679
type,864670,3,ダ,418502
track,864670,4,右,562201
distance,864670,80,1200m,183765
weather,864670,6,晴,539263
condition,864670,4,良,644475
time,864670,123,10:30,19697


The following proves that it is thus true that around 12 races are held for a single place for national racing events in Japan. A sample of 10 racing events are shown below.

In [6]:
# Ensure that (almost) all races on the same day at the same place have a count of 12
race_count = curs.execute('SELECT DISTINCT run_date, place, race from race_record').fetchall()
race_count_df = pd.DataFrame(race_count, columns=['run_date', 'place', 'race'])
race_count_df.groupby(['run_date', 'place']).count().sample(n=10)

Unnamed: 0_level_0,Unnamed: 1_level_0,race
run_date,place,Unnamed: 2_level_1
2009-05-10,新潟,12
2012-04-14,福島,12
2010-05-22,京都,12
2008-11-01,京都,12
2003-02-22,京都,12
2009-11-01,東京,12
2005-05-14,新潟,12
2012-08-25,札幌,12
2001-06-23,福島,12
2009-10-17,東京,12


### 1.2.2 Horse Record
Similar check is done for horse record.

In [7]:
# Check the data columns
horse_df.sample(n=3)

Unnamed: 0,horse_id,horse_name,date_of_birth,trainer,owner,breeder,place_of_birth,transaction_price,prize_obtained,race_record,highlight_race,relatives,parents,status,gender,breed,offer_info
10600,1998110197,エイシンシャイアン,1998年4月8日,伊藤強一 (笠松),平井豊光,Shannon A. Wolfram & Fred Sietz,米,-,"9,668万円 (中央)",29戦6勝 [ 6-5-2-16 ],04'斑鳩S(1600万下),エイシンワシントン 、 エイシンリンカーン,000a000348 000a00968f,,牡,鹿毛,
34912,2004100242,テイエムフルパワー,2004年3月17日,柴田光陽 (栗東),竹園正繼,競優牧場,新冠町,-,"6,285万円 (中央)",58戦3勝 [ 3-3-6-46 ],08'3歳上500万下,テイエムシバスキー 、 テイエムライジン,1996100292 1999109041,抹消,牡,栗毛,
86195,2015105941,サトノグリッター,2015年5月16日,吉村圭司 (栗東),サトミホースカンパニー,千代田牧場,新ひだか町,-,350万円 (中央),2戦0勝 [ 0-1-1-0 ],,プレフェリート 、 レサンシエル,000a011a7f 000a011a6a,現役,牡3歳,鹿毛,


In [8]:
# Snapshot of tha race_record dataframe
print(horse_df.shape)
horse_df.describe().T

(86803, 17)


Unnamed: 0,count,unique,top,freq
horse_id,86803,86803,1999102024,1
horse_name,86803,85726,ビッグボス,3
date_of_birth,86803,3905,2002年4月3日,73
trainer,86803,2073,(地方),541
owner,86803,6512,サンデーレーシング,1304
breeder,86803,4265,ノーザンファーム,4743
place_of_birth,86803,107,浦河町,16125
transaction_price,86803,7774,-,68142
prize_obtained,86803,30894,0万円,14749
race_record,86803,22569,2戦0勝 [ 0-0-0-2 ],3643


### 1.2.3 Individual Record
Similar check is done for individual record. It provides yearly consolidated results from each individual related to horse racing events.

In [9]:
# Check the data columns
individual_df.sample(n=3)

Unnamed: 0,individual_id,individual_type,name,year,rank,first,second,third,out,races_major,...,wins_flat,races_grass,wins_grass,races_dirt,wins_dirt,wins_percent,wins_percent_2nd,wins_percent_3rd,prize_obtained,representative_horse
114615,779033,馬主,太田好則,累計,,0,0,0,3,1,...,0,2,0,1,0,0.0,0.0,0.0,0.0,
49322,72800,馬主,王蔵牧場,2004,739.0,1,0,1,3,0,...,1,2,0,3,1,0.2,0.2,0.4,815.0,ミズホユウセイ
99442,744800,馬主,北星村田牧場,2009,486.0,1,0,1,15,0,...,0,9,1,8,0,0.059,0.059,0.118,2115.0,スワン


In [10]:
# Snapshot of tha race_record dataframe
print(individual_df.shape)
individual_df.describe().T

(115196, 24)


Unnamed: 0,count,unique,top,freq
individual_id,115196,11782,148800,34
individual_type,115196,4,生産者,51903
name,115196,10403,シンボリ牧場,97
year,115196,34,累計,11782
rank,115196,1501,,11782
first,115196,510,0,49106
second,115196,507,0,51546
third,115196,510,0,49638
out,115196,1660,1,9704
races_major,115196,366,0,77310


### 1.2.4 Trainer Profile
Similar check is done for trainer profiles. It lists personal information for a certain trainer.

In [11]:
# Check the data columns
trainer_df.sample(n=3)

Unnamed: 0,trainer_id,trainer_name,date_of_birth,place_of_birth,first_run_date,first_run_horse,first_win_date,first_win_horse
884,5631,[外]モートン,1971/11/10,,,,,
9,405,[東]小西一男,1955/09/30,千葉県,1991/03/02,ミニヨン,1991/06/23,サロマブルー
109,338,[西]野元昭,1940/09/30,宮崎県,1980/10/04,ジョーバブーン,1980/10/18,ジョーソレムニス


In [12]:
# Snapshot of tha race_record dataframe
print(trainer_df.shape)
trainer_df.describe().T

(1021, 8)


Unnamed: 0,count,unique,top,freq
trainer_id,1021,1021,05635,1
trainer_name,1021,1016,[外]ダンロッ,2
date_of_birth,1021,987,1965/09/29,2
place_of_birth,1021,36,,799
first_run_date,1021,157,,798
first_run_horse,1021,221,,801
first_win_date,1021,210,,800
first_win_horse,1021,221,,801


### 1.2.5 Jockey Profile
Similar check is done for jockey profiles. It lists personal information for a certain jockey.

In [13]:
# Check the data columns
jockey_df.sample(n=3)

Unnamed: 0,jockey_id,jockey_name,date_of_birth,place_of_birth,blood_type,height,weight,first_flat_run_date,first_flat_run_horse,first_flat_win_date,first_flat_win_horse,first_obs_run_date,first_obs_run_horse,first_obs_win_date,first_obs_win_horse
507,888,ビードマ,1965/11/17,,,,,1992/11/28,ダンディアンバー,2005/12/03,ロードマジェスティ,,,,
95,1040,穂苅寿彦,1979/10/22,埼玉県,A型,164cm,47kg,1998/03/01,オーガストキング,1998/03/01,オーガストキング,1999/01/23,ライズノメ,2000/05/06,ダイワデュール
236,5100,斉藤誠,1962/03/07,,,,,1997/08/16,タマルファイター,1997/08/16,タマルファイター,,,,


In [14]:
# Snapshot of tha race_record dataframe
print(jockey_df.shape)
jockey_df.describe().T

(807, 15)


Unnamed: 0,count,unique,top,freq
jockey_id,807,807,05228,1
jockey_name,807,801,オドノヒ,2
date_of_birth,807,780,1990/02/26,2
place_of_birth,807,41,,573
blood_type,807,5,,578
height,807,25,,571
weight,807,17,,571
first_flat_run_date,807,425,,72
first_flat_run_horse,807,728,,72
first_flat_win_date,807,405,,355


## 1.3 Preprocessing
The following shows further preprocessing of the dataset. Predominantly it resolves around dealing with missing values within each columns. As majority of the columns contain no missing values and some of them having over 90% of missing values or below 1%, simple dropping is performed for these columns.

### 1.3.1 Race Record

As the percentage of missing value for the latter columns is trivial in this case, it can be safely dropped without affecting the entire dataset.

In [15]:
# Check missing value
get_missing_value_perc(race_df)

Unnamed: 0,Missing Value,Missing Value (%)
run_date,0,0.00%
place,0,0.00%
race,0,0.00%
title,0,0.00%
type,0,0.00%
track,0,0.00%
distance,0,0.00%
weather,0,0.00%
condition,0,0.00%
time,0,0.00%


In [16]:
race_df = race_df.loc[(race_df.applymap(lambda x: x == 'null').sum(axis=1) == 0) & (race_df['run_time'] != ''), :]

### 1.3.2 Horse Record

The last column named 'offer_info' can simply be dropped from the dataset as it has 98.39% of missing values.

In [17]:
# Check missing value
get_missing_value_perc(horse_df)

Unnamed: 0,Missing Value,Missing Value (%)
horse_id,0,0.00%
horse_name,0,0.00%
date_of_birth,0,0.00%
trainer,0,0.00%
owner,0,0.00%
breeder,0,0.00%
place_of_birth,0,0.00%
transaction_price,0,0.00%
prize_obtained,0,0.00%
race_record,0,0.00%


In [18]:
horse_df = horse_df.drop('offer_info', axis=1)

### 1.3.3 Individual Record

As individual records with missing value are all tied to solely one person, it could be simply dropped from the table.

In [19]:
# Check missing value
get_missing_value_perc(individual_df)

Unnamed: 0,Missing Value,Missing Value (%)
individual_id,0,0.00%
individual_type,0,0.00%
name,0,0.00%
year,0,0.00%
rank,0,0.00%
first,0,0.00%
second,0,0.00%
third,0,0.00%
out,0,0.00%
races_major,0,0.00%


In [20]:
individual_df.loc[individual_df.applymap(lambda x: x == 'null').sum(axis=1) > 0]['name'].value_counts(ascending=False)[:5]

Series([], Name: name, dtype: int64)

In [21]:
individual_df = individual_df.loc[individual_df.applymap(lambda x: x == 'null').sum(axis=1) == 0, :]
individual_df = individual_df.loc[individual_df['year'] != u'累計']

### 1.3.4 Trainer & Jockey Profile

Regarding place of birth, it is assumed that trainer/jockey without such as record would be treated as from outside Tokyo. With regards with other attributes, some further feature engineering is believed to be executable instead of treating them as missing values tentatively. It is suggestible that attributes such as first run date can be derived from the race record table despite not being shown here.

In [22]:
# Check missing value
get_missing_value_perc(trainer_df)

Unnamed: 0,Missing Value,Missing Value (%)
trainer_id,0,0.00%
trainer_name,0,0.00%
date_of_birth,0,0.00%
place_of_birth,799,78.26%
first_run_date,798,78.16%
first_run_horse,801,78.45%
first_win_date,800,78.35%
first_win_horse,801,78.45%


In [23]:
# Check missing value
get_missing_value_perc(jockey_df)

Unnamed: 0,Missing Value,Missing Value (%)
jockey_id,0,0.00%
jockey_name,0,0.00%
date_of_birth,0,0.00%
place_of_birth,573,71.00%
blood_type,578,71.62%
height,571,70.76%
weight,571,70.76%
first_flat_run_date,72,8.92%
first_flat_run_horse,72,8.92%
first_flat_win_date,355,43.99%


In [24]:
trainer_df['place_of_birth'] = trainer_df['place_of_birth'].apply(lambda x: x if x != 'null' else u'地方')
jockey_df['place_of_birth'] = jockey_df['place_of_birth'].apply(lambda x: x if x != 'null' else u'地方')

Finally, we can output the dataframes as csv files for further analysis.

In [25]:
race_df.to_csv('data/race.csv', encoding='utf-8')
horse_df.to_csv('data/horse.csv', encoding='utf-8')
individual_df.to_csv('data/individual.csv', encoding='utf-8')
trainer_df.to_csv('data/trainer.csv', encoding='utf-8')
jockey_df.to_csv('data/jockey.csv', encoding='utf-8')