# Data Scraping

In this exercise, we will scrape data from Hong Kong Jockey Club's race result page: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20151213/ST

Libraries needed:
- Requests: http://docs.python-requests.org/en/master/ 
- Beautiful Soup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- Regular expression: https://docs.python.org/3.6/library/re.html

The ```Requests``` library access a page, which we then pass to ```Beautiful Soup``` to parse into a searchable structure. Regular expression allows us to find specific part of the structure by keyword match.

### A. Getting a Single Column of Data

Let's begin with fetching the names of the horses. We note that each horse's name is enclosed in a HTML ```<a>``` tag, with the term *horseno* contained in its hypertext reference.

<img src="http://www.ticoneva.com/econ/econ4130/images/chrome_horseno.png" style="border: 1px solid grey; width: 750px;">

In [2]:
import requests
import re
from bs4 import BeautifulSoup

#URL of data
url = "http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20151213/ST"

#Access the page
page = requests.get(url)

#Load the page into BeautifulSoup
soup = BeautifulSoup(page.content,'html.parser')

#Find all tags with href containing "horseno"
#horses is a list of matched tags
horses = soup.find_all(href=re.compile("horseno"))

#Print the result
for horse in horses:
    print(horse.text)

JOLLY JOLLY
PEOPLE'S KNIGHT
RUN FORREST
MODERN TSAR
MAGNETISM
ENORMOUS HONOUR
HAPPY JOURNEY
WINGOLD
OVETT
PAKISTAN BABY
SUPER FLUKE
JUN GONG
LAUGH OUT LOUD
TEN SPEED
JOLLY JOLLY


Notice JOLLY JOLLY is repeated above. Let's check the source of the webpage to understand why. 

We can do the same for jockeys, noting that each jockey name is enclosed in a ```<a>``` tag with hypertext reference containing the term *jockeyprofile.asp*.

In [6]:
#Find all tags with href containing "jockeyprofile.asp" 
jockeys = soup.find_all(href=re.compile("jockeyprofile.asp"))

#Print the result
for jockey in jockeys:
    print(jockey.text)

K Teetan
T Berry
J Moreira
B Prebble
G Lerena
N Rawiller
H W Lai
M L Yeung
H N Wong
D Whyte
M Demuro
C Y Ho
G Mosse
Y T Cheng


Finally, we can also match by the class of the ```<td>``` tag one layer up. This would return horse names, jockey names and trainer names.

<img src="http://www.ticoneva.com/econ/econ4130/images/chrome_td_class.png" style="border: 1px solid grey; width: 750px;">

In [8]:
data = soup.find_all("td",class_="tdAlignL font13 fontStyle")

for d in data:
    print(d.text)

JOLLY JOLLY(T087)
K Teetan
P O'Sullivan
PEOPLE'S KNIGHT(T305)
T Berry
J Moore
RUN FORREST(T176)
J Moreira
C S Shum
MODERN TSAR(S167)
B Prebble
W Y So
MAGNETISM(V114)
G Lerena
D E Ferraris
ENORMOUS HONOUR(T236)
N Rawiller
Y S Tsui
HAPPY JOURNEY(S299)
H W Lai
S Woods
WINGOLD(T202)
M L Yeung
A Lee
OVETT(P351)
H N Wong
A T Millard
PAKISTAN BABY(S442)
D Whyte
A S Cruz
SUPER FLUKE(T382)
M Demuro
D Cruz
JUN GONG(N325)
C Y Ho
C H Yip
LAUGH OUT LOUD(P297)
G Mosse
K L Man
TEN SPEED(T239)
Y T Cheng
C W Chang


### B. Fetching Adjacent Fields
Let's now try fetching the jockeys' and trainers' names, having first located the horse names.

<img src="http://www.ticoneva.com/econ/econ4130/images/chrome_horseno_sibling.png" style="border: 1px solid grey; width: 750px;">

In [9]:
#Find all tags with href containing "horseno"
#This is the horses variable

#Loop through each horse and find the jockey and trainer along the way
for horse in horses:
    jockey = horse.parent.next_sibling
    trainer = jockey.next_sibling
    
    print(horse.text, jockey.text, trainer.text)

JOLLY JOLLY K Teetan P O'Sullivan
PEOPLE'S KNIGHT T Berry J Moore
RUN FORREST J Moreira C S Shum
MODERN TSAR B Prebble W Y So
MAGNETISM G Lerena D E Ferraris
ENORMOUS HONOUR N Rawiller Y S Tsui
HAPPY JOURNEY H W Lai S Woods
WINGOLD M L Yeung A Lee
OVETT H N Wong A T Millard
PAKISTAN BABY D Whyte A S Cruz
SUPER FLUKE M Demuro D Cruz
JUN GONG C Y Ho C H Yip
LAUGH OUT LOUD G Mosse K L Man
TEN SPEED Y T Cheng C W Chang


AttributeError: 'NavigableString' object has no attribute 'text'

Note the error at the end. It is caused by the separate ```Jolly Jolly``` entry, which has no jockey and trainer information next to it.

Is there a way to get around this problem? We could first locate the jockey's name, before fetching the horse name and the trainer's name relative to it.

<img src="http://www.ticoneva.com/econ/econ4130/images/chrome_jockey_sibling.png" style="border: 1px solid grey; width: 750px;">

In [12]:
#Use "jockeyprofile instead"
#We have jockeys

for jockey in jockeys:
    horse = jockey.parent.previous_sibling
    trainer = jockey.parent.next_sibling
    actual_weight = trainer.next_sibling
    declare_weight = actual_weight.next_sibling
    
    print(horse.text,
          jockey.text,
          trainer.text,
          actual_weight.text,
          declare_weight.text)


JOLLY JOLLY(T087) K Teetan P O'Sullivan 114 1214
PEOPLE'S KNIGHT(T305) T Berry J Moore 119 1163
RUN FORREST(T176) J Moreira C S Shum 115 1135
MODERN TSAR(S167) B Prebble W Y So 123 1101
MAGNETISM(V114) G Lerena D E Ferraris 125 1130
ENORMOUS HONOUR(T236) N Rawiller Y S Tsui 131 1127
HAPPY JOURNEY(S299) H W Lai S Woods 114 1040
WINGOLD(T202) M L Yeung A Lee 111 1154
OVETT(P351) H N Wong A T Millard 105 1153
PAKISTAN BABY(S442) D Whyte A S Cruz 121 1023
SUPER FLUKE(T382) M Demuro D Cruz 120 1109
JUN GONG(N325) C Y Ho C H Yip 115 1147
LAUGH OUT LOUD(P297) G Mosse K L Man 126 1127
TEN SPEED(T239) Y T Cheng C W Chang 116 1033


### C. Multiple Pages

Most of the time we need more than one page. We can go through pages with for loop(s).

Before we go there, let's write a helper function that returns the content we want from each page in a list:

In [15]:
import requests
import re
from bs4 import BeautifulSoup

def scrape_horses(url):
    #Function to access a page and save all horses into a list
    #If login is required, try requests.get(url,auth=('user', 'pass'))
    page = requests.get(url)

    #Load the page into BeautifulSoup
    soup = BeautifulSoup(page.content, 'html.parser')

    #Find all tags with href containing "jockeyprofile"
    jockeys = soup.find_all(href=re.compile("jockeyprofile"))

    #output_list is the whole table
    #output is a single row
    output_list = []
    
    #We go through jockey names as before
    for jockey in jockeys:

        #Get the horse name
        horse = jockey.parent.previous_sibling
        output = [horse.text]
        output.append(jockey.text)

        #This while loop fetch all remaining fields in a row
        a = jockey.parent.next_sibling
        while a != None:
            try:
                output.append(a.text)
            except:
                pass
            a = a.next_sibling
            
        output_list.append(output)

    return output_list

Here we have the loops. Note that month and day are always in two digits. 

String formatting: https://docs.python.org/3.4/library/string.html#format-string-syntax


In [16]:
#URL of data
url_front = "http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/"

#Write a loop to go through year, month and day
#Note that month and day is always 2 digit
#Call scrape_horses() in each iteration
for year in range(2016,2018):
    for month in range(1,13):
        for day in range(1,32):
            
            #Convert month and day to 2-digit representation
            month_2d = '{:02d}'.format(month)
            day_2d = '{:02d}'.format(day)
            
            url = url_front + str(year) + month_2d + day_2d
            
            print(url)
            print(scrape_horses(url))
            

http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20160101
[['DOUBLE POINT(S246)', 'N Callan', 'C Fownes', '123', '990', '3', '-', '4331', '1.23.30', '6.4'], ['BE THERE AHEAD(S193)', 'K C Leung', 'L Ho', '121', '1045', '10', 'HD', '1010102', '1.23.33', '8.1'], ['AMBITIOUS SPEEDY(T063)', 'H W Lai', 'A Lee', '132', '1172', '14', 'N', '8793', '1.23.35', '10'], ['JIMSON THE FAMOUS(T253)', 'Z Purton', 'C H Yip', '130', '1031', '5', '1/2', '1114', '1.23.38', '17'], ['HOLY STAR(T068)', 'J Moreira', 'D J Hall', '116', '1018', '7', '1-3/4', '5875', '1.23.60', '2.2'], ['GOLDWEAVER(P072)', 'C Y Ho', 'Y S Tsui', '129', '1091', '6', '2', '2226', '1.23.64', '8'], ['GROOVY(L401)', 'N Rawiller', 'T K Ng', '131', '1165', '2', '2-3/4', '7557', '1.23.73', '51'], ['TRIBAL GLORY(S395)', 'G Mosse', 'P F Yiu', '125', '1131', '8', '3', '11988', '1.23.78', '99'], ['CIRCUIT STAR(N220)', 'H N Wong', 'K L Man', '122', '1073', '4', '3-1/2', '66610', '1.23.87', '87'], ['EVER SHINY(S214)', 'C Schofiel

### D. Saving data to file

Most of the time we want to save the data for future use. The most common method is to save the data in a CSV file, a format that is supported by virtually all data analysis software.

Package needed:
- CSV file reading and writing: https://docs.python.org/3.6/library/csv.html

The basic syntax of saving into a CSV file is:

In [None]:
filepath = "data/temp.csv"
content = [[1,"ha","abc"]]

import csv
with open(filepath, 'w', newline='') as csvfile:
    mywriter = csv.writer(csvfile)
    mywriter.writerows(content)

Now we will incorporate file-saving to our loop:

In [20]:
#The first part of the URL of data source
url_front = "http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/"

#Copy the loop from above and incorporate the csv-saving code
for year in range(2016,2018):
    for month in range(1,13):
        for day in range(1,32):
            
            #Convert month and day to 2-digit representation
            month_2d = '{:02d}'.format(month)
            day_2d = '{:02d}'.format(day)
            
            #Full URL of data source
            url = url_front + str(year) + month_2d + day_2d
            
            #Print the URL so we know the progress so far
            print("Trying:",url)
            
            #Call our function to fetch and process data given the URL
            content = scrape_horses(url)
            
            #Only save if there is something in content
            if len(content) > 0:
                filepath = str(year)+month_2d+day_2d+".csv"
                
                #This part is just standard CSV-writing code
                import csv
                with open(filepath, 'w', newline='') as csvfile:
                    mywriter = csv.writer(csvfile)
                    mywriter.writerows(content)   
                    print(filepath,"saved.")

Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20160101
20160101.csv saved.
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20160102
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20160103
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20160104
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20160105
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20160106
20160106.csv saved.
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20160107
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20160108
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20160109
20160109.csv saved.
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20160110
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20160111
Trying: http://racing.hkjc.com/racing/

### E. Exercise
How to get the data for different races? In particular, how should we handle the code for race tracks in the URL?