# Data Scraping
Version: 2025-10-3

In this exercise, we will scrape data from Hong Kong Jockey Club's race result page: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20151213/ST

Libraries needed:
- For downloading files, `urllib.request`: https://docs.python.org/3.8/library/urllib.request.html
- For static webpage, `requests`: http://docs.python-requests.org/en/master/ 
- For dynamic webpage, `selenium`: https://selenium-python.readthedocs.io/
- For parsing the webpage, `BeautifulSoup`: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- Regular expression: https://docs.python.org/3.6/library/re.html


### A. Downloading Files

The syntax for `urlretrieve` is:
```python
urllib.request.urlretrieve(url, filename)
```
This saves the file fetched from `url` as `filename`. 

In [1]:
# Download a file using urlretrieve
import urllib.request
urllib.request.urlretrieve("https://scrp.econ.cuhk.edu.hk/workshops/stata-workshop/stata-workshop-handout.pdf", "handout.pdf")

('handout.pdf', <http.client.HTTPMessage at 0x7fb120103290>)

### B. Fetching the Webpage

To scrape a website, We first use ```requests``` or ```selenium``` to access a page, which we then pass to ```BeautifulSoup``` to parse into a searchable structure. Regular expression allows us to find specific part of the structure by keyword match.

We will begin by fetching the webpage. Because HKJC has switched to using a dynamic page with Javascript and AJAX, we will use `selenium`, which loads the webpage through an actual browser.

Selenium needs an interface, called *WebDriver*, to control the browser. We will use the `webdriver_manager` library to locate the correct driver for our choice of browser.

After loading the webpage, we can fetch the webpage's source with `driver.page_source`.

We should close the browser with `driver.quit()` once we have the page source to free resources taken by the browser. 

In [2]:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

# Set up selenium to use Firefox
options = Options()
options.add_argument('-headless') #No need to open a browser window
driver = webdriver.Firefox(options=options)

# Fetch the page
driver.get('http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20151213/ST')

# Make a copy of the page source
page_source = driver.page_source

# we can close the browser and clear out Selenium
driver.quit()

You can also use Chrome instead:

In [3]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# Set up selenium to use Chrome
options = Options()
options.add_argument('--headless=new') #No need to open a browser window
driver = webdriver.Chrome(options=options)

# Fetch the page
driver.get('http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20151213/ST')

# Make a copy of the page source
page_source = driver.page_source

# Once the data is passed to Beautiful Soup 
# we can close the browser and clear out Selenium
driver.quit()

If the page static then we can use `requests`, which does not require a browser to work. We can fetch the webpage's source with `page.text`.


In [4]:
# Only works with static content
import requests

# URL of data
url = "http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20151213/ST"

# Access the page
page = requests.get(url)

# Make a copy of the page source
page_source = page.text

Once we fetch the website, we will pass it to `BeautifulSoup` for parsing:

In [5]:
from bs4 import BeautifulSoup

# Load the page into BeautifulSoup
# Change 'driver.page_source' to 'page.content' when using requests
soup = BeautifulSoup(page_source,'html.parser')

### C. Getting a Table

If the data you need is in a table, the easiest way to scrape it is to:
1. Find the table based on certain characteristics, either by a name, an ID or class.
2. Loop through the rows in the table.
3. Loop through each cell in a row and extract its content.
   

In [6]:
data = [] 

# Find the table
table = soup.find("table", class_="f_tac")

# Loop through all rows
for tr in table.find_all("tr"):
    cells = tr.find_all("td")
    # Loop through all cells in a row
    row = [cell.get_text(strip=True) for cell in cells]
    data.append(row)

# Show content
data[0:2]

[['Pla.',
  'Horse No.',
  'Horse',
  'Jockey',
  'Trainer',
  'Act. Wt.',
  'Declar. Horse Wt.',
  'Dr.',
  'LBW',
  'RunningPosition',
  'Finish Time',
  'Win Odds'],
 ['1',
  '10',
  'JOLLY JOLLY(T087)',
  'K Teetan',
  "P O'Sullivan",
  '114',
  '1214',
  '13',
  '-',
  '1111',
  '1:22.05',
  '2.6']]

### D. Getting a Single Column of Data

If the data is not nicely formatted as an HTML table, we can locate it directly.

Let's begin with fetching the names of the horses. We note that each horse's name is enclosed in a HTML ```<a>``` tag, with the term *HorseId* contained in its hypertext reference.

<img src="../Images/webscraping-2020/HorseId.png" style="border: 1px solid grey; width: 750px;">

In [7]:
from bs4 import BeautifulSoup
import re

# Find all tags with href containing "horseno"
#horses is a list of matched tags
horses = soup.find_all(href=re.compile("HorseId"))

# Print the result
for horse in horses:
    print(horse.text)

JOLLY JOLLY
PEOPLE'S KNIGHT
RUN FORREST
MODERN TSAR
MAGNETISM
ENORMOUS HONOUR
HAPPY JOURNEY
WINGOLD
OVETT
PAKISTAN BABY
SUPER FLUKE
JUN GONG
LAUGH OUT LOUD
TEN SPEED


We can do the same for jockeys, noting that each jockey name is enclosed in a ```<a>``` tag with hypertext reference containing the term *JockeyProfile.aspx*.

In [8]:
# Find all tags with href containing "JockeyProfile.aspx" 
jockeys = soup.find_all(href=re.compile("JockeyProfile.aspx"))

# Print the result
for jockey in jockeys:
    print(jockey.text)

K Teetan
J Moreira
M L Yeung
C Y Ho


Finally, we can also match by the class of the ```<td>``` tag one layer up. This would return horse names, jockey names and trainer names.

<img src="../Images/webscraping-2020/HorseClass.png" style="border: 1px solid grey; width: 750px;">

In [9]:
data = soup.find_all("td",class_="f_fs13 f_tal")

for d in data:
    print(d.text.strip())

JOLLY JOLLY (T087)
K Teetan
P O'Sullivan
PEOPLE'S KNIGHT (T305)
T Berry
J Moore
RUN FORREST (T176)
J Moreira
C S Shum
MODERN TSAR (S167)
B Prebble
W Y So
MAGNETISM (V114)
G Lerena
D E Ferraris
ENORMOUS HONOUR (T236)
N Rawiller
Y S Tsui
HAPPY JOURNEY (S299)
H W Lai
S Woods
WINGOLD (T202)
M L Yeung
A Lee
OVETT (P351)
H N Wong
A T Millard
PAKISTAN BABY (S442)
D Whyte
A S Cruz
SUPER FLUKE (T382)
M Demuro
D Cruz
JUN GONG (N325)
C Y Ho
C H Yip
LAUGH OUT LOUD (P297)
G Mosse
K L Man
TEN SPEED (T239)
Y T Cheng
C W Chang


### E. Fetching Adjacent Fields
Let's now try fetching the jockeys' and trainers' names, having first located the horse names.

<img src="../Images/webscraping-2020/HorseId_siblings.png" style="border: 1px solid grey; width: 750px;">

In [10]:
# Loop through each horse and find the jockey and trainer along the way
from bs4 import NavigableString

for horse in horses:
    
    # jockey is supposed to be horse.parent.next_sibling
    jockey = horse.parent.next_sibling
       
    # But there are whitespace between tags, which BeautifulSoup picks 
    # up as 'NavigableString'. We use a while loop to keep moving when 
    # we encounter such cases
    while isinstance(jockey, NavigableString):
            jockey = jockey.next_sibling
            
    # Now do the same to find trainer            
    trainer = jockey.next_sibling
    while isinstance(trainer, NavigableString):
            trainer = trainer.next_sibling
    
    # Print what we find
    print(horse.text.strip().ljust(20),
          jockey.text.strip().ljust(15),
          trainer.text.strip())

JOLLY JOLLY          K Teetan        P O'Sullivan
PEOPLE'S KNIGHT      T Berry         J Moore
RUN FORREST          J Moreira       C S Shum
MODERN TSAR          B Prebble       W Y So
MAGNETISM            G Lerena        D E Ferraris
ENORMOUS HONOUR      N Rawiller      Y S Tsui
HAPPY JOURNEY        H W Lai         S Woods
WINGOLD              M L Yeung       A Lee
OVETT                H N Wong        A T Millard
PAKISTAN BABY        D Whyte         A S Cruz
SUPER FLUKE          M Demuro        D Cruz
JUN GONG             C Y Ho          C H Yip
LAUGH OUT LOUD       G Mosse         K L Man
TEN SPEED            Y T Cheng       C W Chang


Because we are going to need to deal with whitespace very often, let us first write a function that runs the while loop for us:

In [11]:
from bs4 import NavigableString

def get_sibling(tag,previous=False):
    if previous:
        sibling = tag.previous_sibling
        while isinstance(sibling, NavigableString):
            sibling = sibling.previous_sibling
    else:
        sibling = tag.next_sibling
        while isinstance(sibling, NavigableString):
            sibling = sibling.next_sibling        
    return sibling

Now we can loop through all horses and fetch other fields relative to them:

In [12]:
# Use jockey instead
for horse in horses:
    jockey = get_sibling(horse.parent)
    trainer = get_sibling(jockey)
    actual_weight = get_sibling(trainer)
    declare_weight = get_sibling(actual_weight)

    print(horse.text.strip().ljust(20),
          jockey.text.strip().ljust(15),
          trainer.text.strip().ljust(15),
          actual_weight.text.strip().ljust(15),
          declare_weight.text.strip())


JOLLY JOLLY          K Teetan        P O'Sullivan    114             1214
PEOPLE'S KNIGHT      T Berry         J Moore         119             1163
RUN FORREST          J Moreira       C S Shum        115             1135
MODERN TSAR          B Prebble       W Y So          123             1101
MAGNETISM            G Lerena        D E Ferraris    125             1130
ENORMOUS HONOUR      N Rawiller      Y S Tsui        131             1127
HAPPY JOURNEY        H W Lai         S Woods         114             1040
WINGOLD              M L Yeung       A Lee           111             1154
OVETT                H N Wong        A T Millard     105             1153
PAKISTAN BABY        D Whyte         A S Cruz        121             1023
SUPER FLUKE          M Demuro        D Cruz          120             1109
JUN GONG             C Y Ho          C H Yip         115             1147
LAUGH OUT LOUD       G Mosse         K L Man         126             1127
TEN SPEED            Y T Cheng       C

More generally, we can use a while loop to fetch adjacent fields until 
there is nothing left to fetch:

In [13]:
# 'data' is the whole table, 'row' is a single row
data = []

# Loop through horses
for horse in horses:

    # Get the horse name
    row = [horse.text.strip()]
    
    # This while loop fetch all remaining fields in a row
    a = get_sibling(horse.parent)
    while a != None:
        row.append(a.text
                      .strip()
                      # The last two lines are for running positions
                      .replace('\n','') 
                      .replace(' '*20,' ') 
                     )
        a = get_sibling(a)
    
    # Append each row to the output list
    data.append(row)

data[0:2]

[['JOLLY JOLLY',
  'K Teetan',
  "P O'Sullivan",
  '114',
  '1214',
  '13',
  '-',
  '1                  \r1                  \r1                  \r1',
  '1:22.05',
  '2.6'],
 ["PEOPLE'S KNIGHT",
  'T Berry',
  'J Moore',
  '119',
  '1163',
  '8',
  '2',
  '2                  \r4                  \r2                  \r2',
  '1:22.39',
  '5.7']]

### F. Multiple Pages

Most of the time we need more than one page. We can go through pages with for loop(s).

Before we go there, let's write a helper function that returns the content we want from each page in a list. Because we are going to load multiple pages consecutively, we need a way to ensure each page is loaded before we move on to the next. For this we need `WebDriverWait`, which allows Selenium to wait for certain conditions to be true before moving on.

In [14]:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup
import re

from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait


def scrape_horses(url):
    # Function to access a page and save all horses into a list

    # Fetch the page
    driver.get(url)
    
    # Is there anything?
    if driver.page_source.find("No information.") != -1:
        return []
    
    # Wait 30 secs so that the dynamic content has time to load.
    # Proceed to next date if page doesn't load.
    try:
        wait = WebDriverWait(driver, 30).until(
            EC.presence_of_element_located((By.CLASS_NAME, "f_fs13")))
    except:
        return []
    
    # Load the page into BeautifulSoup
    soup = BeautifulSoup(driver.page_source, 'html.parser')

    # Here we can use whatever method that fetches what we need
    data = [] 
    table = soup.find("table", class_="f_tac")
    for tr in table.find_all("tr"):
        cells = tr.find_all("td")
        # Loop through all cells in a row
        row = [cell.get_text(strip=True) for cell in cells]
        data.append(row)

    return data

Let us first try the function on one single page:

In [15]:
options = Options()
options.add_argument('-headless') #No need to open a browser window
driver = webdriver.Firefox(options=options)
output = scrape_horses('http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20151213/ST')
driver.quit()
output[0:2]

[['Pla.',
  'Horse No.',
  'Horse',
  'Jockey',
  'Trainer',
  'Act. Wt.',
  'Declar. Horse Wt.',
  'Dr.',
  'LBW',
  'RunningPosition',
  'Finish Time',
  'Win Odds'],
 ['1',
  '10',
  'JOLLY JOLLY(T087)',
  'K Teetan',
  "P O'Sullivan",
  '114',
  '1214',
  '13',
  '-',
  '1111',
  '1:22.05',
  '2.6']]

Here we have the loops. Note that month and day are always in two digits. 

String formatting: https://docs.python.org/3.4/library/string.html#format-string-syntax


In [16]:
#URL of data
url_front = "http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/"

driver = webdriver.Firefox(options=options)

#Write a loop to go through year, month and day
#Note that month and day is always 2 digit
#Call scrape_horses() in each iteration
for year in range(2017,2018):
    for month in range(1,2):
        for day in range(1,15):
            
            #Convert month and day to 2-digit representation
            month_2d = '{:02d}'.format(month)
            day_2d = '{:02d}'.format(day)
            
            url = url_front + str(year) + month_2d + day_2d
            
            print(url)
            print(scrape_horses(url))
            
driver.quit()

http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20170101
[['Pla.', 'Horse No.', 'Horse', 'Jockey', 'Trainer', 'Act. Wt.', 'Declar. Horse Wt.', 'Dr.', 'LBW', 'RunningPosition', 'Finish Time', 'Win Odds'], ['1', '2', 'ROCK THE TREE(P272)', 'B Prebble', 'D E Ferraris', '133', '1056', '11', '-', '121211121', '2:03.16', '9.7'], ['2', '9', 'HIGH SPEED METRO(P293)', 'K C Leung', 'L Ho', '119', '1169', '12', '3/4', '11101042', '2:03.31', '10'], ['3', '13', 'WIN CHANCE(P415)', 'M L Yeung', 'A Lee', '112', '1026', '2', '4', '77723', '2:03.82', '17'], ['4', '11', 'LOYAL CRAFTSMAN(S354)', 'S Clipperton', 'D E Ferraris', '120', '1080', '13', '4-1/2', '13131394', '2:03.88', '8.3'], ['5', '1', 'CHOICE EXCHEQUER(P088)', 'A Badel', 'C H Yip', '133', '1209', '3', '5-1/2', '11115', '2:04.03', '15'], ['6', '4', 'SWEET BEAN(S205)', 'N Callan', 'C Fownes', '128', '1031', '7', '5-3/4', '888106', '2:04.10', '22'], ['7', '3', 'TELEPHATIA(P405)', 'Z Purton', 'A Lee', '130', '1077', '8', '6', '1

### F. Saving data to file

Most of the time we want to save the data for future use. The most common method is to save the data in a CSV file, a format that is supported by virtually all data analysis software.

Package needed:
- CSV file reading and writing: https://docs.python.org/3.6/library/csv.html

The basic syntax of saving into a CSV file is:

In [17]:
filepath = "temp.csv"
content = [[1,"ha","abc"]]

import csv
with open(filepath, 'w', newline='') as csvfile:
    mywriter = csv.writer(csvfile)
    mywriter.writerows(content)

Now we will incorporate file-saving to our loop:

In [18]:
#The first part of the URL of data source
url_front = "http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/"

driver = webdriver.Firefox(options=options)

#Copy the loop from above and incorporate the csv-saving code
for year in range(2017,2018):
    for month in range(1,2):
        for day in range(1,15):
            
            #Convert month and day to 2-digit representation
            month_2d = '{:02d}'.format(month)
            day_2d = '{:02d}'.format(day)
            
            #Full URL of data source
            url = url_front + str(year) + month_2d + day_2d
            
            #Print the URL so we know the progress so far
            print("Trying:",url)
            
            #Call our function to fetch and process data given the URL
            content = scrape_horses(url)
            
            #Only save if there is something in content
            if len(content) > 0:
                filepath = str(year)+month_2d+day_2d+".csv"
                
                #This part is just standard CSV-writing code
                import csv
                with open(filepath, 'w', newline='') as csvfile:
                    mywriter = csv.writer(csvfile)
                    mywriter.writerows(content)   
                    print(filepath,"saved.")
                    
driver.quit()

Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20170101
20170101.csv saved.
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20170102
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20170103
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20170104
20170104.csv saved.
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20170105
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20170106
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20170107
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20170108
20170108.csv saved.
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20170109
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20170110
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20170111
20170111.csv saved.
Trying: http://rac

### H. Exercise
How to get the data for different races? In particular, how should we handle the code for race tracks in the URL?