# Data Scraping
Version: 2020-3-19

In this exercise, we will scrape data from Hong Kong Jockey Club's race result page: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20151213/ST

Libraries needed:
- For static webpage, `requests`: http://docs.python-requests.org/en/master/ 
- For dynamic webpage, `selenium`: https://selenium-python.readthedocs.io/
- For parsing the webpage, `BeautifulSoup`: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- Regular expression: https://docs.python.org/3.6/library/re.html

We first use ```requests``` or ```selenium``` to access a page, which we then pass to ```BeautifulSoup``` to parse into a searchable structure. Regular expression allows us to find specific part of the structure by keyword match.

### A. Fetching the Webpage

We will begin by fetching the webpage. Because HKJC has switched to using a dynamic page with Javascript and AJAX, we will use `selenium`, which loads the webpage through an actual browser.

Selenium needs an interface to control the browser. The interface is called *WebDriver* and is browser dependent. Here are the links to the WebDrivers of common browsers:
- Firefox: https://github.com/mozilla/geckodriver/releases
- Chrome: https://chromedriver.chromium.org/

What to do after the download:
- On linux, decompress the package then move its content to `/usr/local/bin/`.
- On Windows, decompress the package and move its content to somewhere that makes sense (e.g. `Program Files`). You will need to manually add the path to where you put the content to the system variable `Path`.

If the above procedure confuses you, it might be easier just to put the WebDriver somewhere and directly passing its location to Selenium through the `executable_path` option.


In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options

# Set up selenium to use Firefox
options = Options()
options.headless = True #No need to open a browser window
driver = webdriver.Firefox(options=options)

# Example of manaully specifying the WebDriver's location: 
# driver = webdriver.Firefox(executable_path="../Others/geckodriver.exe",options=options) #Windows
# driver = webdriver.Firefox(executable_path="../Others/geckodriver",options=options) #Linux

# Fetch the page
driver.get('http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20151213/ST')


If the page static then we can use `requests`, which does not require a browser to work:

In [1]:
# Only works with static content
import requests

# URL of data
url = "http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20151213/ST"

# Access the page
page = requests.get(url)

We can fetch the webpage's source with:

- `driver.page_source` for `selenium`.
- `page.text` for `requests`.

For Selenium, we should close the browser with `driver.quit()` once we have the page source to free resources taken by the browser. **Closing the notebook alone does not close the browser!**

In [None]:
from bs4 import BeautifulSoup

# Make a copy of the page source
page_source = driver.page_source

# Load the page into BeautifulSoup
# Change 'driver.page_source' to 'page.content' when using requests
soup = BeautifulSoup(page_source,'html.parser')

# Once the data is passed to Beautiful Soup 
# we can close the browser and clear out Selenium
driver.quit()

### A. Getting a Single Column of Data

Let's begin with fetching the names of the horses. We note that each horse's name is enclosed in a HTML ```<a>``` tag, with the term *HorseId* contained in its hypertext reference.

<img src="../Images/webscraping-2020/HorseId.png" style="border: 1px solid grey; width: 750px;">

In [3]:
from bs4 import BeautifulSoup
import re

# Load the page into BeautifulSoup
# Change 'driver.page_source' to 'page.content' when using requests
soup = BeautifulSoup(driver.page_source,'html.parser')

# Find all tags with href containing "horseno"
#horses is a list of matched tags
horses = soup.find_all(href=re.compile("HorseId"))

# Print the result
for horse in horses:
    print(horse.text)

JOLLY JOLLY
PEOPLE'S KNIGHT
RUN FORREST
MODERN TSAR
MAGNETISM
ENORMOUS HONOUR
HAPPY JOURNEY
WINGOLD
OVETT
PAKISTAN BABY
SUPER FLUKE
JUN GONG
LAUGH OUT LOUD
TEN SPEED


We can do the same for jockeys, noting that each jockey name is enclosed in a ```<a>``` tag with hypertext reference containing the term *jockeyprofile.aspx*.

In [4]:
# Find all tags with href containing "jockeyprofile.asp" 
jockeys = soup.find_all(href=re.compile("JockeyProfile.aspx"))

# Print the result
for jockey in jockeys:
    print(jockey.text)

K Teetan
T Berry
J Moreira
H W Lai
M L Yeung
H N Wong
D Whyte
M Demuro
C Y Ho
G Mosse


Finally, we can also match by the class of the ```<td>``` tag one layer up. This would return horse names, jockey names and trainer names.

<img src="../Images/webscraping-2020/HorseClass.png" style="border: 1px solid grey; width: 750px;">

In [6]:
data = soup.find_all("td",class_="f_fs13 f_tal")

for d in data:
    print(d.text.strip())

JOLLY JOLLY(T087)
K Teetan
P O'Sullivan
PEOPLE'S KNIGHT(T305)
T Berry
J Moore
RUN FORREST(T176)
J Moreira
C S Shum
MODERN TSAR(S167)
B Prebble
W Y So
MAGNETISM(V114)
G Lerena
D E Ferraris
ENORMOUS HONOUR(T236)
N Rawiller
Y S Tsui
HAPPY JOURNEY(S299)
H W Lai
S Woods
WINGOLD(T202)
M L Yeung
A Lee
OVETT(P351)
H N Wong
A T Millard
PAKISTAN BABY(S442)
D Whyte
A S Cruz
SUPER FLUKE(T382)
M Demuro
D Cruz
JUN GONG(N325)
C Y Ho
C H Yip
LAUGH OUT LOUD(P297)
G Mosse
K L Man
TEN SPEED(T239)
Y T Cheng
C W Chang


### B. Fetching Adjacent Fields
Let's now try fetching the jockeys' and trainers' names, having first located the horse names.

<img src="../Images/webscraping-2020/HorseId_siblings.png" style="border: 1px solid grey; width: 750px;">

In [40]:
# Loop through each horse and find the jockey and trainer along the way
from bs4 import NavigableString

for horse in horses:
    
    # jockey is supposed to be horse.parent.next_sibling
    jockey = horse.parent.next_sibling
       
    # But there are whitespace between tags, which BeautifulSoup picks 
    # up as 'NavigableString'. We use a while loop to keep moving when 
    # we encounter such cases
    while isinstance(jockey, NavigableString):
            jockey = jockey.next_sibling
            
    # Now do the same to find trainer            
    trainer = jockey.next_sibling
    while isinstance(trainer, NavigableString):
            trainer = trainer.next_sibling
    
    # Print what we find
    print(horse.text.strip().ljust(20),
          jockey.text.strip().ljust(15),
          trainer.text.strip())

JOLLY JOLLY          K Teetan        P O'Sullivan
PEOPLE'S KNIGHT      T Berry         J Moore
RUN FORREST          J Moreira       C S Shum
MODERN TSAR          B Prebble       W Y So
MAGNETISM            G Lerena        D E Ferraris
ENORMOUS HONOUR      N Rawiller      Y S Tsui
HAPPY JOURNEY        H W Lai         S Woods
WINGOLD              M L Yeung       A Lee
OVETT                H N Wong        A T Millard
PAKISTAN BABY        D Whyte         A S Cruz
SUPER FLUKE          M Demuro        D Cruz
JUN GONG             C Y Ho          C H Yip
LAUGH OUT LOUD       G Mosse         K L Man
TEN SPEED            Y T Cheng       C W Chang


We could also first locate the jockey's name, before fetching the horse name and the trainer's name relative to it.

<img src="../Images/webscraping-2020/JockeyProfile_siblings.png" style="border: 1px solid grey; width: 750px;">

Because we are going to need to deal with whitespace very often, let us first write a function that runs the while loop for us:

In [34]:
def get_sibling(tag,previous=False):
    if previous:
        sibling = tag.previous_sibling
        while isinstance(sibling, NavigableString):
            sibling = sibling.previous_sibling
    else:
        sibling = tag.next_sibling
        while isinstance(sibling, NavigableString):
            sibling = sibling.next_sibling        
    return sibling

Now we can loop through all jockeys and fetch other fields relative to them:

In [41]:
# Use jockey instead
for jockey in jockeys:
    horse = get_sibling(jockey.parent,previous=True)
    trainer = get_sibling(jockey.parent)
    actual_weight = get_sibling(trainer)
    declare_weight = get_sibling(actual_weight)

    print(horse.text.strip().ljust(20),
          jockey.text.strip().ljust(15),
          trainer.text.strip().ljust(15),
          actual_weight.text.strip().ljust(15),
          declare_weight.text.strip())


JOLLY JOLLY(T087)    K Teetan        P O'Sullivan    114             1214
PEOPLE'S KNIGHT(T305) T Berry         J Moore         119             1163
RUN FORREST(T176)    J Moreira       C S Shum        115             1135
HAPPY JOURNEY(S299)  H W Lai         S Woods         114             1040
WINGOLD(T202)        M L Yeung       A Lee           111             1154
OVETT(P351)          H N Wong        A T Millard     105             1153
PAKISTAN BABY(S442)  D Whyte         A S Cruz        121             1023
SUPER FLUKE(T382)    M Demuro        D Cruz          120             1109
JUN GONG(N325)       C Y Ho          C H Yip         115             1147
LAUGH OUT LOUD(P297) G Mosse         K L Man         126             1127


### C. Multiple Pages

Most of the time we need more than one page. We can go through pages with for loop(s).

Before we go there, let's write a helper function that returns the content we want from each page in a list:

In [130]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup
import re

from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait


def scrape_horses(url):
    # Function to access a page and save all horses into a list

    # Fetch the page
    driver.get(url)
    
    # Is there anything?
    if driver.page_source.find("No information.") != -1:
        return []
    
    # Wait 30 secs so that the dynamic content has time to load.
    # Proceed to next date if page doesn't load.
    try:
        wait = WebDriverWait(driver, 30).until(
            EC.presence_of_element_located((By.CLASS_NAME, "f_fs13")))
    except:
        return []
    
    # Load the page into BeautifulSoup
    soup = BeautifulSoup(driver.page_source, 'html.parser')

    # Find all tags with href containing "HorseId"
    horses = soup.find_all(href=re.compile("HorseId"))

    # 'output_list' is the whole table
    # 'output' is a single row
    output_list = []
    
    # Loop through horses
    for horse in horses:

        # Get the horse name
        output = [horse.text.strip()]
        
        # This while loop fetch all remaining fields in a row
        a = get_sibling(horse.parent)
        while a != None:
            output.append(a.text
                          .strip()
                          # The last two lines are for running positions
                          .replace('\n','') 
                          .replace(' '*20,' ') 
                         )
            a = get_sibling(a)
        
        # Append each row to the output list
        output_list.append(output)

    return output_list

Let us first try the function on one single page:

In [116]:
scrape_horses('http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20151213/ST')

[['JOLLY JOLLY',
  'K Teetan',
  "P O'Sullivan",
  '114',
  '1214',
  '13',
  '-',
  '1   1   1   1',
  '1:22.05',
  '2.6'],
 ["PEOPLE'S KNIGHT",
  'T Berry',
  'J Moore',
  '119',
  '1163',
  '8',
  '2',
  '2   4   2   2',
  '1:22.39',
  '5.7'],
 ['RUN FORREST',
  'J Moreira',
  'C S Shum',
  '115',
  '1135',
  '10',
  '3',
  '14   10   10   3',
  '1:22.54',
  '3.9'],
 ['MODERN TSAR',
  'B Prebble',
  'W Y So',
  '123',
  '1101',
  '11',
  '4-3/4',
  '9   13   13   4',
  '1:22.80',
  '13'],
 ['MAGNETISM',
  'G Lerena',
  'D E Ferraris',
  '125',
  '1130',
  '3',
  '4-3/4',
  '7   6   6   5',
  '1:22.81',
  '52'],
 ['ENORMOUS HONOUR',
  'N Rawiller',
  'Y S Tsui',
  '131',
  '1127',
  '9',
  '5-1/4',
  '11   12   12   6',
  '1:22.87',
  '10'],
 ['HAPPY JOURNEY',
  'H W Lai',
  'S Woods',
  '114',
  '1040',
  '5',
  '6-1/2',
  '5   7   7   7',
  '1:23.08',
  '121'],
 ['WINGOLD',
  'M L Yeung',
  'A Lee',
  '111',
  '1154',
  '12',
  '6-1/2',
  '13   14   14   8',
  '1:23.11',
  '331'],


Here we have the loops. Note that month and day are always in two digits. 

String formatting: https://docs.python.org/3.4/library/string.html#format-string-syntax


In [131]:
#URL of data
url_front = "http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/"

#Write a loop to go through year, month and day
#Note that month and day is always 2 digit
#Call scrape_horses() in each iteration
for year in range(2017,2018):
    for month in range(1,2):
        for day in range(1,32):
            
            #Convert month and day to 2-digit representation
            month_2d = '{:02d}'.format(month)
            day_2d = '{:02d}'.format(day)
            
            url = url_front + str(year) + month_2d + day_2d
            
            print(url)
            print(scrape_horses(url))
            

http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20170101
[['ROCK THE TREE', 'B Prebble', 'D E Ferraris', '133', '1056', '11', '-', '12   12   11   12   1', '2:03.16', '9.7'], ['HIGH SPEED METRO', 'K C Leung', 'L Ho', '119', '1169', '12', '3/4', '11   10   10   4   2', '2:03.31', '10'], ['WIN CHANCE', 'M L Yeung', 'A Lee', '112', '1026', '2', '4', '7   7   7   2   3', '2:03.82', '17'], ['LOYAL CRAFTSMAN', 'S Clipperton', 'D E Ferraris', '120', '1080', '13', '4-1/2', '13   13   13   9   4', '2:03.88', '8.3'], ['CHOICE EXCHEQUER', 'A Badel', 'C H Yip', '133', '1209', '3', '5-1/2', '1   1   1   1   5', '2:04.03', '15'], ['SWEET BEAN', 'N Callan', 'C Fownes', '128', '1031', '7', '5-3/4', '8   8   8   10   6', '2:04.10', '22'], ['TELEPHATIA', 'Z Purton', 'A Lee', '130', '1077', '8', '6', '10   9   9   7   7', '2:04.13', '7.9'], ['KERKENI', 'O Doleuze', 'R Gibson', '127', '1053', '5', '6-1/4', '9   11   12   11   8', '2:04.16', '12'], ['GLAMOROUS RYDER', 'S de Sousa', 'D E Fe

[]
http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20170120
[]
http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20170121
[]
http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20170122
[['KEEP MOVING', 'N Callan', 'P F Yiu', '126', '1104', '12', '-', '2   2   1', '0:58.15', '5.3'], ['CITY LEGEND', 'K Teetan', 'T P Yung', '126', '1032', '7', 'N', '11   10   2', '0:58.22', '74'], ['MERRYGOWIN', 'Z Purton', "P O'Sullivan", '126', '1124', '5', '1/2', '7   8   3', '0:58.25', '2.9'], ['HEALTHY LUCK', 'D Whyte', 'K L Man', '133', '1026', '3', '1', '4   5   4', '0:58.31', '4.7'], ['VITAL SPRING', 'J Moreira', 'J Size', '126', '1082', '1', '3-1/4', '8   6   5', '0:58.67', '9.8'], ['GRACE HEART', 'C Y Ho', 'C Fownes', '128', '979', '11', '3-1/4', '5   7   6', '0:58.67', '20'], ['VICTORY MUSIC', 'T Berry', 'J Moore', '126', '1153', '8', '4-1/4', '9   9   7', '0:58.83', '26'], ['BELOVED', 'C Schofield', 'P F Yiu', '131', '1076', '14', '4-3/4', '10   11

### D. Saving data to file

Most of the time we want to save the data for future use. The most common method is to save the data in a CSV file, a format that is supported by virtually all data analysis software.

Package needed:
- CSV file reading and writing: https://docs.python.org/3.6/library/csv.html

The basic syntax of saving into a CSV file is:

In [133]:
filepath = "../data/temp.csv"
content = [[1,"ha","abc"]]

import csv
with open(filepath, 'w', newline='') as csvfile:
    mywriter = csv.writer(csvfile)
    mywriter.writerows(content)

Now we will incorporate file-saving to our loop:

In [None]:
#The first part of the URL of data source
url_front = "http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/"

#Copy the loop from above and incorporate the csv-saving code
for year in range(2017,2018):
    for month in range(1,2):
        for day in range(1,32):
            
            #Convert month and day to 2-digit representation
            month_2d = '{:02d}'.format(month)
            day_2d = '{:02d}'.format(day)
            
            #Full URL of data source
            url = url_front + str(year) + month_2d + day_2d
            
            #Print the URL so we know the progress so far
            print("Trying:",url)
            
            #Call our function to fetch and process data given the URL
            content = scrape_horses(url)
            
            #Only save if there is something in content
            if len(content) > 0:
                filepath = str(year)+month_2d+day_2d+".csv"
                
                #This part is just standard CSV-writing code
                import csv
                with open(filepath, 'w', newline='') as csvfile:
                    mywriter = csv.writer(csvfile)
                    mywriter.writerows(content)   
                    print(filepath,"saved.")

Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20160101
20160101.csv saved.
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20160102
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20160103
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20160104
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20160105
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20160106
20160106.csv saved.
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20160107
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20160108
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20160109
20160109.csv saved.
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20160110
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20160111
Trying: http://racing.hkjc.com/racing/

Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20160402


### E. Exercise
How to get the data for different races? In particular, how should we handle the code for race tracks in the URL?