# Data Scraping
Version: 2023-9-11

In this exercise, we will scrape data from Hong Kong Jockey Club's race result page: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20151213/ST

Libraries needed:
- For downloading files, `urllib.request`: https://docs.python.org/3.8/library/urllib.request.html
- For static webpage, `requests`: http://docs.python-requests.org/en/master/ 
- For dynamic webpage, `selenium`: https://selenium-python.readthedocs.io/
- To locate drivers for browsers, `webdriver-manager`: https://pypi.org/project/webdriver-manager/
- For parsing the webpage, `BeautifulSoup`: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- Regular expression: https://docs.python.org/3.6/library/re.html


### A. Downloading Files

The syntax for `urlretrieve` is:
```python
urllib.request.urlretrieve(url, filename)
```
This saves the file fetched from `url` as `filename`. 

In [None]:
# Download a file using urlretrieve
import urllib.request


### B. Fetching the Webpage

To scrape a website, We first use ```requests``` or ```selenium``` to access a page, which we then pass to ```BeautifulSoup``` to parse into a searchable structure. Regular expression allows us to find specific part of the structure by keyword match.

We will begin by fetching the webpage. Because HKJC has switched to using a dynamic page with Javascript and AJAX, we will use `selenium`, which loads the webpage through an actual browser.

Selenium needs an interface, called *WebDriver*, to control the browser. We will use the `webdriver_manager` library to locate the correct driver for our choice of browser.

In [None]:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

# WebDriver. See https://pypi.org/project/webdriver-manager/
from selenium.webdriver.firefox.service import Service as FirefoxService
from webdriver_manager.firefox import GeckoDriverManager

# Set up selenium to use Firefox
options = Options()
options.headless = True #No need to open a browser window
driver = webdriver.Firefox(service=FirefoxService(GeckoDriverManager().install()),options=options)

# Fetch the page
# http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20151213/ST


Once we have the appropriate WebDriver, we can fetch the page we want:

In [None]:
# Fetch the page


If the page static then we can use `requests`, which does not require a browser to work:

In [None]:
# Only works with static content
import requests

# URL of data
url = "http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20151213/ST"

# Access the page
page = requests.get(url)

# What's inside?
page.content

We can fetch the webpage's source with:

- `driver.page_source` for `selenium`.
- `page.text` for `requests`.

For Selenium, we should close the browser with `driver.quit()` once we have the page source to free resources taken by the browser. **Closing the notebook alone does not close the browser!**

In [None]:
driver.page_source

In [None]:
from bs4 import BeautifulSoup



### C. Getting a Single Column of Data

Let's begin with fetching the names of the horses. We note that each horse's name is enclosed in a HTML ```<a>``` tag, with the term *HorseId* contained in its hypertext reference.

<img src="../Images/webscraping-2020/HorseId.png" style="border: 1px solid grey; width: 750px;">

In [None]:
from bs4 import BeautifulSoup
import re



We can do the same for jockeys, noting that each jockey name is enclosed in a ```<a>``` tag with hypertext reference containing the term *JockeyProfile.aspx*.

Finally, we can also match by the class of the ```<td>``` tag one layer up. This would return horse names, jockey names and trainer names.

<img src="../Images/webscraping-2020/HorseClass.png" style="border: 1px solid grey; width: 750px;">

### D. Fetching Adjacent Fields
Let's now try fetching the jockeys' and trainers' names, having first located the horse names.

<img src="../Images/webscraping-2020/HorseId_siblings.png" style="border: 1px solid grey; width: 750px;">

We could also first locate the jockey's name, before fetching the horse name and the trainer's name relative to it.

<img src="../Images/webscraping-2020/JockeyProfile_siblings.png" style="border: 1px solid grey; width: 750px;">

Because we are going to need to deal with whitespace very often, let us first write a function that runs the while loop for us:

In [None]:
def get_sibling(tag,previous=False):
    if previous:
        sibling = tag.previous_sibling
        while isinstance(sibling, NavigableString):
            sibling = sibling.previous_sibling
    else:
        sibling = tag.next_sibling
        while isinstance(sibling, NavigableString):
            sibling = sibling.next_sibling        
    return sibling

Now we can loop through all jockeys and fetch other fields relative to them:

### E. Multiple Pages

Most of the time we need more than one page. We can go through pages with for loop(s).

Before we go there, let's write a helper function that returns the content we want from each page in a list:

In [None]:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from webdriver_manager.firefox import GeckoDriverManager
from bs4 import BeautifulSoup
import re

from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait



Let us first try the function on one single page:

In [None]:
driver = webdriver.Firefox(options=options)
scrape_horses('http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20151213/ST')
driver.quit()

Here we have the loops. Note that month and day are always in two digits. 

String formatting: https://docs.python.org/3.4/library/string.html#format-string-syntax


In [None]:
#URL of data
url_front = "http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/"

driver = webdriver.Firefox(options=options)

#Write a loop to go through year, month and day
#Note that month and day is always 2 digit
#Call scrape_horses() in each iteration


driver.quit()

### F. Saving data to file

Most of the time we want to save the data for future use. The most common method is to save the data in a CSV file, a format that is supported by virtually all data analysis software.

Package needed:
- CSV file reading and writing: https://docs.python.org/3.6/library/csv.html

The basic syntax of saving into a CSV file is:

In [None]:
filepath = "temp.csv"
content = [[1,"ha","abc"]]

import csv
with open(filepath, 'w', newline='') as csvfile:
    mywriter = csv.writer(csvfile)
    mywriter.writerows(content)

Now we will incorporate file-saving to our loop:

In [None]:
#The first part of the URL of data source
url_front = "http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/"

driver = webdriver.Firefox(options=options)

#Copy the loop from above and incorporate the csv-saving code


driver.quit()

### G. Exercise
How to get the data for different races? In particular, how should we handle the code for race tracks in the URL?