# Accessing a Generative AI Model through OpenAI API

Generative AI models are machine learning models that are capable of generating content.
This notebook is going to focus on the most commonly-used type of generative AI models&mdash;the 
ones that generate text. These models are called *large language models*, or 'LLM' for short.
We will not be going into the technical details of how a LLM works at this moment, but rather focus
on how to use an LLM as an end user.

### Accessing an LLM through Python
Although it is very convenient to access an LLM through a browser interface for one-off task, 
for repetitive task you will generating want to access the model using a program. Model providers 
will generally provide such access through an *application programming interface*, or 'API' for short.
An API defines what sorts of interactions are possible with the model, and how input and output data 
should look like. 

The API used by OpenAI, the provider of ChatGPT, is the de facto standard supported by most models.

The first step is to set up an OpenAI client. The client requires two piece of information:
1. `base_url`: this is the web address of the model provider. If you use OpenAI's model, this can be omited.
2. `api_key`: this is a string of text unique to your user account with the model provider. 

In [None]:
from openai import OpenAI

client = OpenAI(
    base_url = 'https://scrp-chat.econ.cuhk.edu.hk/api',
    api_key='your_api_key_here',
)

In a production environment, it is best to save this piece of information as an environment variable, so that it will not be saved in your notebook. The easiest way to do so when working in a Jupyter notebook is to put the key in a filed named `.env`, then use the `dotenv` library to set the environment variable.

In [1]:
from openai import OpenAI
from dotenv import load_dotenv
import os

# Load the .env file containing environment variables.
# Default is to look for it in the current working directory.
home_dir = os.environ["HOME"]
load_dotenv(os.path.join(home_dir,".env"))

# The API key should be saved as an environment variable OPENAI_API_KEY
# Optionally set the base_url as OPENAI_BASE_URL in the same file
client = OpenAI(
    base_url = 'https://scrp-chat.econ.cuhk.edu.hk/api',
)

### Model List

To see what models the provider support:

In [2]:
models = client.models.list()
model_ids = sorted([model.id for model in models.data])
print("Available Models:")
for model_id in model_ids:
    print(model_id)

Available Models:
Qwen/Qwen2.5-VL-72B-Instruct-AWQ
Qwen/Qwen3-32B-FP8
abc-test
default
help
image-editor-function
openai/gpt-oss-120b
openai/gpt-oss-120b
reasoning
text
text-large
text-medium
text-small
vision
vision-large
vision-medium
vision-small
zai-org/GLM-4.5-FP8


### Calling the Model

To call a model, use 
```python
response = client.chat.completions.create(...)
```
You will need to provide a few pieces of information:
- `model`: the name of the model.
- `messages`: a list containing the conversation history. Each message should be dictionary with a `role` and a `content`. `role` can be one of:
    - `system`: system prompt. Use this to give the model general guidelines that it should follow strictly.
    - `user`: prompt entered by the user. 
    - `assistant`: response from the model. 
    - `tool`: response from tools the model can use.
- Additional settings such as temperature and number of samples:
    - `temperature`:
    - `n`:

The response will be recorded in
```python
response.choices[0].message.content
```

In [3]:
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()

client = OpenAI()

response = client.chat.completions.create(
  model="default",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Where is Hong Kong?"},
  ],
)
print(response.choices[0].message.content)

Hong Kong is a Special Administrative Region (SAR) of the People’s Republic of China. Geographically, it is located on the southeastern coast of China, at the mouth of the Pearl River Delta. 

- **Coordinates:** approximately 22.3° N latitude, 114.2° E longitude.  
- **Borders:**  
  - **South:** South China Sea (including Victoria Harbour).  
  - **North:** The mainland Chinese city of Shenzhen in Guangdong Province.  
- **Proximity to major cities:**  
  - About 30 km (≈ 19 mi) north of Macau.  
  - Roughly 120 km (≈ 75 mi) west of Guangzhou, the capital of Guangdong Province.  

On a world map, you’ll find Hong Kong on the eastern side of the Asian continent, just east of mainland China’s southern coast and opposite the island of Taiwan across the South China Sea.


### Example: Scrapping Webpage and Extract Information with LLM

We first scrape the webpage with Selenium + Beautiful Soup,
but unlike before, we do not manually find the elements we need.
Instead, we export a stripped version of the webpage's source code,
then uses an LLM to extract the information we need. 
We use stripped version of the source code to reduce the number of 
tokens fed to the LLM, so as not to exceed the LLM's context length,
and it also improves accuracy.

In [1]:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup

from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait


def get_stripped_page_source(url):
    # Function to access a page and save all horses into a list

    # Fetch the page
    driver.get(url)
    
    # Is there anything?
    if driver.page_source.find("No information.") != -1:
        return ""
    
    # Wait 30 secs so that the dynamic content has time to load.
    # Proceed to next date if page doesn't load.
    try:
        wait = WebDriverWait(driver, 30).until(
            EC.presence_of_element_located((By.CLASS_NAME, "f_fs13")))
    except:
        return ""
    
    # Load the page into BeautifulSoup
    soup = BeautifulSoup(driver.page_source, 'html.parser')

    return soup.get_text(strip=True)   

Scrape the webpage:

In [2]:
# Set up selenium to use Firefox
options = Options()
options.add_argument('-headless') #No need to open a browser window

driver = webdriver.Firefox(options=options)
page_source = get_stripped_page_source('http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20151213/ST')
driver.quit()


Let us take a look at the stripped source code:

In [None]:
output

Now we feed the source code to the LLM:

In [3]:
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()

client = OpenAI()

response = client.chat.completions.create(
  model="default",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": f"""The following webpage contains horse racing results. 
                                Please extract the result table and return it in comma-delimited
                                CSV format. No explanation is needed.
                                {page_source}
                                """},
  ],
)
output = response.choices[0].message.content
print(output)

Place,Horse No,Horse,Jockey,Trainer,Details,Finish Time,Win Odds
1,110,"JOLLY JOLLY (T087)","K Teetan","P O'Sullivan","114121413-11111","1:22.05","2.628"
2,105,"PEOPLE'S KNIGHT (T305)","T Berry","J Moore","1191163822422","1:22.39","5.7312"
3,176,"RUN FORREST (T176)","J Moreira","C S Shum","11511351031410103","1:22.54","3.944"
4,167,"MODERN TSAR (S167)","B Prebble","W Y So","1231101114-3/4913134","1:22.80","1.353"
5,114,"MAGNETISM (V114)","G Lerena","D E Ferraris","125113034-3/47665","1:22.81","5.261"
6,236,"ENORMOUS HONOUR (T236)","N Rawiller","Y S Tsui","131112795-1/4111212","1:22.87","10.79"
7,299,"HAPPY JOURNEY (S299)","H W Lai","S Woods","114104056-1/25777","1:23.08","12.1814"
8,202,"WINGOLD (T202)","M L Yeung","A Lee","1111154126-1/2131414","1:23.11","33.1911"
9,351,"OVETT (P351)","H N Wong","A T Millard","105115347-1/43239","1:23.20","4.1105"
10,442,"PAKISTAN BABY (S442)","D Whyte","A S Cruz","121102377-1/46111110","1:23.22","12.116"
11,382,"SUPER FLUKE (T382)","M Demuro","D Cruz

If the output is unreliable, you can use Beauitful Soup to locate the table, and provide only the source code of the table to the LLM:

In [None]:
def get_stripped_page_source(url):
    # Fetch the page
    driver.get(url)
    # Load the page into Beautiful Soup
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    # Get the result table
    result_table = soup.find("table", class_="f_tac")
    if result_table:
        return result_table.prettify()
    else:
        return ""

Let us try again:

In [None]:
# Get source code of the table
driver = webdriver.Firefox(options=options)
page_source = get_stripped_page_source('http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20151213/ST')
driver.quit()

# Feed LLM the table
response = client.chat.completions.create(
  model="default",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": f"""The following webpage contains horse racing results. 
                                Please extract the result table and return it in comma-delimited
                                CSV format. No explanation is needed.
                                {page_source}
                                """},
  ],
)
output = response.choices[0].message.content
print(output)

Finally, we write the output to file:

In [11]:
with open("result.csv", "a") as f:
  f.write(output)

We can put everything together in one function
and use it in a loop to save the information we need:

In [5]:
# Function to scrape results

def scrape_webpage(url, prompt):
    page_source = get_stripped_page_source(url)
    if page_source != "":
        response = client.chat.completions.create(
          model="default",
          messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt + page_source},
          ],
        )
        output = response.choices[0].message.content

        return output
        
        

In [None]:
# The first part of the URL of data source
url_front = "http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/"
prompt = """The following webpage contains horse racing results. 
            Please extract the result table and return it in comma-delimited
            CSV format. No explanation is needed.
            """

driver = webdriver.Firefox(options=options)

try:
    # Copy the loop from above and incorporate the csv-saving code
    for year in range(2017,2018):
        for month in range(1,2):
            for day in range(1,32):
                
                # Convert month and day to 2-digit representation
                month_2d = '{:02d}'.format(month)
                day_2d = '{:02d}'.format(day)
                
                # Full URL of data source
                url = url_front + str(year) + month_2d + day_2d
                
                # Print the URL so we know the progress so far
                print("Trying:",url)
                
                # Call our function to fetch and process data given the URL
                content = scrape_webpage(url, prompt)
                
                # Only save if there is something in content
                if content is not None:
                    filepath = str(year)+month_2d+day_2d+".csv"
                    
                    # Save to file
                    with open(filepath, "a") as f:
                        f.write(output)

except Exception as e:
    print(e)    
                    
driver.quit()

Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20170101
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20170102
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20170103
Trying: http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20170104
