# Accessing a Generative AI Model through OpenAI API

Version: 2025-10-13

Generative AI models are machine learning models that are capable of generating content.
This notebook is going to focus on the most commonly-used type of generative AI models&mdash;the 
ones that generate text. These models are called *large language models*, or 'LLM' for short.
We will not be going into the technical details of how a LLM works at this moment, but rather focus
on how to use an LLM as an end user.

## A. Accessing an LLM through Python
Although it is very convenient to access an LLM through a browser interface for one-off task, 
for repetitive task you will generating want to access the model using a program. Model providers 
will generally provide such access through an *application programming interface*, or 'API' for short.
An API defines what sorts of interactions are possible with the model, and how input and output data 
should look like. 

The API used by OpenAI, the provider of ChatGPT, is the de facto standard supported by most models.

The first step is to set up an OpenAI client. The client requires two piece of information:
1. `base_url`: this is the web address of the model provider. If you use OpenAI's model, this can be omited.
2. `api_key`: this is a string of text unique to your user account with the model provider. 

In [None]:
from openai import OpenAI



Note that whoever has your API key has access to your AI service account. In a production environment, it is therefore best to save the API key as an environment variable, so that it will not be saved in your notebook. 

The easiest way to do so when working in a Jupyter notebook is to use the `dotenv` library to load environment variables from a file. The library defaults to loading the variables from a file named `.env` in your current working directory, but you can load from any file by providing a suitable path. 

In [None]:
from openai import OpenAI
from dotenv import load_dotenv
import os

# Load the scrp-chat.env file containing environment variables.
# Default is to look for a file named .env in the current working directory.


# The API key should be saved as an environment variable OPENAI_API_KEY
# The base_url should be saved as OPENAI_BASE_URL in the same file


## B. Model List

To see what models the provider support:

## C. Calling the Model

To call a model, use 
```python
response = client.chat.completions.create(...)
```
You will need to provide a few pieces of information:
- `model`: the name of the model.
- `messages`: a list containing the conversation history. Each message should be dictionary with a `role` and a `content`. `role` can be one of:
    - `system`: system prompt. Use this to give the model general guidelines that it should follow strictly.
    - `user`: prompt entered by the user. 
    - `assistant`: response from the model. 
    - `tool`: response from tools the model can use.
- Additional settings such as temperature and number of samples:
    - `temperature`: Adjust the randomness of the response. 0 is the lowest setting.
      Default is model specfic, around 0.6-0.8.
    - `n`: number of responses you want the model to generate. Default is 1.

The response will be recorded in
```python
response.choices[0].message.content
```

In [None]:
from openai import OpenAI
from dotenv import load_dotenv

client = OpenAI()



## D. Example: Scrapping Webpage and Extract Information with LLM

We first scrape the webpage with Selenium + Beautiful Soup,
but unlike before, we do not manually find the elements we need.
Instead, we export a stripped version of the webpage's source code,
then uses an LLM to extract the information we need. 

Because LLMs can only process a limited amount of text input, called 
*context length*, and slows down as the context length goes up, 
it is usually wise to use Beautiful Soup to locate
as precious as possible what you need, before passing the source code
to the LLM.

### D1. Fetch the Page Source in Full

Scrape the webpage in full:

In [None]:
# Only works with static content.
# See A6-Data-Scrping-Completed for a version that works with dynamic content.
import requests
def get_page_source(url):
    page = requests.get(url)
    return page.text

Now we feed the source code to the LLM:

In [None]:
from openai import OpenAI
from dotenv import load_dotenv



### D2. Strip the Page Source

You can strip white space by setting `strip = True` in `get_page_source()`, 
but it appears to drastically lower the quality of the LLM's output.
The likely reason for this is that during training, the LLM has learnt to 
use the leading and trailing white space in understanding the content. 
Removing the whitespace therefore make the content harder to understand,
just like it would be for an actual human.

In [None]:
# Page source with white space stripped
import requests
from bs4 import BeautifulSoup

def get_stripped_page_source(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    return soup.get_text(strip=True)   

Let us take a look at the stripped source code:

In [None]:
page_source = get_stripped_page_source('http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20151213/ST')
page_source[0:500]

Now let us provide the page source to the LLM:

In [None]:
from openai import OpenAI
import time

page_source = get_stripped_page_source('http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20151213/ST',
                              strip=True)

load_dotenv()
client = OpenAI()

start = time.time()
response = client.chat.completions.create(
  model="default",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": f"""The following webpage contains horse racing results. 
                                Please extract the result table and return it in comma-delimited
                                CSV format. No explanation is needed.
                                {page_source}
                                """},
  ],
)
output = response.choices[0].message.content
print("Processing took: {:.2f}".format(time.time() - start))
print(output)

### D3. Target Specific Elements

To speed up the processing and make the model's output more accurate, you can use Beauitful Soup to locate the table, and provide only the source code of the table to the LLM:

In [None]:
import requests
def get_table_source(url):
    

Let us try again:

In [None]:
page_source = get_table_source('http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20151213/ST',
                              strip=True)
start = time.time()
response = client.chat.completions.create(
  model="default",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": f"""The following webpage contains horse racing results. 
                                Please extract the result table and return it in comma-delimited
                                CSV format. No explanation is needed.
                                {page_source}
                                """},
  ],
)
output = response.choices[0].message.content
print("Processing took: {:.2f}".format(time.time() - start))
print(output)

Finally, we write the output to file:

In [None]:
with open("result.csv", "w") as f:
  f.write(output)

### E. Outputing to File

Because the LLM has done the formatting, we directly write
the output to file:

### F. Scrape Multiple Pages

We can put everything together in one function
and use it in a loop to save the information we need:

In [None]:
# Function to scrape results

def scrape_webpage(url, prompt):

        
        

In [None]:
# The first part of the URL of data source
url_front = "http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/"
prompt = """The following webpage may contain horse racing results in a table. 
            If it does, please extract the result table and return it in 
            comma-delimited CSV format. If not, just return an empty string.
            No explanation is needed.
            """

# Copy the loop from above and incorporate the csv-saving code
for year in range(2017,2018):
    for month in range(1,2):
        for day in range(1,15):
            
            # Convert month and day to 2-digit representation
            month_2d = '{:02d}'.format(month)
            day_2d = '{:02d}'.format(day)
            
            # Full URL of data source
            url = url_front + str(year) + month_2d + day_2d
            
            # Print the URL so we know the progress so far
            print("Trying:",url)
            
            # Call our function to fetch and process data given the URL
            output = scrape_webpage(url, prompt)
            
            # Only save if there is something in content
            if len(output) > 0:
                filepath = str(year)+month_2d+day_2d+".csv"
                
                # Save to file
                with open(filepath, "w") as f:
                    f.write(output)
                    print(f"{filepath} saved.")