# Accessing a Generative AI Model through OpenAI API

Generative AI models are machine learning models that are capable of generating content.
This notebook is going to focus on the most commonly-used type of generative AI models&mdash;the 
ones that generate text. These models are called *large language models*, or 'LLM' for short.
We will not be going into the technical details of how a LLM works at this moment, but rather focus
on how to use an LLM as an end user.

### Accessing an LLM through Python
Although it is very convenient to access an LLM through a browser interface for one-off task, 
for repetitive task you will generating want to access the model using a program. Model providers 
will generally provide such access through an *application programming interface*, or 'API' for short.
An API defines what sorts of interactions are possible with the model, and how input and output data 
should look like. 

The API used by OpenAI, the provider of ChatGPT, is the de facto standard supported by most models.

The first step is to set up an OpenAI client. The client requires two piece of information:
1. `base_url`: this is the web address of the model provider. If you use OpenAI's model, this can be omited.
2. `api_key`: this is a string of text unique to your user account with the model provider. 

In [None]:
from openai import OpenAI



In a production environment, it is best to save this piece of information as an environment variable, so that it will not be saved in your notebook. The easiest way to do so when working in a Jupyter notebook is to put the key in a filed named `.env`, then use the `dotenv` library to set the environment variable.

In [None]:
from openai import OpenAI
from dotenv import load_dotenv
import os

# Load the .env file containing environment variables.
# Default is to look for it in the current working directory.


# The API key should be saved as an environment variable OPENAI_API_KEY
# Optionally set the base_url as OPENAI_BASE_URL in the same file


### Model List

To see what models the provider support:

### Calling the Model

To call a model, use 
```python
response = client.chat.completions.create(...)
```
You will need to provide a few pieces of information:
- `model`: the name of the model.
- `messages`: a list containing the conversation history. Each message should be dictionary with a `role` and a `content`. `role` can be one of:
    - `system`: system prompt. Use this to give the model general guidelines that it should follow strictly.
    - `user`: prompt entered by the user. 
    - `assistant`: response from the model. 
    - `tool`: response from tools the model can use.
- Additional settings such as temperature and number of samples.- The response will be recorded in
```python
response.choices[0].message.content
```

In [None]:
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()

client = OpenAI()



### Example: Scrapping Webpage and Extract Information with LLM

We first scrape the webpage with Selenium + Beautiful Soup,
but unlike before, we do not manually find the elements we need.
Instead, we export a stripped version of the webpage's source code,
then uses an LLM to extract the information we need. 
We use stripped version of the source code to reduce the number of 
tokens fed to the LLM, so as not to exceed the LLM's context length,
and it also improves accuracy.

In [None]:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup

from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait


def get_stripped_page_source(url):
    

Scrape the webpage:

In [None]:
# Set up selenium to use Firefox
options = Options()
options.add_argument('-headless') #No need to open a browser window

driver = webdriver.Firefox(options=options)
page_source = get_stripped_page_source('http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20151213/ST')
driver.quit()


Let us take a look at the stripped source code:

In [None]:
output

Now we feed the source code to the LLM:

In [None]:
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()



If the output is unreliable, you can use Beauitful Soup to locate the table, and provide only the source code of the table to the LLM:

Let us try again:

In [None]:
# Get source code of the table
driver = webdriver.Firefox(options=options)
page_source = get_stripped_page_source('http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/20151213/ST')
driver.quit()

# Feed LLM the table
response = client.chat.completions.create(
  model="default",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": f"""The following webpage contains horse racing results. 
                                Please extract the result table and return it in comma-delimited
                                CSV format. No explanation is needed.
                                {page_source}
                                """},
  ],
)
output = response.choices[0].message.content
print(output)

Finally, we write the output to file:

In [None]:
with open("result.csv", "a") as f:
  f.write(output)

We can put everything together in one function
and use it in a loop to save the information we need:

In [None]:
# Function to scrape results

def scrape_webpage(url, prompt):

        
        

In [None]:
# The first part of the URL of data source
url_front = "http://racing.hkjc.com/racing/Info/Meeting/Results/english/Local/"
prompt = """The following webpage contains horse racing results. 
            Please extract the result table and return it in comma-delimited
            CSV format. No explanation is needed.
            """

driver = webdriver.Firefox(options=options)

try:
    # Copy the loop from above and incorporate the csv-saving code
    for year in range(2017,2018):
        for month in range(1,2):
            for day in range(1,32):
                
                # Convert month and day to 2-digit representation
                month_2d = '{:02d}'.format(month)
                day_2d = '{:02d}'.format(day)
                
                # Full URL of data source
                url = url_front + str(year) + month_2d + day_2d
                
                # Print the URL so we know the progress so far
                print("Trying:",url)
                
                # Call our function to fetch and process data given the URL
                content = scrape_webpage(url, prompt)
                
                # Only save if there is something in content
                if content is not None:
                    filepath = str(year)+month_2d+day_2d+".csv"
                    
                    # Save to file
                    with open(filepath, "a") as f:
                        f.write(output)

except Exception as e:
    print(e)    
                    
driver.quit()