#### Saturday, December 14, 2024

This notebook was not part of the original authors repository, but was created from the sample code from Chapter 3.

I'm really kinda surprised the authors repository didn't include a notebook!

In [1]:
from openai import OpenAI
import os

# Set your OpenAI key as an environment variable
# https://platform.openai.com/api-keys
# client = OpenAI(
#   api_key=os.environ['OPENAI_API_KEY'],  # Default
# )

# Point to the local server ... last guy wins.
# lmstudio = "http://localhost:1234/v1"
lmstudio = "http://192.168.2.16:1234/v1"

client = OpenAI(base_url=lmstudio, api_key="lm-studio")

model = "qwen2.5-14b-instruct" # lmstudio-community/Qwen2.5-14B-Instruct-GGUF : wen2.5-14B-Instruct-Q8_0.gguf

# model = "hermes-3-llama-3.2-3b"


In [2]:
def get_response(prompt):
    response = client.chat.completions.create(
        # model="gpt-3.5-turbo",
        model = model ,
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant."
            },
            {
                "role": "user",
                "content": prompt
            }
        ]
    )
    return response.choices[0].message.content

### Generating Lists

In [3]:
prompt = """Generate a list of Disney characters."""

In [4]:
 # Get a response from the model
response = get_response(prompt)
print(response)

# 18.5s
# 17.8s

Here is a diverse list of popular Disney characters from various movies and shows:

1. Mickey Mouse (Steamboat Willie)
2. Minnie Mouse (Steamboat Willie)
3. Donald Duck (The Wise Little Hen)
4. Daisy Duck (Don Donald)
5. Goofy (Mickey's Revue)
6. Pluto (Pluto's Judgement Day)
7. Scrooge McDuck (Christmas on Bear Mountain, Christmas at Willow Bay)
8. Jiminy Cricket (Pinocchio)
9. Dopey (Snow White and the Seven Dwarfs)
10. Snow White (Snow White and the Seven Dwarfs)
11. Cinderella (Cinderella)
12. Aurora / Sleeping Beauty (Sleeping Beauty)
13. Belle (Beauty and the Beast)
14. Jasmine (Aladdin)
15. Pocahontas (Pocahontas)
16. Mulan (Mulan)
17. Tiana (The Princess and the Frog)
18. Rapunzel (Tangled)
19. Merida (Brave)
20. Moana (Moana)

Animated Movie Characters:
1. Simba, Nala, Pumbaa, Timon, Rafiki (The Lion King)
2. Stitch (Lilo & Stitch)
3. Buzz Lightyear and Woody (Toy Story series)
4. Sully and Mike Wazowski (Monsters, Inc.)
5. Baloo, Mowgli (The Jungle Book)
6. Nemo and Marlin (F

The response is unpredictable because we were not clear about the type of list we wanted. Let's try again, this time being more specific with what we want.

In [5]:
prompt = """Generate a bullet point list of 5 male Disney characters. 
Only include the name of the character for each line. 
Never include the film for each Disney character.

Below is an example list:
* Aladdin
* Simba
* Beast
* Hercules
* Tarzan"""

In [None]:
 # Get a response from the model
response = get_response(prompt)
print(response)

# 0.9s
# 0.7s

* Peter Pan
* Genie
* Woody
* Mowgli
* Stitch


It's interesting to observe just how much faster this prompt runs!

### Hierarchical List Generation

Hierarchical lists are useful for when your desired output is nested. A good example of this would be a detailed article structure.

In [7]:
prompt = """Generate a hierarchical and incredibly detailed article outline on "what are the benefits of data engineering".

See an example of the hierarchical structure below:

Article Title: What are the benefits of digital marketing?

* Introduction
    a. Explanation of digital marketing
    b. Importance of digital marketing in today’s business world

* Increased Brand Awareness
    a. Definition of brand awareness
    b. How digital marketing helps in increasing brand awareness"""

In [None]:
 # Get a response from the model
response = get_response(prompt)
print(response)

# 24.2s
# 34.9s

### Article Title: What Are the Benefits of Data Engineering?

#### I. Introduction
   - A. Explanation of data engineering
       1. Definition and scope of data engineering
       2. Evolution of data engineering
   - B. Importance of data engineering in today’s business world
       1. Role in modern business operations
       2. Impact on decision-making processes

#### II. Enhanced Data Management Capabilities
   - A. Efficient Data Collection and Integration
       1. Automation of data ingestion from various sources
       2. Standardization and normalization of diverse datasets
   - B. Scalability and Flexibility
       1. Adapting to growing data volumes and changing business needs
       2. Elastic computing resources for optimal performance

#### III. Improved Data Quality and Reliability
   - A. Data Cleaning and Validation Techniques
       1. Identifying and removing inconsistencies or inaccuracies
       2. Implementing robust validation rules
   - B. Ensuring Consistenc

So you’ve successfully produced a hierarchical article outline, but how could you parse the string into structured data?

Example 3.1 Parsing a hierierarchical list 

In [9]:
import re

# openai_result = generate_article_outline(prompt)

openai_result = """
* Introduction
    a. Explanation of data engineering
    b. Importance of data engineering in today’s data-driven world
* Efficient Data Management
    a. Definition of data management
    b. How data engineering helps in efficient data management.
* Conclusion
    a. Importance of Data Engineering in the modern business world
    b. Future of Data Engineering and its impact on the data ecosystem
"""

# Regular expression patterns
heading_pattern = r"\* (.+)"
subheading_pattern = r"\s+[a-z]\. (.+)"

# Extract headings and subheadings
headings = re.findall(heading_pattern, openai_result)
subheadings = re.findall(subheading_pattern, openai_result)

# Print results
print("Headings:\n")
for heading in headings:
    print(f"* {heading}")

print("\nSubheadings:\n")
for subheading in subheadings:
    print(f"* {subheading}")

Headings:

* Introduction
* Efficient Data Management
* Conclusion

Subheadings:

* Explanation of data engineering
* Importance of data engineering in today’s data-driven world
* Definition of data management
* How data engineering helps in efficient data management.
* Importance of Data Engineering in the modern business world
* Future of Data Engineering and its impact on the data ecosystem


Example 3.2 Parsing a hierarchical list into a Python dictionary:

In [12]:
import re

openai_result = """
* Introduction
  a. Explanation of data engineering
  b. Importance of data engineering in today’s data-driven world
* Efficient Data Management
    a. Definition of data management
    b. How data engineering helps in efficient data management.
    c. Why data engineering is important for data management
* Conclusion
    a. Importance of Data Engineering in the modern business world
    b. Future of Data Engineering and its impact on the data ecosystem
"""

section_regex = re.compile(r"\* (.+)")
subsection_regex = re.compile(r"\s*([a-z]\..+)")

result_dict = {}
current_section = None

for line in openai_result.split("\n"):
    section_match = section_regex.match(line)
    subsection_match = subsection_regex.match(line)

    if section_match:
        current_section = section_match.group(1)
        result_dict[current_section] = []
    elif subsection_match and current_section is not None:
        result_dict[current_section].append(subsection_match.group(1))

#print(result_dict)
result_dict

{'Introduction': ['a. Explanation of data engineering',
  'b. Importance of data engineering in today’s data-driven world'],
 'Efficient Data Management': ['a. Definition of data management',
  'b. How data engineering helps in efficient data management.',
  'c. Why data engineering is important for data management'],
 'Conclusion': ['a. Importance of Data Engineering in the modern business world',
  'b. Future of Data Engineering and its impact on the data ecosystem']}

### When to Avoid using Regular Expressions

#### Generating JSON

In [13]:
prompt = """Compose a very detailed article outline on "The benefits of learning code" with a JSON payload structure that highlights key points.

Only return valid JSON.

Here is an example of the JSON structure:
```{
    "Introduction": [
        "a. Explanation of data engineering",
        "b. Importance of data engineering in today’s data-driven world"],
    "Efficient Data Management": [
        "a. Definition of data management",
        "b. How data engineering helps in efficient data management."],
    "Conclusion": [
        "a. Importance of Data Engineering in the modern business world",
        "b. Future of Data Engineering and its impact on the data ecosystem"]
} """

In [14]:
 # Get a response from the model
response = get_response(prompt)
print(response)

# 17.1s

```json
{
    "Introduction": [
        "a. Explanation of what coding is and why it's important",
        "b. Overview of the increasing demand for coders in various industries"
    ],
    "Enhanced Problem-Solving Skills": [
        "a. Development of logical thinking through programming challenges",
        "b. Application of problem-solving skills to real-world scenarios beyond coding",
        "c. Improved analytical capabilities and decision-making processes"
    ],
    "Career Opportunities and Advancement": [
        "a. Broad range of career options in software development, data science, web design, etc.",
        "b. High demand for skilled coders leading to better job prospects and higher salaries",
        "c. Continuous learning and adaptation necessary in the rapidly evolving tech industry"
    ],
    "Personal Creativity and Innovation": [
        "a. Freedom to create unique software applications or improve existing ones",
        "b. Collaboration with other developers

Now let's examine how we can parse JSON output with Python:

In [16]:
import json

# openai_json_result = generate_article_outline(prompt)

openai_json_result = """
{
    "Introduction": [
        "a. Overview of coding and programming languages",
        "b. Importance of coding in today's technology-driven world"],
    "Conclusion": [
        "a. Recap of the benefits of learning code",
        "b. The ongoing importance of coding skills in the modern world"]
}
"""

parsed_json_payload = json.loads(openai_json_result)
# print(parsed_json_payload)
parsed_json_payload

{'Introduction': ['a. Overview of coding and programming languages',
  "b. Importance of coding in today's technology-driven world"],
 'Conclusion': ['a. Recap of the benefits of learning code',
  'b. The ongoing importance of coding skills in the modern world']}

#### YAML

In [17]:
prompt = """Return valid `.yml` for 5 apple slices, and 2 dozen eggs.

- Below you'll find the current yaml schema.
- Please only use the schema below and if you are unsure, then return `Unsure`.

```
- item: Apple Slices
  quantity: 5
  unit: pieces
- item: Milk
  quantity: 1
  unit: gallon
- item: Bread
  quantity: 2
  unit: loaves
- item: Eggs
  quantity: 1
  unit: dozen
```"""

In [18]:
 # Get a response from the model
response = get_response(prompt)
print(response)

# 2.4s

Based on the provided schema, here is the valid `.yml` for 5 apple slices and 2 dozen eggs:

```
- item: Apple Slices
  quantity: 5
  unit: pieces
- item: Eggs
  quantity: 2
  unit: dozen
```
