<a href="https://colab.research.google.com/github/jeffheaton/app_generative_ai/blob/main/assignments/assignment_yourname_t81_559_class3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-559: Applications of Generative AI
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/index.html)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

**Module 3 Assignment: LLM Text Classification**

**Student Name: Your Name**

# Assignment Instructions

A [file](https://data.heatonresearch.com/data/t81-559/assignments/jobs.csv) is provided that contains 25 biographies sentences. Sample lines from this file include:

|id	|bio|
|---|---|
|1	|Dr. Emily Carter is a dedicated healthcare professional ...|
|2	|Born in a small town in Texas, she developed a fascination ...|
|3	|Alex is a passionate technology enthusiast with a knack ...|
|4	|Born and raised in a small town, she developed a fascination ...|
|5	|Dr. Emily Carter is a dedicated healthcare professional with over... |
|...|...|

For each of these, classify into the categories of:

* doctor
* lawyer
* teacher
* software engineer
* astronaut

Your output should look like this:

|id|job|
|---|---|
|id	|job|
|1	|doctor ...|
|2	|lawyer ...|
|3	|lawyer ...|
|4	|doctor ...|
|5	|lawyer ... |
|...|...|

Use a large language model (LLM) to extract the single word action from each of these sentences.



# Google CoLab Instructions

If you are using Google CoLab, it will be necessary to mount your GDrive so that you can send your notebook during the submit process. Running the following code will map your GDrive to ```/content/drive```.

In [9]:
try:
  #from google.colab import drive, userdata
  #drive.mount('/content/drive', force_remount=True)
  COLAB = True
  print("Note: using Google CoLab")
except:
  print("Note: not using Google CoLab")
  COLAB = False

# OpenAI Secrets
import os
if COLAB:
    os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

# Install needed libraries in CoLab
if COLAB:
    !pip install langchain langchain_openai

Note: using Google CoLab
Collecting langchain_openai
  Downloading langchain_openai-0.3.7-py3-none-any.whl.metadata (2.3 kB)
Collecting langchain-core<1.0.0,>=0.3.35 (from langchain)
  Downloading langchain_core-0.3.40-py3-none-any.whl.metadata (5.9 kB)
Collecting tiktoken<1,>=0.7 (from langchain_openai)
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading langchain_openai-0.3.7-py3-none-any.whl (55 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.3/55.3 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading langchain_core-0.3.40-py3-none-any.whl (414 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m414.3/414.3 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m33.5 MB/s[0m eta [36m0:00:00[0m
[?25

# Assignment Submit Function

You will submit the 10 programming assignments electronically.  The following submit function can be used to do this.  My server will perform a basic check of each assignment and let you know if it sees any basic problems.

**It is unlikely that should need to modify this function.**

In [None]:
import base64
import os
import numpy as np
import pandas as pd
import requests
import PIL
import PIL.Image
import io
from typing import List, Union

# This function submits an assignment.  You can submit an assignment as much as you like, only the final
# submission counts.  The paramaters are as follows:
# data - List of pandas dataframes or images.
# key - Your student key that was emailed to you.
# course - The course that you are in, currently t81-558 or t81-559.
# no - The assignment class number, should be 1 through 10.
# source_file - The full path to your Python or IPYNB file.  This must have "_class1" as part of its name.
# .             The number must match your assignment number.  For example "_class2" for class assignment #2.

def submit(
    data: List[Union[pd.DataFrame, PIL.Image.Image]],
    key: str,
    course: str,
    no: int,
    source_file: str = None
) -> None:
    if source_file is None and '__file__' not in globals():
        raise Exception("Must specify a filename when in a Jupyter notebook.")
    if source_file is None:
        source_file = __file__

    suffix = f'_class{no}'
    if suffix not in source_file:
        raise Exception(f"{suffix} must be part of the filename.")

    ext = os.path.splitext(source_file)[-1].lower()
    if ext not in ['.ipynb', '.py']:
        raise Exception(f"Source file is {ext}; must be .py or .ipynb")

    with open(source_file, "rb") as file:
        encoded_python = base64.b64encode(file.read()).decode('ascii')

    payload = []
    for item in data:
        if isinstance(item, PIL.Image.Image):
            buffered = io.BytesIO()
            item.save(buffered, format="PNG")
            payload.append({'PNG': base64.b64encode(buffered.getvalue()).decode('ascii')})
        elif isinstance(item, pd.DataFrame):
            payload.append({'CSV': base64.b64encode(item.to_csv(index=False).encode('ascii')).decode("ascii")})
        else:
            raise ValueError(f"Unsupported data type: {type(item)}")

    response = requests.post(
        "https://api.heatonresearch.com/wu/submit",
        headers={'x-api-key': key},
        json={
            'payload': payload,
            'assignment': no,
            'course': course,
            'ext': ext,
            'py': encoded_python
        }
    )

    if response.status_code == 200:
        print(f"Success: {response.text}")
    else:
        print(f"Failure: {response.text}")

# Assignment #3 Sample Code

The following code provides a starting point for this assignment.

In [19]:
import os
import pandas as pd
from scipy.stats import zscore
import string
from langchain.prompts import ChatPromptTemplate

# Begin assignment

bio_df = pd.read_csv("https://data.heatonresearch.com/data/t81-559/assignments/jobs.csv")
bio_df.head()

Unnamed: 0,id,bio
0,1,Dr. Emily Carter is a dedicated healthcare pro...
1,2,"Born in a small town in Texas, she developed a..."
2,3,Alex is a passionate technology enthusiast wit...
3,4,"Born and raised in a small town, she developed..."
4,5,Dr. Emily Carter is a dedicated healthcare pro...


In [11]:
from langchain_openai import ChatOpenAI

MODEL = 'gpt-4o'
TEMPERATURE = 0.0

# Initialize the OpenAI LLM with your API key
llm = ChatOpenAI(
    model=MODEL,
    temperature=TEMPERATURE,
    n=1
)

In [53]:
from langchain_core.messages import HumanMessage, SystemMessage, AIMessage
from langchain_core.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    SystemMessagePromptTemplate,
)
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain, SimpleSequentialChain

career_prompt = PromptTemplate( input_variables = ['biography'], template = """
Classify the person in the following short biographies as one of the following:
- doctor
- lawyer
- teacher
- software engineer
- astronaut
Below is their biography. Return just the one word that best describes their career and nothing else, do not explain your choice.
Here is the biography:

{biography}""")

chain_career = career_prompt | llm
career_list = []

for id, bio in zip(bio_df['id'], bio_df['bio']):
    classification = chain_career.invoke(bio).content.strip()
    career_list.append({'id': id, 'job': classification})

career_df = pd.DataFrame(career_list)


In [55]:
career_df.set_index('id', inplace=True)

In [56]:
career_df

Unnamed: 0_level_0,job
id,Unnamed: 1_level_1
1,doctor
2,astronaut
3,software engineer
4,astronaut
5,doctor
6,teacher
7,lawyer
8,lawyer
9,doctor
10,doctor
