# Lecture 27 – Data 100, Spring 2025

Data 100, Spring 2025

[Acknowledgments Page](https://ds100.org/sp25/acks/)

## Getting Setup

You can run this notebook on the Jupyter Hub machines but you will need to setup an OpenAI account.  Alternatively, if you are running on your own computer you can also try to run a model locally.

### Step 1. Create an OpenAI account

You can create a free account which has some initial free credits by going here:

https://platform.openai.com

You will the need to get an API Key.  Save that api key to a local file called `openai.key`:

In [None]:
# with open("openai.key", "w") as f:
#     f.write("YOUR KEY")

### Step 2. Install Python Tools

Uncomment the following line. 

In [None]:
!pip install -U openai langchain langchain-openai

## Using OpenAI with LangChain

In [None]:
from langchain_openai import OpenAI
import pandas as pd

In [None]:
openai_key = open("openai.key", "r").readline()
llm = OpenAI(openai_api_key=openai_key,
             model_name="gpt-3.5-turbo-instruct")

In [None]:
llm.invoke("What is the capital of California? Provide a short answer.")

In [None]:
for chunk in llm.stream("Write a short song about data science and large language models."):
    print(chunk, end="", flush=True)

<br>

## Data Analytics

We can use LLMs to help in analyzing data

In [None]:
tweets = pd.read_json("AOC_recent_tweets.txt")
list(tweets['full_text'][0:10])

<br><br><br>
Suppose I wanted to evaluate whether a tweet is attacking someone

In [None]:
prompt = """
Is the following text making a statement about minimum wage? You should answer either Yes or No.

{text}

Answer:
"""
questions = [prompt.format_map(dict(text=t)) for t in tweets['full_text'].head(20)]

Ask each of the LLMs to answer the questions:

In [None]:
open_ai_answers = llm.batch(questions)
open_ai_answers

In [None]:
pd.set_option('display.max_colwidth', None)
df = pd.DataFrame({"OpenAI": open_ai_answers, 
                   "Text": tweets['full_text'].head(20)})
df["OpenAI"] = df["OpenAI"].str.contains("Y")
df

## Working with Google Gemini Models

You will need to install Gemini API to use the code below.  You can install these APIs by uncommenting and running the following command:

In [None]:
!pip install -q -U google-generativeai

You will need to obtain an API key.  Unfortunately, UC Berkeley has not yet enabled access to the Gemini API for Berkeley accounts but you can use any free Google account to obtain an API key.  You can obtain an API key by following the instructions [here](https://makersuite.google.com/app/apikey).

Once you get an API Key you can put it here:

In [None]:
# with open("gemini_key.txt", "w") as f:
#     f.write("YOUR KEY")

In [None]:
GEMINI_API_KEY = None
if not GEMINI_API_KEY:
    with open("gemini_key.txt", "r") as f:
        GEMINI_API_KEY = f.read().strip()

We can then connect to the Gemini API using the following code:


In [None]:
import google.generativeai as genai
genai.configure(api_key=GEMINI_API_KEY)

models_df = pd.DataFrame(genai.list_models())
models_df

We can obtain a model and use it to make a prediction. Here we will use the `"gemini-2.5-flash"` model, which is generally pretty good for a wide range of tasks. 


In [None]:
from IPython.display import Markdown
display(Markdown(models_df[models_df["name"] == "models/gemini-2.5-flash-preview-04-17"]['description'].values[0]))

In [None]:
model = genai.GenerativeModel("gemini-2.5-flash-preview-04-17")

Use the model to generate text

In [None]:
response = model.generate_content("Why is Data 100 great?")
Markdown(response.text)

## Working with images

In [None]:
from IPython.display import Image
from IPython.core.display import HTML
img = Image("data100_logo.png", width=200, height=200)
img

In [None]:
response = model.generate_content([
    """What is going on in this picture I downloaded from 
    the Berkeley Data100 Course Website? 
    How does it related to Data Science""", img])
Markdown(response.text)

You can stream content back which could be useful for interacting with the model.

In [None]:
from IPython.display import clear_output

response = model.generate_content("Write a poem about Data Science.", stream=True)

output = ""
for chunk in response:
    output += chunk.text
    clear_output(wait=True)
    display(Markdown(output))


## Using Gen AI for EDA

We could use the model to help analyze our data.

In [None]:
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_California")[1]
df

In [None]:
fast_model = genai.GenerativeModel("gemini-1.5-flash-8b")

In [None]:
prompt = "What is the mascot of {school}? Answer by only providing the mascot."
df['mascot'] = df['Name'].apply(
    lambda x: fast_model.generate_content(prompt.format(school=x)).text)
df

## More EDA with Open AI

In [None]:
from langchain_openai import OpenAI
openai_key = open("openai.key", "r").readline()
client = OpenAI(openai_api_key=openai_key,
             model_name="gpt-3.5-turbo-instruct")

In [None]:
# Simulating student feedback data
feedback_data = {
    'StudentID': [1, 2, 3, 4, 5],
    'Feedback': [
        'Great class, learned a lot! But I really did not like PCA.',
        'The course was very informative and well-structured. Would prefer if lectures went faster. ',
        'I found the assignments challenging but rewarding. But the midterm was brutal.',
        'The lectures were engaging and the instructor was very knowledgeable.',
        'I struggled with the linear algebra. I would recommend this class to anyone interested in data science.'
    ],
    'Rating': [5, 4, 4, 5, 5]
}
feedback_df = pd.DataFrame(feedback_data)
feedback_df

In [None]:

output_schema = {
        "type": "json_schema",
        "json_schema": {
            "name": "issue_schema",
            "schema": {
                "type": "object",
                "properties": {
                    "Issue": {
                        "description": "Any issues or concerns the user raised about the class.",
                        "type": "string"
                    },
                    "Liked": {
                        "description": "Any things the user liked about the class.",
                        "type": "string"
                    },
                    "additionalProperties": False
                }
            }
        }
    }

def process_feedback(feedback):
    prompt = f"""Extract the following information in JSON format:
    {{
  "Issue": "Any issues or concerns the user raised about the class.",
  "Liked": "Any things the user liked about the class."
  }}

  Feedback: "{feedback}"
"""
    response = client.invoke(prompt)
    import re, json
    try:
        json_match = re.search(r"\{.*\}", response, re.DOTALL)
        return json.loads(json_match.group(0)) if json_match else {"Issue": "", "Liked": ""}
    except:
        return {"Issue": "", "Liked": ""}

In [None]:
responses = feedback_df["Feedback"].apply(process_feedback)
responses

In [None]:
pd.set_option('display.max_colwidth', None)
feedback_df.join(pd.DataFrame(responses.to_list()))