## API Key Setup

Follow the steps in the Quickstart Guide to set up your API key.

You can also do the following 

+ Navigate to your project folder `cd ~/Desktop/source/my-project`

+ Create an .env file `touch .env`

+ Edit the file `nano .env`

+ Add your API key as `export OPENAI_API_KEY='your-api-key-here'`

+ Hit Ctrl + O to save, then hit enter. Exit using Ctrl + X.   

+ Update the gitignore to exclude .env `nano .gitignore`

+ Type `.env`

+ Save and exit.

+ To use the .env variable, type `source .env`

## Virtual Environment Setup

In Terminal, type:

`python -m venv openai-env`

## Activation

- For Windows, type: `openai-env\Scripts\activate`
- For MacOS or Unix, type: `source openai-env/bin/activate`

If done correctly, you terminal should see `(openai-env)` in your terminal and you can select it as a Kernel moving forward.

## Documentation

- OpenAI Documentation: https://platform.openai.com/docs/introduction
- OpenAI API Reference: https://platform.openai.com/docs/api-reference

## Libraries: 
- `!pip install --upgrade openai`
- `!pip install pyreadstat`
- `!pip install pandas`
- `!pip install urllib3`
- `!pip install mysql-connector-python`

In [1]:
import pandas as pd 
import pyreadstat
import urllib.request
from openai import OpenAI
from typing_extensions import override

# Correct raw URL from GitHub
url = "https://github.com/zachhollow/GSS-ChatGPT-Demo/raw/main/replication_data/GSS2018.sav"
local_file = "GSS2018.sav"

# Download the file locally
urllib.request.urlretrieve(url, local_file)

# Use pyreadstat to read the SPSS file
df, meta = pyreadstat.read_sav(local_file)

# Display the first few rows of the dataframe
print("File imported successfully as a DataFrame.\n\nHere is a preview:\n")
print(df.head())

# Preview total number of variables 
count = len(meta.column_names)
print(f"\nNumber of variables: {count}")

File imported successfully as a DataFrame.

Here is a preview:

   ABANY  ABDEFECT  ABFELEGL  ABHELP1  ABHELP2  ABHELP3  ABHELP4  ABHLTH  \
0    2.0       1.0       NaN      1.0      1.0      1.0      1.0     1.0   
1    1.0       1.0       3.0      2.0      2.0      2.0      2.0     1.0   
2    NaN       NaN       NaN      1.0      2.0      1.0      1.0     NaN   
3    NaN       NaN       1.0      1.0      1.0      1.0      1.0     NaN   
4    2.0       1.0       NaN      2.0      2.0      2.0      1.0     1.0   

   ABINSPAY  ABMEDGOV1  ...  XMARSEX  XMARSEX1  XMOVIE  XNORCSIZ    YEAR  \
0       1.0        2.0  ...      1.0       1.0     NaN       6.0  2018.0   
1       2.0        NaN  ...      1.0       NaN     2.0       6.0  2018.0   
2       2.0        1.0  ...      NaN       1.0     2.0       6.0  2018.0   
3       1.0        NaN  ...      NaN       NaN     2.0       6.0  2018.0   
4       2.0        NaN  ...      1.0       NaN     2.0       6.0  2018.0   

   YEARSJOB  YEARSUSA 

In [2]:
%%capture capture1
#Print all column names and labels saved as meta
print("Scroll to see the full list of column names and labels:\n")
for i in range(len(meta.column_names)):
    var_name = meta.column_names[i]
    var_label = meta.column_labels[i]
    print(f"{var_name}: {var_label}")

In [3]:
# Convert our SPSS file to a CSV as it's more compatible with Code Interpreter.
csv_file_path = "GSS2018.csv"
df.to_csv(csv_file_path, index=False)
print(f"File has been converted to CSV and saved as {csv_file_path}")

File has been converted to CSV and saved as GSS2018.csv


In [4]:
%load_ext autoreload
%autoreload 2

import os
from dotenv import load_dotenv

load_dotenv()  # Load environment variable from .env file

api_key = os.getenv('OPENAI_API_KEY') # Access your API key using the environment variable

client = OpenAI(api_key=api_key)  # Create the OpenAI client instance

In [5]:
# Upload our CSV as a file 
file = client.files.create(
                file=open(f"{csv_file_path}", "rb"),
                purpose='assistants',
            )
print(f"{csv_file_path} successfully uploaded to OpenAI")

GSS2018.csv successfully uploaded to OpenAI


In [6]:
# Access elements of the FileObject
print("Elements of the FileObject - \n")
print(f"File Name: {file.id}\n")
print(f"File Bytes: {file.bytes}\n")
print(f"File Created At: {file.created_at}\n")
print(f"File Name: {file.filename}\n")
print(f"File Purpose: {file.purpose}")

Elements of the FileObject - 

File Name: file-Aimi1qoCDhnr9YISzRo53Oq7

File Bytes: 6744448

File Created At: 1716595786

File Name: GSS2018.csv

File Purpose: assistants


In [7]:
from openai import AssistantEventHandler

file = client.files.retrieve(file.id) #Retrieve file

assistant = client.beta.assistants.create(
    instructions="You are a research analyst. Summarize data and provide data visualizations.",
    name="Research Analyst",
    tools=[{"type": "code_interpreter"}], 
    model="gpt-4",
    tool_resources={
    "code_interpreter": {
      "file_ids": [file.id]
    }
  }
) # Create assistant

thread = client.beta.threads.create() # Create a thread

In [8]:
message = client.beta.threads.messages.create(
  thread_id=thread.id,
  role="user",
  content=input("\nYou: ") 
) # Add input to thread 
  # NOTE: this code only handles one input at a time

# Create a EventHandler class to define how we want to handle the events in the response stream
# Use the `stream` SDK helper with the `EventHandler` class to create the Run and stream the response
class EventHandler(AssistantEventHandler):    
  @override
  def on_text_created(self, text) -> None:
    print(f"\nAssistant > ", end="", flush=True)
      
  @override
  def on_text_delta(self, delta, snapshot):
    print(delta.value, end="", flush=True)
      
  def on_tool_call_created(self, tool_call):
    print(f"\nAssistant > {tool_call.type}\n", flush=True)
  
  def on_tool_call_delta(self, delta, snapshot):
    if delta.type == 'code_interpreter':
      if delta.code_interpreter.input:
        print(delta.code_interpreter.input, end="", flush=True)
      if delta.code_interpreter.outputs:
        print(f"\n\nOutput >", flush=True)
        for output in delta.code_interpreter.outputs:
          if output.type == "logs":
            print(f"\n{output.logs}", flush=True)
 
with client.beta.threads.runs.stream(
  thread_id=thread.id,
  assistant_id=assistant.id,
  instructions="Please address the user as Zach. The user has a background in data analytics and machine learning using Python.",
  event_handler=EventHandler(),
) as stream:
  stream.until_done()


You:  Can you run regression to see whether INTECON (y) is casually driven by ABSINGLE (x). Maybe include CHURHPOW and HELPNOT as control variables. Please drop all missing values and provide data visualizations, tables or other files as you see necessary.



Assistant > Absolutely, Zach, I can help with that. Let me begin by loading your data and investigating its structure. Once I have a good sense of the dataset, I'll proceed to implement the linear regression and control for 'CHURHPOW' and 'HELPNOT'.
Assistant > code_interpreter

import pandas as pd

# Load the dataset
data = pd.read_csv('/mnt/data/file-Aimi1qoCDhnr9YISzRo53Oq7')

# Display the first few rows of the dataset
data.head()

Output >

   ABANY  ABDEFECT  ABFELEGL  ABHELP1  ABHELP2  ABHELP3  ABHELP4  ABHLTH  \
0    2.0       1.0       NaN      1.0      1.0      1.0      1.0     1.0   
1    1.0       1.0       3.0      2.0      2.0      2.0      2.0     1.0   
2    NaN       NaN       NaN      1.0      2.0      1.0      1.0     NaN   
3    NaN       NaN       1.0      1.0      1.0      1.0      1.0     NaN   
4    2.0       1.0       NaN      2.0      2.0      2.0      1.0     1.0   

   ABINSPAY  ABMEDGOV1  ...  XMARSEX  XMARSEX1  XMOVIE  XNORCSIZ    YEAR  \
0       1.0     