This notebook was originally Dr. Ringel's, and we will be repurposing it for use as part of our Major Class Project. For use, please download the "training_sampleX.csv" files, as well as the "missingdata.csv" file, and load them into the runtime.

Bernie Chen

# Creating Synthetic Experts with Generative AI
> ## Label Text with ChatGPT4

Version 1.2   
Date: February, 2024    
Author: Daniel M. Ringel   
Contact: dmr@unc.edu   

*Daniel M. Ringel, Creating Synthetic Experts with Generative Artificial Intelligence (December 11, 2023).  
Available at SSRN: https://papers.ssrn.com/abstract_id=4542949*

#### This notebook uses the OpenAI API to communicate with GPT4.
- You need an account with OpenAI for API access
- Visit https://platform.openai.com/signup?launch to sign-up
- Beware that using the API comes at a cost: https://openai.com/pricing

# 1. Imports

In [None]:
# Make sure to have the current openai package installed
%pip install --upgrade openai

Note: you may need to restart the kernel to use updated packages.


In [None]:
import pandas as pd
import numpy as np
import openai
from openai import OpenAI
import re, os, signal, datetime, warnings
from bs4 import BeautifulSoup

warnings.filterwarnings("ignore", category=UserWarning, module='bs4')
pd.set_option('display.max_colwidth', 300)

print(f"Using OpenAI API version: {openai.__version__}")

Using OpenAI API version: 1.23.6


# 2. Configure

##### By using this notebook, you agree that the author is not liable for any cost or damages that you incur.

> I ***strongly recommend*** that you set a ***soft limit*** and a ***hard limit*** on your ***OpenAI account*** before running this notebook to prevent excessive cost due to glitches in the API interaction (e.g., unexpected answers from the API lead to ongoing queries that incur cost)

In [None]:
# Put your OpenAI API Key here. DO NOT SHARE YOUR KEY!
# ----> Always delete your OpenAi API key before sharing the notebook! <-------
api_key = ""

# Instantiate an OpenAI client
client = OpenAI(api_key = api_key)

if not api_key == None:
    print("!!! Your API Key may be included in this notebook !!!\n\n >>> Do not forget to delete it before you share the notebook <<<")

run_index = 6

!!! Your API Key may be included in this notebook !!!

 >>> Do not forget to delete it before you share the notebook <<<


Loading Training Data Sample to Label (7 Total Samples)
- training_sample1.csv (1000 samples)
- training_sample2.csv (1000 samples)
- training_sample3.csv (1000 samples)
- training_sample4.csv (1000 samples)
- training_sample5.csv (1000 samples)
- training_sample6.csv (1000 samples)
- training_sample7.csv (604 samples)
- missingdata.csv (58 samples, reviews that couldn't be labeled from samples above)

In [None]:
# Set Paths
IN_Path = "Data/training"
# Set the filepath to each respective training sample for ChatGPT labeling
IN_File = f"training_sample{run_index}.csv"
TEMP_Path = "tmp"
TEMP_File = "SyntheticExperts_tmp"
OUT_Path = "Data/out/training"
OUT_File = "Labeled-TrainingMissing"

if not os.path.exists(TEMP_Path): os.makedirs(TEMP_Path)
if not os.path.exists(OUT_Path): os.makedirs(OUT_Path)

In [None]:
data = pd.read_csv(f"{IN_Path}/{IN_File}")

data.head()
# data.drop(columns="Unnamed: 0")

Unnamed: 0.1,Unnamed: 0,original_index,text,reviewId,city,lan,5D,Reliability,Tangibility,Empathy,Responsiveness,Assurance
0,0,143,Thats a great,ChZDSUhNMG9nS0VJQ0FnSURkOTd6WGJnEAE,Atlanta,en,No specific dimension mentioned,False,False,False,False,False
1,1,3183,"Everything we ordered was delicious. Their sauce is outstanding. We loved their eggplant rollatini and their lobster ravioli. Our waiter was friendly, professional and generous. Their bread is soft and yummy too. Highly recommend if you’re in the area.",ChdDSUhNMG9nS0VJQ0FnSUREMGJLQ2tRRRAB,Dallas,en,,False,False,False,False,False
2,2,1064,good,ChdDSUhNMG9nS0VJQ0FnSURELWJyVDRRRRAB,Atlanta,en,Not enough information to determine relevant dimension,False,False,False,False,False
3,3,1713,Listen..... Take The Time... Just Go and Try this INSANE FOOD.... YOU WON'T BE DISAPPOINTED,ChZDSUhNMG9nS0VJQ0FnSUQ5MGZ2NUxBEAE,Boston,en,Not enough information to determine relevant dimension,False,False,False,False,False
4,4,1793,Good,ChdDSUhNMG9nS0VJQ0FnSURwcmRlaS1BRRAB,Cary,en,-,False,False,False,False,False


In [None]:
# Formulate AI Prompt in RTF (Role, Task, Format) convention
AI_Prompt = "You are experienced customer service manager and an expert on the 5 service quality dimensions: Reliability, Tangibility, Empathy, Responsiveness, and Assurance. When given a numbered list of customer reviews, you examine each review individually. For each review, determine which of the 5 dimensions is most relevant to the review. Respond with the most relevant dimensions for each review, with every review having at least one labeled dimension. Only respond with the terms Reliability, Tangibility, Empathy, Responsiveness, and Assurance. The following are the definitions of the dimensions. Reliability: In a restaurant setting, reliability involves consistently delivering quality food and service. This means that meals are prepared correctly and served on time, reservations are honored promptly, and any issues are resolved efficiently to meet customer expectations. Tangibility: Tangibility in restaurants relates to the physical elements that represent the quality of service. This can include the cleanliness and decor of the dining area, the appearance of staff uniforms, the layout of the restaurant, and the design of menus and promotional materials. Empathy: In a restaurant setting, empathy is demonstrated through personalized service and attentive interactions with customers. It means understanding and accommodating dietary preferences, special requests, and creating a welcoming environment that shows genuine care for guests' needs. Responsiveness: Responsiveness in restaurants is the promptness with which staff attend to customers. This includes quickly seating guests, taking orders, serving food, and addressing customer inquiries or problems. Restaurants with high responsiveness ensure guests feel valued and well-attended throughout their visit. Assurance: Assurance involves the trust and confidence customers have in the restaurant's ability to deliver a safe and enjoyable dining experience. This can be achieved through staff knowledge, food safety practices, and professional handling of customer concerns. Customers need to feel that they are in good hands and that the restaurant prioritizes their well-being."

# AI Controls
tokens = int(3000)
temp = 0 # According to OpenAI, as this value approaches 0, the GPT4 model becomes deterministic in its responses
model = "gpt-3.5-turbo"

# Batch Controls 7(number of texts per query - need consider token limits, cost of retries, and size of failed batches)
batch_size = 25

# Set random state (change this for each run when you take majority labels across multiple runs; see Ringel (2023))
seed = 33

**Notes on batch size:**

- At the time of developing this notebook, the performance of OpenAI's API varied dramatically by
> *weekday* **x** *time of day* **x** *internet connection* **x** *model used* **x** *number of tokens*
- In general, I found:
    - smaller batches were less prone to API communication errors than larger batches.
    - longer texts work better in smaller batches
    - runtime dramatically increases during business hours
    - format of AI response deviates more during business hours and early evening, which can lead to errors in response processing
    
***My take-away:*** Create Synthetic Experts overnight on weekends and keep batch size at moderate level, especially for longer texts (i.e., more tokens)

# 3. Helper Functions

**Note from author:** These functions are coded for functionality, not for speed, elegance, or best readability at (i.e., not fully pythonic). Refactor them as needed.

The code in the function *twins_from_ai* is rather extensive to catch errors, retry queries, and collect failed batches. While shorter solutions are possible, I found that the current state of OpenAI's API and models calls for extensive error catching and handling.

In [None]:
def build_query(dataframe, start=0, end=0):
    """Function that builds the AI_query"""
    AI_Query = "".join([f"{i}: {dataframe.iloc[i]['text']}\n" for i in range(start, end+1)])
    return AI_Query

def handle_interrupt(signal, frame):
    """Function to handle interrupts"""
    print("Interrupt signal received. Exiting...")
    exit(0)

def ask_gpt(System_Prompt, User_Query, tokens=2000, temp=1, top_p=1, frequency_penalty=0, presence_penalty=0, model="gpt-3.5-turbo"):
    """Function that Queries OpenAI API"""
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": System_Prompt},
            {"role": "user", "content": User_Query}],
        max_tokens=tokens,
        temperature=temp,
        top_p=top_p,
        frequency_penalty=frequency_penalty,
        presence_penalty=presence_penalty
        )
    return response

def process_response(answer, start, end, retry_count):
    """Function that Processes AI Response. Note: For the original project about marketing mix variables, 'FourP' corresponds to the returned labels. """
    # Extract content from answer
    if 'message' in vars(answer.choices[0]):
        answer_content = answer.choices[0].message.content
    elif 'text' in vars(answer.choices[0]):
        answer_content = answer.choices[0].text
    else:
        raise ValueError("Processing Error: Cannot find model response text")
    print(answer_content)
    # Get the token usage
    used_tokens = answer.usage.total_tokens
    # Pre-process message content
    # answer_content = answer_content.replace("###", "") #optional if you are passing additional information behind a separator (e.g., ###)
    lines = [line.strip() for line in answer_content.split('\n') if line.strip()]
    results = []
    for line in lines:
        try:
            index = int(re.findall(r'^(\d+)', line.strip())[0])
            text = re.findall(r'^\d+[:.\s](.*)$', line.strip())[0].strip()
        except IndexError:
            continue
        if retry_count < 3:
            if index is None:
                raise ValueError("Response missing [index]")
            if text is None or len(text) == 0:
                raise ValueError("Response missing [text]")
        results.append((index, text))
    if len(results) == 0:
        raise ValueError("No index returned with content")
    # Create DataFrame
    FiveDimensions = pd.DataFrame(results, columns=["Index", "Content"]).set_index('Index', drop=True).rename_axis(None)
    FiveDimensions = FiveDimensions[~FiveDimensions.index.duplicated(keep='first')]
    # Check if the indices are within the range [start, end]
    indices = FiveDimensions.index.tolist()
    if not all(start <= index <= end for index in indices):
        raise ValueError("Returned indices do not correspond to input indices")
    return FiveDimensions, used_tokens

def classify_by_ai(AI_Prompt, batch_size, model, tokens, temp, data, interims_file):
    """Function that gets synthetic twins of text from AI"""
    counter, sum_tokens, consecutive_fails = 1, 0, 0
    data_len = len(data)
    num_full_batches, remainder = divmod(data_len, batch_size)
    failed_batches = pd.DataFrame(columns=['start', 'end'])  # DataFrame to store the failed batches
    signal.signal(signal.SIGINT, handle_interrupt)
    def process_batch(start, end, max_tries=5):
        nonlocal consecutive_fails, sum_tokens, counter
        print(f"\nstart: {start}, end: {end}")
        AI_Query = build_query(data, start, end)
        #signal.signal(signal.SIGINT, handle_interrupt) #optional for local handling (set to global 5 lines earlier)
        tries_query = 0
        while tries_query < max_tries:
            try:
                print(datetime.datetime.now(), f"Querying OpenAI: Try {tries_query+1}")
                if "gpt" in model:
                    response = ask_gpt(AI_Prompt, AI_Query, tokens=tokens, temp=temp, model=model)
                else:
                    print("Unknown Model Specification")
                try:
                    FiveDimensions, used_tokens = process_response(response, start, end, tries_query)
                    if 'Content' not in FiveDimensions.columns:
                        raise ValueError("Expected 'Content' column in FourP DataFrame")
                    sum_tokens += used_tokens
                    data.loc[FiveDimensions.index, '5D'] = FiveDimensions['Content'].values
                    consecutive_fails = 0
                    return True
                except ValueError as ve:
                    print(f"Unexpected AI response. Try {tries_query+1}, Error: {ve}")
                    tries_query += 1
            except Exception as e:
                print(f"Error: {e}")
                tries_query += 1
        print(f"Failed querying OpenAI {max_tries} times at batch {counter}.")
        consecutive_fails += 1
        new_row = {'start': start, 'end': end}
        failed_batches.loc[counter] = new_row
        return False
    for batch_num in range(num_full_batches):
        if consecutive_fails == 5:
            print("5 consecutive fails encountered. Stopping the process.")
            return data, sum_tokens, failed_batches
        start, end = batch_num * batch_size, (batch_num + 1) * batch_size - 1
        process_batch(start, end)
        if counter % 10 == 0:
            data.to_pickle(f"{interims_file}.pkl")
            failed_batches.to_pickle(f"{interims_file}_Failed_batches.pkl")
            print(f"Interim Results Saved: Batch {counter}")
        counter += 1
        print(f"Total Tokens used so far: {sum_tokens}")
    if remainder >= 2:
        start, end = num_full_batches * batch_size, num_full_batches * batch_size + remainder - 1
        process_batch(start, end, max_tries=3)
        if counter % 10 == 0:
            data.to_pickle(f"{interims_file}.pkl")
            failed_batches.to_pickle(f"{interims_file}_Failed_batches.pkl")
            print(f"Interim Results Saved: Batch {counter}")
        print(f"Total Tokens used so far: {sum_tokens}")
    return data, sum_tokens, failed_batches

def clean_and_parse_text(text):
    """Function that cleans-up texts by (1) parsing HTML, (2) removing URLS (replace with URL), (3) removing line breaks and leading periods and colons, and (4) removing leading, trailing, and duplicate spaces."""
    text = re.sub(r"https?://\S+|www\.\S+", " URL ", text)
    parsed = BeautifulSoup(text, "html.parser").get_text() if "filename" not in str(BeautifulSoup(text, "html.parser")) else None
    return re.sub(r" +", " ", re.sub(r'^[.:]+', '', re.sub(r"\\n+|\n+", " ", parsed or text)).strip()) if parsed else None

def boolean_ps(frame):
    """Function that checks which Ps the AI identified and creates a Boolean column for each P"""
    for p in ['Reliability', 'Tangibility', 'Empathy', 'Responsiveness', 'Assurance']:
        frame[p] = frame['5D'].apply(
            lambda x: False if pd.isnull(x) else any([
                p.lower() in item.lower()
                for item in (x.split(", ") if isinstance(x, str) else [])]))
    return frame

# 2. Load, Parse, and Clean Demo Text
These 250 demo texts are ***Synthetic Twins*** of real Tweets. I do not publish real (i.e., original) Tweets with this notebook.  
> ***Synthetic Twins*** correspond semantically in idea and meaning to original texts. However, wording, people, places, firms, brands, and products were changed by an AI. As such, ***Synthetic Twins*** mitigate, to some extent, possible privacy, and copyright concerns. If you'd like to learn more about ***Synthetic Twins***, another generative AI project by Daniel Ringel, then please get in touch! dmr@unc.edu  


You can ***create your own Synthetic Twins of texts*** with this Python notebook:   `SyntheticExperts_Create_Synthetic_Twins_of_Texts.ipynb`,   
available as BETA version (still being tested) on the **Synthetic Experts [GitHub](https://github.com/dringel/Synthetic-Experts)** respository.<br><br><br>

In [None]:
# Load Texts
df = pd.read_csv(f"{IN_Path}/{IN_File}")
df['text'] = df['text'].apply(clean_and_parse_text)
# df.rename(columns={'reliability': 'man_r', 'man_a': 'Y', 'responsiveness': 'man_rp', 'tangibility': 'man_t', 'empathy': 'man_e', 'assurance': 'man_a'}, inplace=True)

# df = df.drop(columns="Unnamed: 0")
df

  parsed = BeautifulSoup(text, "html.parser").get_text() if "filename" not in str(BeautifulSoup(text, "html.parser")) else None


Unnamed: 0,original_index,text,reviewId,city,lan,5D,Reliability,Tangibility,Empathy,Responsiveness,Assurance
0,143,Thats a great,ChZDSUhNMG9nS0VJQ0FnSURkOTd6WGJnEAE,Atlanta,en,No specific dimension mentioned,False,False,False,False,False
1,3183,"Everything we ordered was delicious. Their sauce is outstanding. We loved their eggplant rollatini and their lobster ravioli. Our waiter was friendly, professional and generous. Their bread is soft and yummy too. Highly recommend if you’re in the area.",ChdDSUhNMG9nS0VJQ0FnSUREMGJLQ2tRRRAB,Dallas,en,,False,False,False,False,False
2,1064,good,ChdDSUhNMG9nS0VJQ0FnSURELWJyVDRRRRAB,Atlanta,en,Not enough information to determine relevant dimension,False,False,False,False,False
3,1713,Listen..... Take The Time... Just Go and Try this INSANE FOOD.... YOU WON'T BE DISAPPOINTED,ChZDSUhNMG9nS0VJQ0FnSUQ5MGZ2NUxBEAE,Boston,en,Not enough information to determine relevant dimension,False,False,False,False,False
4,1793,Good,ChdDSUhNMG9nS0VJQ0FnSURwcmRlaS1BRRAB,Cary,en,-,False,False,False,False,False
5,2140,Love this place,ChZDSUhNMG9nS0VJQ0FnSUQ5LXU2NVdnEAE,Cary,en,,False,False,False,False,False
6,3164,Good,ChZDSUhNMG9nS0VJQ0FnSUQ5MmJPWWRnEAE,Dallas,en,Not enough information to determine a relevant dimension,False,False,False,False,False
7,3215,Very good,ChZDSUhNMG9nS0VJQ0FnSUQ2Nk82T1BnEAE,Dallas,en,Not enough information to determine relevant dimension,False,False,False,False,False
8,3719,Monica was free flu,ChZDSUhNMG9nS0VJQ0FnSUNwc2ZmeFFnEAE,Dallas,en,-,False,False,False,False,False
9,4146,I'm God,ChdDSUhNMG9nS0VJQ0FnSUNwNDRXd19RRRAB,New York,en,,False,False,False,False,False


You may see a warning from Beautiful Soup when it finds a pattern in text that is similar to a filename. This warning is not a problem for this notebook and for what we are doing here.

# 3. Label Text with OpenAI's GPT4

> From my experience, the speed at which the AI labels texts, and the occurrence of possible errors in communicating with the API is related to the day and time of day you query the API. Workday afternoons and evenings tend to see more traffic (i.e., queries) to GPT4, which can slow down its responses, lead to time-outs, and create various other errors.

In [None]:
# Shuffle order while preserving Index
df["original_Index"] = df.index
df = df.sample(frac=1, random_state=seed)
df.reset_index(inplace=True, drop=True)

df

Unnamed: 0,text,reviewId,city,lan,original_Index
0,I Love the Salad and The Sub sandwiches I love the great customers service the Staff be friendly and helpful to me I Love that $10 Salad U can put all that fresh veggies on it Thanks,ChdDSUhNMG9nS0VJQ0FnSUNqaUx6eDZRRRAB,Atlanta,en,286
1,"Don’t doubt this place, their food is amazing. The owner and cook are friendly and give excellent service.",ChZDSUhNMG9nS0VJQ0FnSURsaVBHOU9nEAE,New York,en,402
2,Great neighborhood place to meet and eat. Loud at lunchtime,ChdDSUhNMG9nS0VJQ0FnSURGeW9EOHJRRRAB,Atlanta,en,253
3,My friend and I came from Shreveport for a convert and me and my friend stopped for lunch here and the food was amazing service by Bella was amazing and she was super attentive. Loved it here🤗🤗🤗,ChZDSUhNMG9nS0VJQ0FnSUNqLWY2VmRnEAE,Dallas,en,4
4,"Cafe flora has been my go to restaurant since for their creative and tasty vegan options! Every time i have a family or friend, this is my go to place to bring them. My complaint is that, we were a given a table right next to the bathroom when i could see enough other empty tables. I hope the se...",ChZDSUhNMG9nS0VJQ0FnSURwcmRyWEVBEAE,Seattle,en,372
...,...,...,...,...,...
995,Sam the bartender and the entire staff is amazing. We have. Even there several times during all seasons and it always is 5 star service and food. I highly recommend it,ChdDSUhNMG9nS0VJQ0FnSUNENnIzazVBRRAB,Washington,en,658
996,"I ordered the Beef with Bitter Melons for dinner. The dish was delicious. The restaurant is simply decorated and clean. The waiters were courteous. After the meal, I ordered two more dishes to take home. It was overall an excellent experience.",ChdDSUhNMG9nS0VJQ0FnSUMxZ0tUam1RRRAB,New York,en,578
997,Great place. Took our daughter for her birthday. Great welcoming staff and service. I had the reddish courtbouillon and it's the best I've had anywhere!,ChdDSUhNMG9nS0VJQ0FnSUM5bEotSW9BRRAB,New Orleans,en,728
998,Drive thru very slow,ChdDSUhNMG9nS0VJQ0FnSUQ5cmRqUWxBRRAB,Atlanta,en,391


In [None]:
# Label with GPT-4
interims_file = f"{TEMP_Path}/{TEMP_File}_seed{seed}"
out, total_tokens, failed_batches = classify_by_ai(AI_Prompt, batch_size, model, tokens, temp, df, interims_file)
print(f"\nComplete. Total tokens used: {total_tokens}")
if not failed_batches.empty:
    print(f"WARNING: AI failed to label {len(failed_batches)} rows (texts).\nConsider querying the AI again for just these rows (texts).")


start: 0, end: 24
2024-04-27 02:03:03.449092 Querying OpenAI: Try 1
1: Tangibility, Empathy, Assurance
4: Tangibility
8: Empathy
14: Tangibility, Empathy
19: Reliability, Responsiveness
20: Tangibility, Empathy, Reliability
Total Tokens used so far: 779

start: 25, end: 49
2024-04-27 02:03:04.923821 Querying OpenAI: Try 1
25: Reliability
26: Empathy
27: Tangibility
28: Assurance
29: Reliability
30: Tangibility
31: Empathy
32: Tangibility
33: Empathy, Reliability
34: Tangibility, Empathy
35: Tangibility
36: Tangibility
37: Assurance
38: Tangibility
39: Reliability
40: N/A
41: N/A
42: N/A
43: Responsiveness
44: Tangibility, Reliability
45: N/A
46: Tangibility
47: N/A
48: Assurance
49: N/A
Total Tokens used so far: 1726

start: 50, end: 57
2024-04-27 02:03:06.960731 Querying OpenAI: Try 1
50: Reliability, Empathy
51: Tangibility
52: Empathy
53: Tangibility
54: Reliability
55: Reliability
56: Reliability
57: Assurance
Total Tokens used so far: 2235

Complete. Total tokens used: 2235


In [None]:
# Code 4Ps to Boolean Columns
out = boolean_ps(out)
out

Unnamed: 0,original_index,text,reviewId,city,lan,5D,Reliability,Tangibility,Empathy,Responsiveness,Assurance
0,143,Thats a great,ChZDSUhNMG9nS0VJQ0FnSURkOTd6WGJnEAE,Atlanta,en,No specific dimension mentioned,False,False,False,False,False
1,3183,"Everything we ordered was delicious. Their sauce is outstanding. We loved their eggplant rollatini and their lobster ravioli. Our waiter was friendly, professional and generous. Their bread is soft and yummy too. Highly recommend if you’re in the area.",ChdDSUhNMG9nS0VJQ0FnSUREMGJLQ2tRRRAB,Dallas,en,"Tangibility, Empathy, Assurance",False,True,True,False,True
2,1064,good,ChdDSUhNMG9nS0VJQ0FnSURELWJyVDRRRRAB,Atlanta,en,Not enough information to determine relevant dimension,False,False,False,False,False
3,1713,Listen..... Take The Time... Just Go and Try this INSANE FOOD.... YOU WON'T BE DISAPPOINTED,ChZDSUhNMG9nS0VJQ0FnSUQ5MGZ2NUxBEAE,Boston,en,Tangibility,False,True,False,False,False
4,1793,Good,ChdDSUhNMG9nS0VJQ0FnSURwcmRlaS1BRRAB,Cary,en,Tangibility,False,True,False,False,False
5,2140,Love this place,ChZDSUhNMG9nS0VJQ0FnSUQ5LXU2NVdnEAE,Cary,en,,False,False,False,False,False
6,3164,Good,ChZDSUhNMG9nS0VJQ0FnSUQ5MmJPWWRnEAE,Dallas,en,Not enough information to determine a relevant dimension,False,False,False,False,False
7,3215,Very good,ChZDSUhNMG9nS0VJQ0FnSUQ2Nk82T1BnEAE,Dallas,en,Not enough information to determine relevant dimension,False,False,False,False,False
8,3719,Monica was free flu,ChZDSUhNMG9nS0VJQ0FnSUNwc2ZmeFFnEAE,Dallas,en,Empathy,False,False,True,False,False
9,4146,I'm God,ChdDSUhNMG9nS0VJQ0FnSUNwNDRXd19RRRAB,New York,en,,False,False,False,False,False


In [None]:
run_index = 8
missing = pd.DataFrame(columns=out.columns)

for index, row in out.iterrows():
    # Check if all specified columns are empty
    if not (row['Reliability'] or row['Tangibility'] or row['Empathy'] or row['Responsiveness'] or row['Assurance']):
        # Add a copy of the row to the missing DataFrame
        missing.loc[len(missing)] = row.copy()
        # Remove the row from vertical_concat by index
        out = out.drop(index)

out.count()

original_index    39
text              39
reviewId          39
city              39
lan               39
5D                39
Reliability       39
Tangibility       39
Empathy           39
Responsiveness    39
Assurance         39
dtype: int64

In [None]:
OUT_FILEPATH = f"{OUT_Path}"

# Save
if not os.path.exists(OUT_FILEPATH): os.makedirs(OUT_FILEPATH)
out.to_csv(f"{OUT_FILEPATH}/{OUT_File}{run_index}_{model}.csv")
failed_batches.to_csv(f"{OUT_FILEPATH}/{OUT_File}{run_index}_{model}_failed_run.csv")

# out.to_csv(f"{OUT_FILEPATH}/{OUT_File}_seed{seed}_{model}_run{run_index}.csv")
# failed_batches.to_csv(f"{OUT_FILEPATH}/{OUT_File}_seed{seed}_{model}_failed_run{run_index}.csv")
run_index += 1

In [None]:
print("If you use this notebook's code, please give credit to the author by citing the paper:\n\nDaniel M. Ringel, Creating Synthetic Experts with Generative Artificial Intelligence (December 11, 2023).\nAvailable at SSRN: https://papers.ssrn.com/abstract_id=4542949")

If you use this notebook's code, please give credit to the author by citing the paper:

Daniel M. Ringel, Creating Synthetic Experts with Generative Artificial Intelligence (December 11, 2023).
Available at SSRN: https://papers.ssrn.com/abstract_id=4542949
