# Data Collection - Control Group

Control group is gathered to **establish a baseline of response length variability without the influence of length-defining keywords**. 
<br><br>
It consists of 500 responses (5 sets of 100) from ChatGPT ("gpt-3.5-turbo" model) collected using `generate_responses.py`. 
<br><br>
As the prompt, prompt templates from `prompt_templates.py` were used without specifying any keywords.
<br><br>
These responses serve as a reference point for comparing and measuring the effects of different keywords on response lengths. 
<br><br>
Data from control group will **help to determine template-specific keywords by analyzing the response length distribution** in `2_EDA_Control_Group.ipynb`.

# Setup

In [1]:
# LIBRARIES
import pandas as pd
import os
from typing import Dict  # type hinting

# FILES
from generate_responses import (
    generate_responses,
)  # function to generate responses from OpenAI
import prompt_templates

# CONSTANTS
N_RESPONSES = 100  # set number of responses to generate per prompt
PROMPT_TITLE_TEMPLATE = (
    prompt_templates.prompt_title_template
)  # Dict[str, str] -> prompt titles (str) and templates (str) with placeholders for length defining keywords

# COLORS FOR PRINTING
GRAY = "\033[90m"
RED = "\033[91m"
GREEN = "\033[92m"
YELLOW = "\033[93m"
PURPLE = "\033[95m"
CYAN = "\033[96m"
RESET = "\033[0m"

# Response Collection: Control Group without specyfing length defining keywords

In [2]:
def control_group_collection(prompt_title_template: Dict[str, str]) -> None:
    for prompt_title, template in prompt_title_template.items():
        print(
            f"{YELLOW}GENERATING ALL RESPONSES FOR {CYAN}{prompt_title}{YELLOW} PROMPT TEMPLATE...{RESET}"
        )
        print(f"File being created: {GREEN}{prompt_title}_control.csv{RESET}")

        prompt = template.format("")  # no length defining keyword for control group
        prompt = " ".join(prompt.split())  # delete double space in prompt
        print(f"Prompt used: {GRAY}{prompt}{RESET}")

        generate_responses(
            n=N_RESPONSES,
            prompt=prompt,
            filename_prefix=prompt_title,
            filename_suffix="control",  # manually set filename suffix, because no length defining keyword was used
            folder_type="control",  # default is "experimental"
        )

In [3]:
control_group_collection(PROMPT_TITLE_TEMPLATE)

[93mGENERATING ALL RESPONSES FOR [96memail[93m PROMPT TEMPLATE...[0m
File being created: [92memail_control.csv[0m
Prompt used: [90mWrite a business email that is professional, clear, and concise. The email should be addressed to a potential business client. Please introduce my paper company and our offer. Invite the recipient to a meeting in my office at 2PM on Wednesday.[0m
[94mResponses gathered: 1  |  Working on: email_control.csv  |  Errors occured: 0[0m
[94mResponses gathered: 2  |  Working on: email_control.csv  |  Errors occured: 0[0m
[94mResponses gathered: 3  |  Working on: email_control.csv  |  Errors occured: 0[0m
[94mResponses gathered: 4  |  Working on: email_control.csv  |  Errors occured: 0[0m
[94mResponses gathered: 5  |  Working on: email_control.csv  |  Errors occured: 0[0m
[94mResponses gathered: 6  |  Working on: email_control.csv  |  Errors occured: 0[0m
[94mResponses gathered: 7  |  Working on: email_control.csv  |  Errors occured: 0[0m
[94m

# Assert shape of all CSV's

If runs without output, all CSV's have the same shape.

In [4]:
desired_number_of_responses = N_RESPONSES
desired_number_of_columns = 4
folder_type = "control"

directory = os.getcwd() + f"/data/raw_{folder_type}_group"

for filename in os.listdir(directory):
    file_path = os.path.join(directory, filename)

    if filename.endswith(".csv"):
        df = pd.read_csv(file_path)
        # making sure that all files have same shape
        assert df.shape == (
            desired_number_of_responses,
            desired_number_of_columns,
        ), f"{filename} has shape {df.shape} instead of ({desired_number_of_responses}, {desired_number_of_columns})"

# Checking if all files were collected

In [5]:
csv_count = 0

for file in os.listdir(directory):
    if file.endswith(".csv"):
        csv_count += 1

# Print the number of CSV files
print("Number of CSV files:", csv_count)

Number of CSV files: 5
