In [78]:
#| default_exp gptanalysis

In [94]:
import pandas as pd
from IPython.display import display, Markdown

In [80]:
df = pd.read_excel('../out/attestations_clustered.xlsx')

# set up API

In [81]:
from dotenv import load_dotenv
import os
from simpleaichat import AIChat

# Load the environment variables from the .env file
load_dotenv()

# Get the value of the OPENAI_API_KEY environment variable
api_key = os.getenv("OPENAI_API_KEY")

In [102]:
with open('../data/gpt_prompt_system.txt', 'r') as f:
    prompt_system = f.read().replace('\n', ' ')

display(Markdown(prompt_system))

You work as a corpuslinguistic annotator for a research project in linguistics. The data are from the BNC 2014 spoken corpus and contain utterances that feature the phrase 'in here'. These utterances were classified into several clusters based on several features of the attested utterances: word forms, word classes, and several semantic and pragmatic features. The assumption is that these clusters represent different types of uses of the phrase 'in here' and that they show differences with regard to linguistic regularities underlying the use of 'in here'. Your job is to summarise the regularities of each cluster, and to summarise how these clusters cover the space of usages of the phrase 'in here'.

In [83]:
model='gpt-4'
# model='gpt-3.5-turbo'

In [84]:
from pydantic import BaseModel, Field

ai = AIChat( 
    console=False,
    save_messages=False,  # with schema I/O, messages are never saved
    model=model,
    params={"temperature": 0.0},
    system=prompt_system
)

In [85]:
n = 5

In [86]:
# get distinct values of the cluster column
clusters = df['cluster'].unique()
# sort them
clusters.sort()

In [87]:
#| export
def get_examples(df, cluster, n):
	return (df
		.nsmallest(n, f'dist_cluster_{cluster}')
		.assign(text = lambda row: row['left_context'] + ' ' + row['node'] + ' ' + row['right_context'])
		.loc[:, ['text']]
	)

In [88]:
examples = get_examples(df, 0, n)
examples

Unnamed: 0,text
71,in here it wo n't allow me to store anything i...
104,? yeah just get a pair just to eh keep in here...
38,s like four or five drawers yeah there 's four...
181,--ANONnameM not have one ? I 've got bigger bo...
175,did it go ? ah users --ANONnameM my pictures p...


In [89]:
#| export
def format_examples(examples):
	# convert to list of strings
	examples = examples['text'].tolist()
	# join all examples into one string with linebreaks
	examples = '- ' + '\n- '.join(examples)
	return examples

In [90]:
examples_formatted = format_examples(examples)
examples_formatted

"- in here it wo n't allow me to store anything in here right well that 's annoying so I can only put\n- ? yeah just get a pair just to eh keep in here well I 've got two pairs anyway so I might\n- s like four or five drawers yeah there 's four in here and they 're nice it 's a nice size one\n- --ANONnameM not have one ? I 've got bigger bowls in here hang on a minute there might be something in here\n- did it go ? ah users --ANONnameM my pictures probably in here what is it you 're looking for ? er that"

In [91]:
prompt_examples = ''

for c in clusters:
	prompt_examples += f'cluster {c}' + '\n'
	prompt_examples += '\n' + format_examples(get_examples(df, c, n)) + '\n\n'

print(prompt_examples)

cluster 0

- in here it wo n't allow me to store anything in here right well that 's annoying so I can only put
- ? yeah just get a pair just to eh keep in here well I 've got two pairs anyway so I might
- s like four or five drawers yeah there 's four in here and they 're nice it 's a nice size one
- --ANONnameM not have one ? I 've got bigger bowls in here hang on a minute there might be something in here
- did it go ? ah users --ANONnameM my pictures probably in here what is it you 're looking for ? er that

cluster 1

- I 'm the alpaca Pandora of of the --ANONplace came in here totally see you as like a horror movie lead actually
- Helen live together and er they when I first moved in here and I was n't sure about it and I was
- see it ? oh yeah yeah --ANONnameM --ANONnameM --ANONnameM come in here where I can see you hello hello hello hello juice
- when I lived with --ANONnameM and --ANONnameF before I moved in here mm hm and um she came and everything and it
- How long ago ? last

In [92]:
# gpt_analysis = ai(prompt_examples)

In [95]:
display(Markdown(gpt_analysis))

Cluster 0: This cluster seems to represent uses of 'in here' where the phrase refers to a specific location or container where items can be stored or found. The phrase is often used in the context of searching for something or discussing the placement of objects.

Cluster 1: This cluster represents uses of 'in here' in the context of moving or coming into a place. The phrase is often used to describe someone's arrival or relocation to a new place, such as moving into a new house or entering a room.

Cluster 2: This cluster represents uses of 'in here' where the phrase is used to describe the conditions or atmosphere of a place, such as its temperature or noise level. The phrase is often used in the context of commenting on these conditions.

Cluster 3: This cluster represents uses of 'in here' where the phrase is used in questions or exclamations about what is inside a particular place or container. The phrase is often used in the context of discovering or identifying what is inside.

Cluster 4: This cluster represents uses of 'in here' where the phrase is used to refer to a specific location where activities are taking place or are planned to take place. The phrase is often used in the context of discussing these activities.

These clusters cover a range of usages of 'in here', from referring to a specific location or container, to describing the conditions of a place, to discussing activities taking place in a location. They show that 'in here' can be used in a variety of contexts and with different meanings depending on the situation.

In [103]:
# with open(f'../out/gpt_cluster_analysis_{model}_{n}.txt', 'w') as f:
# 	f.write(gpt_analysis)