# Anthropic Interviewer dataset

Quickstart notebook to pull the transcripts from the Hugging Face dataset and do light inspection.
            


**Dataset:** `Anthropic/AnthropicInterviewer` on Hugging Face.
- Interview transcripts from 1,250 professionals (workforce=1,000, creatives=125, scientists=125).
- Data is CC-BY; code MIT. Public dataset, so no auth token needed for reading.

Run the install cell once per environment, then execute the rest.
            


In [11]:
%pip install -q pandas huggingface_hub openai scikit-learn



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [12]:
import pandas as pd
from pathlib import Path

SPLITS = {
    "workforce": "interview_transcripts/workforce_transcripts.csv",
    "creatives": "interview_transcripts/creatives_transcripts.csv",
    "scientists": "interview_transcripts/scientists_transcripts.csv",
}
BASE_PATH = "hf://datasets/Anthropic/AnthropicInterviewer/"

def load_split(name: str) -> pd.DataFrame:
    path = BASE_PATH + SPLITS[name]
    df = pd.read_csv(path)
    df["split"] = name
    return df

dfs = {name: load_split(name) for name in SPLITS}
for name, df in dfs.items():
    cols = ", ".join(df.columns)
    print(f"{name:10} {df.shape[0]:4} rows | columns: {cols}")
            


workforce  1000 rows | columns: transcript_id, text, split
creatives   125 rows | columns: transcript_id, text, split
scientists  125 rows | columns: transcript_id, text, split


In [13]:
# Quick look at the workforce split
dfs["workforce"].head()
            


Unnamed: 0,transcript_id,text,split
0,work_0000,Assistant: Hi there! I'm Claude from Anthropic...,workforce
1,work_0001,Assistant: Hi there! I'm Claude from Anthropic...,workforce
2,work_0002,Assistant: Hi there! I'm Claude from Anthropic...,workforce
3,work_0003,Assistant: Hi there! I'm Claude from Anthropic...,workforce
4,work_0004,Assistant: Hi there! I'm Claude from Anthropic...,workforce


In [14]:
# Sample rows across all splits
all_df = pd.concat(dfs.values(), ignore_index=True)
all_df.sample(5, random_state=42)[["transcript_id", "split", "text"]]
            


Unnamed: 0,transcript_id,split,text
680,work_0680,workforce,Assistant: Hi there! I'm Claude from Anthropic...
1102,creativity_0102,creatives,Assistant: Hi there! Thank you so much for tak...
394,work_0394,workforce,Assistant: Hi there! I'm Claude from Anthropic...
930,work_0930,workforce,Assistant: Hi there! I'm Claude from Anthropic...
497,work_0497,workforce,Assistant: Hi there! I'm Claude from Anthropic...


In [15]:
# Rough length stats by split (character count of transcript text)
all_df = all_df.copy()
all_df["text_length"] = all_df["text"].str.len()
all_df.groupby("split")["text_length"].describe()[["count", "mean", "min", "max"]]
            


Unnamed: 0_level_0,count,mean,min,max
split,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
creatives,125.0,8905.288,4867.0,14399.0
scientists,125.0,8511.928,5550.0,16061.0
workforce,1000.0,9230.874,4837.0,26826.0


### Per-split descriptive stats
Add word-level and character-level summaries to see distribution differences per group.
    


In [16]:
# Word-level descriptive stats by split
all_df = pd.concat(dfs.values(), ignore_index=True)
all_df = all_df.assign(
    word_count=all_df["text"].str.split().str.len(),
    char_count=all_df["text"].str.len(),
)
summary = (
    all_df.groupby("split")[["word_count", "char_count"]]
    .agg(["count", "mean", "median", "min", "max"])
    .round(2)
)
summary
    


Unnamed: 0_level_0,word_count,word_count,word_count,word_count,word_count,char_count,char_count,char_count,char_count,char_count
Unnamed: 0_level_1,count,mean,median,min,max,count,mean,median,min,max
split,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
creatives,125,1545.13,1481.0,856,2535,125,8905.29,8582.0,4867,14399
scientists,125,1438.47,1348.0,920,2784,125,8511.93,8025.0,5550,16061
workforce,1000,1593.74,1519.5,825,4789,1000,9230.87,8809.5,4837,26826


### Top keywords per group (TF-IDF)
Rough sense of distinctive vocabulary by group. Adjust `top_n` or stop words as needed.
    


In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

top_n = 15
results = {}
for split, df in dfs.items():
    vec = TfidfVectorizer(stop_words="english", max_features=5000)
    matrix = vec.fit_transform(df["text"])
    scores = matrix.sum(axis=0).A1
    terms = vec.get_feature_names_out()
    order = scores.argsort()[::-1][:top_n]
    results[split] = [(terms[i], float(scores[i])) for i in order]

for split, items in results.items():
    print(f"\n{split.title()} top {top_n} tf-idf terms:")
    for term, score in items:
        print(f"  {term:20s} {score:.2f}")



Workforce top 15 tf-idf terms:
  ai                   465.14
  work                 136.01
  user                 112.22
  use                  101.86
  like                 91.00
  really               69.51
  time                 60.90
  tasks                56.05
  using                51.92
  think                49.41
  help                 43.22
  questions            40.53
  sounds               40.53
  experiences          38.04
  insights             37.81

Creatives top 15 tf-idf terms:
  ai                   54.93
  creative             24.38
  like                 15.17
  work                 14.42
  user                 13.30
  really               11.30
  using                7.86
  use                  6.76
  sounds               6.12
  think                6.09
  time                 6.01
  writing              5.90
  process              5.21
  ve                   5.01
  research             4.89

Scientists top 15 tf-idf terms:
  ai                   46.78
  researc

### LLM themes per group
Set `OPENAI_API_KEY` in your environment. The cell below samples transcripts per split and asks a stronger model (default `gpt-4o`) for 5 themes with supporting evidence.
            


In [None]:
from openai import OpenAI
import os

api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise ValueError("Set OPENAI_API_KEY in your environment before running this cell.")

client = OpenAI(api_key=api_key)
separator = "\n\n---\n\n"

def summarize_split(split: str, sample_size: int = 8, model: str = "gpt-4o") -> str:
    subset = dfs[split].sample(sample_size, random_state=42)["text"].tolist()
    prompt = f"""
You are analyzing qualitative interview transcripts from the {split} group.
Extract 5 themes. For each theme, provide a short label and 1-2 bullet examples grounded in the text.
Return concise markdown.

Transcripts (each separated by ---):
{separator.join(subset)}
"""
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.2,
    )
    return response.choices[0].message.content

for split in ["workforce", "creatives", "scientists"]:
    print(f"\n### {split.title()} themes\n")
    print(summarize_split(split))



### Workforce themes

# Themes from Interview Transcripts

## 1. AI as a Time-Saving Tool
- **Example:** A funeral director uses AI to help with obituary writing, saving time and ensuring character limits are met for newspaper publications.
- **Example:** A high school teacher uses AI to brainstorm lesson ideas efficiently, allowing more time for student engagement.

## 2. Collaboration and Iteration with AI
- **Example:** A software developer collaborates with AI to debug and refine code, stepping in manually when the AI gets stuck.
- **Example:** An instructional designer uses AI to generate outlines and then iterates on them to fit specific learning objectives.

## 3. Maintaining Human Expertise and Judgment
- **Example:** A pastor emphasizes the importance of embodying sermon messages personally, using AI only as a supplementary tool.
- **Example:** A university lecturer uses AI for initial insights but relies on personal expertise to verify and expand on those ideas.

## 4. Chall

### Save per-group LLM themes to markdown
Writes the generated themes to `analysis/llm_group_analysis.md` for easy reference.


In [None]:
from pathlib import Path

output_dir = Path("analysis")
output_dir.mkdir(exist_ok=True)
out_path = output_dir / "llm_group_analysis.md"

sections = []
for split in ["workforce", "creatives", "scientists"]:
    sections.append(f"## {split.title()} themes\n")
    sections.append(summarize_split(split))

content = "\n\n".join(sections)
out_path.write_text(content)
print(f"Wrote {out_path} ({len(content)} chars)")


In [None]:
# Optional: persist the three splits locally in data/
output_dir = Path("data")
output_dir.mkdir(exist_ok=True)
for name, df in dfs.items():
    dest = output_dir / f"{name}_transcripts.csv"
    df.to_csv(dest, index=False)
    print(f"Wrote {dest}")
            
