Title: The Distress Analysis Interview Corpus of human and computer interviews

Link: https://dcapswoz.ict.usc.edu/

Datasets link (access granted)

https://dcapswoz.ict.usc.edu/wwwdaicwoz/  [Downloaded this dataset]

https://dcapswoz.ict.usc.edu/wwwedaic/ [Not using extended dataset]

# Data description:

The DAIC-WOZ Depression dataset is a subset of the Distress Analysis Interview Corpus containing Wizard-of-Oz clinical interviews, where participants interact with a virtual agent (Ellie) controlled by a human. Each session provides audio, video, and transcripts of the conversations, along with questionnaire responses such as the PHQ-8 for depression assessment.

# How to get conversation and it's label (depression)

Conversation text can be obtained from the XXX_TRANSCRIPT.csv files in each participant’s folder, which include time-aligned utterances from both Ellie and the participant. Labels for depression are derived from the train_split_Depression_AVEC2017.csv and dev_split_Depression_AVEC2017.csv files in the util/ folder, where a PHQ-8 score of 10 or higher indicates depression, and scores below 10 indicate non-depression.

In [None]:
import os
import urllib.request
import zipfile

# Base URL for DAIC-WOZ data
BASE_URL = "https://dcapswoz.ict.usc.edu/wwwdaicwoz"

# Output directory where you want transcript CSVs
OUTPUT_DIR = r"D:\Sajjad-Workspace\Datasets\Dataset_6_Distress_Analysis_Interview\DAIC-WOZ"
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Temporary directory for downloading zips
TEMP_DIR = os.path.join(OUTPUT_DIR, "tmp")
os.makedirs(TEMP_DIR, exist_ok=True)

# Loop through participant IDs
for pid in range(305, 493):  # 300 to 492
    zip_name = f"{pid}_P.zip"
    url = f"{BASE_URL}/{zip_name}"
    temp_zip_path = os.path.join(TEMP_DIR, zip_name)
    transcript_name = f"{pid}_TRANSCRIPT.csv"
    output_csv_path = os.path.join(OUTPUT_DIR, transcript_name)

    if os.path.exists(output_csv_path):
        print(f"Already have: {output_csv_path}")
        continue

    try:
        print(f"Downloading {zip_name} ...")
        urllib.request.urlretrieve(url, temp_zip_path)

        # Open zip and look for transcript
        with zipfile.ZipFile(temp_zip_path, "r") as zf:
            found = False
            for member in zf.namelist():
                if member.endswith("_TRANSCRIPT.csv"):
                    with zf.open(member) as src, open(output_csv_path, "wb") as dst:
                        dst.write(src.read())
                    print(f"✔ Extracted {transcript_name}")
                    found = True
                    break
            if not found:
                print(f"✘ Transcript not found in {zip_name}")

        # Remove temp zip after extraction
        os.remove(temp_zip_path)

    except Exception as e:
        print(f"⚠ Error with {zip_name}: {e}")
        if os.path.exists(temp_zip_path):
            os.remove(temp_zip_path)

print("Finished downloading and extracting transcripts.")


Downloading 305_P.zip ...
✔ Extracted 305_TRANSCRIPT.csv
Downloading 306_P.zip ...
✔ Extracted 306_TRANSCRIPT.csv
Downloading 307_P.zip ...
✔ Extracted 307_TRANSCRIPT.csv
Downloading 308_P.zip ...
✔ Extracted 308_TRANSCRIPT.csv
Downloading 309_P.zip ...
✔ Extracted 309_TRANSCRIPT.csv
Downloading 310_P.zip ...
✔ Extracted 310_TRANSCRIPT.csv
Downloading 311_P.zip ...
✔ Extracted 311_TRANSCRIPT.csv
Downloading 312_P.zip ...
✔ Extracted 312_TRANSCRIPT.csv
Downloading 313_P.zip ...
✔ Extracted 313_TRANSCRIPT.csv
Downloading 314_P.zip ...
✔ Extracted 314_TRANSCRIPT.csv
Downloading 315_P.zip ...
✔ Extracted 315_TRANSCRIPT.csv
Downloading 316_P.zip ...
✔ Extracted 316_TRANSCRIPT.csv
Downloading 317_P.zip ...
✔ Extracted 317_TRANSCRIPT.csv
Downloading 318_P.zip ...
✔ Extracted 318_TRANSCRIPT.csv
Downloading 319_P.zip ...
✔ Extracted 319_TRANSCRIPT.csv
Downloading 320_P.zip ...
✔ Extracted 320_TRANSCRIPT.csv
Downloading 321_P.zip ...
✔ Extracted 321_TRANSCRIPT.csv
Downloading 322_P.zip ...
✔ Ext