# Slang Data Processing

This notebook is used for processing the file *Australian Slang v2_September 5, 2021.csv* which is the original data from the Australian Slang Survey Data collection.

The notebook will create two outputs:
1. A CSV folder with 14 CSV files - one for each of the slang prompts given to the participants. Non-responses have been removed, participant IDs added, and minor character encoding and formatting applied.
2. A de-identified version of the original survey data, where seven columns have been removed.

## Import libraries

In [None]:
import pandas as pd
import re
import os

## Load Australian Slang v2_September 5, 2021.csv as a dataframe

In [None]:
# Load Australian Slang v2_September 5, 2021.csv
df = pd.read_csv('Australian Slang v2_September 5, 2021.csv')

# Show the first few rows
df.head()

## Find the index of each column

In [None]:
for i, col in enumerate(df.columns):
    print(f"{i}: {col}")

## Separate data into prompts

Selecting only the columns:
- Response ID
- Do you know any typically Australian word or expression for [prompt]?
- What is the word or expression?
- What does this word or expression mean? (Note: Where included in the survey)
- Can you provide an example of the word or expression in use?
- When and where have you heard this word or expression being used?
- Is there anything else you would like to tell us about this word or expression?

In [None]:
verygood = df.iloc[:, [8, 26, 27, 28, 29, 30]]
verybad = df.iloc[:, [8, 31, 32, 33, 34, 35]]
stupid = df.iloc[:, [8, 36, 37, 38, 39, 40]]
attractiveperson = df.iloc[:, [8, 41, 42, 43, 44, 45]]
attractivefemale = df.iloc[:, [8, 46, 47, 48, 49, 50]]
attractivemale = df.iloc[:, [8, 51, 52, 53, 54, 55]]
unattractiveperson = df.iloc[:, [8, 56, 57, 58, 59, 60]]
arrogant = df.iloc[:, [8, 61, 62, 63, 64, 65]]
nonsense = df.iloc[:, [8, 66, 67, 68, 69, 70]]
alcohol = df.iloc[:, [8, 71, 72, 73, 74, 75]]
intoxicated = df.iloc[:, [8, 76, 77, 78, 79, 80]]
doesntdofairshare = df.iloc[:, [8, 81, 82, 83, 84, 85]]
bodypart = df.iloc[:, [8, 86, 87, 88, 89, 90, 91]]          # From here onwards there are 7 columns instead of 6
freechoice_a = df.iloc[:, [8, 92, 93, 94, 95, 96, 97]]
freechoice_b = df.iloc[:, [8, 98, 99, 100, 101, 102, 103]]
freechoice_c = df.iloc[:, [8, 104, 105, 106, 107, 108, 109]]
freechoice_d = df.iloc[:, [8, 110, 111, 112, 113, 114, 115]]
freechoice_e = df.iloc[:, [8, 116, 117, 118, 119, 120, 121]]

## Insert 'prompt' columns

In [None]:
verygood.insert(loc=1, column='prompt', value='Very Good')
verybad.insert(loc=1, column='prompt', value='Very Bad')
stupid.insert(loc=1, column='prompt', value='Stupid')
attractiveperson.insert(loc=1, column='prompt', value='Attractive Person')
attractivefemale.insert(loc=1, column='prompt', value='Attractive Female')
attractivemale.insert(loc=1, column='prompt', value='Attractive Male')
unattractiveperson.insert(loc=1, column='prompt', value='Unattractive Person')
arrogant.insert(loc=1, column='prompt', value='Arrogant')
nonsense.insert(loc=1, column='prompt', value='Nonsense')
alcohol.insert(loc=1, column='prompt', value='Alcohol')
intoxicated.insert(loc=1, column='prompt', value='Intoxicated')
doesntdofairshare.insert(loc=1, column='prompt', value="Doesn't Do Fair Share")
bodypart.insert(loc=1, column='prompt', value='Body Part')
freechoice_a.insert(loc=1, column='prompt', value='(Free Choice)')
freechoice_b.insert(loc=1, column='prompt', value='(Free Choice)')
freechoice_c.insert(loc=1, column='prompt', value='(Free Choice)')
freechoice_d.insert(loc=1, column='prompt', value='(Free Choice)')
freechoice_e.insert(loc=1, column='prompt', value='(Free Choice)')

## Insert empty 'meaning' columns for prompts missing this

In [None]:
verygood.insert(loc=4, column='meaning', value='')
verybad.insert(loc=4, column='meaning', value='')
stupid.insert(loc=4, column='meaning', value='')
attractiveperson.insert(loc=4, column='meaning', value='')
attractivefemale.insert(loc=4, column='meaning', value='')
attractivemale.insert(loc=4, column='meaning', value='')
unattractiveperson.insert(loc=4, column='meaning', value='')
arrogant.insert(loc=4, column='meaning', value='')
nonsense.insert(loc=4, column='meaning', value='')
alcohol.insert(loc=4, column='meaning', value='')
intoxicated.insert(loc=4, column='meaning', value='')
doesntdofairshare.insert(loc=4, column='meaning', value='')

## Check each prompt has the same number of columns

In [None]:
print("verygood:", len(verygood.columns))
print("verybad:", len(verybad.columns))
print("stupid:", len(stupid.columns))
print("attractiveperson:", len(attractiveperson.columns))
print("attractivefemale:", len(attractivefemale.columns))
print("attractivemale:", len(attractivemale.columns))
print("unattractiveperson:", len(unattractiveperson.columns))
print("arrogant:", len(arrogant.columns))
print("nonsense:", len(nonsense.columns))
print("alcohol:", len(alcohol.columns))
print("intoxicated:", len(intoxicated.columns))
print("doesntdofairshare:", len(doesntdofairshare.columns))
print("bodypart:", len(bodypart.columns))
print("freechoice_a:", len(freechoice_a.columns))
print("freechoice_b:", len(freechoice_b.columns))
print("freechoice_c:", len(freechoice_c.columns))
print("freechoice_d:", len(freechoice_d.columns))
print("freechoice_e:", len(freechoice_e.columns))

## Rename columns

In [None]:
verygood.columns = ['responseID', 'prompt', 'known', 'slang', 'meaning', 'exampleUsage', 'whenWhereUsed', 'additionalNotes']
verybad.columns = ['responseID', 'prompt', 'known', 'slang', 'meaning', 'exampleUsage', 'whenWhereUsed', 'additionalNotes']
stupid.columns = ['responseID', 'prompt', 'known', 'slang', 'meaning', 'exampleUsage', 'whenWhereUsed', 'additionalNotes']
attractiveperson.columns = ['responseID', 'prompt', 'known', 'slang', 'meaning', 'exampleUsage', 'whenWhereUsed', 'additionalNotes']
attractivefemale.columns = ['responseID', 'prompt', 'known', 'slang', 'meaning', 'exampleUsage', 'whenWhereUsed', 'additionalNotes']
attractivemale.columns = ['responseID', 'prompt', 'known', 'slang', 'meaning', 'exampleUsage', 'whenWhereUsed', 'additionalNotes']
unattractiveperson.columns = ['responseID', 'prompt', 'known', 'slang', 'meaning', 'exampleUsage', 'whenWhereUsed', 'additionalNotes']
arrogant.columns = ['responseID', 'prompt', 'known', 'slang', 'meaning', 'exampleUsage', 'whenWhereUsed', 'additionalNotes']
nonsense.columns = ['responseID', 'prompt', 'known', 'slang', 'meaning', 'exampleUsage', 'whenWhereUsed', 'additionalNotes']
alcohol.columns = ['responseID', 'prompt', 'known', 'slang', 'meaning', 'exampleUsage', 'whenWhereUsed', 'additionalNotes']
intoxicated.columns = ['responseID', 'prompt', 'known', 'slang', 'meaning', 'exampleUsage', 'whenWhereUsed', 'additionalNotes']
doesntdofairshare.columns = ['responseID', 'prompt', 'known', 'slang', 'meaning', 'exampleUsage', 'whenWhereUsed', 'additionalNotes']
bodypart.columns = ['responseID', 'prompt', 'known', 'slang', 'meaning', 'exampleUsage', 'whenWhereUsed', 'additionalNotes']
freechoice_a.columns = ['responseID', 'prompt', 'known', 'slang', 'meaning', 'exampleUsage', 'whenWhereUsed', 'additionalNotes']
freechoice_b.columns = ['responseID', 'prompt', 'known', 'slang', 'meaning', 'exampleUsage', 'whenWhereUsed', 'additionalNotes']
freechoice_c.columns = ['responseID', 'prompt', 'known', 'slang', 'meaning', 'exampleUsage', 'whenWhereUsed', 'additionalNotes']
freechoice_d.columns = ['responseID', 'prompt', 'known', 'slang', 'meaning', 'exampleUsage', 'whenWhereUsed', 'additionalNotes']
freechoice_e.columns = ['responseID', 'prompt', 'known', 'slang', 'meaning', 'exampleUsage', 'whenWhereUsed', 'additionalNotes']

## Check each prompt has the same column names

In [None]:
# List of all your DataFrames
dfs = [
    verygood,
    verybad,
    stupid,
    attractiveperson,
    attractivefemale,
    attractivemale,
    unattractiveperson,
    arrogant,
    nonsense,
    alcohol,
    intoxicated,
    doesntdofairshare,
    bodypart,
    freechoice_a,
    freechoice_b,
    freechoice_c,
    freechoice_d,
    freechoice_e
]

for i, df in enumerate(dfs):
    print(f"DataFrame {i} columns: {df.columns.tolist()}")

## Remove the first two rows of each prompt

In [None]:
verygood = verygood.iloc[2:].reset_index(drop=True)
verybad = verybad.iloc[2:].reset_index(drop=True)
stupid = stupid.iloc[2:].reset_index(drop=True)
attractiveperson = attractiveperson.iloc[2:].reset_index(drop=True)
attractivefemale = attractivefemale.iloc[2:].reset_index(drop=True)
attractivemale = attractivemale.iloc[2:].reset_index(drop=True)
unattractiveperson = unattractiveperson.iloc[2:].reset_index(drop=True)
arrogant = arrogant.iloc[2:].reset_index(drop=True)
nonsense = nonsense.iloc[2:].reset_index(drop=True)
alcohol = alcohol.iloc[2:].reset_index(drop=True)
intoxicated = intoxicated.iloc[2:].reset_index(drop=True)
doesntdofairshare = doesntdofairshare.iloc[2:].reset_index(drop=True)
bodypart = bodypart.iloc[2:].reset_index(drop=True)
freechoice_a = freechoice_a.iloc[2:].reset_index(drop=True)
freechoice_b = freechoice_b.iloc[2:].reset_index(drop=True)
freechoice_c = freechoice_c.iloc[2:].reset_index(drop=True)
freechoice_d = freechoice_d.iloc[2:].reset_index(drop=True)
freechoice_e = freechoice_e.iloc[2:].reset_index(drop=True)

## Recombine prompts to single dataframe

In [None]:
dfs = [
    verygood,
    verybad,
    stupid,
    attractiveperson,
    attractivefemale,
    attractivemale,
    unattractiveperson,
    arrogant,
    nonsense,
    alcohol,
    intoxicated,
    doesntdofairshare,
    bodypart,
    freechoice_a,
    freechoice_b,
    freechoice_c,
    freechoice_d,
    freechoice_e
]

combined_df = pd.concat(dfs, axis=0, ignore_index=True)

print(f"Contents of 'prompt' column: {combined_df['prompt'].unique()}")

## Remove rows where 'slang' is empty

In [None]:
combined_df = combined_df[combined_df['slang'].notna() & (combined_df['slang'].str.strip() != '')].reset_index(drop=True)
combined_df = combined_df.drop(columns='known')
combined_df.head(15)

## Add participantID to dataframe using responseID as a key

In [None]:
# Load the key
key_df = pd.read_csv('response_participant_key.txt', sep='\t', header=None, names=['responseID', 'participantID'])

# Merge with combined_df
combined_df = key_df.merge(combined_df, on='responseID', how='right')

# Move participantID to the first column
cols = combined_df.columns.tolist()
cols.insert(0, cols.pop(cols.index('participantID')))
combined_df = combined_df[cols]

# Check rows where participantID is missing
missing_participantID = combined_df[combined_df['participantID'].isna()]
print(f"Number of rows with missing participantID: {len(missing_participantID)}")

combined_df.head()

## Get counts for responses per prompt, total responses and total unique participants

In [None]:
print(f"Very Good: {len(combined_df[combined_df['prompt'] == 'Very Good'])} rows")
print(f"Very Bad: {len(combined_df[combined_df['prompt'] == 'Very Bad'])} rows")
print(f"Stupid: {len(combined_df[combined_df['prompt'] == 'Stupid'])} rows")
print(f"Attractive Person: {len(combined_df[combined_df['prompt'] == 'Attractive Person'])} rows")
print(f"Attractive Female: {len(combined_df[combined_df['prompt'] == 'Attractive Female'])} rows")
print(f"Attractive Male: {len(combined_df[combined_df['prompt'] == 'Attractive Male'])} rows")
print(f"Unattractive Person: {len(combined_df[combined_df['prompt'] == 'Unattractive Person'])} rows")
print(f"Arrogant: {len(combined_df[combined_df['prompt'] == 'Arrogant'])} rows")
print(f"Nonsense: {len(combined_df[combined_df['prompt'] == 'Nonsense'])} rows")
print(f"Alcohol: {len(combined_df[combined_df['prompt'] == 'Alcohol'])} rows")
print(f"Intoxicated: {len(combined_df[combined_df['prompt'] == 'Intoxicated'])} rows")
print(f"Doesn't Do Fair Share: {len(combined_df[combined_df['prompt'] == "Doesn't Do Fair Share"])} rows")
print(f"Body Part: {len(combined_df[combined_df['prompt'] == 'Body Part'])} rows")
print(f"(Free Choice): {len(combined_df[combined_df['prompt'] == '(Free Choice)'])} rows")

total_count = len(combined_df['participantID'])
unique_count = combined_df['participantID'].nunique()

print(f"Total responses: {total_count}")
print(f"Unique participants: {unique_count}")

## Clean characters in dataframe

In [None]:
replacements = {
    '“': '"',
    '”': '"',
    '‘': "'",
    '’': "'",
    '`': "'",
    'Ý': 'Y',
    'ç': 'c',
    'ï': 'i',
    'ó': 'o',
    'ķ': 'k',
    '—': '-',
    '  ': ' ',
    ' ,': ',',
    ' ?': '?',
    ' !': '!',
    ' .': '.',
    '[ ': '[',
    '( ': '(',
    '{ ': '{',
    ' ]': ']',
    ' )': ')',
    ' }': '}',
}

def clean_text(val):
    if isinstance(val, str):
        val = val.strip()
        for old, new in replacements.items():
            val = val.replace(old, new)
        # Add space after comma if missing
        val = re.sub(r',([^\s])', r', \1', val)
    return val

combined_df_clean = combined_df.map(clean_text)

## Create CSV folder with cleaned files per prompt

In [None]:
# Create a CSV folder if it doesn't exist
os.makedirs("CSV", exist_ok=True)

# Filter rows for each prompt
verygood_clean = combined_df_clean[combined_df_clean['prompt'].str.contains("Very Good", na=False)]
verybad_clean = combined_df_clean[combined_df_clean['prompt'].str.contains("Very Bad", na=False)]
stupid_clean = combined_df_clean[combined_df_clean['prompt'].str.contains("Stupid", na=False)]
attractiveperson_clean = combined_df_clean[combined_df_clean['prompt'].str.contains("Attractive Person", na=False)]
attractivefemale_clean = combined_df_clean[combined_df_clean['prompt'].str.contains("Attractive Female", na=False)]
attractivemale_clean = combined_df_clean[combined_df_clean['prompt'].str.contains("Attractive Male", na=False)]
unattractiveperson_clean = combined_df_clean[combined_df_clean['prompt'].str.contains("Unattractive Person", na=False)]
arrogant_clean = combined_df_clean[combined_df_clean['prompt'].str.contains("Arrogant", na=False)]
nonsense_clean = combined_df_clean[combined_df_clean['prompt'].str.contains("Nonsense", na=False)]
alcohol_clean = combined_df_clean[combined_df_clean['prompt'].str.contains("Alcohol", na=False)]
intoxicated_clean = combined_df_clean[combined_df_clean['prompt'].str.contains("Intoxicated", na=False)]
doesntdofairshare_clean = combined_df_clean[combined_df_clean['prompt'].str.contains("Doesn't Do Fair Share", na=False)]
bodypart_clean = combined_df_clean[combined_df_clean['prompt'].str.contains("Body Part", na=False)]
freechoice_clean = combined_df_clean[combined_df_clean['prompt'].str.contains("Free Choice", na=False)]

# Export files to CSV
verygood_clean.to_csv("CSV/Item01-VeryGood.csv", index=False)
verybad_clean.to_csv("CSV/Item02-VeryBad.csv", index=False)
stupid_clean.to_csv("CSV/Item03-Stupid.csv", index=False)
attractiveperson_clean.to_csv("CSV/Item04-AttractivePerson.csv", index=False)
attractivefemale_clean.to_csv("CSV/Item05-AttractiveFemale.csv", index=False)
attractivemale_clean.to_csv("CSV/Item06-AttractiveMale.csv", index=False)
unattractiveperson_clean.to_csv("CSV/Item07-UnattractivePerson.csv", index=False)
arrogant_clean.to_csv("CSV/Item08-Arrogant.csv", index=False)
nonsense_clean.to_csv("CSV/Item09-Nonsense.csv", index=False)
alcohol_clean.to_csv("CSV/Item10-Alcohol.csv", index=False)
intoxicated_clean.to_csv("CSV/Item11-Intoxicated.csv", index=False)
doesntdofairshare_clean.to_csv("CSV/Item12-DoesntDoFairShare.csv", index=False)
bodypart_clean.to_csv("CSV/Item13-BodyPart.csv", index=False)
freechoice_clean.to_csv("CSV/Item14-FreeChoice.csv", index=False)
print('Files created in CSV folder.')

print('File summary:')
print(f"Very Good: {len(verygood_clean)} rows")
print(f"Very Bad: {len(verybad_clean)} rows")
print(f"Stupid: {len(stupid_clean)} rows")
print(f"Attractive Person: {len(attractiveperson_clean)} rows")
print(f"Attractive Female: {len(attractivefemale_clean)} rows")
print(f"Attractive Male: {len(attractivemale_clean)} rows")
print(f"Unattractive Person: {len(unattractiveperson_clean)} rows")
print(f"Arrogant: {len(arrogant_clean)} rows")
print(f"Nonsense: {len(nonsense_clean)} rows")
print(f"Alcohol: {len(alcohol_clean)} rows")
print(f"Intoxicated: {len(intoxicated_clean)} rows")
print(f"Doesn't Do Fair Share: {len(doesntdofairshare_clean)} rows")
print(f"Body Part: {len(bodypart_clean)} rows")
print(f"(Free Choice): {len(freechoice_clean)} rows")

## Create de-identified CSV version of original data

In [None]:
# Load Australian Slang v2_September 5, 2021.csv
orig = pd.read_csv('Australian Slang v2_September 5, 2021.csv')

# Show the first few rows
orig.head()

## Remove identifying columns

Columns removed:
- 3: IPAddress
- 9: RecipientLastName
- 10: RecipientFirstName
- 11: RecipientEmail
- 13: LocationLatitude
- 14: LocationLongitude
- 25: If you would be willing to answer further questions, please provide an emailaddress. from Australian Slang v2_September 5, 2021.csv removed.

In [None]:
data = orig.drop(orig.columns[[3, 9, 10, 11, 13, 14, 25]], axis=1)
data.head()

## List columns after selected ones were removed

In [None]:
for i, col in enumerate(data.columns):
    print(f"{i}: {col}")

## Export to de-identified CSV

In [None]:
data.to_csv("Australian Slang v2_September 5, 2021_deidentified.csv", index=False)