# Generate training datasets divided by race/ethnicity

In [1]:
import pandas as pd
from datasets import load_dataset, load_metric, Dataset, Value, ClassLabel, Features, DatasetDict

In [2]:
DATA = '../data/'
seed = 42

## Convert the raw data so there is one essay per row

In [3]:
raw_df = pd.read_csv(DATA + 'persuade_corpus_raw.csv', low_memory=False, index_col=0)

In [4]:
raw_df[raw_df['task'] == 'Text dependent'].prompt_name.value_counts()

Does the electoral college work?    22306
Car-free cities                     20834
Driverless cars                     20130
Facial action coding system         19255
Exploring Venus                     15628
The Face on Mars                    15027
"A Cowboy Who Rode the Waves"       12963
Name: prompt_name, dtype: int64

In [5]:
raw_df.task.value_counts()

Independent       159240
Text dependent    126143
Name: task, dtype: int64

In [6]:
essays_df = raw_df[['essay_id', 'holistic_score_1', 'holistic_score_2', 'holistic_score_adjudicated',
                    'full_text', 'source', 'source_text', 'prompt_name', 'task', 'assignment', 'gender', 
                    'grade', 'ell', 'race_ethnicity', 'economically_disadvantaged', 'student_disability_status',]].drop_duplicates()

In [7]:
essays_df['race_ethnicity'].value_counts()

White                             11571
Hispanic/Latino                    6560
Black/African American             4959
Asian/Pacific Islander             1743
Two or more races/Other            1022
American Indian/Alaskan Native      141
Name: race_ethnicity, dtype: int64

In [8]:
essays_df.to_csv(DATA + 'persuade_corpus.csv')

In [9]:
essays_df['full_text'] = essays_df['full_text'].apply(lambda x: x.strip())

## Sample based on race/ethnicity

In [10]:
white_sample = essays_df[essays_df['race_ethnicity'] == 'White'].sample(n=4000, random_state=seed)
baa_sample = essays_df[essays_df['race_ethnicity'] == 'Black/African American'].sample(n=4000, random_state=seed)
hl_sample = essays_df[essays_df['race_ethnicity'] == 'Hispanic/Latino'].sample(n=4000, random_state=seed)
control_sample = pd.concat([white_sample.sample(n=1333, random_state=seed),baa_sample.sample(n=1333, random_state=seed),
                           hl_sample.sample(n=1333, random_state=seed)], axis=0)

### Remove the training data from the test set

In [11]:
indexes_to_remove = white_sample.index.to_list() + baa_sample.index.to_list() + hl_sample.index.to_list()
len(indexes_to_remove)

12000

In [12]:
essays_without_training = essays_df.drop(index=indexes_to_remove)
print(len(essays_df)-len(essays_without_training))

12000


### Downsample to get a test set

In [13]:
race_ethnicities = ['White', 'Hispanic/Latino', 'Black/African American', 'Asian/Pacific Islander', 'Two or more races/Other']
df_list = [essays_without_training[essays_without_training['race_ethnicity'] == race].sample(n=959, random_state=seed) for race in race_ethnicities]
test_df = pd.concat(df_list, axis=0)    

## Build the four datasets

I think the "test" set here is actually a development/validation set, right?

In [16]:
def buildDataset(df):
    df = df[['full_text', 'holistic_score_adjudicated']]
    df['holistic_score_adjudicated'] = df['holistic_score_adjudicated'].astype(float)
    df.columns = ['text', 'label']
    full_dataset = Dataset.from_pandas(df, preserve_index=False)
    # 70% train, 30% test
    train_valid = full_dataset.train_test_split(test_size=0.2, seed=seed)
    # gather everyone if you want to have a single DatasetDict
    final_dataset = DatasetDict({
        'train': train_valid['train'],
        'valid': train_valid['test']})
    return final_dataset

In [17]:
w_ds = buildDataset(white_sample)
baa_ds = buildDataset(baa_sample)
hl_ds = buildDataset(hl_sample)
control_ds = buildDataset(control_sample)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['holistic_score_adjudicated'] = df['holistic_score_adjudicated'].astype(float)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['holistic_score_adjudicated'] = df['holistic_score_adjudicated'].astype(float)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['holistic_score_adjudicated'] = df['ho

## Save everything to file

In [19]:
white_sample['condition'] = 'test_w'
baa_sample['condition'] = 'test_baa'
hl_sample['condition'] = 'test_hl'
control_sample['condition'] = 'control'


train_df = pd.concat([white_sample, baa_sample, hl_sample, control_sample], axis= 0)
train_df

Unnamed: 0,essay_id,holistic_score_1,holistic_score_2,holistic_score_adjudicated,full_text,source,source_text,prompt_name,task,assignment,gender,grade,ell,race_ethnicity,economically_disadvantaged,student_disability_status,condition
136345,AAAOPP13416000019618,3,3,3,Have you ever wondered if the face on Mars was...,Indiana,"""Unmasking the Face on Mars""",The Face on Mars,Text dependent,You have read the article 'Unmasking the Face ...,M,8,No,White,Not economically disadvantaged,Not identified as having disability,test_w
251900,AAAXMP138200001590842125_OR,4,5,5,In the 21st century humanity has entered what ...,Virginia,,Distance learning,Independent,Some schools offer distance learning as an opt...,M,11,No,White,Not economically disadvantaged,Not identified as having disability,test_w
97053,AAAVUP14319000075672,2,3,2,The author supports the suggestion that studyi...,Indiana,"""The Challenge of Exploring Venus""",Exploring Venus,Text dependent,"In ""The Challenge of Exploring Venus,"" the aut...",M,10,No,White,Economically disadvantaged,Not identified as having disability,test_w
169024,2171006195,4,3,3,"School sports have always been very important,...",NCES,,Grades for extracurricular activities,Independent,Your principal is considering changing school ...,M,8,No,White,Not economically disadvantaged,Not identified as having disability,test_w
120358,AAATRP14318000643203,2,2,2,i think that it could be used in class rooms s...,Indiana,"""Making Mona Lisa Smile"", by Nick D'Alto",Facial action coding system,Text dependent,"In the article ""Making Mona Lisa Smile,"" the a...",M,10,No,White,Economically disadvantaged,Identified as having disability,test_w
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
130384,AAAOPP13416000079387,4,4,4,Unmaskin the Face on Mars\n\nHave you seen the...,Indiana,"""Unmasking the Face on Mars""",The Face on Mars,Text dependent,You have read the article 'Unmasking the Face ...,F,8,No,Hispanic/Latino,Economically disadvantaged,Not identified as having disability,control
100907,AAAVUP14319000036701,3,3,3,Studying Venus is worthy pursuit despiting the...,Indiana,"""The Challenge of Exploring Venus""",Exploring Venus,Text dependent,"In ""The Challenge of Exploring Venus,"" the aut...",M,10,Yes,Hispanic/Latino,Economically disadvantaged,Not identified as having disability,control
267695,AAAXMP138200002033152810_OR,3,3,3,"When you need advice who do you go to, do you ...",Virginia,,Seeking multiple opinions,Independent,"When people ask for advice, they sometimes tal...",M,8,No,Hispanic/Latino,Economically disadvantaged,Not identified as having disability,control
17286,5075042,1,1,1,Driver good is very important for all people. ...,Florida,"Source 1: ""In German Suburb, Life Goes On With...",Car-free cities,Text dependent,Write an explanatory essay to inform fellow ci...,M,10,Yes,Hispanic/Latino,,Not identified as having disability,control


In [20]:
w_ds.save_to_disk(DATA + "w_ds.hf")
baa_ds.save_to_disk(DATA + "baa_ds.hf")
hl_ds.save_to_disk(DATA + "hl_ds.hf")
control_ds.save_to_disk(DATA + "control_ds.hf")
test_df.to_csv(DATA + "test_df.csv")
train_df.to_csv(DATA + "train_df.csv")

Flattening the indices:   0%|          | 0/4 [00:00<?, ?ba/s]

Saving the dataset (0/1 shards):   0%|          | 0/3200 [00:00<?, ? examples/s]

Flattening the indices:   0%|          | 0/1 [00:00<?, ?ba/s]

Saving the dataset (0/1 shards):   0%|          | 0/800 [00:00<?, ? examples/s]

Flattening the indices:   0%|          | 0/4 [00:00<?, ?ba/s]

Saving the dataset (0/1 shards):   0%|          | 0/3200 [00:00<?, ? examples/s]

Flattening the indices:   0%|          | 0/1 [00:00<?, ?ba/s]

Saving the dataset (0/1 shards):   0%|          | 0/800 [00:00<?, ? examples/s]

Flattening the indices:   0%|          | 0/4 [00:00<?, ?ba/s]

Saving the dataset (0/1 shards):   0%|          | 0/3200 [00:00<?, ? examples/s]

Flattening the indices:   0%|          | 0/1 [00:00<?, ?ba/s]

Saving the dataset (0/1 shards):   0%|          | 0/800 [00:00<?, ? examples/s]

Flattening the indices:   0%|          | 0/4 [00:00<?, ?ba/s]

Saving the dataset (0/1 shards):   0%|          | 0/3199 [00:00<?, ? examples/s]

Flattening the indices:   0%|          | 0/1 [00:00<?, ?ba/s]

Saving the dataset (0/1 shards):   0%|          | 0/800 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 3200
    })
    valid: Dataset({
        features: ['text', 'label'],
        num_rows: 800
    })
})