# Merging Checkpoints

As you can see from the scripts included in this project, we ended up batching the comparisons between our keyword utterances ($k \in K$) and our context utterances ($c \in C$). Partially, this was to decrease the noise in the office where the tower is stored while running our tests.

The following scripts are designed to stitch those pieces back together again, largely using the CEDA object/framework to do so.

In [1]:
from shared.CEDA import ceda_model
from tqdm import tqdm
import pandas as pd
import numpy as np
import json
import os

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/zacharyrosen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
CKPT_PATH = 'data/ckpts'
RAW_PATH = 'data/raw'
OUT_PATH = 'data/results'
OUT_NAME = 'ceda-results.csv'

In [3]:
df = []

In [4]:
mod = ceda_model()

files = [os.path.join(CKPT_PATH, f) for f in os.listdir(CKPT_PATH)]
for f in tqdm(files):
    mod.load_from_checkpoint(f)
    df += [mod.graph_df(residualize=False)]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 199/199 [01:36<00:00,  2.05it/s]


In [5]:
df = pd.concat(df, ignore_index=True)
df.head()

Unnamed: 0,x_submission_id,x_submission_created_at,x_comment_created_at,x_comment_id,x_user,x_tag,x_line_no,y_submission_id,y_submission_created_at,y_comment_created_at,y_comment_id,y_user,y_tag,y_line_no,nx,ny,Hxy,Hyx
0,1b11mg2,1709004000.0,1709073000.0,ksg240g,Crea8talife,pro_life,9,49048d,1457137000.0,1457137000.0,49048d,RebeccaHeinicke,,5320,93.0,11.0,9.474005,0.926241
1,1b11mg2,1709004000.0,1709073000.0,ksg240g,Crea8talife,pro_life,9,49048d,1457137000.0,1457137000.0,49048d,RebeccaHeinicke,,6682,93.0,11.0,9.474005,0.926241
2,1fi412k,1726492000.0,1726493000.0,lneku7u,StonkSalty,pro_life,16,49048d,1457137000.0,1457137000.0,49048d,RebeccaHeinicke,,5320,10.0,11.0,0.791789,0.93118
3,1fi412k,1726492000.0,1726493000.0,lneku7u,StonkSalty,pro_life,16,49048d,1457137000.0,1457137000.0,49048d,RebeccaHeinicke,,6682,10.0,11.0,0.791789,0.93118
4,1fi412k,1726492000.0,1726496000.0,lneslb8,crochet-fae,forced_birth|pro_life,17,49048d,1457137000.0,1457137000.0,49048d,RebeccaHeinicke,,5320,22.0,11.0,2.097525,0.899163


Stupidly, I left out some crucial information for ascertaining whether $x$ and $y$ (i.e. $k$ and $c$) are in the same context. That missing info being the parent comments for $x$ and $y$. To get those, I'm addding in the following script.

In [6]:
dfc = pd.read_csv(os.path.join(RAW_PATH, 'corpus-localcontext.csv'))
dfc['parent_id_'] = [pid.split('_')[-1] for pid in tqdm(dfc['parent_id'].values)]

# conversion to get parent ids from the line number
conversion = {line_no: dfc['parent_id'].loc[line_no] for line_no in dfc.index}

# conversion to get when the comment was created from parent comment ids
#  used to get created at time for parent comments
parent_created_at_conversion = {cid: dfc['comment_created_at'].loc[dfc['comment_id'].isin([cid])].values[0] for cid in dfc['comment_id'].unique()}

# conversion to get comment ups from comment id
comment_ups_conversion = {cid: comment_ups for cid, comment_ups in dfc[['comment_id', 'comment_ups']].values}

# conversion to get all tags associated with a parent_id.
parent_tags = {
    pid: '|'.join(dfc['tag'].loc[dfc['parent_id_'].isin([pid]) & ~dfc['tag'].isna()])
    for pid in dfc['parent_id_'].loc[~dfc['tag'].isna()].unique()
}

100%|██████████| 14007/14007 [00:00<00:00, 2317082.08it/s]


In [7]:
df['x_parent_id'] = [conversion[line_no] for line_no in tqdm(df['x_line_no'].values)]

100%|██████████| 12174954/12174954 [00:03<00:00, 3081964.14it/s]


In [8]:
df['y_parent_id'] = [conversion[line_no] for line_no in tqdm(df['y_line_no'].values)]

100%|██████████| 12174954/12174954 [00:03<00:00, 3350534.42it/s]


In [9]:
df['x_comment_ups'] = [comment_ups_conversion[cid] for cid in tqdm(df['x_comment_id'].values)]

100%|██████████| 12174954/12174954 [00:02<00:00, 4466861.74it/s]


In [10]:
df['y_comment_ups'] = [comment_ups_conversion[cid] for cid in tqdm(df['y_comment_id'].values)]

100%|██████████| 12174954/12174954 [00:02<00:00, 4570929.35it/s]


I also want to create a context label, alongside of selecting a context beginning timestamp.

In [11]:
df['x_parent_id_'] = [pid.split('_')[-1] for pid in tqdm(df['x_parent_id'].values)]
df['y_parent_id_'] = [pid.split('_')[-1] for pid in tqdm(df['y_parent_id'].values)]

df['x_parent_id_'].loc[df['x_parent_id_'].isin(['ROOT'])] = df['x_comment_id'].loc[df['x_parent_id_'].isin(['ROOT'])]
df['y_parent_id_'].loc[df['y_parent_id_'].isin(['ROOT'])] = df['y_comment_id'].loc[df['y_parent_id_'].isin(['ROOT'])]

df['x_context_id'] = df['x_parent_id_'].values
df['y_context_id'] = None
df['same_context'] = False

100%|██████████| 12174954/12174954 [00:04<00:00, 2924297.05it/s]
100%|██████████| 12174954/12174954 [00:04<00:00, 2944435.93it/s]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['x_parent_id_'].loc[df['x_parent_id_'].isin(['ROOT'])] = df['x_comment_id'].loc[df['x_parent_id_'].isin(['ROOT'])]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['y_parent_id_'].loc[df['y_parent_id_'].isin(['ROOT'])] = df['y_comment_id'].loc[df['y_parent_id_'].isin(['ROOT'])]


In [12]:
# get children and label context
sel = df['x_comment_id'] == df['y_parent_id_']
df['cc_is_child'] = sel
df['y_context_id'].loc[sel] = df['x_context_id'].loc[sel]
# df['y_tag'].loc[sel] = df['x_tag'].loc[sel]
df['same_context'].loc[sel] = True

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['y_context_id'].loc[sel] = df['x_context_id'].loc[sel]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['same_context'].loc[sel] = True


In [13]:
# get siblings and label context
sel = df['x_parent_id'] == df['y_parent_id']
df['cc_is_sibling'] = sel
df['y_context_id'].loc[sel] = df['x_context_id'].loc[sel]
# df['y_tag'].loc[sel] = df['x_tag'].loc[sel]
df['same_context'].loc[sel] = True

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['y_context_id'].loc[sel] = df['x_context_id'].loc[sel]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['same_context'].loc[sel] = True


In [14]:
# get parents and label context
sel = df['y_comment_id'] == df['x_parent_id_']
df['cc_is_parent'] = sel
df['y_context_id'].loc[sel] = df['x_context_id'].loc[sel]
# df['y_tag'].loc[sel] = df['x_tag'].loc[sel]
df['same_context'].loc[sel] = True

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['y_context_id'].loc[sel] = df['x_context_id'].loc[sel]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['same_context'].loc[sel] = True


In [None]:
# df['x_context_time'] = [parent_created_at_conversion[cid] 
#                         if cid in parent_created_at_conversion.keys() else None 
#                         for cid in tqdm(df['x_context_id'].values)
#                         ] #df['x_context_id'].replace(parent_created_at_conversion)
# 
# df['y_context_time'] = [parent_created_at_conversion[cid] 
#                         if cid in parent_created_at_conversion.keys() else None 
#                         for cid in tqdm(df['y_context_id'].values)
#                         ] #df['x_context_id'].replace(parent_created_at_conversion)

In [None]:
# df['y_context_id'].loc[(~df['y_context_id'].isna() & df['y_context_time'].isna())].value_counts()

In [15]:
s1 = df['y_comment_id'].loc[df['cc_is_child']].unique()
s2 = df['y_comment_id'].loc[df['cc_is_parent'] | df['cc_is_sibling']].unique()

only_as_child_comments = list(set(s1).difference(set(s2)))
only_as_child_context_ids = {comment: df['x_parent_id_'].loc[df['y_comment_id'].isin([comment]) & df['same_context']].values[0] for comment in only_as_child_comments}

In [17]:
df['x_context_time'] = [parent_created_at_conversion[cid] 
                        if cid in parent_created_at_conversion.keys() else None 
                        for cid in tqdm(df['x_context_id'].values)
                        ] #df['x_context_id'].replace(parent_created_at_conversion)

df['y_context_time'] = [parent_created_at_conversion[cid] 
                        if cid in parent_created_at_conversion.keys() else None 
                        for cid in tqdm(df['y_context_id'].values)
                        ] #df['x_context_id'].replace(parent_created_at_conversion)

100%|██████████| 12174954/12174954 [00:04<00:00, 2920884.65it/s]
100%|██████████| 12174954/12174954 [00:03<00:00, 3494660.08it/s]


In [18]:
sel = df['y_comment_id'].isin(only_as_child_comments)

all_other_y_contexts = dict()
for comment in df['y_comment_id'].loc[df['same_context'] & ~sel].unique():
    responses = df[['y_context_id', 'y_context_time']].loc[df['y_comment_id'].isin([comment]) & df['same_context']].values
    all_other_y_contexts[comment] = responses[:,0][responses[:,1].argmin()]

df['y_context_id'].loc[~sel & (~df['same_context'])] = [all_other_y_contexts[comment] for comment in tqdm(df['y_comment_id'].loc[~sel & (~df['same_context'])].values)]

df['y_context_id'].loc[sel] = [only_as_child_context_ids[comment] for comment in tqdm(df['y_comment_id'].loc[sel].values)]

100%|██████████| 8106183/8106183 [00:01<00:00, 4417004.03it/s]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['y_context_id'].loc[~sel & (~df['same_context'])] = [all_other_y_contexts[comment] for comment in tqdm(df['y_comment_id'].loc[~sel & (~df['same_context'])].values)]
100%|██████████| 3971550/3971550 [00:00<00:00, 4501415.33it/s]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['y_context_id'].loc[sel] = [only_as_child_context_ids[comment] for comment in tqdm(df['y_comment_id'].loc[sel].values)]


In [None]:
# sel = df.loc[df['same_context']]
# context_conversion = {yid: sel[['y_context_id', 'y_context_time', 'x_tag']].loc[sel['y_parent_id_'].isin([yid])].values for yid in sel['y_parent_id_'].unique()}

In [None]:
# # comparisons across contexts
# sel = ~df['y_context_id'].isna()
# for cid in tqdm(df['y_comment_id'].loc[sel].unique()):
#     sub = df.loc[sel & df['y_comment_id'].isin([cid])]
#     min_ = sub['y_context_time'].min()
#     earliest_head = sub['y_context_id'].loc[sub['y_context_time']==min_].values
#     df['y_context_id'].loc[~sel & df['y_comment_id'].isin([cid])] = earliest_head[0]

In [19]:
df['y_context_id'].isin(df['x_context_id'].unique()).mean()

1.0

In [20]:
# everything else:
sel = df['y_context_id'].isna()
print(sel.sum())
# df['y_context_id'].loc[sel] = df['y_parent_id_'].loc[sel]

0


In [21]:
df['x_context_time'] = [parent_created_at_conversion[cid] 
                        if cid in parent_created_at_conversion.keys() else None 
                        for cid in tqdm(df['x_context_id'].values)
                        ] #df['x_context_id'].replace(parent_created_at_conversion)

df['y_context_time'] = [parent_created_at_conversion[cid] 
                        if cid in parent_created_at_conversion.keys() else None 
                        for cid in tqdm(df['y_context_id'].values)
                        ] #df['x_context_id'].replace(parent_created_at_conversion)

100%|██████████| 12174954/12174954 [00:03<00:00, 3095963.41it/s]
100%|██████████| 12174954/12174954 [00:03<00:00, 3082343.08it/s]


Adding the context time for all the y_contexts, one last time . . . 

In [None]:
# sel = df['y_context_time'].isna()
# df['y_context_time'].loc[sel] = [parent_created_at_conversion[cid] 
#                         if cid in parent_created_at_conversion.keys() else None 
#                         for cid in tqdm(df['y_context_id'].loc[sel].values)
#                         ] #df['x_context_id'].replace(parent_created_at_conversion)

In [22]:
possible_y_tags = {
    xcid: '|'.join(df['x_tag'].loc[df['x_context_id'].isin([xcid])].unique())
    for xcid in df['x_context_id'].unique()
}

df['y_tag'] = [
    possible_y_tags[ycid] if ycid in possible_y_tags.keys() 
    else None 
    for ycid in tqdm(df['y_context_id'].values)
]

100%|██████████| 12174954/12174954 [00:03<00:00, 3407743.68it/s]


And some last checks.

In [None]:
df.isna().sum()

In [None]:
df['same_context'].loc[df['y_tag'].isna()].value_counts()

Just in case, I also want to note when the $x$ and $y$ authors are the same.

In [23]:
del df['x_parent_id_']
del df['y_parent_id_']

In [24]:
df['same_author'] = df['x_user'] == df['y_user']

In [25]:
df['same_author'].value_counts()

False    12170042
True         4912
Name: same_author, dtype: int64

Let's also take a moment now and anonymize some of the data (and save our anonymization key locally)

In [26]:
anonymize_columns = [['x_user', 'y_user'], ['x_comment_id', 'y_comment_id'], ['x_submission_id', 'y_submission_id']]
for cols in anonymize_columns:
    values = np.unique(df[cols].values)
    values = np.random.choice(values, size=(len(values),), replace=False)
    
    conversion = {val:i+1 for i,val in enumerate(values)}
    
    # save conversion dictionary
    f = open(
        os.path.join(
            OUT_PATH, 
            cols[0].replace('x_', '').replace('y_', '')+'.json'
        ), 
        'w'
    )
    f.write(json.dumps(conversion,indent=4))
    f.close()
    
    # anonymize the column
    for col in cols:
        print(col)
        df[col] = [conversion[val] for val in tqdm(df[col].values)]

x_user


100%|██████████| 12174954/12174954 [00:02<00:00, 4517161.90it/s]


y_user


100%|██████████| 12174954/12174954 [00:02<00:00, 4760389.80it/s]


x_comment_id


100%|██████████| 12174954/12174954 [00:02<00:00, 4697857.58it/s]


y_comment_id


100%|██████████| 12174954/12174954 [00:03<00:00, 4012023.33it/s]


x_submission_id


100%|██████████| 12174954/12174954 [00:02<00:00, 4862790.66it/s]


y_submission_id


100%|██████████| 12174954/12174954 [00:02<00:00, 5067351.38it/s]


Finishing this, let's save the data.

In [27]:
df.to_csv(os.path.join(OUT_PATH, OUT_NAME), index=False, encoding='utf-8')

In [28]:
df.shape

(12174954, 31)

In [30]:
df['y_tag'].value_counts()

pro_life                                       9185165
forced_birth                                   1741309
forced_birth|pro_life                           615898
pro_life|forced_birth                           430779
pro_life|forced_birth|pro_life                  110285
forced_birth|pro_life|forced_birth|pro_life      55148
forced_birth|pro_life|pro_life                   36370
Name: y_tag, dtype: int64