# Combine and Evaluate Datasets
- Inspect and evaluate the input data 
- Explore output results of multiple models' sentiment evaluations

### Prepare annotated datasets
- combine 3 human-annotated datasets
- normalize categories
- extract conversation IDs for removal from model predictions
- save as new datasets (one with dupes, one without dupes)


### Prepare model-predicted datasets
- combine 4 datasets
- standardize scales
- remove annotated tweets
- save as 2 sets (annotated and not)



In [100]:
import os
import pandas as pd
from datetime import datetime

### Twitter dataset
- Load and validate dataset
- check for duplicate tweets
- ensure keyword field exists

In [101]:
twitter_data = '../../data/deduplicated_final_.2023-04-02_11.38.26.083972.tsv'
tweets_df = pd.read_csv(twitter_data, sep='\t')
print(tweets_df.shape)
print(tweets_df.columns)
assert tweets_df['conversation_id'].nunique() == tweets_df.shape[0], 'missmatch row count vs unique conv ids'
assert 'keyword' in list(tweets_df.columns), 'Missing keyword field in twitter data'
tweets_df.head(3)

(30611, 22)
Index(['Unnamed: 0', 'conversation_id', 'lang', 'reply_settings', 'created_at',
       'keyword', 'clean_text', 'text', 'author_id', 'referenced_tweets', 'id',
       'edit_history_tweet_ids', 'public_metrics.retweet_count',
       'public_metrics.reply_count', 'public_metrics.like_count',
       'public_metrics.impression_count', 'in_reply_to_user_id',
       'geo.place_id', 'withheld.copyright', 'withheld.country_codes',
       'geo.coordinates.type', 'geo.coordinates.coordinates'],
      dtype='object')


Unnamed: 0.1,Unnamed: 0,conversation_id,lang,reply_settings,created_at,keyword,clean_text,text,author_id,referenced_tweets,...,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.impression_count,in_reply_to_user_id,geo.place_id,withheld.copyright,withheld.country_codes,geo.coordinates.type,geo.coordinates.coordinates
0,0,1633954063934009344,en,everyone,2023-03-09 22:13:00+00:00,#chrisrocklive,rt infantry bucky lucky chrisrocklive worse th...,RT @Infantry_bucky: He’s lucky a #ChrisRockLiv...,1519164980582653952,"[{'type': 'retweeted', 'id': '1633938373529292...",...,8,0,0,0,,,,,,
1,1,1633954058212876288,en,everyone,2023-03-09 22:12:59+00:00,#chrisrocklive,rt ofakindnocap chris rock cheated never perso...,RT @1_ofakindnocap: Chris Rock: “we all been c...,21575184,"[{'type': 'retweeted', 'id': '1632283297588948...",...,616,0,0,0,,,,,,
2,2,1633951267423768576,en,everyone,2023-03-09 22:01:54+00:00,#chrisrocklive,rt ofakindnocap chris rock cheated never perso...,RT @1_ofakindnocap: Chris Rock: “we all been c...,360633018,"[{'type': 'retweeted', 'id': '1632283297588948...",...,616,0,0,0,,,,,,


### Distribution of keywords in our data

In [102]:
import string

my_keywords = list(tweets_df['keyword'].unique())
print(f'My keywords: {my_keywords}\n')

col_width = 15

print(f'{"keyword".ljust(15)} {"% unique".ljust(15)} "count"')
for kw in my_keywords:
    unique_for_kw = tweets_df.loc[tweets_df['keyword'] == kw]['conversation_id'].nunique()
    rows_for_kw = tweets_df.loc[tweets_df['keyword'] == kw].shape[0]
    my_pct = round(100.0*unique_for_kw/rows_for_kw, 2)
    print(f'{kw.ljust(15)} {str(my_pct).ljust(15)} {rows_for_kw}')


My keywords: ['#chrisrocklive', '#tyrenichols', 'chatgpt', 'maga', 'obama', 'russia', 'statehood', 'ukraine']

keyword         % unique        "count"
#chrisrocklive  100.0           5057
#tyrenichols    100.0           744
chatgpt         100.0           4913
maga            100.0           1057
obama           100.0           4714
russia          100.0           4681
statehood       100.0           4713
ukraine         100.0           4732


# Annotations Datasets
- Load and inspect each model annotations dataset
- Normalize scores
- Check for any duplicate tweets
- Combine into a single polarity predictions dataset

In [103]:


model_files = [
    "textblob_polarity.2023-03-26_14.35.07.210250.tsv",
    "vader_polarity.2023-03-26_14.38.52.858509.tsv",
    "AFINN.csv",
    "SentiWordNet.csv"
]

for filename in model_files:
    filepath = os.path.join('..', '..', 'data', filename)
    print(f'Checking {filepath}')
    if '.tsv' in filename:
        temp_df = pd.read_csv(filepath, sep='\t')
    else:
        temp_df = pd.read_csv(filepath)
    print(temp_df.shape)
    display(temp_df.head(1))


Checking ../../data/textblob_polarity.2023-03-26_14.35.07.210250.tsv
(32291, 9)


Unnamed: 0.1,Unnamed: 0,clean_text,Polarity,Subjectivity,referenced_tweets,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.impression_count
0,0,rt infantry bucky lucky chrisrocklive worse th...,-0.038889,0.527778,"[{'type': 'retweeted', 'id': '1633938373529292...",8,0,0,0


Checking ../../data/vader_polarity.2023-03-26_14.38.52.858509.tsv
(32291, 9)


Unnamed: 0.1,Unnamed: 0,clean_text,Polarity,Subjectivity,referenced_tweets,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.impression_count
0,0,rt infantry bucky lucky chrisrocklive worse th...,-0.6369,0.574,"[{'type': 'retweeted', 'id': '1633938373529292...",8,0,0,0


Checking ../../data/AFINN.csv
(32291, 4)


Unnamed: 0.1,Unnamed: 0,text,scores,sentiments
0,0,rt infantry bucky lucky chrisrocklive worse th...,-2.0,negative


Checking ../../data/SentiWordNet.csv
(32291, 3)


Unnamed: 0.1,Unnamed: 0,text,scores
0,0,rt infantry bucky lucky chrisrocklive worse th...,0.25


## Sentiment Evaluation Models Data preparation
1. The AFINN and SentiWordNet datasets do not contain the fields that are part of the twitter dataset.  I will perform a join and sample all rows to ensure complete alignment across all models, although preliminary inspection suggests the same input file and same order was maintained.
2.  Restore the other fields from the twitter datasets so I can remove duplicates (need the conversation ID, for instance), as well as keywords, etc.
3.  Each of the four sentiment datasets has duplicated rows.  Need to remove those rows by identifying the conversation IDs contained in the modeled sentiment data that have been removed from the deduplicated set.

### Merge model datasets

In [105]:
# Start with original dataset on which models ran, recover missing fields
df = pd.read_csv('../../data/deduplicated_tweets.2023-03-26_14.08.48.636120.tsv', sep='\t')
assert df.shape[0] == 32291, 'Not the dataset you seek'

for filename in model_files:
    filepath = os.path.join('..', '..', 'data', filename)
    shortname = filename.split('.')[0].split('_polarity')[0]
    print(f'Checking {filepath}  -> {shortname}_ prefix')
    if '.tsv' in filename:
        temp_df = pd.read_csv(filepath, sep='\t')
    else:
        temp_df = pd.read_csv(filepath)
    print(temp_df.shape)
    
    # drop the "unnamed" columns
    if "Unnamed: 0" in temp_df.columns:
        temp_df.drop(["Unnamed: 0"], axis=1, inplace=True)
    
    # label the incoming columns before concatenation
    temp_df.columns = [f'{shortname}_{x}' for x in temp_df.columns]
    
    # verify we are aligned
    assert df.shape[0] == temp_df.shape[0], f'Rows mismatch on {shortname}'
    
    # concatenate and validate our frames
    df = pd.concat([df, temp_df], axis='columns')
    assert df.shape[0] == temp_df.shape[0], 'attaching the wrong direction'
    assert df.shape[1] > temp_df.shape[1], 'not getting wider'

df.head(3)



Checking ../../data/textblob_polarity.2023-03-26_14.35.07.210250.tsv  -> textblob_ prefix
(32291, 9)
Checking ../../data/vader_polarity.2023-03-26_14.38.52.858509.tsv  -> vader_ prefix
(32291, 9)
Checking ../../data/AFINN.csv  -> AFINN_ prefix
(32291, 4)
Checking ../../data/SentiWordNet.csv  -> SentiWordNet_ prefix
(32291, 3)


Unnamed: 0.1,Unnamed: 0,conversation_id,lang,reply_settings,created_at,clean_text,text,author_id,referenced_tweets,id,...,vader_referenced_tweets,vader_public_metrics.retweet_count,vader_public_metrics.reply_count,vader_public_metrics.like_count,vader_public_metrics.impression_count,AFINN_text,AFINN_scores,AFINN_sentiments,SentiWordNet_text,SentiWordNet_scores
0,0,1633954063934009344,en,everyone,2023-03-09 22:13:00+00:00,rt infantry bucky lucky chrisrocklive worse th...,RT @Infantry_bucky: He’s lucky a #ChrisRockLiv...,1519164980582653952,"[{'type': 'retweeted', 'id': '1633938373529292...",1633954063934009344,...,"[{'type': 'retweeted', 'id': '1633938373529292...",8,0,0,0,rt infantry bucky lucky chrisrocklive worse th...,-2.0,negative,rt infantry bucky lucky chrisrocklive worse th...,0.25
1,1,1633954058212876288,en,everyone,2023-03-09 22:12:59+00:00,rt ofakindnocap chris rock cheated never perso...,RT @1_ofakindnocap: Chris Rock: “we all been c...,21575184,"[{'type': 'retweeted', 'id': '1632283297588948...",1633954058212876288,...,"[{'type': 'retweeted', 'id': '1632283297588948...",616,0,0,0,rt ofakindnocap chris rock cheated never perso...,-6.0,negative,rt ofakindnocap chris rock cheated never perso...,-0.625
2,2,1633951267423768576,en,everyone,2023-03-09 22:01:54+00:00,rt ofakindnocap chris rock cheated never perso...,RT @1_ofakindnocap: Chris Rock: “we all been c...,360633018,"[{'type': 'retweeted', 'id': '1632283297588948...",1633951267423768576,...,"[{'type': 'retweeted', 'id': '1632283297588948...",616,0,0,0,rt ofakindnocap chris rock cheated never perso...,-6.0,negative,rt ofakindnocap chris rock cheated never perso...,-0.625


#### Verify alignment

In [106]:
# Are all the text fields the same?
assert df.loc[(df['clean_text'] == df['textblob_clean_text'])].shape == df.shape, 'alignment issue in textblob'
assert df.loc[(df['clean_text'] == df['vader_clean_text'])].shape == df.shape, 'alignment issue in vader'
assert df.loc[(df['clean_text'] == df['AFINN_text'])].shape == df.shape, 'alignment issue in AFFIN'
assert df.loc[(df['clean_text'] == df['SentiWordNet_text'])].shape == df.shape, 'alignment issue in SentiWordNet'

# Check for common texts across each set (ensure alignment)
df['textblob_aligned'] = (df['clean_text'] == df['textblob_clean_text'])
df['vader_aligned'] = (df['clean_text'] == df['vader_clean_text'])
df['AFINN_aligned'] = (df['clean_text'] == df['AFINN_text'])
df['SentiWordNet_aligned'] = (df['clean_text'] == df['SentiWordNet_text'])

alignment_fields = [x for x in df.columns if '_aligned' in x.lower()]
df[alignment_fields].describe()


Unnamed: 0,textblob_aligned,vader_aligned,AFINN_aligned,SentiWordNet_aligned
count,32291,32291,32291,32291
unique,1,1,1,1
top,True,True,True,True
freq,32291,32291,32291,32291


#### Trim unneeded fields and drop duplicate rows

In [194]:
# Keep only meaningful columns
keep_cols = ['conversation_id', 'lang', 'reply_settings', 'created_at',
       'clean_text', 'text', 'author_id', 'referenced_tweets', 'id',
       'edit_history_tweet_ids', 'public_metrics.retweet_count',
       'public_metrics.reply_count', 'public_metrics.like_count',
       'public_metrics.impression_count', 'in_reply_to_user_id',
       'geo.place_id', 'withheld.copyright', 'withheld.country_codes',
       'geo.coordinates.type', 'geo.coordinates.coordinates',
       'textblob_Polarity', 'textblob_Subjectivity',
       'vader_Polarity', 'vader_Subjectivity', 'AFINN_scores',
       'AFINN_sentiments', 'SentiWordNet_scores']
drop_cols = [x for x in df.columns if x not in keep_cols]
print(f'Dropping unneeded columns {drop_cols}')
df.drop(drop_cols, axis=1, inplace=True)

df.drop_duplicates('conversation_id', inplace=True)

Dropping unneeded columns []


# Inspect Our Model Data

First Observations:
- We have 30,611 distinct tweets.  Each has been evaluated by four different sentiment analysis models.
- There are 3,517 rows that are direct replies to other tweets, comprising ~11% of all tweets.


In [195]:
df.describe()

Unnamed: 0,conversation_id,author_id,id,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.impression_count,in_reply_to_user_id,withheld.copyright,textblob_Polarity,textblob_Subjectivity,vader_Polarity,vader_Subjectivity,AFINN_scores,SentiWordNet_scores
count,30611.0,30611.0,30611.0,30611.0,30611.0,30611.0,30611.0,3517.0,46.0,30611.0,30611.0,30611.0,30611.0,30611.0,30611.0
mean,1.633309e+18,7.48907e+17,1.63364e+18,1365.227337,0.051419,0.401457,41.540721,6.032123e+17,0.0,0.061751,0.267651,0.015488,0.213545,-0.209304,0.075877
std,1.150231e+16,6.871952e+17,314554400000000.0,2937.853465,0.450962,6.321221,670.628861,6.673151e+17,0.0,0.243358,0.304795,0.444652,0.173778,2.963791,0.486241
min,4.973844e+17,2360.0,1.631455e+18,0.0,0.0,0.0,0.0,1605.0,0.0,-1.0,0.0,-0.9856,0.0,-24.0,-3.25
25%,1.633536e+18,551530800.0,1.63363e+18,1.0,0.0,0.0,0.0,171346000.0,0.0,0.0,0.0,-0.2732,0.074,-2.0,-0.125
50%,1.63367e+18,8.849535e+17,1.633671e+18,52.0,0.0,0.0,0.0,3221217000.0,0.0,0.0,0.155556,0.0,0.191,0.0,0.0
75%,1.633875e+18,1.4686e+18,1.633876e+18,616.0,0.0,0.0,0.0,1.335989e+18,0.0,0.136364,0.5,0.296,0.336,1.0,0.25
max,1.633984e+18,1.633981e+18,1.633984e+18,74592.0,30.0,534.0,67372.0,1.633867e+18,0.0,1.0,1.0,0.9889,0.896,25.0,4.625


# TODO: save model data after pulling annotations

In [39]:
# Save modeled twitter sentiment data to file after removing our annotated set
from datetime import datetime
timestamp = str(datetime.now()).replace(' ', '_').replace(':', '.')
output_filepath = f'../../data/combined_model_data_{timestamp}.tsv'

#df.to_csv(output_filepath, sep='\t')
#print(f'wrote {output_filepath}')

wrote ../../data/combined_model_data_2023-04-02_16.43.07.173739.tsv


# Create Annotations dataset


### Fix our conversation_ids

- each of us annotated in excel, and it trashed our conversation IDs, author IDs, and other long-integer fields.  
- I regenerated our assigned tweets "\_REDO" and our numbers are good there.
- Pulling the annotation labels from our messed up files and adding them to the regenerated REDO files.

#### Fixed Annotation Files:
* '../../data/mel_annotations_REDO_04022023.tsv'
* '../../data/sarik_annotations_REDO_04022023.tsv'
* '../../data/Annotated_Anu.tsv'  (not actually a .tsv)

In [174]:
mel_messed_up_cids_df = pd.read_csv('../../data/mel_annotations.csv')
mel_correct_cids_df = pd.read_csv('../../data/mel_annotations_REDO.tsv')

# mel_correct_cids_df.columns
# '''Index([',Unnamed: 0,conversation_id,lang,reply_settings,created_at,clean_text,text,author_id,referenced_tweets,id,edit_history_tweet_ids,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.impression_count,in_reply_to_user_id,geo.place_id,withheld.copyright,withheld.country_codes,geo.coordinates.type,geo.coordinates.coordinates,our_label'], dtype='object')'''
# mel_messed_up_cids_df.columns
# '''Index(['Unnamed: 0', 'Unnamed: 0.1', 'conversation_id', 'lang',
#        'reply_settings', 'created_at', 'clean_text', 'text', 'author_id',
#        'referenced_tweets', 'id', 'edit_history_tweet_ids',
#        'public_metrics.retweet_count', 'public_metrics.reply_count',
#        'public_metrics.like_count', 'public_metrics.impression_count',
#        'in_reply_to_user_id', 'geo.place_id', 'withheld.copyright',
#        'withheld.country_codes', 'geo.coordinates.type',
#        'geo.coordinates.coordinates', 'our_label'],
#       dtype='object')'''
display(mel_messed_up_cids_df[['Unnamed: 0', 'conversation_id', 'author_id', 'text', 'referenced_tweets']].head(3))
display(mel_correct_cids_df[['Unnamed: 0', 'conversation_id', 'author_id', 'text', 'referenced_tweets']].head(3))
#display(mel_correct_cids_df[['Unnamed: 0']].head(3))


Unnamed: 0.1,Unnamed: 0,conversation_id,author_id,text,referenced_tweets
0,13581,1.63367e+18,3024401000.0,RT @TomFitton: Obama knew. Clinton knew. Biden...,"[{'type': 'retweeted', 'id': '1633625411845345..."
1,13249,1.63367e+18,1.57922e+18,cant even threaten my friends during a game wi...,"[{'type': 'quoted', 'id': '1633098341578940417'}]"
2,28682,1.63388e+18,200698200.0,Trump was impeached over a ☎️ call! Why has th...,


Unnamed: 0.1,Unnamed: 0,conversation_id,author_id,text,referenced_tweets
0,13581,1633665111406542848,3024401194,RT @TomFitton: Obama knew. Clinton knew. Biden...,"[{'type': 'retweeted', 'id': '1633625411845345..."
1,13249,1633666519245750272,1579224008033275904,cant even threaten my friends during a game wi...,"[{'type': 'quoted', 'id': '1633098341578940417'}]"
2,28682,1633876284940877824,200698213,Trump was impeached over a ☎️ call! Why has th...,


In [187]:
# Now get our scores back in place in the "REDO" files
# mel_correct_cids_df['our_label'] = mel_messed_up_cids_df['our_label']
# mel_correct_cids_df.head()
# mel_correct_cids_df.to_csv('../../data/mel_annotations_REDO_04022023.tsv', sep='\t')

# fix sarik
# sarik_df = pd.read_csv('../../data/annotated_sarik.txt', sep='\t', encoding='iso-8859-1')
# sarik_correct_df = pd.read_csv('../../data/sarik_annotations_REDO.tsv')
# sarik_correct_df['our_label'] = sarik_df['our_label']
# sarik_correct_df.head()
# sarik_correct_df.to_csv('../../data/sarik_annotations_REDO_04022023.tsv', sep='\t')

# fix anu
# anu_df = pd.read_csv('../../data/Annotated_Anu.tsv')  # ANU's is OK!  no update needed.
# anu_df.head()

### Combine Annotations

In [188]:
#annot_df = pd.read_csv('../../data/mel_annotations.csv')
annot_df = pd.read_csv('../../data/mel_annotations_REDO_04022023.tsv', sep='\t')
annot_df['annotator'] = 'melissa'
annot_df.drop('Unnamed: 0', axis=1, inplace=True)
annot_df.drop('Unnamed: 0.1', axis=1, inplace=True)

expected_cols = annot_df.columns
print('mel annotations shape:', annot_df.shape)

# Sarik
#sarik_df = pd.read_csv('../../data/annotated_sarik.txt', sep='\t', encoding='iso-8859-1')
sarik_df = pd.read_csv('../../data/sarik_annotations_REDO_04022023.tsv', sep='\t')
sarik_df['annotator'] = 'sarik'

# he has some extra columns and fields are out of order...
sarik_df = sarik_df[expected_cols]

assert (sarik_df.columns == expected_cols).all(), 'columns not as expected'
print('sarik shape:', sarik_df.shape)

# Anu
#anu_df = pd.read_csv('../../data/Annotated_Anu.tsv', sep='\t')
#anu_df = pd.read_csv('../../data/anu_annotations.tsv', sep='\t')
#anu_df = pd.read_csv('../../data/anu_annotations.tsv')
anu_df = pd.read_csv('../../data/Annotated_Anu.tsv')
anu_df['annotator'] = 'anu'

# fix labels -> our_label
anu_df.drop('our_label', inplace=True, axis=1)
anu_df['our_label'] = anu_df['labels']

# fix columns
anu_df = anu_df[expected_cols]

assert (anu_df.columns == expected_cols).all(), 'anu columns may be off'  # dang, columns are different?
print('anu shape:', anu_df.shape)

# Combine these annotations
annot_df = annot_df.append(sarik_df)
annot_df = annot_df.append(anu_df)
annot_df.reset_index(inplace=True, drop=True)

assert annot_df.shape[0] == 300, f'row count unexpected: {annot_df.shape[0]}'

# standardize our label categories
annot_df.loc[annot_df['our_label'] == 'Neutral', 'our_label'] = 'neutral'
annot_df.loc[annot_df['our_label'] == 'neu', 'our_label'] = 'neutral'
annot_df.loc[annot_df['our_label'] == 'Negative', 'our_label'] = 'negative'
annot_df.loc[annot_df['our_label'] == 'neg', 'our_label'] = 'negative'
annot_df.loc[annot_df['our_label'] == 'Positive', 'our_label'] = 'positive'
annot_df.loc[annot_df['our_label'] == 'pos', 'our_label'] = 'positive'

print('Unique annotated conversation IDs:', annot_df['conversation_id'].nunique())

display(annot_df.sample(n=25))
annot_df['our_label'].value_counts()


mel annotations shape: (100, 23)
sarik shape: (100, 23)
anu shape: (100, 23)


Unnamed: 0,Unnamed: 0.1.1,conversation_id,lang,reply_settings,created_at,clean_text,text,author_id,referenced_tweets,id,...,public_metrics.like_count,public_metrics.impression_count,in_reply_to_user_id,geo.place_id,withheld.copyright,withheld.country_codes,geo.coordinates.type,geo.coordinates.coordinates,our_label,annotator
70,24696,1633671406650023936,en,everyone,2023-03-09 03:29:50+00:00,rt ybarrap fox news edits trump saying might l...,RT @ybarrap: Fox News Edits Out Trump Saying H...,1171599962649612288,"[{'type': 'retweeted', 'id': '1633664990841282...",1633671406650023936,...,0,0,,,,,,,negative,melissa
283,1622,1633229109009477632,en,everyone,2023-03-07 22:12:18+00:00,rt sagesurge marlon wayans roasting chris rock...,RT @sagesurge: Marlon Wayans roasting Chris Ro...,1130471128072413184,"[{'type': 'retweeted', 'id': '1632525319973289...",1633229109009477632,...,0,0,,,,,,,neutral,anu
84,35680,1633874158764294144,en,everyone,2023-03-09 16:55:30+00:00,rt ricwe u nato orchestration maidan coup ukra...,RT @ricwe123: The US/NATO orchestration of the...,1483594860,"[{'type': 'retweeted', 'id': '1633785840844386...",1633874158764294144,...,0,0,,,,,,,negative,melissa
74,28575,1633664819093213184,en,everyone,2023-03-09 03:03:39+00:00,rt lwv stand resident washington dc quest stat...,RT @LWV: We stand with the residents of Washin...,71977357,"[{'type': 'retweeted', 'id': '1633627246412824...",1633664819093213184,...,0,0,,,,,,,positive,melissa
61,31964,1633877099110109184,en,everyone,2023-03-09 17:07:11+00:00,http co ywo cor let clear said blank check ok,"https://t.co/yWo3coR637 \n""Let’s be very clear...",2860325151,,1633877099110109184,...,0,2,,,,,,,neutral,melissa
65,31371,1632802881962221568,en,everyone,2023-03-07 13:06:46+00:00,crufreeman senmdbrown yes real http co nagco z,"@crufreeman @SenMDBrown Yes, it's real\n\nhttp...",49826627,"[{'type': 'replied_to', 'id': '163299118783649...",1633091822598717440,...,2,32,384963900.0,,,,,,neutral,melissa
280,34539,1633875071487057920,en,everyone,2023-03-09 16:59:07+00:00,rt frontlinekit special thanks felicityspector...,RT @frontlinekit: Special thanks to @FelicityS...,1569539386286698496,"[{'type': 'retweeted', 'id': '1633820635594993...",1633875071487057920,...,0,0,,,,,,,positive,anu
183,23265,1633640638959714304,en,everyone,2023-03-09 03:39:33+00:00,gerardroth adamkinzinger jeffrey sachs advice ...,"@GerardRoth1 @AdamKinzinger Jeffrey Sachs ""adv...",1541225713499394048,"[{'type': 'replied_to', 'id': '163364250573959...",1633673853233758208,...,0,6,302100900.0,,,,,,positive,sarik
200,30947,1633209504631513088,en,everyone,2023-03-07 20:54:24+00:00,rt hegelwcrmcheese fuss senate unequal represe...,RT @HegelwCrmCheese: All the fuss about the Se...,14794913,"[{'type': 'retweeted', 'id': '1633174877141184...",1633209504631513088,...,0,0,,,,,,,neutral,anu
130,32246,1633876882436546560,en,everyone,2023-03-09 17:06:19+00:00,rt gerashchenko en russian propagandist say wa...,RT @Gerashchenko_en: Russian propagandists say...,1496834414768267264,"[{'type': 'retweeted', 'id': '1633759277641785...",1633876882436546560,...,0,0,,,,,,,neutral,sarik


neutral     173
negative     93
positive     34
Name: our_label, dtype: int64

In [204]:
# Save our combined annotations
annot_df.head()

Unnamed: 0,Unnamed: 0.1.1,conversation_id,lang,reply_settings,created_at,clean_text,text,author_id,referenced_tweets,id,...,public_metrics.like_count,public_metrics.impression_count,in_reply_to_user_id,geo.place_id,withheld.copyright,withheld.country_codes,geo.coordinates.type,geo.coordinates.coordinates,our_label,annotator
0,17871,1633665111406542848,en,everyone,2023-03-09 03:04:49+00:00,rt tomfitton obama knew clinton knew biden kne...,RT @TomFitton: Obama knew. Clinton knew. Biden...,3024401194,"[{'type': 'retweeted', 'id': '1633625411845345...",1633665111406542848,...,0,0,,,,,,,neutral,melissa
1,17539,1633666519245750272,en,everyone,2023-03-09 03:10:24+00:00,cant even threaten friend game without watchli...,cant even threaten my friends during a game wi...,1579224008033275904,"[{'type': 'quoted', 'id': '1633098341578940417'}]",1633666519245750272,...,0,2,,,,,,,negative,melissa
2,32972,1633876284940877824,en,everyone,2023-03-09 17:03:56+00:00,trump impeached call gop house brought article...,Trump was impeached over a ☎️ call! Why has th...,200698213,,1633876284940877824,...,0,15,,,,,,,negative,melissa
3,1203,1633277949234364416,en,everyone,2023-03-08 01:26:22+00:00,rt sagesurge marlon wayans roasting chris rock...,RT @sagesurge: Marlon Wayans roasting Chris Ro...,1093689222823849984,"[{'type': 'retweeted', 'id': '1632525319973289...",1633277949234364416,...,0,0,,,,,,,neutral,melissa
4,26453,1633668541122592768,en,everyone,2023-03-09 03:18:26+00:00,rt visegrad time west send patriot missile def...,RT @visegrad24: It’s time for the West to send...,4060962981,"[{'type': 'retweeted', 'id': '1633639497408696...",1633668541122592768,...,0,0,,,,,,,neutral,melissa


### Collect identifiers of our annotated set
- Identify the annotated entries so we can split out the human-labeled tweets from the others; 
- split the combined twitter model-predicted dataset into human-labeled and not-human labeled (remove our 300)
- save all 3 datasets

In [197]:
# Capture our annotation conversation_ids
human_annotation_ids = annot_df['conversation_id'].to_list()
assert len(human_annotation_ids) == 300, 'wrong # human-annotated IDs'

annotated_tweets_df = df.loc[df['conversation_id'].isin(human_annotation_ids)]
assert annotated_tweets_df.shape[0] == 300, f'issue getting our annotated tweets separated: {annotated_tweets_df.shape[0]}'


# Finally - save model datasets


In [None]:
# Save modeled twitter sentiment data to file after removing our annotated set
# from datetime import datetime
# timestamp = str(datetime.now()).replace(' ', '_').replace(':', '.')
# output_filepath = f'../../data/combined_model_data_{timestamp}.tsv'

# #df.to_csv(output_filepath, sep='\t')
# #print(f'wrote {output_filepath}')

In [202]:
df.shape  # (30611, 27)
df.loc[~df['conversation_id'].isin(human_annotation_ids)].shape  # 30311
df.loc[df['conversation_id'].isin(human_annotation_ids)].shape  # 300

timestamp = str(datetime.now()).replace(' ', '_').replace(':', '.')

# Save un-annotated model data:
output_filepath = f'../../data/combined_model_data_without_annotations{timestamp}.tsv'
df.loc[~df['conversation_id'].isin(human_annotation_ids)].to_csv(output_filepath, sep='\t')
print(f'wrote {output_filepath}')

# Save all model data:
output_filepath = f'../../data/combined_model_data_all_{timestamp}.tsv'
df.to_csv(output_filepath, sep='\t')
print(f'wrote {output_filepath}')

# Save annotated model data:
output_filepath = f'../../data/combined_model_data_with_annotations_{timestamp}.tsv'
df.loc[df['conversation_id'].isin(human_annotation_ids)].to_csv(output_filepath, sep='\t')
print(f'wrote {output_filepath}')


wrote ../../data/combined_model_data_without_annotations2023-04-02_20.13.51.914670.tsv
wrote ../../data/combined_model_data_all_2023-04-02_20.13.51.914670.tsv
wrote ../../data/combined_model_data_with_annotations_2023-04-02_20.13.51.914670.tsv


# Rubbish

### Do NOT Collect all annotations in a merge (fixed duplication false alarm) (DELETEME)



In [158]:
# assert False
# annot_df = pd.read_csv('../../data/mel_annotations.csv')
# annot_df['annotator'] = 'melissa'

# annot_df.drop('Unnamed: 0', axis=1, inplace=True)
# annot_df.drop('Unnamed: 0.1', axis=1, inplace=True)

# expected_cols = annot_df.columns
# #annot_df['mel_label'] = annot_df['our_label']
# annot_df['our_label_mel'] = annot_df['our_label']  # adding so I match the others
# print('mel annotations shape:', annot_df.shape)

# # Sarik
# sarik_df = pd.read_csv('../../data/annotated_sarik.txt', sep='\t', encoding='iso-8859-1')
# sarik_df['annotator'] = 'sarik'

# # he has some extra columns and fields are out of order...
# sarik_df = sarik_df[expected_cols]

# assert (sarik_df.columns == expected_cols).all(), 'columns not as expected'
# #sarik_df['sarik_label'] = sarik_df['our_label']  # add sarik label col for merging
# print('sarik shape:', sarik_df.shape)

# # Anu
# anu_df = pd.read_csv('../../data/Annotated_Anu.tsv')
# anu_df['annotator'] = 'anu'

# # fix labels -> our_label
# anu_df.drop('our_label', inplace=True, axis=1)
# anu_df['our_label'] = anu_df['labels']

# # fix columns
# anu_df = anu_df[expected_cols]

# assert (anu_df.columns == expected_cols).all(), 'anu columns may be off'  # dang, columns are different?
# print('anu shape:', anu_df.shape)

# # Merge:
# annot_df = annot_df.merge(sarik_df[['conversation_id', 'our_label']],
#                           how='outer', on='conversation_id', suffixes=('', '_sarik'))

# annot_df = annot_df.merge(anu_df[['conversation_id', 'our_label']],
#                           how='outer', on='conversation_id', suffixes=('', '_anu'))

# annot_df.drop('our_label', inplace=True, axis=1)

# #assert annot_df.shape[0] == 300, f'row count unexpected: {annot_df.shape[0]}'

# # standardize over all variations of our labels:
# label_cols = [x for x in annot_df.columns if 'our_label' in x.lower()]
# for col in label_cols:
#     annot_df.loc[annot_df[col] == 'Neutral', col] = 'neutral'
#     annot_df.loc[annot_df[col] == 'neu', col] = 'neutral'
#     annot_df.loc[annot_df[col] == 'Negative', col] = 'negative'
#     annot_df.loc[annot_df[col] == 'neg', col] = 'negative'
#     annot_df.loc[annot_df[col] == 'Positive', col] = 'positive'
#     annot_df.loc[annot_df[col] == 'pos', col] = 'positive'


# # Inspect a sample
# annot_df.sample(n=30)
# #annot_df[label_cols].sample(n=30)



mel annotations shape: (100, 23)
sarik shape: (100, 22)
anu shape: (100, 22)


Unnamed: 0,conversation_id,lang,reply_settings,created_at,clean_text,text,author_id,referenced_tweets,id,edit_history_tweet_ids,...,in_reply_to_user_id,geo.place_id,withheld.copyright,withheld.country_codes,geo.coordinates.type,geo.coordinates.coordinates,annotator,our_label_mel,our_label_sarik,our_label_anu
378,1.63395e+18,en,everyone,2023-03-09 22:02:38+00:00,chatgpt api good cheap make text generating ai...,#ChatGPT's #API is so good and cheap it makes ...,19628770.0,,1.63395e+18,['1633951454544248839'],...,,,,,,,melissa,positive,negative,
528,1.633147e+18,,,,,,,,,,...,,,,,,,,,,positive
465,1.63368e+18,en,everyone,2023-03-09 03:44:57+00:00,rt robin hoodsband russia road victory evil na...,RT @Robin_Hoodsband: Russia is on the road to ...,2287968000.0,"[{'type': 'retweeted', 'id': '1633500528234594...",1.63368e+18,['1633675210929971201'],...,,,,,,,melissa,positive,negative,
479,1.63359e+18,en,everyone,2023-03-08 21:49:04+00:00,protester arrested demonstrator confront polic...,Protester arrested as demonstrators confront p...,6.99023e+17,,1.63359e+18,['1633585651714236416'],...,,,,,,,melissa,negative,negative,
333,1.63387e+18,en,everyone,2023-03-09 16:53:11+00:00,russia take east bakhmut ukraine build force r...,Russia takes east Bakhmut as Ukraine builds up...,1.48485e+18,,1.63387e+18,['1633873578553933824'],...,,,,,,,melissa,neutral,negative,
253,1.63303e+18,en,everyone,2023-03-07 08:56:35+00:00,rt tasterreblanche sagesurge also said chrisro...,RT @Tasterreblanche: @sagesurge He also said🤷🏻...,1.21436e+18,"[{'type': 'retweeted', 'id': '1632830373305409...",1.63303e+18,['1633028860735369216'],...,,,,,,,melissa,neutral,,
231,1.63388e+18,en,everyone,2023-03-09 17:07:11+00:00,http co ywo cor let clear said blank check ok,"https://t.co/yWo3coR637 \n""Let’s be very clear...",2860325000.0,,1.63388e+18,['1633877099110109184'],...,,,,,,,melissa,neutral,negative,
448,1.63394e+18,en,everyone,2023-03-09 23:45:56+00:00,pauljessup chatgpt us artificial neural net co...,@pauljessup ChatGPT uses an Artificial Neural ...,430965800.0,"[{'type': 'replied_to', 'id': '163393867815319...",1.63398e+18,['1633977450542641152'],...,14079203.0,,,,,,melissa,neutral,neutral,
502,1.63375e+18,,,,,,,,,,...,,,,,,,,,positive,
405,1.63398e+18,en,everyone,2023-03-09 23:38:38+00:00,rt mtaibbi chatgpt testified would called dire...,RT @mtaibbi: If ChatGPT testified to this it w...,1.59924e+18,"[{'type': 'retweeted', 'id': '1633934237815431...",1.63398e+18,['1633975612305182720'],...,,,,,,,melissa,neutral,neutral,


In [153]:
# mel_df = pd.read_csv('../../data/mel_annotations.tsv', sep='\t')
# mel_df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,conversation_id,lang,reply_settings,created_at,clean_text,text,author_id,...,public_metrics.reply_count,public_metrics.like_count,public_metrics.impression_count,in_reply_to_user_id,geo.place_id,withheld.copyright,withheld.country_codes,geo.coordinates.type,geo.coordinates.coordinates,our_label
0,0,13581,17871,1.63367e+18,en,everyone,2023-03-09 03:04:49+00:00,rt tomfitton obama knew clinton knew biden kne...,RT @TomFitton: Obama knew. Clinton knew. Biden...,3024401000.0,...,0,0,0,,,,,,,neutral
1,1,13249,17539,1.63367e+18,en,everyone,2023-03-09 03:10:24+00:00,cant even threaten friend game without watchli...,cant even threaten my friends during a game wi...,1.57922e+18,...,0,0,2,,,,,,,negative
2,2,28682,32972,1.63388e+18,en,everyone,2023-03-09 17:03:56+00:00,trump impeached call gop house brought article...,Trump was impeached over a ☎️ call! Why has th...,200698200.0,...,0,0,15,,,,,,,negative
3,3,1203,1203,1.63328e+18,en,everyone,2023-03-08 01:26:22+00:00,rt sagesurge marlon wayans roasting chris rock...,RT @sagesurge: Marlon Wayans roasting Chris Ro...,1.09369e+18,...,0,0,0,,,,,,,neutral
4,4,22163,26453,1.63367e+18,en,everyone,2023-03-09 03:18:26+00:00,rt visegrad time west send patriot missile def...,RT @visegrad24: It’s time for the West to send...,4060963000.0,...,0,0,0,,,,,,,neutral
