This notebook uses BERT to extract sentence embeddings from the memory descriptions.

Following this tutorial: https://www.geeksforgeeks.org/how-to-generate-word-embedding-using-bert/

In [3]:
import numpy as np
import pandas as pd
import random
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import scipy.stats
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

Set random seed

In [4]:
random_seed = 9
random.seed(random_seed)

# set for PyTorch
torch.manual_seed(random_seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(random_seed)

Load BERT

In [23]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# gave a JS error before, but it went away... (if happens again, ctrl+refresh button)
# https://stackoverflow.com/questions/73715821/jupyter-lab-issue-displaying-widgets-javascript-error

# Sentence embeddings for all memories

First, we generate sentence embeddings for all memories reported. While this is not used specifically in the content analysis (comparing across pairs of memories), it could be useful in future exploratory analysis.

**Note:** we don't filter out "private" memories!

### Load the data

In [10]:
all_memories = pd.read_csv('../../processed/dense/memory/all_memories.csv', header = 0)

In [14]:
all_memories

Unnamed: 0,internal_id,session,song_id,orig_or_cover,age_at_event,association,control,description,emot_content,emot_intensity,energetic,experience,familiarity,important,preference,social,unique,vivid
0,1,1,2013_1,cover,20.0,I do not associate it,completely spontaneous,I was walking alone to one of my first few job...,neutral,not at all,not at all,I have never heard it,not at all,not at all,not at all,not at all,very,not at all
1,1,1,2016_5,orig,9.0,I do not associate it,completely spontaneous,I remembered when my mother threw out my Yugio...,somewhat negative,not at all,not at all,I have never heard it,not at all,a little,not at all,extremely,extremely,not at all
2,1,1,2017_2,orig,21.0,I do not associate with anything in particular...,completely spontaneous,I was sitting at home alone watching a youtube...,somewhat positive,not at all,a little,I have heard it a few times over the internet ...,a little,not at all,not at all,a little,very,not at all
3,1,1,2017_3,orig,10.0,I do not associate it,completely spontaneous,I remembered when I was at a pool party with m...,neutral,not at all,not at all,I have never heard it,not at all,not at all,not at all,extremely,extremely,not at all
4,1,1,2018_5,cover,9.0,I do not associate it with anything,completely spontaneous,I was at a fireworks festival when I smelled s...,very negative,not at all,not at all,I have never heard it before,not at all,not at all,not at all,a little,extremely,not at all
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2083,90,2,2011_1,cover,13.0,middle school,somewhat spontaneous,I watched a girl sing this flawlessly while we...,somewhat positive,a little,a little,First Adele song I heard. I like this one ok b...,very,a little,a little,somewhat,a little,somewhat
2084,90,2,2013_2,orig,16.0,high school,completely spontaneous,"This was part of my ""feminist awakening"" on Tu...",somewhat negative,somewhat,somewhat,I always secretly liked the sound of this song...,very,a little,a little,a little,a little,very
2085,90,2,2014_1,orig,17.0,,somewhat spontaneous,I was alone in my childhood bedroom watching t...,neutral,not at all,a little,I don't like it!,very,not at all,not at all,not at all,a little,somewhat
2086,90,2,2014_5,cover,17.0,latter half of high school,completely spontaneous,I remember sitting in my current events class ...,neutral,not at all,not at all,I heard it quite a bit especially back then bu...,somewhat,not at all,not at all,somewhat,somewhat,somewhat


Only keep the identifying columns and the memory description.

In [17]:
all_memories = all_memories[['internal_id', 'song_id', 'orig_or_cover', 'description']]

In [19]:
all_memories

Unnamed: 0,internal_id,song_id,orig_or_cover,description
0,1,2013_1,cover,I was walking alone to one of my first few job...
1,1,2016_5,orig,I remembered when my mother threw out my Yugio...
2,1,2017_2,orig,I was sitting at home alone watching a youtube...
3,1,2017_3,orig,I remembered when I was at a pool party with m...
4,1,2018_5,cover,I was at a fireworks festival when I smelled s...
...,...,...,...,...
2083,90,2011_1,cover,I watched a girl sing this flawlessly while we...
2084,90,2013_2,orig,"This was part of my ""feminist awakening"" on Tu..."
2085,90,2014_1,orig,I was alone in my childhood bedroom watching t...
2086,90,2014_5,cover,I remember sitting in my current events class ...


### Generate sentence embeddings

In [88]:
sentence_embeddings = []

for row in range(#len(all_memories)):
    # print every fifty rows to keep track of where we are
    if row % 50 == 0: print(row)
    
    # grab this row - keep names consistent with dataframe
    internal_id = all_memories.iloc[row]["internal_id"]
    song_id = all_memories.iloc[row]["song_id"]
    orig_or_cover = all_memories.iloc[row]["orig_or_cover"]
    description = all_memories.iloc[row]["description"]

    # tokenize the input text for the memory description
    encoding = tokenizer.batch_encode_plus([description], padding=True, truncation=True, return_tensors='pt', add_special_tokens=True)
    input_ids = encoding['input_ids']
    attention_mask = encoding['attention_mask']
    
    # generate embeddings
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
        sentence_embedding = outputs.pooler_output
        #average_word_embedding = outputs.last_hidden_state.mean(dim=1) # if we want to compare average word embeddings to pooled output

    # append the embeddings to our list
    sentence_embeddings.append([internal_id, song_id, orig_or_cover] + sentence_embedding.tolist()[0])

# turn list of embeddings into a dataframe
embeddings_df = pd.DataFrame(sentence_embeddings, columns = ['internal_id', 'song_id', 'orig_or_cover'] + np.arange(1,769).tolist())
embeddings_df

0
50
100
150
200
250
300
350
400
450
500
550
600
650
700
750
800
850
900
950
1000
1050
1100
1150
1200
1250
1300
1350
1400
1450
1500
1550
1600
1650
1700
1750
1800
1850
1900
1950
2000
2050


Unnamed: 0,internal_id,song_id,orig_or_cover,1,2,3,4,5,6,7,...,759,760,761,762,763,764,765,766,767,768
0,1,2013_1,cover,-0.925039,-0.476511,-0.820146,0.806121,0.750252,-0.263882,0.900542,...,0.415158,0.133134,0.985191,0.896381,-0.238206,0.392944,0.534719,-0.758175,-0.565566,0.959932
1,1,2016_5,orig,-0.725542,-0.170654,-0.016003,0.391178,0.064678,-0.125905,0.626576,...,0.360012,0.354670,0.153214,0.760723,0.278341,0.566524,0.412910,-0.148993,-0.570092,0.839836
2,1,2017_2,orig,-0.822758,-0.352374,-0.358245,0.545977,0.640670,-0.177868,0.559740,...,0.360978,0.118866,0.901249,0.838365,0.052868,0.700132,0.486593,-0.466681,-0.606861,0.880014
3,1,2017_3,orig,-0.689323,-0.106435,0.154920,0.106485,0.154481,-0.173088,0.394321,...,0.224393,0.588233,0.244819,0.667610,0.404708,0.709202,0.253536,-0.193580,-0.557565,0.820684
4,1,2018_5,cover,-0.834857,-0.417065,-0.782101,0.466850,0.813551,-0.326551,0.459257,...,0.394750,0.041945,0.974859,0.705473,-0.337239,0.462324,0.496768,-0.774906,-0.668279,0.907586
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2083,90,2011_1,cover,-0.806903,-0.383219,-0.701549,0.506742,0.800984,-0.239791,0.634900,...,0.222808,-0.051539,0.961630,0.716346,-0.202372,0.509185,0.361068,-0.814546,-0.614946,0.842037
2084,90,2013_2,orig,-0.749622,-0.499494,-0.897951,0.491014,0.779210,-0.302061,0.551081,...,0.448018,0.133541,0.963856,0.735331,-0.022848,-0.119280,0.569422,-0.767897,-0.654401,0.815262
2085,90,2014_1,orig,-0.946511,-0.576533,-0.970251,0.879920,0.885311,-0.392012,0.895518,...,0.508017,-0.781010,0.998417,0.898583,0.064658,0.218098,0.635531,-0.880555,-0.729259,0.932932
2086,90,2014_5,cover,-0.471628,-0.115952,-0.225138,-0.160180,0.543597,-0.240385,-0.468006,...,-0.035180,0.419475,0.791787,0.360677,0.095291,0.731498,0.116594,-0.552660,-0.409857,0.712509


(Fifty rows takes about 2 seconds)

Save the output.

In [92]:
embeddings_df.to_csv('../../processed/dense/memory/all_memory_embeddings.csv', index = False)

# Semantic distance for memory pairs

Now that we have all of the sentence embeddings, we compute semantic distance between pairs of memories using cosine similarity.

First, identify the pairs.

In [171]:
embedding_distances = []

for sub in pd.unique(embeddings_df['internal_id']):
#for sub in [1,2]:
    # get just this participant's memories
    this_sub = embeddings_df[embeddings_df['internal_id'] == sub]

    # get the songs and their counts 
    these_songs, these_song_counts = np.unique(this_sub['song_id'], return_counts=True)

    # if the song shows up twice, then that's a memory pair
    for i in range(len(these_songs)):
        if these_song_counts[i] == 2: 
            song_id = these_songs[i]
            this_song = this_sub[this_sub['song_id'] == song_id]
            orig_embedding = this_song[this_song['orig_or_cover'] == 'orig'].to_numpy()[0,3:]
            cover_embedding = this_song[this_song['orig_or_cover'] == 'cover'].to_numpy()[0,3:]
            # compute semantic distance using Euclidean distance
            semantic_distance = np.linalg.norm(orig_embedding - cover_embedding)

            embedding_distances.append([sub, song_id, semantic_distance])
            
embedding_dist_df = pd.DataFrame(embedding_distances, columns = ['internal_id', 'song_id', 'semantic_distance'])
embedding_dist_df

Unnamed: 0,internal_id,song_id,semantic_distance
0,1,2017_2,4.476163
1,2,2008_2,3.320117
2,2,2010_1,2.279668
3,2,2011_3,2.294284
4,2,2012_4,4.059248
...,...,...,...
625,89,2011_2,5.434355
626,89,2012_3,4.745692
627,89,2015_4,3.691397
628,90,2011_1,3.891313


In [173]:
print(np.max(embedding_dist_df['semantic_distance']))

18.58325032095421


In [175]:
embedding_dist_df.to_csv('../../processed/dense/memory/memory_pairs_semantic_distances.csv', index = False)