# Semantic Embedding Analysis

The following contains reference methods used for the LDA topic model analysis in the submitted paper:

### CONTENT ANALYSIS, AGING AND AUTOBIOGRAPHICAL MEMORY Differences in the Content and Coherence of Autobiographical Memories Between Younger and Older Adults: Insights from Text Analysis 


-----------------

## Input Transcript Text

transcripts_df is a dataframe with the following columns:
- subject_id
- transcript_id
- raw_text - the text of the transcript
- stage, which is one of: 'teenage', 'childhood', 'adult', 'earlyadult', 'middleadult',
       'lateadult'
- agegroup: younger, older



### Analytic Methods

#### Load Universal Sentence Encoder
And prepare function to calculate internal correlations for a set of text inputs

In [None]:
from absl import logging
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns


# load Google's Universal Setence Encoder
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)


def embed(input):
  return model(input)


def get_embedding_correlations(text_list):
  text_embeddings = embed(text_list)
  text_embeddings_corr = np.inner(text_embeddings, text_embeddings)
  n = text_embeddings_corr.shape[1]
  average_other_correlations = (text_embeddings_corr.mean(1)-1./n)*(n/(n-1))
  return dict(corr_avg=average_other_correlations.mean(),corr_var=average_other_correlations.var(), num_gt_0p4 = (average_other_correlations>0.4).mean())


#### Prepare Transcript Sliding Window


In [None]:
import re
import pandas as pd
import numpy as np

def get_split_indexes(x):
  index_series = pd.Series([0]+[m.start() for m in re.finditer('[\., ]', x)] + [len(x)])
  return list(index_series.loc[index_series.diff().map(lambda x: pd.isna(x) or x>1)].values)

def window_text(x, window_length=15, stride_length=15):
  token_split_indexes = get_split_indexes(x)
  num_tokens = len(token_split_indexes)
  window_texts = []
  t1 = 0
  t2 = min(num_tokens-1, window_length)
  while t2 < num_tokens:
    i1 = token_split_indexes[t1]
    i2 = token_split_indexes[t2]

    window_texts += [x[i1:i2]]
    
    t1 = t1 + stride_length
    t2 = t2 + stride_length
  return window_texts


### Calculate Internal Similarity For Transcript Sliding Windows

In [None]:
transcripts_df["window_texts"] = transcripts_df.raw_text.map(window_text)
transcripts_df["num_window_texts"] = transcripts_df.window_texts.map(len)

transcripts_df = transcripts_df.query("num_window_texts > 3")

transcripts_df["internal_similarity"] = transcripts_df["window_texts"].map(get_embedding_correlations)

## Determine average internal similarity score per transcript

In [None]:
transcripts_df["internal_similarity_mean"] = transcripts_df["internal_similarity"].map(lambda x: None if not x else x.get("corr_avg"))