# Beyond n-grams, tf-idf, and word indicators for text: Leveraging the Python API for vector embeddings

## Billy Buchanan, Sr. Research Scientist, SAG Corporation

## Abstract
This talk will share strategies that Stata users can use to get more informative word, sentence, and document vector embeddings of text in their data. While indicator and bag-of-words strategies can be useful for some types of text analytics, they lack the richness of the semantic relationships between words that provide meaning and structure to language. Vector space embeddings attempt to preserve these relationships and in doing so can provide more robust numerical representations of text data that can be used for subsequent analysis. I will share strategies for using existing tools from the Python ecosystem with Stata to leverage the advances in NLP in your Stata workflow.

## Information about this Notebook
I wanted to find some examples that included actual language that might be useful for others.  I came across a dataset used for a paper at the 2020 Association of Computational Linguistics conference that seemed like it could be interesting.  The cells with code below assume that some tools have already been installed in your environment.  I'll include a snippet below in this cell that will show you what you will need to do to install these other packages in your Python environment.  

The first section of this Notebook is purely data acquisition, preparation, and management.  The second section will show more about obtaining word vectors, what the word vectors look like, obtaining sentence/paragraph/document vectors, and will potentially show an example of using the vectors in a model.

```
pip install -U pip setuptools wheel

# You could also use conda install -c conda-forge spacy at this step as well
pip install -U spacy 

# This file is about 0.5GB in size but is used in examples below
# You could swap the lg at the end with md and should still be able to follow 
# along, but may have embeddings of a different dimension
python -m spacy download en_core_web_lg 

# You can use conda install -c huggingface transformers at this step as well
pip install transformers[torch]
```

Since it can take a bit to initialize and download some of the NLP models (transformers will download/install models as needed), we'll do that towards the beginning to so you can get that stuff done before I start discussing the examples/code below.

In [1]:
# The majority of the examples will use spaCy for speed, but 
# I will also try to include an example of applying the same concepts using transformers as well
import spacy

# Torch is Facebook's Deep Learning Library and it runs reasonably on a CPU only machine
import torch

torch.manual_seed(0)

# This will load the tokenizers and models using the BERT architecture
from transformers import BertTokenizer, BertModel

# This will initialize the tokenizer and download the pretrained model parameters
# You can also use 'bert-large-cased' if you are using Stata SE or Stata MP.
# 'bert-large-cased' will produce 1,024 dimensional vectors, while 
# 'bert-base-cased' will return only 768 dimensional vectors.  
# If you really need something more expressive, there are other pre-trained models available that will return 
# > 2,000 dimension vectors (e.g., GPTNeo, xlm-mlm-en-2048, alberta-xxlarge-v1/2)
tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case = False)

# We'll also load up the model for spaCy at this time
nlp = spacy.load('en_core_web_lg')

## Data Acquisition, Preparation, and Loading
This section of the jupyter notebook illustrates the process of getting the data used for the subsequent examples and getting it loaded into Stata.  I'll make a copy of the dataset available in Stata format from the GitHub repo for anyone interested in skipping over this section.

The cell below is used to acquire all of the data used in the examples.  There is a single bit of the preparation pipeline (e.g., removing elements of the JSON object that only have a single value and break the pandas parser).

In [2]:
import json
import requests
import pandas as pd

# List of the URLs containing the data set
files = [ "https://raw.githubusercontent.com/DenisPeskov/2020_acl_diplomacy/master/data/test.jsonl",
"https://raw.githubusercontent.com/DenisPeskov/2020_acl_diplomacy/master/data/train.jsonl",
"https://raw.githubusercontent.com/DenisPeskov/2020_acl_diplomacy/master/data/validation.jsonl" ]

# Function to handle dropping "variables" that prevent pandas from
# reading the JSON object
def normalizer(obs: dict, drop: list) -> pd.DataFrame:
    # Loop over the "variables" to drop
    for i in drop:
        # Remove it from the dictionary object
        del obs[i]
    # Returns the Pandas dataframe
    return pd.DataFrame.from_dict(obs)

# Object to store each of the data frames
data = []

# Loop over each of the files from the URLs above
for i in files:
    # Get the raw content from the GitHub location
    content = requests.get(i).content
    # Split the JSON objects by new lines, pass each individual line to json.loads,
    # pass the json.loads value to the normalizer function, and
    # append the result to the data object defined outside of the loop
    [ data.append(normalizer(json.loads(i), [ "players", "game_id" ])) for i in content.decode('utf-8').splitlines() ]


In the next cell, we're going to append all of the `DataFrame` objects that were just acquired.  We'll also define some mappings that we will use to convert the values of the columns/variables to numeric values and will recast other values to numeric formats to ensure they get loaded into Stata without any issues.  We'll also create one variable that identifies if the receiver of the message correctly identified whether it was truthful or not that will have 0/1 values.  We'll treat cases where there was no information about the truthfulness of the message as incorrect in cases where there was a response from one party and not the other.

In [3]:
# Define a couple data mappings for later use
labmap = { True: 1, False: 0, 'NOANNOTATION': -1 }
cntrys = { 'austria': 0, 'england': 1, 'france': 2, 'germany': 3, 'italy': 4, 'russia': 5, 'turkey': 6 }
seasons = { 'Fall': 0, 'Winter': 1, 'Spring': 2 }

# Combine each of the data frames for each game into one large dataset
dataset = pd.concat(data, axis = 0, join = 'inner', ignore_index = True, sort = False)

# Change data types of a couple columns
dataset['game_score'] = dataset['game_score'].astype('int')
dataset['sender_labels'] = dataset['sender_labels'].astype('int')
dataset['absolute_message_index'] = dataset['absolute_message_index'].astype('int')
dataset['relative_message_index'] = dataset['relative_message_index'].astype('int')
dataset['game_score_delta'] = dataset['game_score_delta'].astype('int')
dataset['years'] = dataset['years'].astype('int')

# Recodes text labels to numeric values
dataset.replace({'receiver_labels': labmap, 'speakers': cntrys, 'receivers': cntrys, 'seasons': seasons}, inplace = True)

# Creates an indicator for when the receiver correctly identifies the truthfulness of the message
dataset['correct'] = (dataset['sender_labels'] == dataset['receiver_labels']).astype('int')

# Parse the message into NLP tokens and store the spaCy object
dataset['token'] = dataset['messages'].apply(lambda x: nlp(x))

# Count the number of tokens in each token observation
dataset['tokens'] = dataset['token'].apply(lambda x: len(x))

# Show a preview of the data to this point
dataset.head()


Unnamed: 0,messages,sender_labels,receiver_labels,speakers,receivers,absolute_message_index,relative_message_index,seasons,years,game_score,game_score_delta,correct,token,tokens
0,"Hi Italy! Just opening up communication, and I...",1,1,3,4,87,0,2,1901,3,0,1,"(Hi, Italy, !, Just, opening, up, communicatio...",31
1,"Well....that's a great question, and a lot of ...",1,1,4,3,132,1,2,1901,3,0,1,"(Well, ...., that, 's, a, great, question, ,, ...",30
2,"Well, if you want to attack France in the Medi...",1,0,3,4,138,2,2,1901,3,0,0,"(Well, ,, if, you, want, to, attack, France, i...",47
3,"Hello, I'm just asking about your move to Tyro...",1,1,3,4,207,3,1,1901,5,1,1,"(Hello, ,, I, 'm, just, asking, about, your, m...",21
4,Totally understandable - but did you notice th...,1,0,4,3,221,4,1,1901,4,-1,0,"(Totally, understandable, -, but, did, you, no...",23


In this cell Stata gets initialized.  Although I used `pip` to install the `stata_setup` module on my machine, it doesn't seem to persist between sessions, so this seemed like the most prudent route to take to illustrate things for others.  I am using a 2013 MacBook Pro running OSX.  If you are also running OSX and have Stata installed in the default location, this should run without any issues on your machine.  If not, you'll need to replace `'/Applications/Stata/utilities'` below with the comparable path on your system.

In [4]:
# Imports the sys module to point Python to the location of the pystata module
import sys

# It seems configuration doesn't always persist, so you should do this for the sake of 
# defensive programming practices. This is the location on my MacBook Pro.
# sys.path.append('/Applications/Stata/utilities')

# Given the memory requirements of the notebook, I'm going to run this on my Linux server instead:
sys.path.append('/usr/local/stata17/utilities/')

# Imports the pystata module
import pystata

In [5]:
%%capture --no-stdout --no-display
# Initialize Stata environment
pystata.config.init('mp')


  ___  ____  ____  ____  ____ ©
 /__    /   ____/   /   ____/      17.0
___/   /   /___/   /   /___/       MP—Parallel Edition

 Statistics and Data Science       Copyright 1985-2021 StataCorp LLC
                                   StataCorp
                                   4905 Lakeway Drive
                                   College Station, Texas 77845 USA
                                   800-STATA-PC        https://www.stata.com
                                   979-696-4600        stata@stata.com

Stata license: Single-user 12-core  perpetual
Serial number: 
  Licensed to: Billy Buchanan
               SAG Corporation

Notes:
      1. Unicode is supported; see help unicode_advice.
      2. More than 2 billion observations are allowed; see help obs_advice.
      3. Maximum number of variables is set to 5,000; see help set_maxvar.


Initially, I had written some of this using the Stata Function Interface API (e.g., the API that you use when calling Python from inside of Stata) and then realized that I needed to adjust it for the new environment.  One of the nice things about this was the `stata.pdataframe_to_data()` function that handled loading our big `DataFrame` object into a dataset in Stata.  I'll also print out the first few rows of data so we can check to make sure things loaded without issue.

In [6]:
# Push the data frame into Stata
pystata.stata.pdataframe_to_data(dataset, force = True)

# List the first few records (similar to calling dataset.head())
pystata.stata.run("li in 1/5")


     +------------------------------------------------------------------------+
  1. | messages                                                               |
     | Hi Italy! Just opening up communication, and I want to know what som.. |
     |------------------------------------------------------------------------|
     | sender~s  | recei~ls  | speakers  | recei~rs  | absolu~x  |  relati~x  |
     |        1  |        1  |        3  |        4  |       87  |         0  |
     |------------------------------------------------------------------------|
     |   seasons   |   years   |   game_s~e    |   game_s~a    |   correct    |
     |         2   |    1901   |          3    |          0    |         1    |
     |------------------------------------------------------------------------|
     | token                                                                  |
     | Hi Italy! Just opening up communication, and I want to know what som.. |
     |---------------------------------

We can see that nothing in the output above is labeled. So, we can use some pseudo-metaprogramming to generate the strings that will construct the Stata commands we want to create to define value labels and apply the value labels to the data.  

_Pro-tip: If you create value labels with the same name as the variable they will be assigned to you can easily assign them all with a single loop and fewer keystrokes._

In [7]:
# Create mapping of value labels to variables
vallabmap = {   'sender_labels' : labmap, 'receiver_labels': labmap, 'seasons': seasons,
                'speakers': cntrys, 'receivers': cntrys }

# Loop over the dictionary containing the value label mappings
for varnm, vallabs in vallabmap.items():

    # Start the string that defines the value labels
    valueLabel = 'la def ' + varnm + ' '
    
    # Defines string used to apply the value labels to the variables
    vallabApply = 'la val ' + varnm + ' ' + varnm

    # Now iterate over the value label mappings
    # Remember that the keys are all of the strings in these dictionaries and the
    # values are the numeric encodings
    for label, value in vallabs.items():

        # Add the value label mappings for the value labels
        valueLabel = valueLabel + str(value) + ' "' + str(label) + '" '
    
    # Adds modify to the end of the label definition just for safety
    valueLabel = valueLabel + ", modify"
    
    # Now this string can be used to define the value labels in Stata
    pystata.stata.run(valueLabel, echo = True)
    
    # And then apply the value labels to the variables
    pystata.stata.run(vallabApply, echo = True)

# Now we'll check on the first few records again to make sure the value labels are showing up
pystata.stata.run("li in 1/5")

. la def sender_labels 1 "True" 0 "False" -1 "NOANNOTATION" , modify
. la val sender_labels sender_labels
. la def receiver_labels 1 "True" 0 "False" -1 "NOANNOTATION" , modify
. la val receiver_labels receiver_labels
. la def seasons 0 "Fall" 1 "Winter" 2 "Spring" , modify
. la val seasons seasons
. la def speakers 0 "austria" 1 "england" 2 "france" 3 "germany" 4 "italy" 5 "r
> ussia" 6 "turkey" , modify
. la val speakers speakers
. la def receivers 0 "austria" 1 "england" 2 "france" 3 "germany" 4 "italy" 5 "
> russia" 6 "turkey" , modify
. la val receivers receivers

     +------------------------------------------------------------------------+
  1. | messages                                                               |
     | Hi Italy! Just opening up communication, and I want to know what som.. |
     |------------------------------------------------------------------------|
     | sender~s  | recei~ls  | speakers  | recei~rs  | absolu~x  |  relati~x  |
     |     True  |     T

Now that the values are labeled and the data are available in Stata, I'll save this version of the data set (remember I mentioned that I would make the dataset available for others who didn't want to deal with this part of things).  Once the data are saved, we'll start working with it to get word/vector embeddings.

In [8]:
# Now save the data set where the local repository is located.
pystata.stata.run("save ~/Desktop/Programs/JavaScript/stataConference2021/nlpdata.dta, replace")

# And to reduce memory overhead we'll clear these data from memory
pystata.stata.run("clear")

file ~/Desktop/Programs/JavaScript/stataConference2021/nlpdata.dta saved


## Getting Vector Embeddings
There are myriad libraries and models from which you can get vector embeddings.  `spaCy` is a fairly high performing library, so I'll use it for the majority of the examples here.  Then I'll provide an example of getting vector embeddings from `Transformer` based models using an example shared by Rostyslav Neskorozhenyi (https://colab.research.google.com/drive/1N7HELWImK9xCYheyozVP3C_McbiRo1nb#scrollTo=XxnDjaFYMolw).

## A Shift in the Examples
The `pystata` module doesn't currently include accessor methods (e.g., methods used to access data that exists in Stata, but not in Python), so the examples below will basically show the Python side of things (where possible).  Loading a portion of the data into Stata's memory isn't possible, but since we already have the data set defined in memory that Python can access, we can always join the vectors to the Pandas dataframe and load the entire data frame into memory.  In the presentation slides, I'll try to show how to do this all from within Stata, so you'll be able to see both approaches at work.

In [9]:
# spaCy returns vectors with 300 dimensions.  
# Since we will be passing the full message into spaCy's primary function, the default value that will be returned 
# is the vector for the "document" or the full message
example = nlp(dataset['messages'].values[0])

# This will display the vector below
example.vector

# More importantly, the returned object is actually an array/list of the individual tokens
for i in example:
    # Prints out the individual words and their associated vectors
    print('\n\nThe token is : {} and has the vector: {}'.format(i.text, i.vector))




The token is : Hi and has the vector: [ 2.8796e-02  4.1306e-01 -4.6690e-01 -7.8175e-02  3.7058e-01  1.2867e-01
  4.7714e-01 -9.2372e-01 -6.7789e-02  6.2381e-01 -2.9670e-01 -4.4328e-01
 -8.4224e-02 -3.1270e-01 -1.8197e-01  3.2360e-01 -7.7793e-02  1.3314e+00
 -1.5676e-01  1.2857e-01  4.3474e-02  7.9883e-02  1.1311e-02  1.4428e-01
  1.7653e-01 -2.2321e-01 -4.2480e-02  2.1707e-03 -4.7640e-02  3.8532e-01
 -5.9911e-02  1.8338e-01 -1.9145e-01 -1.3184e-01 -2.2440e-01 -3.4313e-01
 -1.9527e-01  2.0129e-01 -2.8915e-01 -2.0750e-01  1.9230e-01 -4.3318e-01
 -3.5914e-02 -1.7492e-01  5.1793e-03  4.1998e-01  1.0637e-01  1.6559e-01
  2.8926e-01  2.1868e-01 -7.7643e-02  6.1037e-01 -1.7432e-02 -2.9676e-03
 -3.0160e-01 -1.1983e-02 -9.4832e-02  9.5424e-02 -3.7713e-01 -1.1239e-01
 -7.8399e-01 -1.7278e-01  4.9498e-02 -2.0969e-01  3.1968e-01 -3.0732e-01
  1.0192e-01  2.0580e-01  3.2505e-01 -2.5291e-01 -9.3692e-02  5.2662e-03
  4.5696e-01 -1.1763e-01  2.6193e-01  3.2966e-02 -4.7883e-03  4.7738e-01
 -3.3887e-0

In [10]:
# Knowing the dimension of the returned vector in advance can make it a bit easier to define variable names
# but we can still accomplish the same goal if we can get the dimension of the returned vector
for i in range(0, len(example.vector)):
    print('docvec' + str(i + 1) + " = " + str(example.vector[i]))


docvec1 = -0.0133300405
docvec2 = 0.24801837
docvec3 = -0.24180952
docvec4 = -0.06054988
docvec5 = 0.10447392
docvec6 = 0.027609779
docvec7 = -0.018385166
docvec8 = -0.07656802
docvec9 = 0.056135613
docvec10 = 2.201388
docvec11 = -0.27096534
docvec12 = -0.027924638
docvec13 = 0.07406946
docvec14 = -0.06148023
docvec15 = -0.067046374
docvec16 = -0.051080007
docvec17 = -0.16768312
docvec18 = 1.2792045
docvec19 = -0.2881262
docvec20 = -0.060520634
docvec21 = 0.11248713
docvec22 = -0.012621745
docvec23 = -0.050633833
docvec24 = -0.09533046
docvec25 = 0.027415907
docvec26 = 0.022421826
docvec27 = -0.03796288
docvec28 = -0.052168544
docvec29 = 0.031498346
docvec30 = -0.075751744
docvec31 = -0.016572738
docvec32 = 0.11701402
docvec33 = -0.079945914
docvec34 = 0.077758156
docvec35 = 0.06744284
docvec36 = -0.032226004
docvec37 = -0.028253572
docvec38 = 0.056850847
docvec39 = -0.07275157
docvec40 = -0.063669845
docvec41 = -0.06467036
docvec42 = 0.041804124
docvec43 = -0.057499718
docvec44 = -0.0

In [11]:
# We can do the same for each word as well.  The challenge at that point is to ensure that we include an ID 
# in the data structure used in order to merge it later:
retobject = []

# Gets all of the messages
for example in dataset['token'].tolist():

    # Loop through the words in the message
    for word in example:

        # Define a dictionary that will store the record
        record = dict()

        # Populate a couple identifiers
        record['messages'] = str(example.text)
        record['token'] = str(word.text)

        # If we want to include the "document" vector (the vector representing the collection 
        # of words in this observation) we can add that to the returned object as well. Since 
        # the document vector will have the same dimensionality as the word embeddings we can 
        # do all of this in a single loop
        for dim in range(0, len(word.vector)):

            # Construct the word embedding variable name
            varnm = "wembed" + str(dim + 1)

            # Construct the document embedding variable name
            docvarnm = 'docembed' + str(dim + 1)

            # And assign each dimension of the vector to it's own variable
            record[varnm] = word.vector[dim]
            record[docvarnm] = example.vector[dim]

        # Now insert this record into the storage object
        retobject.append(record)
    
# Now we'll store this in a second DataFrame object
embeddings = pd.DataFrame(retobject)

# We can see how the first couple of records look now as a sanity check 
embeddings.head(3)


Unnamed: 0,messages,token,wembed1,docembed1,wembed2,docembed2,wembed3,docembed3,wembed4,docembed4,...,wembed296,docembed296,wembed297,docembed297,wembed298,docembed298,wembed299,docembed299,wembed300,docembed300
0,"Hi Italy! Just opening up communication, and I...",Hi,0.028796,-0.01333,0.41306,0.248018,-0.4669,-0.24181,-0.078175,-0.06055,...,-0.11668,-0.044448,0.20476,-0.028502,-0.053029,-0.043654,-0.33494,0.032481,0.36282,0.160504
1,"Hi Italy! Just opening up communication, and I...",Italy,-0.21052,-0.01333,0.18476,0.248018,-0.005624,-0.24181,-0.15168,-0.06055,...,-0.15561,-0.044448,0.13197,-0.028502,0.16191,-0.043654,0.078475,0.032481,0.72517,0.160504
2,"Hi Italy! Just opening up communication, and I...",!,-0.26554,-0.01333,0.33531,0.248018,0.2186,-0.24181,-0.301,-0.06055,...,-0.255,-0.044448,0.15195,-0.028502,-0.17859,-0.043654,-0.062878,0.032481,0.16232,0.160504


In [12]:
# To expand each record based on the number of tokens in the message, we will use the object returned by spaCy.
dataset = dataset.explode('token')

# This should prevent issues with the merge in the cell below.
dataset['token'] = dataset['token'].astype('str')

# Then add ID's for each token (these values should also use zero-based indexing)
dataset['tokenid'] = dataset.groupby('messages').cumcount()

# And take a peek at the data to make sure things look correct
dataset.head()

Unnamed: 0,messages,sender_labels,receiver_labels,speakers,receivers,absolute_message_index,relative_message_index,seasons,years,game_score,game_score_delta,correct,token,tokens,tokenid
0,"Hi Italy! Just opening up communication, and I...",1,1,3,4,87,0,2,1901,3,0,1,Hi,31,0
0,"Hi Italy! Just opening up communication, and I...",1,1,3,4,87,0,2,1901,3,0,1,Italy,31,1
0,"Hi Italy! Just opening up communication, and I...",1,1,3,4,87,0,2,1901,3,0,1,!,31,2
0,"Hi Italy! Just opening up communication, and I...",1,1,3,4,87,0,2,1901,3,0,1,Just,31,3
0,"Hi Italy! Just opening up communication, and I...",1,1,3,4,87,0,2,1901,3,0,1,opening,31,4


In [13]:
# Now we can join the embeddings to the rest of the dataset 
withembeddings = dataset.merge(embeddings, how = 'inner', on = ['messages', 'token'], copy = False)

# This will show the first few records from the merged result
withembeddings.head()

Unnamed: 0,messages,sender_labels,receiver_labels,speakers,receivers,absolute_message_index,relative_message_index,seasons,years,game_score,...,wembed296,docembed296,wembed297,docembed297,wembed298,docembed298,wembed299,docembed299,wembed300,docembed300
0,"Hi Italy! Just opening up communication, and I...",1,1,3,4,87,0,2,1901,3,...,-0.11668,-0.044448,0.20476,-0.028502,-0.053029,-0.043654,-0.33494,0.032481,0.36282,0.160504
1,"Hi Italy! Just opening up communication, and I...",1,1,3,4,87,0,2,1901,3,...,-0.15561,-0.044448,0.13197,-0.028502,0.16191,-0.043654,0.078475,0.032481,0.72517,0.160504
2,"Hi Italy! Just opening up communication, and I...",1,1,3,4,87,0,2,1901,3,...,-0.255,-0.044448,0.15195,-0.028502,-0.17859,-0.043654,-0.062878,0.032481,0.16232,0.160504
3,"Hi Italy! Just opening up communication, and I...",1,1,3,4,87,0,2,1901,3,...,-0.015913,-0.044448,-0.24901,-0.028502,-0.029137,-0.043654,0.062257,0.032481,0.090782,0.160504
4,"Hi Italy! Just opening up communication, and I...",1,1,3,4,87,0,2,1901,3,...,0.15636,-0.044448,-0.25689,-0.028502,0.52901,-0.043654,-0.023808,0.032481,0.20809,0.160504


In [14]:
# Now we can create a new and improved dataset for Stata
pystata.stata.run("clear")

# Push the new DataFrame into Stata:
pystata.stata.pdataframe_to_data(withembeddings, force = True)

In [15]:
# Now apply the value labels to the same variables in this new dataset

# Loop over the dictionary containing the value label mappings
for varnm, vallabs in vallabmap.items():

    # Start the string that defines the value labels
    valueLabel = 'la def ' + varnm + ' '
    
    # Defines string used to apply the value labels to the variables
    vallabApply = 'la val ' + varnm + ' ' + varnm

    # Now iterate over the value label mappings
    # Remember that the keys are all of the strings in these dictionaries and the
    # values are the numeric encodings
    for label, value in vallabs.items():

        # Add the value label mappings for the value labels
        valueLabel = valueLabel + str(value) + ' "' + str(label) + '" '
    
    # Adds modify to the end of the label definition just for safety
    valueLabel = valueLabel + ", modify"
    
    # Now this string can be used to define the value labels in Stata
    pystata.stata.run(valueLabel, echo = True)
    
    # And then apply the value labels to the variables
    pystata.stata.run(vallabApply, echo = True)

# Now we'll check on the first few records again to make sure the value labels are showing up
pystata.stata.run("li in 1/5")

. la def sender_labels 1 "True" 0 "False" -1 "NOANNOTATION" , modify
. la val sender_labels sender_labels
. la def receiver_labels 1 "True" 0 "False" -1 "NOANNOTATION" , modify
. la val receiver_labels receiver_labels
. la def seasons 0 "Fall" 1 "Winter" 2 "Spring" , modify
. la val seasons seasons
. la def speakers 0 "austria" 1 "england" 2 "france" 3 "germany" 4 "italy" 5 "r
> ussia" 6 "turkey" , modify
. la val speakers speakers
. la def receivers 0 "austria" 1 "england" 2 "france" 3 "germany" 4 "italy" 5 "
> russia" 6 "turkey" , modify
. la val receivers receivers

     +------------------------------------------------------------------------+
  1. | messages                                                               |
     | Hi Italy! Just opening up communication, and I want to know what som.. |
     |------------------------------------------------------------------------|
     | sender~s  | recei~ls  | speakers  | recei~rs  | absolu~x  |  relati~x  |
     |     True  |     T

     |   .42693999  |  .10121305  |     -1.2049  |  -.6367169  |     -.1437   |
     |---------------------------------------------------------+--------------|
     |  docem~107  |  wemb~108  |  docemb~108  |   wembed109   |  docem~109   |
     |  .16658485  |   -.11751  |  -.00003162  |  -.41370001   |  .00481618   |
     |------------------------------------------------------------------------|
     |   wembed110  |  docem~110  |  wembed111  |  docem~111  |   wembed112   |
     |  -.38361999  |  .02938706  |     .20676  |  .05801778  |  -.55361003   |
     |--------------------------------------------------------+---------------|
     |  docem~112  |  wembed113  |  docem~113  |   wembed114  |  docemb~114   |
     |  -.3026925  |    .084365  |  .16927974  |  -.70433003  |  -.08528137   |
     |------------------------------------------------------------------------|
     |  wembed115 | doce~115 | wembed116 | docemb~116 | wembed117 | doce~117  |
     |    -.23252 | .0609485 |   -.21264

In [16]:
# Now create an ID variable for each message
pystata.stata.run('encode messages, gen(msgid)')

# And Save the new dataset
pystata.stata.run("save ~/Desktop/Programs/JavaScript/stataConference2021/dataWithEmbeddings.dta, replace")

# I'll largely stick to the process that Neskorozhenyi used, and will try to be clear when identifying deviations
# First let's get the vector of messages that we would want to generate embeddings for.  
# Since this will result in a much larger set of vectors we'll subset the data
messages = dataset['messages'].unique()[:500]

# And then clean up some of the variables to avoid tieing up memory
del withembeddings
del embeddings
del dataset

file ~/Desktop/Programs/JavaScript/stataConference2021/dataWithEmbeddings.dta
    saved


In [17]:
# Now you can fit a model to the data if you'd like.  Just as an example, we can fit a mixed effects model 
# with random intercepts on each word within the message, using the word and document (i.e., message) embeddings
# to identify whether the receiver of the message correctly identified whether the message was true or false on a 
# subset of roughly 19,000 records in the dataset.
pystata.stata.run('melogit correct i.season i.year i.speakers game_score *embed* if msgid <= 250 || msgid: || tokenid:, iterate(1)')

note: docembed243 omitted because of collinearity.
note: docembed244 omitted because of collinearity.
note: docembed245 omitted because of collinearity.
note: docembed246 omitted because of collinearity.
note: docembed247 omitted because of collinearity.
note: docembed248 omitted because of collinearity.
note: docembed249 omitted because of collinearity.
note: docembed250 omitted because of collinearity.
note: docembed251 omitted because of collinearity.
note: docembed252 omitted because of collinearity.
note: docembed253 omitted because of collinearity.
note: docembed254 omitted because of collinearity.
note: docembed255 omitted because of collinearity.
note: docembed256 omitted because of collinearity.
note: docembed257 omitted because of collinearity.
note: docembed258 omitted because of collinearity.
note: docembed259 omitted because of collinearity.
note: docembed260 omitted because of collinearity.
note: docembed261 omitted because of collinearity.
note: docembed262 omitted becau

Iteration 90:  log likelihood = -774.49766  (not concave)
Iteration 91:  log likelihood = -774.49745  (not concave)
Iteration 92:  log likelihood = -774.49725  (not concave)
Iteration 93:  log likelihood = -774.49704  (not concave)
Iteration 94:  log likelihood = -774.49683  (not concave)
Iteration 95:  log likelihood = -774.49663  (not concave)
Iteration 96:  log likelihood = -774.49643  (not concave)
Iteration 97:  log likelihood = -774.49623  (not concave)
Iteration 98:  log likelihood = -774.49603  (not concave)
Iteration 99:  log likelihood = -774.49583  (not concave)
Iteration 100: log likelihood = -774.49564  (not concave)
Iteration 101: log likelihood = -774.49544  (not concave)
Iteration 102: log likelihood = -774.49525  (not concave)
Iteration 103: log likelihood = -774.49506  (not concave)
Iteration 104: log likelihood = -774.49487  (not concave)
Iteration 105: log likelihood = -774.49468  (not concave)
Iteration 106: log likelihood = -774.49449  (not concave)
Iteration 107:

Iteration 232: log likelihood = -774.47706  (not concave)
Iteration 233: log likelihood = -774.47695  (not concave)
Iteration 234: log likelihood = -774.47684  (not concave)
Iteration 235: log likelihood = -774.47673  (not concave)
Iteration 236: log likelihood = -774.47663  (not concave)
Iteration 237: log likelihood = -774.47652  (not concave)
Iteration 238: log likelihood = -774.47641  (not concave)
Iteration 239: log likelihood = -774.47631  (not concave)
Iteration 240: log likelihood =  -774.4762  (not concave)
Iteration 241: log likelihood = -774.47609  (not concave)
Iteration 242: log likelihood = -774.47599  (not concave)
Iteration 243: log likelihood = -774.47588  (not concave)
Iteration 244: log likelihood = -774.47578  (not concave)
Iteration 245: log likelihood = -774.47567  (not concave)
Iteration 246: log likelihood = -774.47557  (not concave)
Iteration 247: log likelihood = -774.47546  (not concave)
Iteration 248: log likelihood = -774.47536  (not concave)
Iteration 249:

  docembed32 |   1868.396   8719.756     0.21   0.830    -15222.01     18958.8
    wembed33 |  -.0162748   6.034292    -0.00   0.998    -11.84327    11.81072
  docembed33 |  -2084.139   11777.89    -0.18   0.860    -25168.37     21000.1
    wembed34 |   .0392899   6.244596     0.01   0.995    -12.19989    12.27847
  docembed34 |   1262.302   3982.757     0.32   0.751    -6543.759    9068.364
    wembed35 |   -.008911   5.536078    -0.00   0.999    -10.85942     10.8416
  docembed35 |  -493.1898   1266.576    -0.39   0.697    -2975.633    1989.254
    wembed36 |  -.0068675   5.794947    -0.00   0.999    -11.36475    11.35102
  docembed36 |   1609.578    12302.9     0.13   0.896    -22503.66    25722.82
    wembed37 |  -.0041773   5.476479    -0.00   0.999    -10.73788    10.72952
  docembed37 |   942.2059   2118.783     0.44   0.657    -3210.533    5094.945
    wembed38 |  -.0038589   6.530249    -0.00   1.000    -12.80291    12.79519
  docembed38 |  -185.7093   829.3491    -0.22   0.82

    wembed97 |   -.004944   6.042564    -0.00   0.999    -11.84815    11.83826
  docembed97 |  -1615.246   5911.003    -0.27   0.785     -13200.6    9970.107
    wembed98 |   .0125464   5.601449     0.00   0.998    -10.96609    10.99119
  docembed98 |   1265.877   3665.792     0.35   0.730    -5918.943    8450.697
    wembed99 |   .0098495   5.982656     0.00   0.999    -11.71594    11.73564
  docembed99 |  -180.8633    3287.19    -0.06   0.956    -6623.637     6261.91
   wembed100 |   -.098468   5.721345    -0.02   0.986     -11.3121    11.11516
 docembed100 |   1467.253   10137.67     0.14   0.885    -18402.21    21336.71
   wembed101 |   .0046174   6.165023     0.00   0.999    -12.07861    12.08784
 docembed101 |  -391.2664   1187.783    -0.33   0.742    -2719.277    1936.745
   wembed102 |  -.0250235   5.634783    -0.00   0.996      -11.069    11.01895
 docembed102 |  -207.6704   1881.347    -0.11   0.912    -3895.042    3479.701
   wembed103 |  -.0094615   6.253007    -0.00   0.99

   wembed190 |  -.0053496   5.568603    -0.00   0.999    -10.91961    10.90891
 docembed190 |  -46.60279   7163.824    -0.01   0.995    -14087.44    13994.23
   wembed191 |   .0293493   5.588716     0.01   0.996    -10.92433    10.98303
 docembed191 |   388.6919   2292.857     0.17   0.865    -4105.224    4882.608
   wembed192 |   .0039036   6.343429     0.00   1.000    -12.42899     12.4368
 docembed192 |    609.115   6500.958     0.09   0.925    -12132.53    13350.76
   wembed193 |   .0010979   5.739761     0.00   1.000    -11.24863    11.25082
 docembed193 |  -71.73923   3865.349    -0.02   0.985    -7647.684    7504.205
   wembed194 |   .0318378   6.150613     0.01   0.996    -12.02314    12.08682
 docembed194 |   1306.816   10159.65     0.13   0.898    -18605.73    21219.36
   wembed195 |   -.008458   5.482816    -0.00   0.999    -10.75458    10.73766
 docembed195 |   268.1958   1498.856     0.18   0.858    -2669.508    3205.899
   wembed196 |  -.0006016   6.214572    -0.00   1.00

   wembed264 |  -.0138947   6.092843    -0.00   0.998    -11.95565    11.92786
 docembed264 |          0  (omitted)
   wembed265 |   .0189366   6.029263     0.00   0.997     -11.7982    11.83607
 docembed265 |          0  (omitted)
   wembed266 |  -.0075513   6.365059    -0.00   0.999    -12.48284    12.46773
 docembed266 |          0  (omitted)
   wembed267 |   .0354106   5.999754     0.01   0.995    -11.72389    11.79471
 docembed267 |          0  (omitted)
   wembed268 |  -.0311382   5.994358    -0.01   0.996    -11.77986    11.71759
 docembed268 |          0  (omitted)
   wembed269 |    .001974    5.52543     0.00   1.000    -10.82767    10.83162
 docembed269 |          0  (omitted)
   wembed270 |   .0071565   6.230536     0.00   0.999    -12.20447    12.21878
 docembed270 |          0  (omitted)
   wembed271 |  -.0178636   5.510917    -0.00   0.997    -10.81906    10.78334
 docembed271 |          0  (omitted)
   wembed272 |   .0168233   6.326642     0.00   0.998    -12.38317    12

In [18]:
# Or we can use lasso to select an optimized subset of variables for us
# This command seemed to hang 
# pystata.stata.run('lasso logit correct i.season i.year i.speakers game_score *embed*, sel(cv, folds(2)) grid(1)')

## Using Transformers
The example below is largely derived from a Google Colab notebook put together by [Rostyslav Neskorozhenyi](https://linkedin.com/in/slanj) that you can find [here](https://colab.research.google.com/drive/1N7HELWImK9xCYheyozVP3C_McbiRo1nb#scrollTo=XxnDjaFYMolw).  The very first cell related to transformers isn't necessary to execute since we handled that at the very beginning of this notebook.  
That said, there are some differences in the steps that need to be taken to get the embeddings out of these types of models.  The first major difference is that unlike spaCy which will return a vector with a preset number of dimensions, the results from transformers-based models will be returned as a tensor and the dimensionality of it will largely depend on the specific pre-trained model that you are using.  Another major difference is that many of these models will return so many dimensions in the embedding that you will need Stata SE or Stata MP in many cases to leverage them within Stata (this is due to variable limits in Stata IC).  Lastly, you'll need to get a bit of familiarity with tensors and some of the methods used in torch (or tensorflow) to perform data manipulation tasks in order to extract the vector embedding.

In [19]:
# Next we'll define a few containers used to store information that will be useful/helpful for us
input_ids = []
attention_masks = []
tokenized_texts = []

# Loop over the individual messages
for sent in messages:
    
    # Returns an object that is parsed using the BertTokenizer that was initialized at the start of this notebook
    encoded_dict = tokenizer.encode_plus(
        sent, 
        add_special_tokens = True,
        truncation = True,
        padding = 'max_length',
        return_tensors = 'pt'
    )
    
    # The [CLS] and [SEP] tokens are used to delineate the boundaries of the message
    marked_msg = '[CLS] ' + sent + ' [SEP]'
    
    # Adds the text to the list storing the tokens
    tokenized_texts.append(tokenizer.tokenize(marked_msg))
    
    # And this stores the ID values assigned by the BertTokenizer to a separate list
    input_ids.append(encoded_dict['input_ids'])
    
# This collapses the tensor into a vector (1D list) for each message (i.e., a list of lists)
input_ids = torch.cat(input_ids, dim = 0)

# This will show the original message along with the IDs for each word/token
print('Original: ', messages[0])
print('Token IDs: ', input_ids[0])

# You'll notice a lot of 0s below.  This is because the messages need to be normalized in someway.  
# Setting the padding parameter to 'max_length' means up to 512 tokens can be passed per message.

Original:  Hi Italy! Just opening up communication, and I want to know what some of your initial thoughts on the game are and if/how we can work together
Token IDs:  tensor([ 101, 8790, 2413,  106, 2066, 2280, 1146, 4909,  117, 1105,  146, 1328,
        1106, 1221, 1184, 1199, 1104, 1240, 3288, 3578, 1113, 1103, 1342, 1132,
        1105, 1191,  120, 1293, 1195, 1169, 1250, 1487,  102,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,   

In [20]:
# This generates a tensor with the same dimensions as the input_ids list of lists
segment_ids = torch.ones_like(input_ids)

# This will show the shape of the tensor 
segment_ids.shape

torch.Size([500, 512])

In [21]:
# This defines the model object and the specific version of pre-trained model we want to use.
# If all the text was lower cased we could use bert-based-uncased which is a bit smaller
model = BertModel.from_pretrained('bert-base-cased', output_hidden_states = True, )

# This will show what the model architecture looks like if you run this in a Python console.
model.eval();

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [22]:
# Tells the interpreter not to compute gradients
with torch.no_grad():
    
    # Gets the output of the model given the ID values used to identify each of the tokens in each message
    outputs = model(input_ids, segment_ids)
    
    # The specific hidden state that we are interested in is returned in the third element of outputs (remember 0 based indexing!!!)
    hidden_states = outputs[2]
    
# Then we can display a bit of metadata about the model, tokens, batches, and number of parameters in the hidden state    
print ("Number of layers:", len(hidden_states), "  (initial embeddings + 12 BERT layers)")
print ("Number of batches:", len(hidden_states[0]))
print ("Number of tokens:", len(hidden_states[0][0]))
print ("Number of hidden units:", len(hidden_states[0][0][0]))

Number of layers: 13   (initial embeddings + 12 BERT layers)
Number of batches: 500
Number of tokens: 512
Number of hidden units: 768


In [23]:
# Concatenate the tensors for all layers. We use `stack` here to
# create a new dimension in the tensor.
token_embeddings = torch.stack(hidden_states, dim=0)

print(token_embeddings.size())

# Swap dimensions, so we get tensors in format: [sentence, tokens, hidden layes, features]
token_embeddings = token_embeddings.permute(1,2,0,3)

print(token_embeddings.size())

# we will use last four hidden layers to create each word embedding
processed_embeddings = token_embeddings[:, :, 9:, :]
print(processed_embeddings.shape)

# Concatenate four layers for each token to create embeddings
embeddings = torch.reshape(processed_embeddings, (4, 48, -1))
print(embeddings.shape)

torch.Size([13, 500, 512, 768])
torch.Size([500, 512, 13, 768])
torch.Size([500, 512, 4, 768])
torch.Size([4, 48, 4096000])


In [24]:
# Now show what the embeddings look like:
print(embeddings)

tensor([[[-3.1412e-01, -1.2492e+00, -9.6004e-01,  ...,  1.1056e+00,
           3.9428e-01, -2.2991e-01],
         [-2.9698e-01, -9.7511e-01, -1.1061e+00,  ...,  8.2798e-01,
          -2.6714e-01,  3.0807e-01],
         [ 1.9186e+00, -4.0136e-02, -1.1576e+00,  ..., -8.1462e-01,
           5.1112e-02,  6.1184e-02],
         ...,
         [ 1.5882e-02, -1.6390e+00,  3.4153e-01,  ...,  1.0444e+00,
           1.2842e+00,  7.6466e-02],
         [-6.5699e-01, -2.5545e-01, -3.2760e-01,  ...,  1.0590e+00,
           3.0184e-02,  3.7472e-01],
         [ 2.0510e+00,  4.0924e-02, -1.3794e+00,  ..., -6.8523e-01,
           1.4295e-01,  4.2663e-02]],

        [[-3.7808e-01, -1.2680e+00, -1.1480e+00,  ...,  1.1544e+00,
           5.4157e-01, -1.3919e-01],
         [-3.8106e-02, -1.1864e+00, -1.0594e+00,  ...,  9.0312e-01,
          -3.1017e-01,  2.7449e-01],
         [ 1.9398e+00, -7.4951e-02, -1.1183e+00,  ..., -8.1127e-01,
           4.2041e-02,  1.2617e-02],
         ...,
         [ 6.0765e-02, -1