In [2]:
#libraries and data import - needed for later code, will figure out to try and hide this later
import pandas as pd
import numpy as np
import seaborn as sns 
from matplotlib import pyplot as plt
import matplotlib as mpl
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

data_dir = "/Users/jonathanzhu/Documents/data/"

text_file_name = "osdg-community-data-v2023-01-01.csv"
text_df = pd.read_csv(data_dir + text_file_name,sep = "\t",  quotechar='"')
col_names = text_df.columns.values[0].split('\t')
text_df[col_names] = text_df[text_df.columns.values[0]].apply(lambda x: pd.Series(str(x).split("\t")))
text_df = text_df.astype({'sdg':int, 'labels_negative': int, 'labels_positive':int, 'agreement': float}, copy=True)
text_df = text_df.query("sdg == 8 or sdg == 16 or sdg == 4").copy()



<h1>3. Document Embedding</h1>

<b>NOTE: The tensorflow package is very finicky and does not like working on Python 3.11.3 to our knowledge, or on computers that are not so powerful. However, it is the only available package (that we know of) for document embedding, and document embedding is nonetheless and important part of natural language processing. We plan on working around this problem for later editions of the book, but for now, any of the code can be skipped, and later sections will not depend on code from this one.</b>

In the previous section, we represented variable length texts as fixed length numeric vectors; the approach we have used so far is the traditional approach of Bag of Words (BoW), which tokenizes a text into words (tokens), ignoring orders of tokens but may reserve the count. This approach is high dimension, and very sparse; this may result in over fitting and high time complexity.

A more modern text vectorization approach is word embedding (also called simply embedding), relying on neural representations. This approach takes distributional semantics into account; that is, a word’s meaning is given by the words that frequently appear close-by. Hence, we can construct a word’s context by using the set of words that appear nearby within a fixed-sized window. 

Semantically similar texts, then, would appear closer to each other in the vector space. We could also possibly capture semantic operations by operations in the vector space; for example, similarity between texts could be measured by vector dot product. We could also perform algebraic operations; for example, 

$\text{vector(”King”)} - \text{vector(”Man”)} + \text{vector(”Woman”)} \sim \text{vector(“Queen”)}$. 

Modern-day representations are typically learned from vast body of texts, often with deep neural networks, and they typically result in pre-trained models.

To get embeddings, we use the $\texttt{tensorflow}$ library, installed and imported as follows:


In [3]:
#the below two lines should be run in the terminal to install
#pip install tensorflow
#pip install tensorflow_hub

import tensorflow as tf
import tensorflow_hub as hub

embed_url = "https://tfhub.dev/google/universal-sentence-encoder/4" # "https://tfhub.dev/google/universal-sentence-encoder-large/5"
embed = hub.load(embed_url) # print ("module %s loaded" % module_url)

2023-07-17 05:17:22.112447: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


ValueError: Trying to load a model of incompatible/unknown type. '/var/folders/px/b7vc3nh913zb_m0x36ncftj00000gn/T/tfhub_modules/063d866c06683311b44b4992fd46003be952409c' contains neither 'saved_model.pb' nor 'saved_model.pbtxt'.

We're also going to use the tokenizer found in the $\texttt{nltk.data}$ library:

In [6]:
import nltk.data
nltk.download('punkt')
from nltk import word_tokenize, sent_tokenize
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/jonathanzhu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


To start, we first need to break our document corpus into sentences:

In [7]:
text_df_sentence = []
text_df_sdg = []
for (text, sdg) in iter(zip(text_df.text, text_df.sdg)):
    sentence = sent_tokenize(text) 
    text_df_sentence = text_df_sentence + sentence
    text_df_sdg = text_df_sdg + [sdg]*len(sentence)
sentence_df = pd.DataFrame({"text": text_df_sentence, "sdg": text_df_sdg})

<b>Exercise 3.1</b>: What are the dimensions of the $\texttt{text\_df}$ and $\texttt{sentence\_df}$ dataframes? What do each of the dimension numbers represent?

<b>Exercise 3.2</b>: Verify that the dimensions of the sentence dataframe are correct by breaking the original text dataframe into sentences and then determining the length.

An important question to ask is how many sentences each document has; in other words, what is the distribution of the number of sentences in each text? We can determine that as follows:

In [8]:
text_df["num_sent"] = text_df.text.apply(lambda x: len(sent_tokenize(x)))
text_df["num_sent"].value_counts()

3     19649
4     12018
5      3723
6      2601
2       959
7       622
8       191
1       154
9        45
10       22
12       18
13       10
15       10
11        8
14        8
16        5
19        4
17        3
21        3
20        2
25        2
18        1
22        1
24        1
31        1
40        1
Name: num_sent, dtype: int64

Notice that the vast majority of the documents have less than 10 sentences, with most being only 3 or 4 sentences. This type of sentence embedding can help us with further NLP tasks down the line.

<h2>3.1 Universal Sentence Encoder</h2>

The Universal Sentence Encoder (USE) was first published by Google around 2018. It maps a sentence, word, or short paragraph to a fixed length (typically 512) numeric vector. This approach would mean semantically similar sentences would be placed closer to each other in the embedding space. 

Embeddings are typically the result of using raw text, so no pre-processing would be involved. This sentence embedding can then be used for downstream applications,
e.g., classification, clustering, and language prediction. 

USE is a pre-trained model trained on variety of data, e.g., wikipedia and books. It was trained with a deep averaging network (DAN) encoder; more information and explanation on the process behind USE can be found at https://arxiv.org/pdf/1803.11175.pdf.

To utilize USE, we can take one of three approaches:
<ol>
<li>We could take our desired document, turn it into a collection of sentences, and then map each sentence to its respective vector;</li>
<li>We could treat each document as a short paragraph and match each document to its respective vector, or;</li>
<li>We could take a similar approach to #1, except then aggregate the vectors for each document to form a single vector per document.</li>
</ol>

To install USE, run the following code:


In [9]:
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" 
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
    return model(input)


module https://tfhub.dev/google/universal-sentence-encoder/4 loaded


Note that the first time you run this, it may take some time (5+ minutes) to complete the process.

<h2>3.2 t-SNE</h2>

t-distributed Stochastic Neighbor Embedding (t-SNE) is best used to scale text features to the same scale. In short, it is a method of dimension reduction (like PCA). t-SNE associates probabilities on a Student's t-distribution with each point; it then uses some randomization (hence Stochastic) to embed, paying particular attention to the neighbors of each point. While t-SNE will not be discussed further as to its specific mathematical methods, it can nonetheless be used for document embedding. More explanations on t-SNE can be found on the $\texttt{scikit\_learn}$ website, https://scikit-learn.org/stable/modules/manifold.html#t-sne. 

To demonstrate t-SNE, we will first start by running a classification algorithm called the Multilayer Perceptron on our UN SDG data 

When using t-SNE in Python, we can start with the following code. Note that t-SNE can take a long time to run; to help with this, the t-SNE documentation suggests using $\texttt{MinMixScaler}$ so as to make everything the same scale.

In [10]:
docs = sentence_df.text
categories = sentence_df.sdg
X_train, X_test, y_train, y_test = \
    train_test_split(docs, categories, test_size=0.33, random_state=7)

X_train_use_vector = embed(X_train.tolist())
X_test_use_vector = embed(X_test.tolist())

2023-06-25 00:07:31.471784: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'inputs' with dtype string
	 [[{{node inputs}}]]


In [15]:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(X_train_use_vector)
X_train_use_vector_scaled = scaler.transform(X_train_use_vector)
X_test_use_vector_scaled = scaler.transform(X_test_use_vector)

use_mlp_clf = MLPClassifier(random_state=1, max_iter=500, hidden_layer_sizes=(300,)).fit(X_train_use_vector_scaled, y_train)
y_pred = use_mlp_clf.predict(X_test_use_vector_scaled)
print(metrics.classification_report(y_test,y_pred, digits = 4))

              precision    recall  f1-score   support

           1     0.3882    0.3780    0.3830      3249
           2     0.4008    0.3763    0.3882      2899
           3     0.5543    0.5545    0.5544      3118
           4     0.5398    0.5539    0.5468      4432
           5     0.5175    0.5159    0.5167      5119
           6     0.4194    0.4277    0.4235      3283
           7     0.4496    0.4652    0.4573      3710
           8     0.1851    0.1860    0.1856      1747
           9     0.3030    0.2583    0.2789      1827
          10     0.2375    0.2569    0.2468      1857
          11     0.3870    0.3606    0.3733      2718
          12     0.2664    0.2869    0.2763      1328
          13     0.3829    0.3981    0.3903      2517
          14     0.4420    0.4012    0.4206      1386
          15     0.3332    0.3847    0.3571      1861
          16     0.7680    0.7593    0.7636      8783

    accuracy                         0.4786     49834
   macro avg     0.4109   

In [1]:
from sklearn import preprocessing
from sklearn.manifold import TSNE
scaler = preprocessing.MinMaxScaler().fit(X_train_use_vector)
X_train_use_vector_scaled = scaler.transform(X_train_use_vector)
X_test_use_vector_scaled = scaler.transform(X_test_use_vector)

tsne = TSNE(2, verbose=0, perplexity=50)
tsne_proj = tsne.fit_transform(X_test_use_vector_scaled)

NameError: name 'X_train_use_vector' is not defined

<h2>3.3 More Exercises</h2>

<b>Exercise 3.1</b>: Take two documents, one labeled as SDG 1 and the other as SDG 8. Segment these into sentences, compute the embedding, and find the dot product between the embeddings.