# NYC Capital Projects

## Notebook 02: Generate Project Description Text BERT Embeddings

This notebook will embed each capital project's textual description into an 1-dimensional embedding consisting of 512 embedded values each. A pre-trained [Bidirectional Encoder Representations from Transformers (BERT) network model](https://arxiv.org/abs/1810.04805) is used to generate these project description embeddings.

The Python library used to provide the BERT implementation used here is [keras-bert](https://pypi.org/project/keras-bert/). As a baseline, the smallest [available pre-trained BERT model](https://github.com/google-research/bert),  ``uncased_L-2_H-128_A-2``, will be used.  This will create a 1D vector of size 512 for every sentence of text provided.

This notebook will output the embeddings for each project into a CSV file.

**NOTE:** Depending on the specifications of your hardware, creating an embedding for even a small latent space can take 30 minutes on some machines.

### Project authors

- [An Hoang](https://github.com/hoangthienan95)
- [Mark McDonald](https://github.com/mcdomx)
- [Mike Sedelmeyer](https://github.com/sedelmeyer)

### Citation:

For additional information on the pre-trained BERT model used in this notebook, please see the original project's source repository:

- https://github.com/google-research/bert

Additionally, this pre-trained model is referenced its authors' in the following article:

- Turc, Iulia and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. (2019). "Well-Read Students Learn Better: On the Importance of Pre-training Compact Models". [arXiv preprint arXiv:1908.08962v2](https://arxiv.org/abs/1908.08962)

And, more information regarding BERT itself can be found in the original paper:

- Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". [arXiv preprint arXiv:1810.04805](https://arxiv.org/abs/1810.04805)

### Inputs:

The following files are required to successfully run this notebook.

- ``../data/interim/NYC_capital_projects_all.csv``

    A dataframe that provides a snapshot of outcomes, irregardless of available time-interval, for all projects under analysis.


- ``../models/pretrained_bert/uncased_L-2_H-128_A-2/``

    A directory containing the pre-trained BERT model, which is accessible on [the Google Research BERT repository](https://github.com/google-research/bert), and downloadable via [the link labeled 2/128 (BERT-Tiny)](https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-2_H-128_A-2.zip).
    

### Outputs:

The following files are generated by executing the code in this notebook.

- ``../data/interim/embeddings_uncased_L-2_H-128_A-2.csv``

    The resulting BERT embeddings for each capital project's textual description.

# Notebook contents

1. [Imports and set base path](#Imports-and-set-base-path)

2. [Select Pretrained BERT encoder](#Select-Pretrained-BERT-encoder)

3. [Define input and output filepaths](#Define-input-and-output-filepaths)

4. [Read the project descriptions](#Read-the-project-descriptions)

5. [Create embedding .csv file](#Create-embedding-.csv-file)

6. [Read embeddings to validate the file](#Read-embeddings-to-validate-the-file)

# Imports and set base path

[Return to top](#Notebook-contents)

In [2]:
import csv
import os

from keras_bert import extract_embeddings, POOL_NSP, POOL_MAX
import pandas as pd
from tqdm.notebook import tqdm

In [3]:
BERT_BASE_DIR = '../models/pretrained_bert'
os.path.isdir(BERT_BASE_DIR)

True

# Select pretrained BERT encoder

[Return to top](#Notebook-contents)

Many BERT pretrained encoders are available.  The more dimensions that the encoder has, the longer it takes to embed a sentence and the more space that it takes.

For purposes of predicting project success, we simply want an encoded space to represent the project description.  We will not be using the embeddings to do any translations or predictions based soley on the embedding.

In [4]:
# https://github.com/google-research/bert
# levels of 2,4,6,8,10,12
# h's of 128,256,512,768
# increasing each increases embedding dimensionality and required processing time
# uncased_L-2_H-128_A-2     1.77s  512 elements (bert tiny) 64.2 *
# uncased_L-12_H-128_A-2    8.92s  1024 elements
# uncased_L-4_H-256_A-4     3.4s   2048 elements (bert mini) 65.8
# uncased_L-4_H-512_A-8     4.06s  4096 elements  (bert small) 71.2
# uncased_L-8_H-512_A-8     7.61s  4096 elements (bert medium) 73.5
# uncased_L-12_H-768_A-12   12.9s  6144 elements (bert base)
bert_model = 'uncased_L-2_H-128_A-2' 
model_path = os.path.join(BERT_BASE_DIR, bert_model)

# Define input and output filepaths

[Return to top](#Notebook-contents)

The calculated embeddings will be output to a CSV file that can be read by another process.  Since the time to embed can take an hour, this is the most effective method for sharing the embedding.

In [5]:
file_path = '../data/interim/NYC_capital_projects_all.csv'
if os.path.isfile(file_path):
    print("OK - path points to file.")
else:
    print("ERROR - check the 'file_path' and ensure it points to the source file.")

OK - path points to file.


In [6]:
output_file = '../data/interim/embeddings_' + bert_model + '.csv'
print("Output filpath: {}".format(output_file))

Output filpath: ../data/processed/embeddings_uncased_L-2_H-128_A-2.csv


# Read the project descriptions

[Return to top](#Notebook-contents)

In [7]:
data = pd.read_csv(file_path)
all_descriptions = data[['PID', 'Description']].drop_duplicates()

In [8]:
# get the indexes of just the first line per project
pid_only_index = all_descriptions['PID'].drop_duplicates().index

projects = all_descriptions.loc[pid_only_index]

# Create embedding .csv file

[Return to top](#Notebook-contents)

Create a .csv file that includes the PID and embedded description.  In order to ensure that each embedding is the same length, the sentence is embedded rather than each of the words in the sentence.  Each embedding is stored in a format that makes it easy to read when extracting from the saved .csv file.

In [9]:
%%time

# NOTE - This will take 30 minutes to execute
# If the file exists, you don't need to run this unless you are changing the model

with open(output_file, 'w', newline='') as csvfile:
    csv_writer = csv.writer(csvfile, delimiter=",")
    csv_writer.writerow(['PID', 'embedding'])

    for row in tqdm(projects.itertuples(), total=len(projects), desc="Creating embeddings"):
        
        # if project description is nan, make it an underscore
        if type(row.Description) == float:
            desc = ['_']
        else:
            # Join all sentences into list of 1 element.
            # This ensures that output is same length for each description.
            desc = [x.strip() for x in row.Description.split('.') if x != '']
            desc = [' '.join(desc)]
        
        # calculate embedding and format to store in csv file
        emb = extract_embeddings(model_path, desc, output_layer_num=4, poolings=[POOL_NSP, POOL_MAX])[0]
        emb = str(list(emb)).replace('[','').replace(']','')
        
        csv_writer.writerow([row.PID, emb])

            

HBox(children=(FloatProgress(value=0.0, description='Creating embeddings', max=355.0, style=ProgressStyle(desc…


CPU times: user 8min 22s, sys: 10.4 s, total: 8min 33s
Wall time: 8min 8s


### Done Creating Embeddings!

# Read embeddings to validate the file

[Return to top](#Notebook-contents)

To read the embeddings, use Pandas to import the file and format the stored embedded values into a list of float values.

In [10]:
if os.path.isfile(output_file):
    print("OK - path points to file.")
else:
    print("ERROR - check the 'output_file' and ensure it points to the source file.")
    print(output_file)

OK - path points to file.


In [11]:
embedding = pd.read_csv(output_file)

def convert(s):
    return [float(x) for x in s.embedding.split(',')]

embedding['embedding'] = embedding.apply(convert, axis=1)

In [12]:
len(embedding)

355

In [13]:
embedding.head()

Unnamed: 0,PID,embedding
0,3,"[-0.13854653, 1.4585932, -6.7886453, 0.0610936..."
1,7,"[-0.13127574, 1.1954153, -6.7207437, 0.0612295..."
2,18,"[0.09863796, 1.6704285, -6.5727553, 0.06882739..."
3,25,"[-0.26632923, 1.1822444, -6.7360897, 0.0684237..."
4,34,"[-0.35451388, 1.6325428, -6.692406, 0.10146355..."


In [14]:
# test cosine distance between two similarly described projects