This project is an experiment to see how one can world knowledge into classical ML models like Decision Trees. 

Deep Learning models are not competitive in tabular datasets. Models like XGBoost are on par with the best DL approaches have to offer but require considerably less training time and can be tuned easily. On the other hand, DL models can levarage large scale language models like GPT-2 easily to incorporate real world knowledge in their decision making.

A typical way DL models incorporate pre-trained language models is by incorporating them as a submodule and training on the whole dataset. Such approaches cannot be easily extended onto classical ML models that dont use gradient descent as the way to optimize them.

The "knowledge" from language models are usually contained in the embeddings they generate. Since these embeddings are just vectors, we can just use them as the inuts of classical ML models. So how do we get these embeddings? 

We could always just take a row from a table, write it down as a sentence and feed it to the language model. But the way we construct our sentence has a big impact on the embeddings generated by the language model. Language models are trained to predict words given a sentence. If we just naively craft a table row as a sentence, the model might not get the task that we are expecting it to do (eg. generating an embedding that can help a linear regressor better predict a certain value. Ideally this embedding will contain some external information that can be used to infer our target variable).

The problem then becomes on how we can get language model to give us the "right" knowledge? This is where prompting language models come in. Prompting involves crafting our input in such a way that the it helps the language model get the larger context of what it expects us to do. Prompting has been succesfull in making large language models like GPT-3 perform zero-shot tasks on a variety of tasks that it was never trained for.

But before we see how prompting can help in enchancing XGBoost on tabular data, let us first get a dataset for which we can easily incorporate some real world knowledge and build a baseline XGBoost model. 

In [1]:
from copy import deepcopy
import itertools
import pickle
from functools import partial

import polars as pl
import numpy as np

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.metrics import mean_squared_error

import torch
from torch.optim import Adam
from torch.nn import CosineEmbeddingLoss
from torch.utils.data import DataLoader

from transformers import AutoTokenizer, AutoModel
from sentence_transformers import InputExample, SentenceTransformer, util
import xgboost

from data_utils import *
from model import *
from prompt_utils import *

from tqdm.notebook import tqdm


  from pandas import MultiIndex, Int64Index


In [2]:
df = load_data(2000, 2002)
df.head()

Year,Month,DayofMonth,DayOfWeek,Origin,counts,airport
i64,i64,i64,i64,str,u32,str
2001,1,9,2,"""SBN""",6,"""South Bend Regional"""
2001,3,27,2,"""GSP""",21,"""Greenville-Spartanburg"""
2001,3,29,4,"""MIA""",229,"""Miami International"""
2001,4,18,3,"""SJU""",77,"""Luis Munoz Marin International"""
2001,4,8,7,"""GEG""",31,"""Spokane Intl"""


The dataset above contains the number of flights from a specific airport in the United States during a particular date. The number of flights is given by the counts column. Our task is to predict this number, given the date and airport.

Origin represents the unique iata indentifier for the airport in question. The actual name of the airport is given in the 'airport' column

Right away we can see how external knowledge can help us here. Take an airport like Chicago O'Hare International. It is known to be pretty busy with an average of around 1000 flights coming and going out of the airport per day. Compare that to another airport like Dawson Community Airport, one of the quietest airports in the United States, we can easily guess the average number of flights per day.

Let's take an XGBoost Regressor as the baseline and see how well it fares on this dataset. As usual with any ML task, we need to first split the dataset and prep it for the XGBoost model. The following code does this

Split the dataset

In [3]:
splits = train_test_split(df, test_size=0.2)
df_train, df_test = splits

# Casting for pylance
df_train = cast(pl.DataFrame, df_train)
df_test = cast(pl.DataFrame, df_test)



Featurize the train and test sets

In [4]:
# Need to convert each airport id from a string to an integer. Using the Label encoder from scikit-learn for this purpose
# The label encoder is fit for the whole dataset to prevent OOV when trying to transform the test set
all_origins = df.select('Origin').distinct().Origin.to_numpy()
origin_encoder = LabelEncoder()
origin_encoder.fit(all_origins)

# Applying the fitted label encoder to get the featurize the train set
X_train = df_train.with_column(
    pl.Series('origin_encoded', origin_encoder.transform(df_train.Origin.to_numpy()))
).select([
    pl.all().exclude(['Year', 'Origin', 'counts', 'airport', 'Month_name'])
]).to_numpy()

y_train = df_train.counts.to_numpy()


# Applying the fitted label encoder to get the featurize the test set
X_test = df_test.with_column(
    pl.Series('origin_encoded', origin_encoder.transform(df_test.Origin.to_numpy()))
).select([
    pl.all().exclude(['Year', 'Origin', 'counts', 'airport', 'Month_name'])
]).to_numpy()

y_test = df_test.counts.to_numpy()

Train and evaluate the performance of the model

In [5]:
xgb = xgboost.XGBRegressor(
    objective='reg:squarederror',
    n_jobs = -1,
)

model = xgb.fit(X_train, y_train)
y_pred = model.predict(X_test)

r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

mae, mse, r2

(3.2043302603855772, 34.39642720417406, 0.9983257424215061)

---

In [6]:
def prompt_embeddings(df, prompt_model):
    origins = df.select(['Origin', 'airport']).distinct()

    embeddings = []
    for airport in origins.airport:
        query = prompt_model.encode(f'How crowded is the {airport} ?')
        responses = prompt_model.encode([
            f'{airport} is very crowded. There are more than 800 flights every day',
            f'{airport} is moderately crowded. There around 400 flights every day',
            f'{airport} is slightly crowded. There are around 100 flights every day',
            f'{airport} is not crowded. There are less than 50 flights every day'
        ])

        sims = util.cos_sim(query, responses)
        most_sim_idx = np.argmax(sims)
        embeddings.append(responses[most_sim_idx])

    embeddings = np.vstack(embeddings)

    origins = origins.with_column(pl.Series('prompt_embeddings', embeddings))
    origin_embedding_map = {k:v for k,v in origins.select(['Origin', 'prompt_embeddings']).rows()}

    embeddings = df.select(
        pl.col('Origin').apply(lambda x: origin_embedding_map[x]).alias('embeddings')
    )
    embeddings = embeddings['embeddings'].to_list()
    embeddings = np.vstack(embeddings)
    return embeddings

In [7]:
prompt_model = SentenceTransformer('all-MiniLM-L6-v2', device='cuda', cache_folder='model_cache')
X_embs_train = prompt_embeddings(df_train, prompt_model)
X_embs_test = prompt_embeddings(df_test, prompt_model)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]



In [8]:
X_train = df_train.with_column(
    pl.Series('origin_encoded', origin_encoder.transform(df_train.Origin.to_numpy()))
).select([
    pl.all().exclude(['Year', 'Origin', 'counts', 'airport', 'Month_name'])
]).to_numpy()

X_train = np.hstack([X_train, X_embs_train])
y_train = df_train.counts.to_numpy()


X_test = df_test.with_column(
    pl.Series('origin_encoded', origin_encoder.transform(df_test.Origin.to_numpy()))
).select([
    pl.all().exclude(['Year', 'Origin', 'counts', 'airport', 'Month_name'])
]).to_numpy()

X_test = np.hstack([X_test, X_embs_test])
y_test = df_test.counts.to_numpy()

In [9]:
xgb = xgboost.XGBRegressor(
    objective='reg:squarederror',
    n_jobs=-1)

model = xgb.fit(X_train, y_train)
y_pred = model.predict(X_test)

r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

mae, mse, r2

(1.6059000998000235, 15.93969158379574, 0.9992241301901905)

---

In [10]:
num_prompts = 8

query_prompt_format = pl.format(
    'Airport: {}, Number of flights: {}', 
    pl.col('airport'), pl.col('airport_tokens')
)

value_prompt_format = pl.format(
    'Airport: {}, Number of flights: {}', 
    pl.col('airport'), pl.col('counts')
)

query_value_prompts_train = generate_query_value_prompts(df_train, query_prompt_format, value_prompt_format, num_prompts)
query_value_prompts_test = generate_query_value_prompts(df_test, query_prompt_format, value_prompt_format, num_prompts)

In [11]:
airport_tokens = df.select('Origin').distinct()
airport_tokens = airport_tokens['Origin']
airport_tokens = airport_tokens.apply(partial(airport_token_sequencer, num_prompts=num_prompts)).to_list()
airport_tokens = list(itertools.chain.from_iterable(airport_tokens))

In [12]:
tokenizer = AutoTokenizer.from_pretrained('nreimers/MiniLM-L6-H384-uncased')
prompt_model = AutoModel.from_pretrained('nreimers/MiniLM-L6-H384-uncased')

num_added_tokens = tokenizer.add_tokens(airport_tokens, special_tokens=True)
assert num_added_tokens == len(airport_tokens)

pretrained_word_embeddings = deepcopy(prompt_model.embeddings.word_embeddings)
prompt_model.resize_token_embeddings(len(tokenizer))

# freeze all layers
for param in prompt_model.parameters():
    param.requires_grad = False

prefrozen_word_embeddings = PrefrozenEmbeddings(pretrained_word_embeddings, num_added_tokens)
prompt_model.embeddings.word_embeddings = prefrozen_word_embeddings

In [13]:
def convert_to_input_example_dataset(df) -> List[InputExample]:
    return [
        InputExample(texts=[q, v], label=1)
        for q, v in zip(df.to_dict()['query'], df.to_dict()['value'])
    ]


prompt_train_set = convert_to_input_example_dataset(query_value_prompts_train.distinct())
prompt_test_set = convert_to_input_example_dataset(query_value_prompts_test.distinct())

In [14]:
train_dataloader = DataLoader(prompt_train_set, shuffle=True, batch_size=64, collate_fn=lambda x: x)      # type: ignore
test_dataloader = DataLoader(prompt_test_set, shuffle=True, batch_size=64, collate_fn=lambda x: x)        # type: ignore

In [16]:
def train_step(model, optimizer, criterion, tokenizer, device):
    epoch_train_loss = 0.0
    model.train()
    for batch in train_dataloader:
            query_embeddings = embed_sentences([i.texts[0] for i in batch], model, tokenizer, device)
            value_embeddings = embed_sentences([i.texts[1] for i in batch], model, tokenizer, device)
            labels = torch.tensor([i.label for i in batch]).to(device)
            loss = criterion(query_embeddings, value_embeddings, labels)

            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            epoch_train_loss += loss.item()
    
    return epoch_train_loss


def valid_step(model, criterion, tokenizer, device):
    epoch_valid_loss = 0.0
    model.eval()
    with torch.no_grad():
            for batch in test_dataloader:

                query_embeddings = embed_sentences([i.texts[0] for i in batch], model, tokenizer, device)
                value_embeddings = embed_sentences([i.texts[1] for i in batch], model, tokenizer, device)
                labels = torch.tensor([i.label for i in batch]).to(device)
                loss = criterion(query_embeddings, value_embeddings, labels)

                epoch_valid_loss += loss.item()

    return epoch_valid_loss

In [None]:
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
optimizer = Adam(prompt_model.parameters(), lr=5e-1)
criterion = CosineEmbeddingLoss()
prompt_model = prompt_model.to(device)

In [17]:
def train(model_gpu, optimizer, criterion, device):
    train_losses = []
    valid_losses = []

    num_epocs = 150
    for e in tqdm(range(num_epocs)):

        epoch_train_loss = train_step(model_gpu, optimizer, criterion, tokenizer, device)
        epoch_valid_loss = valid_step(model_gpu, criterion, tokenizer, device)

        train_losses.append(epoch_train_loss / len(train_dataloader))
        valid_losses.append(epoch_valid_loss / len(test_dataloader))
        
        if e%10 == 0 or e == (num_epocs-1) or e == 0:
            print(e, train_losses[-1], valid_losses[-1]) 

train(prompt_model, optimizer, criterion, device)

  0%|          | 0/150 [00:00<?, ?it/s]

0 0.0631777877608935 0.028248991817235947
10 0.03607655167579651 0.011745750474242063
20 0.03319415425260862 0.009674091895039264
30 0.03185046911239624 0.008430726109788967
40 0.030811092754205068 0.007791874178040486
50 0.030440221726894378 0.007266210571217995
60 0.030016236876447996 0.007052317082595367
70 0.030018943175673485 0.006639283831016376
80 0.029090561096866925 0.006533061010906329
90 0.02915951795876026 0.006316139183651943
100 0.029200083017349242 0.006138484423550276
110 0.02873743958771229 0.006010062204530606
120 0.028976715728640558 0.005955656023266224
130 0.028235080341498058 0.00588332790021713
140 0.028053473805387814 0.005761411041021347
149 0.02845854461193085 0.005648391034740668


In [18]:
# torch.save(model, 'chk_1.pt')

# with open('tok_1.pk', 'wb') as outfile:
#     pickle.dump(tokenizer, outfile)

In [19]:
# model = torch.load('chk_1.pt')
# with open('tok_1.pk', 'rb') as infile:
#     tokenizer = pickle.load(infile)

In [20]:
def generate_prompt_embeddings(df, model, tokenizer, device):
    query_prompts_train = pl.concat([
        df.select('Origin'),
        generate_query_value_prompts(
            df, query_prompt_format, value_prompt_format, num_prompts
        ).select('query')
    ], how='horizontal').distinct()

    with torch.no_grad():
        prompt_embs_train = embed_sentences(
            query_prompts_train.query.to_list(), 
            model, tokenizer, device
        ).cpu()

    origin_prompt_emb_map = dict()
    for i in range(query_prompts_train.shape[0]):
        origin = query_prompts_train['Origin'][i]
        emb = prompt_embs_train[i, :].numpy()

        origin_prompt_emb_map[origin] = emb

    return np.vstack(
        df.select([
            pl.col('Origin').apply(lambda x: origin_prompt_emb_map[x]).alias('emb')
        ]).emb.to_list()
    )

In [21]:
X_embs_train = generate_prompt_embeddings(df_train, prompt_model, tokenizer, device)
X_embs_test = generate_prompt_embeddings(df_test, prompt_model, tokenizer, device)



In [22]:
X_train = df_train.with_column(
    pl.Series('origin_encoded', origin_encoder.transform(df_train.Origin.to_numpy()))
).select([
    pl.all().exclude(['Year', 'Origin', 'counts', 'airport', 'Month_name'])
]).to_numpy()

X_train = np.hstack([X_train, X_embs_train])
y_train = df_train.counts.to_numpy()


X_test = df_test.with_column(
    pl.Series('origin_encoded', origin_encoder.transform(df_test.Origin.to_numpy()))
).select([
    pl.all().exclude(['Year', 'Origin', 'counts', 'airport', 'Month_name'])
]).to_numpy()

X_test = np.hstack([X_test, X_embs_test])
y_test = df_test.counts.to_numpy()

In [23]:
xgb = xgboost.XGBRegressor(
    objective='reg:squarederror', 
    n_jobs=-1)

model = xgb.fit(X_train, y_train)
y_pred = model.predict(X_test)

r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

mae, mse, r2

(1.4682715518835787, 13.406071752563278, 0.9993474549814046)

---