This is to create addition features for text. Tokenizers using pre-trained model like BERT usually have high dimension (at least 768). And if we concat these vectors to our dataset, it will cost a lot because it requires a lot of resources to train with large amount of data. Using autogluon for multi-modal model is also a problem for me, because my current laptop does not have GPU, and I think autogluon also process text data via tokenizing, and later use neural network to train the model. 

So an optimal solution for me is to use dimension reduction with PCA. Tokenizer methods can capture well similarities among texts (by self-attention) and high-level semantic similarities are more likely to be preserved after PCA because they contribute more to the variance across embeddings. For example, if two texts are about very similar topics, this major semantic feature is likely to be captured even in a reduced dimensionality space. 

The initial principal components (which explain the most variance) are likely to capture more general, high-level features of the text data. If the primary use case involves understanding or clustering texts based on broad topics or sentiments, PCA-reduced embeddings may still be effective.

Because of limited resources, I will perform the method on only 2% of the data.

In [1]:
import pandas as pd
import numpy as np

df_train = pd.read_csv('../data/train_data.csv')
df_val = pd.read_csv('../data/val_data.csv')
df_test = pd.read_csv('../data/test_data.csv')

# Create samples with 1% of each DataFrame
df_train_sample = df_train.sample(frac=0.001, random_state=1)
df_val_sample = df_val.sample(frac=0.001, random_state=1)
df_test_sample = df_test.sample(frac=0.001, random_state=1)

# Optionally, you can check the sizes of your samples
print(f"Train sample size: {df_train_sample.shape}")
print(f"Validation sample size: {df_val_sample.shape}")
print(f"Test sample size: {df_test_sample.shape}")


Train sample size: (161, 25)
Validation sample size: (46, 25)
Test sample size: (46, 25)


In [2]:
df_train_sample.to_csv('../data/train_data_sample.csv', index=False)
df_val_sample.to_csv('../data/val_data_sample.csv', index=False)
df_test_sample.to_csv('../data/test_data_sample.csv', index=False)

In [3]:
# pip install transformers

In [4]:
from transformers import BertTokenizer, TFBertModel
import tensorflow as tf

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-german-cased')

# Load pre-trained model (weights)
bert_model = TFBertModel.from_pretrained('bert-base-german-cased')

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on,

In [5]:
def text2vect(input_text):
    # Max token is 512
    # We truncate longer input text
    encoded_input = tokenizer(input_text, return_tensors='tf', max_length=512, truncation=True)
    model_output = bert_model(encoded_input)
    embeddings = model_output.last_hidden_state[:,0,:]
    return embeddings

In [6]:
from sklearn.decomposition import PCA

def concat_vectors_with_pca(df, text_column, n_components=10):
    # Generate embeddings
    vectors = np.zeros((len(df), 768))
    for i, text in enumerate(df[text_column]):
        vectors[i, :] = text2vect(text)
    
    # Perform PCA
    pca = PCA(n_components=n_components)
    reduced_vectors = pca.fit_transform(vectors)
    
    # Concatenate reduced embeddings with the original DataFrame
    df_copy = df.copy()
    vector_columns = [f'{text_column}_vec_{i}' for i in range(reduced_vectors.shape[1])]
    df_copy_with_vectors = pd.concat([
        df_copy.reset_index(drop=True),
        pd.DataFrame(reduced_vectors, columns=vector_columns),
    ], axis=1)
    
    return df_copy_with_vectors


In [7]:
text_columns = ['description', 'facilities']

In [8]:
df_train_sample = concat_vectors_with_pca(df_train_sample,'description')
df_train_sample = concat_vectors_with_pca(df_train_sample, 'facilities')

In [9]:
df_train_sample.shape

(161, 45)

In [10]:
df_train_sample.drop(columns = text_columns, axis=1, inplace=True)
print(df_train_sample.shape)
df_train_sample.to_csv('../data/vect_train_data_sample.csv', index=False)

(161, 43)


In [11]:
df_val_sample = concat_vectors_with_pca(df_val_sample,'description')
df_val_sample = concat_vectors_with_pca(df_val_sample, 'facilities')
df_val_sample.drop(columns = text_columns, axis=1, inplace=True)
print(df_val_sample.shape)
df_val_sample.to_csv('../data/vect_val_data_sample.csv', index=False)

(46, 43)


In [12]:
df_test_sample = concat_vectors_with_pca(df_test_sample,'description')
df_test_sample = concat_vectors_with_pca(df_test_sample, 'facilities')
df_test_sample.drop(columns = text_columns, axis=1, inplace=True)
print(df_test_sample.shape)
df_test_sample.to_csv('../data/vect_test_data_sample.csv', index=False)

(46, 43)
