#Obtaining embeddings for Korean everyday conversations

Used the following data from AI Hub : [한국어 대화 데이터](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=116)

The reason why I chose this dataset:  
1) It contains written-style Korean rather than colloquial Korean  
2) It is adequate-sized dataset (around 90,000 sentences), also each sentence has adequate length  
3) Content handles everyday-needs  
4) Previous studies on AAC with deep learning used this dataset

##0. Setup

In [3]:
import zipfile
import csv
import torch
import pandas as pd
import numpy as np
from transformers import BertModel, BertTokenizer
from torch.utils.data import DataLoader, Dataset

##1. About data

At first, I obtained embeddings for [주제별 텍스트 일상 대화 데이터](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=543).  
But the dataset has too many sentences (1,445,376 sentences, 23GB w/ embeddings..) and the data consists of colloquial Korean.  
Thus I changed the dataset to 한국어 대화 데이터.  
  
I preprocessed the dataset and saved all sentences in a single file `conversations.csv`

In [4]:
conversations_df = pd.read_csv('conversations.csv')
sentences = conversations_df['SENTENCE'].tolist()
print(len(sentences))
print(sentences[:10])

90413
['1시간에 얼마인가요?', '처음 1시간은 1000원이고 이후 1시간은 500원씩 추가됩니다', '무인발급기 있나요?', '무인발급기는 카운터 바로 옆쪽에 이용 가능합니다', '1달 정액권 끊을 수 있나요?', '네 1달에 5만 원입니다', '정액권 끊다가 정지해도 되나요?', '네 가능합니다', '음식 주문 가능한가요?', '햄버거랑 핫도그 종류 가능합니다']


##2. Obtain embeddings for each sentences

In [5]:
class SentenceDataset(Dataset):
  def __init__(self, sentences):
    self.sentences = sentences

  def __len__(self):
    return len(self.sentences)

  def __getitem__(self, idx):
    return self.sentences[idx]

In [6]:
def compute_embeddings(dataset, batch_size=32, max_length=128, model_name='klue/bert-base'):
  tokenizer = BertTokenizer.from_pretrained(model_name)
  model = BertModel.from_pretrained(model_name)
  model.eval()

  device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
  model.to(device)

  dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False)
  sentence_embeddings_list = []  # Store sentence-level embeddings
  all_tokens_list = []  # Store tokens for each sentence
  all_token_embeddings_list = []  # Store token embeddings for each sentence

  with torch.no_grad():
    for batch in dataloader:
      inputs = tokenizer(batch, return_tensors='pt', padding=True, truncation=True, max_length=max_length, return_attention_mask=True)
      input_ids = inputs['input_ids'].to(device)
      attention_mask = inputs['attention_mask'].to(device)

      outputs = model(input_ids=input_ids, attention_mask=attention_mask)

      # Sentence-level embeddings
      sentence_embeddings = (outputs.last_hidden_state * attention_mask.unsqueeze(-1)).sum(dim=1) / attention_mask.sum(dim=1, keepdim=True)
      sentence_embeddings_list.append(sentence_embeddings.cpu())

      # Token-level embeddings
      for i, sentence in enumerate(batch):
        tokens = tokenizer.tokenize(sentence)  # Get tokens for the sentence
        token_embeddings = outputs.last_hidden_state[i, :len(tokens), :].cpu()  # Get corresponding embeddings

        all_tokens_list.append(tokens)
        all_token_embeddings_list.append([embedding.numpy() for embedding in token_embeddings])

  sentence_embeddings = torch.cat(sentence_embeddings_list, dim=0)

  return sentence_embeddings, all_tokens_list, all_token_embeddings_list

In [7]:
dataset = SentenceDataset(sentences)
sentence_embeddings, all_tokens, all_token_embeddings = compute_embeddings(dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/289 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/248k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/495k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/425 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/445M [00:00<?, ?B/s]

In [8]:
# Save sentence-level embeddings to a CSV file
sentence_output_filename = 'sentence_with_embeddings.csv'
with open(sentence_output_filename, mode='w', newline='', encoding='utf-8') as sentence_file:
  writer = csv.writer(sentence_file)
  writer.writerow(['Sentence', 'Sentence_Embedding'])

  for sentence, sent_embedding in zip(sentences, sentence_embeddings):
    if sentence.strip():  # Skip empty sentences
      writer.writerow([sentence, sent_embedding.tolist()])

In [9]:
# Save token-level embeddings to a separate CSV file
token_output_filename = 'token_with_embeddings.csv'
with open(token_output_filename, mode='w', newline='', encoding='utf-8') as token_file:
  writer = csv.writer(token_file)
  writer.writerow(['Sentence', 'Token', 'Token_Embedding'])

  for sentence, tokens, token_embeddings in zip(sentences, all_tokens, all_token_embeddings):
    for token, token_embedding in zip(tokens, token_embeddings):
      writer.writerow([sentence, token, token_embedding.tolist()])