Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
-
Updated
Aug 7, 2024 - Python
Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
Fast bare-bones BPE for modern tokenizer training
Subword Encoding in Lattice LSTM for Chinese Word Segmentation
Simple-to-use scoring function for arbitrarily tokenized texts.
Learning BPE embeddings by first learning a segmentation model and then training word2vec
Subword-augmented Embedding for Cloze Reading Comprehension (COLING 2018)
Byte-Pair Encoding (BPE) (subword-based tokenization) algorithm implementaions from scratch with python
Byte-Pair Encoding tokenizer for training large language models on huge datasets
Natural Language EnCoder-Decoder: word, char, bpe etc
Byte Pair Encoding (BPE)
This repository provides a clear, educational implementation of Byte Pair Encoding (BPE) tokenization in plain Python. The focus is on algorithmic understanding, not raw performance.
A python package to build a corpus vocabulary using the byte pair methodology and also a tokenizer to tokenize input texts based on the built vocab.
An extremily simple and restricted tool/lib converting binary data into text that can be processed with unsuperwised character-level natural language processing tools/libs
A modified, secure version of BPE algorithm
An educational project dedicated to text-to-image generation with neural networks. VQVAE and BPE autoencoders are used to learn the embedding of text and image respectively. A transformer-based model then is trained to predict the next token in the concatenated sequence of image and text tokens and used for generation.
Add a description, image, and links to the bpe topic page so that developers can more easily learn about it.
To associate your repository with the bpe topic, visit your repo's landing page and select "manage topics."