This notebook is a demo of how to use GCGC to create a huggingface dataset object from a uniprot dataset.

Install gcgc with support for the hugging face package.

In [12]:
!pip install 'gcgc[hf]'



## Creating the Dataset

This notebook will use swissprot because it's the smallest, and easiest to manage. However the common uniprot datasets are available.

In [13]:
from gcgc.third_party import hf_datasets

", ".join(hf_datasets.UniprotDatasetNames)

'uniref50, uniref90, uniref100, uniparc, trembl, sprot'

The next step is to create the dataset reference. This is responsible for downloading the file and preparing it for the datasets package, which for gcgc means making the request to uniprot, then parsing the resultant FASTA.

In [14]:
ref = hf_datasets.UniprotDataset(name="sprot")

# This will be a noop, if the dataset is cached.
ref.download_and_prepare()

Reusing dataset uniprot_dataset (/Users/thauck/.cache/huggingface/datasets/uniprot_dataset/sprot/1.0.0)


Then it's possible to extract the specific split.

In [15]:
ds = ref.as_dataset("sprot")

In [17]:
type(ds)

datasets.arrow_dataset.Dataset

Now that the sequences are in the nice arrow based dataset, let's have a peak.

In [26]:
five_records = ds[:5]
print("\n".join(five_records['sequence']))

MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPSEKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLDAKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHLEKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDDSFRKIYTDLGWKFTPL
MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQTCASGFCTSQPLCARIKKTQVCGLRYSSKGKDPLVSAEWDSRGAPYVRCTYDADLIDTQAQVDQFVSMFGESPSLAERYCMRGVKNTAGELVSRVSSDADPAGGWCRKWYSAHRGPDQDAALGSFCIKNPGAADCKCINRASDPVYQKVKTLHAYPDQCWYVPCAADVGELKMGTQRDTPTNCPTQVCQIVFNMLDDGSVTMDDVKNTINCDFSKYVPPPPPPKPTPPTPPTPPTPPTPPTPPTPPTPRPVHNRKVMFFVAGAVLVAILISTVRW
MASNTVSAQGGSNRPVRDFSNIQDVAQFLLFDPIWNEQPGSIVPWKMNREQALAERYPELQTSEPSEDYSGPVESLELLPLEIKLDIMQYLSWEQISWCKHPWLWTRWYKDNVVRVSAITFEDFQREYAFPEKIQEIHFTDTRAEEIKAILETTPNVTRLVIRRIDDMNYNTHGDLGLDDLEFLTHLMVEDACGFTDFWAPSLTHLTIKNLDMHPRWFGPVMDGIKSMQSTLKYLYIFETYGVNKPFVQWCTDNIETFYCTNSYRYENVPRPIYVWVLFQEDEWHGYRVEDNKFHRRYMYSTILHKRDTDWVENNPLKTPAQVEMYKFLLRISQLNRDGTGYESDSDPENEHFDDESFSSGEEDSSDEDDPTWAPDSDDSDWETETEEEPSVAARILEKGKLTITNLMKSLGFKPKPKKIQS

Looks likes some proteins... curious what's up with the proline rich area in the second sequence??

## Tokenization

Those are amino acid strings. For modeling, it needs to go through some process to tokenize the sequences, which for gcgc means using its KmerTokenizer.

In [72]:
from gcgc import KmerTokenizer

This is using an extended protein alphabet that will conform the underlying sequence length to 200 (trim or pad as necessary).

In [74]:
tokenizer = KmerTokenizer(alphabet="extended_protein", conform_length=200)

In [75]:
tokenizer

KmerTokenizer(vocab=Vocab(stoi={'|': 0, '>': 1, '<': 2, '#': 3, '?': 4, 'A': 5, 'C': 6, 'D': 7, 'E': 8, 'F': 9, 'G': 10, 'H': 11, 'I': 12, 'K': 13, 'L': 14, 'M': 15, 'N': 16, 'P': 17, 'Q': 18, 'R': 19, 'S': 20, 'T': 21, 'V': 22, 'W': 23, 'Y': 24, 'B': 25, 'X': 26, 'Z': 27, 'J': 28, 'U': 29, 'O': 30}), pad_token='|', bos_token='>', eos_token='<', mask_token='#', unk_token='?', pad_token_id=0, bos_token_id=1, eos_token_id=2, mask_token_id=3, unk_token_id=4, pad_at_end=True, max_length=None, min_length=None, conform_length=200, alphabet='ACDEFGHIKLMNPQRSTVWYBXZJUO', kmer_length=1, kmer_stride=1)

Then we can map this over the dataset to create a new dataset with an `input_ids` column.

In [76]:
encoded_dataset = ds.map(lambda x: {"input_ids": tokenizer.encode(x["sequence"])})

Loading cached processed dataset at /Users/thauck/.cache/huggingface/datasets/uniprot_dataset/sprot/1.0.0/cache-c9c4b3bc9d2ef833.arrow


In [77]:
# Note the new input_ids feature
encoded_dataset

Dataset(features: {'description': Value(dtype='string', id=None), 'id': Value(dtype='string', id=None), 'input_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'sequence': Value(dtype='string', id=None)}, num_rows: 563082)

Looking at a row, things look good.

In [78]:
for x in encoded_dataset:
    print(x)
    break

{'description': 'sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R OS=Frog virus 3 (isolate Goorha) OX=654924 GN=FV3-001R PE=4 SV=1', 'id': 'sp|Q6GZX4|001R_FRG3G', 'input_ids': [1, 15, 5, 9, 20, 5, 8, 7, 22, 14, 13, 8, 24, 7, 19, 19, 19, 19, 15, 8, 5, 14, 14, 14, 20, 14, 24, 24, 17, 16, 7, 19, 13, 14, 14, 7, 24, 13, 8, 23, 20, 17, 17, 19, 22, 18, 22, 8, 6, 17, 13, 5, 17, 22, 8, 23, 16, 16, 17, 17, 20, 8, 13, 10, 14, 12, 22, 10, 11, 9, 20, 10, 12, 13, 24, 13, 10, 8, 13, 5, 18, 5, 20, 8, 22, 7, 22, 16, 13, 15, 6, 6, 23, 22, 20, 13, 9, 13, 7, 5, 15, 19, 19, 24, 18, 10, 12, 18, 21, 6, 13, 12, 17, 10, 13, 22, 14, 20, 7, 14, 7, 5, 13, 12, 13, 5, 24, 16, 14, 21, 22, 8, 10, 22, 8, 10, 9, 22, 19, 24, 20, 19, 22, 21, 13, 18, 11, 22, 5, 5, 9, 14, 13, 8, 14, 19, 11, 20, 13, 18, 24, 8, 16, 22, 16, 14, 12, 11, 24, 12, 14, 21, 7, 13, 19, 22, 7, 12, 18, 11, 14, 8, 13, 7, 14, 22, 13, 7, 9, 13, 5, 14, 22, 8, 20, 5, 11, 19, 15, 19], 'sequence': 'MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPV

Unfortunately this is not a particularly fast process, but caching thanks to hugging face helps alot.

In [79]:
encoded_dataset = ds.map(lambda x: {"input_ids": tokenizer.encode(x["sequence"])})

Loading cached processed dataset at /Users/thauck/.cache/huggingface/datasets/uniprot_dataset/sprot/1.0.0/cache-c9c4b3bc9d2ef833.arrow


There's also a batch mode, which gives some speedup.

In [80]:
second_ds = ds.map(lambda x: {"input_ids": tokenizer.encode_batch(x["sequence"])}, batched=True)

Loading cached processed dataset at /Users/thauck/.cache/huggingface/datasets/uniprot_dataset/sprot/1.0.0/cache-289f655b6eeb11a7.arrow


Anyways, so we have our tokenized dataet, and we're ready to do some (data) science.