In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
import json
import shutil

First, we need to download the data from Kaggle into our own drive. To do this, we've downloaded a json with our credentials allowing us to call the Kaggle API and copy the data. 

The next bit of code will point to the credentials and then using json read them in. I'll probably obsure it a bit, since its my PII. But once this is done, I'll have the api calls and move the data.

In [None]:
base_dir = './drive/MyDrive/PFAM_database/'
credentials_dir = base_dir + 'kaggle.json'

In [None]:
f = open(credentials_dir)
data = json.load(f) #this is now a dictionary with keys: username and key, which are pii for me to be able to access kaggle api

#Now we set up our environment
os.environ['KAGGLE_USERNAME'] = data['username']
os.environ['KAGGLE_KEY'] = data['key']

In [None]:
!kaggle datasets download -d googleai/pfam-seed-random-split

Downloading pfam-seed-random-split.zip to /content
 99% 489M/493M [00:04<00:00, 142MB/s]
100% 493M/493M [00:04<00:00, 108MB/s]


In [None]:
##Oops, I copied it to the wrong loc... lets move it
os.listdir('/content')
shutil.move('/content/pfam-seed-random-split.zip', base_dir)
os.listdir(base_dir)

['kaggle.json', 'pfam-seed-random-split.zip']

Now that we have our zipped dataset, lets unzip it and save it to a new directory (data)

In [None]:
import zipfile
with zipfile.ZipFile(base_dir+'pfam-seed-random-split.zip', 'r') as zip_ref:
    zip_ref.extractall(base_dir+'/data')

In [None]:
print(os.listdir(base_dir))
print(os.listdir(base_dir+'/data/random_split'))

['kaggle.json', 'pfam-seed-random-split.zip', 'data']
['dev', 'random_split', 'test', 'train']


## About Dataset
### Problem description
This directory contains data to train a model to predict the function of protein domains, based
on the PFam dataset.

Domains are functional sub-parts of proteins; much like images in ImageNet are pre segmented to
contain exactly one object class, this data is presegmented to contain exactly and only one
domain.

The purpose of the dataset is to repose the PFam seed dataset as a multiclass classification
machine learning task.

The task is: given the amino acid sequence of the protein domain, predict which class it belongs
to. There are about 1 million training examples, and 18,000 output classes.

### Data structure
This data is more completely described by the publication "Can Deep Learning
Classify the Protein Universe", Bileschi et al.

### Data split and layout
The approach used to partition the data into training/dev/testing folds is a random split.

Training data should be used to train your models.
Dev (development) data should be used in a close validation loop (maybe
for hyperparameter tuning or model validation).
Test data should be reserved for much less frequent evaluations - this
helps avoid overfitting on your test data, as it should only be used
infrequently.
## File content
Each fold (train, dev, test) has a number of files in it. Each of those files
contains csv on each line, which has the following fields:

sequence: HWLQMRDSMNTYNNMVNRCFATCIRSFQEKKVNAEEMDCTKRCVTKFVGYSQRVALRFAE  <br>
family_accession: PF02953.15 <br>
sequence_name: C5K6N5_PERM5/28-87 <br>
aligned_sequence: ....HWLQMRDSMNTYNNMVNRCFATCI...........RS.F....QEKKVNAEE.....MDCT....KRCVTKFVGYSQRVALRFAE  <br>
family_id: zf-Tim10_DDP <br>
#### Description of fields:

- sequence: These are usually the input features to your model. Amino acid sequence for this domain.
- There are 20 very common amino acids (frequency > 1,000,000), and 4 amino acids that are quite uncommon: X, U, B, O, Z.
- family_accession: These are usually the labels for your model. Accession number in form PFxxxxx.y
(Pfam), where xxxxx is the family accession, and y is the version number.
Some values of y are greater than ten, and so 'y' has two digits.
- family_id: One word name for family.
- sequencename: Sequence name, in the form "$uniprotaccessionid/$startindex-$end_index".
aligned_sequence: Contains a single sequence from the multiple sequence alignment (with the rest of the members of
the family in seed, with gaps retained. <p>
Generally, the family_accession field is the label, and the sequence
(or aligned sequence) is the training feature.

This sequence corresponds to a domain, not a full protein.