In [1]:
import os

import pandas as pd

In [2]:
from magneton.interpro_parsing import parse_from_pkl

Goal here was to understand why our AlphaFoldDB download was missing structures for some SwissProt proteins.

## Summary
- Overview of missingness reasons is given under "Which proteins are included?" [here](https://alphafold.ebi.ac.uk/faq)
- Only 23,870 / 555,692 SwissProt proteins parsed out of InterPro are missing AlphaFoldDB structures
- Of that, 5816 were added to SwissProt after the 2021_04 release, and the remainder are excluded for other outlined reasons

## Context
When trying to add in secondary structure annotations to the `Protein` objects parsed out of InterPro, I was running into cases where there was no AlphaFold struture for a certain protein, since I was trying to parse the secondary structure annotations from the AlphaFoldDB mmCIF files.

## Methods
Downloaded old SwissProt release (2021_04) which was used for AlphaFoldDB (located at `/home/rcalef/storage/om_storage/data/uniprot/swissprot_2021_04`), checked how many of the missing UniProt IDs were also missing from there (i.e. added to SwissProt after that cutoff), then manually inspected the UniProt pages for a few of the remainder, which seemed to match the other AlphaFoldDB exclusion criteria (e.g. containing non-standard `X` amino acids).

In [3]:
swissprot_pkl_path = "/weka/scratch/weka/kellislab/rcalef/data/interpro/102.0/filter_swissprot/swissprot_prots.pkl.bz2"
cif_tmpl = "/weka/scratch/weka/kellislab/rcalef/data/cif_alphafolddb/AF-%s-F1-model_v4.cif.gz"

num = 0
num_found = 0
missing = []
for i, prot in enumerate(parse_from_pkl(swissprot_pkl_path, compression="bz2")):
    num += 1
    if os.path.exists(cif_tmpl % prot.uniprot_id):
        num_found += 1
    else:
        missing.append(prot.uniprot_id)
print(f"found {num_found} / {num} ({len(missing)} missing)")


found 531822 / 555692 (23870 missing)


In [4]:
check = (
    pd.read_table(
        "/weka/scratch/weka/kellislab/rcalef/data/uniprot/swissprot_2021_04/uniprot_sprot.fasta.gz.fai",
        names=["uniprot_id", "x1", "x2", "x3", "x4"],
    )
    .assign(
        uniprot_id=lambda x: x.uniprot_id.str.split("|").str[1],
    )
)
check.head()

Unnamed: 0,uniprot_id,x1,x2,x3,x4
0,Q6GZX4,256,122,60,61
1,Q6GZX3,320,499,60,61
2,Q197F8,458,943,60,61
3,Q197F7,156,1527,60,61
4,Q6GZX2,438,1800,60,61


In [5]:
missing = pd.Series(missing)

In [6]:
not_in_old = missing[lambda x: x.isin(check.uniprot_id)]
len(not_in_old)

18054

In [8]:
len(missing) - len(not_in_old)

5816

In [7]:
not_in_old.head()

12    A0A023GS29
15    A0A024B7W1
16    A0A024F910
26    A0A061FLA2
48    A0A075TJ05
dtype: object