<a href="https://colab.research.google.com/github/vprobon/iLIR-ML-data/blob/main/utilities/LIR_in_AFhuman.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LIR_in_AFhuman.ipynb

Reads AlphaFoldDB entries locally stored based on a list of UniProt identifiers. For all segments of the polypeptide chain matchin the canonical LIR motif [WFY]xx[VLI] extract the pLDDT values for the motif (core) and the flanking (upstream/downstream) peptides of length=10.

**Tip**: Essencially, since the pLDDT values are stored in the column where b-factors are traditionally stored in PDB-formatted files, this program can be used to similarly extract b-factors from PDB files reporting experimental structures.

**WARNING**: The PDB gradually will phase out PDB-formatted files in favor of mmCIF ones.

## Enter the path of the folder holding the data

In [None]:
# Change the path to the directory holding the (AF2) structures of interest

AF_dir = "/home/vprobon/AlphaFoldLIRs/AlphaFold_Human_06042022" # Directory in HPC cluster


# Then execute all cells below

## Install and import libraries

In [None]:
!pip install biopython

In [None]:
import pandas as pd


## Now run the actual code

In [None]:


def residueb(residue):
  totalb = 0
  count=0
  for atom in residue:
    totalb += atom.bfactor
    count += 1
  return(totalb/count)

def process_LIRs(structure):
  residues = [r for r in structure.get_residues() if r.get_id()[0] == " "]
  for res in residues:
    if res.id[1] == len(residues)-2:
      # Can't find a core LIR from now on
      return
    if res.resname in ['TRP', 'TYR', 'PHE']:
      aroma_id = res.id[1]-1

      if residues[aroma_id+3].resname in ['VAL','LEU','ILE']:
        ## Then we have a LIR-motif here
        print(f"{structure.id}\t{aroma_id+1:6d}\t{aroma_id+4:6d}", end="\t")
        allb=[]
        upstreamb=[]
        for i in range (-10,0):
          #print(residues[aroma_id+i].resname, residueb(residues[aroma_id+i]))
          if aroma_id+i < 0:
            continue
          upstreamb.append(residueb(residues[aroma_id+i]))
          allb.append(residueb(residues[aroma_id+i]))
        #print()
        lirb=[]
        for i in range (0,4):
          #print(residues[aroma_id+i].resname, residueb(residues[aroma_id+i]))
          lirb.append(residueb(residues[aroma_id+i]))
          allb.append(residueb(residues[aroma_id+i]))
        #print()
        downstreamb=[]
        for i in range (4,14):
          if aroma_id+i == len(residues):
            break
          #print(residues[aroma_id+i].resname, residueb(residues[aroma_id+i]))
          downstreamb.append(residueb(residues[aroma_id+i]))
          allb.append(residueb(residues[aroma_id+i]))

        if upstreamb: # there is an updstream sequence
          #print("Upstream: ", sum(upstreamb)/len(upstreamb), end = "\t")
          print(f"Up: {sum(upstreamb)/len(upstreamb):.6}", end="\t")
        else:
          print("Up: ", "N/A", end = "\t")

        #print("LIR:", sum(lirb)/len(lirb), end = "\t")
        print(f"LIR: {sum(lirb)/len(lirb):.6}", end = "\t")

        if downstreamb: # there is an updstream sequence
          #print("Downstream", sum(downstreamb)/len(downstreamb))
          print(f"Down: {sum(downstreamb)/len(downstreamb):.6}", end="\t")
        else:
          print("Down", "N/A", end="\t")

        print(f"Overall: { (sum(downstreamb)+ +sum(lirb)+sum(upstreamb))/(len(upstreamb)+len(lirb)+len(downstreamb)):.6} " )

def getAFstructures(AFdir):
  from Bio.PDB.PDBParser import PDBParser
  parser = PDBParser()
  import os
  AFfiles=os.listdir(AFdir)
  structs=[]
  for AFfile in AFfiles:
    if AFfile[-4:] != '.pdb':
      continue
    structure = parser.get_structure(AFfile, AFdir+"/"+AFfile)
    structs.append(structure)
  return structs


if __name__ == '__main__':

  structures = []
  debug = True
  if debug:
    structures = getAFstructures('./dummyAF') # For testing and debugging
  else:
    structures = getAFstructures(AF_dir)


# Acknowledgements

This work has been possible through a grant awarded to the [Bioinformatics Research Laboratory](https://vprobon.github.io/BRL-UCY) at the [University of Cyprus](https://www.ucy.ac.cy) for the [LIRcentral project](https://lircentral.eu/).

LIRcentral is co-funded by the European Union (European Regional Development Fund, ERDF) and the Republic of Cyprus through the project EXCELLENCE/0421/0576 under the EXCELLENCE HUBS programme of the [Cyprus Research and Innovation Foundation](https://research.org.cy).

![picture](https://lircentral.eu/images/LIRcentral-FundedBy.png)


For the development of iLIR-ML-v0.9 a number of publicly available resources were/are used.

- Machine learning modules are based on the excellent [sciKit-learn](https://scikit-learn.org/) Python toolkit.

- For the creation of features for representing candidate LIR motifs for predictions the following tools/resources are intrumental:

> - The [MobiDB database](https://mobidb.bio.unipd.it/) (Piovesan et al., 2020) provides precomputed intrinsic disorder prediction based on the AlphaFold-disorder method (Piovesan et al., 2022) for select UniProt entries.
> - The pLIRm software (freely available online at [GitHub](https://github.com/BioCUCKOO/pLIRm-pLAM), which we have tailored to our pipeline for computing the pLIRm score as an additional predictive feature for LIR motifs. We are indebted to the authors of this work for sharing their work.
> - The 'legacy' PSI-BLAST-derived PSSMs from previous work in our lab (Kalvari et al., 2014) ported in Python by undergraduate student Dimitris Kalanides.
>- Newly derived PSSMs (LIRcentral-PSSMs), are based on the more recently updated version of the LIRcentral database (Chatzichristofi et al., 2023).


Last, but not least, there is a huge amount of work held by official and unofficial members of the LIRcentral team, who developed tools for assisting LIRcentral biocuration, for curating LIRcentral entries from the published literature, for exploring properties of the LIRcentral data. In addition, we are grateful to several experts in autophagy who have provided feedback on existing LIRcentral entries and suggestions for adding new intances of LIR motifs in the database. We intend to keep LIRcentral, its data, and software tools derived from analysing these data freely available to the research community. We hope this work inspire and help others to work on this/similar problem(s).


