# Alphafold models analysis main program

## Description of the materials and program

### Introduction

<div style="font-family: Arial, sans-serif; line-height: 1.5; text-align: justify;">
# DataFrame Column Descriptions

- **Ele**: Electrostatic energy of the complex. Measures the interaction between electric charges within the complex.
- **Desolv**: Desolvation energy. Represents the energetic cost associated with desolvating individual molecules to form the complex.
- **VDW**: Van der Waals energy. Measures the attractive and repulsive interactions between atoms that are not chemically bonded.
- **Total**: Total energy of the complex. Sum of all energetic contributions (Electrostatic + Desolvation + 0.1 Van der Waals).
- **Name**: Name of the object or element. Used to identify and merge data from different datasets.
- **PATH**: File path associated with the object. Stores the locations of the files corresponding to each object for additional input/output operations.
- **Complex**: Name or identifier of the studied complex.
- **State**: State of the complex (e.g., native, mutated, etc.).
- **Model**: Specific model used in the analysis.
- **Rank**: Ranking of the model or complex based on a specific criterion.
- **Version**: Version of the model or software used in the analysis.
- **Recycle**: Number of times the model has been recycled or reused in iterations.
- **Seed**: Seed value used by AlphaFold2.
- **Unstructured_count**: Number of unstructured regions in the complex.
- **Max_unstructured_region**: Size of the largest unstructured region.
- **Total_clashes**: Total number of atomic clashes within the complex.
- **Clashes_chain_A**: Number of clashes in chain A.
- **Clashes_chain_B**: Number of clashes in chain B. _There may be more chains._
- **Low_B_factors_chain_A**: Percentage of residues with pLDDT below 50 in chain A.
- **Low_B_factors_chain_B**: Percentage of residues with pLDDT below 50 in chain B. _There may be more chains._
- **Knots**: Number of knots present in the structure.
- **pLDDT**: Predicted Local Distance Difference Test. Measures the quality of the local structural prediction.
- **pTM**: Predicted Template Modeling. Measures the quality of the global structural prediction based on template modeling.
- **ipTM**: Interface Predicted Template Modeling. Measures the quality of the structural prediction at interfaces.
- **tol**: Tolerance of the model or simulation.
- **Model_confidence**: Confidence in the predictive model. Calculated as ipTM\*0.8 + pTM\*0.3.
- **Total2**: Unweighted total energy from pyDock (Electrostatic + Desolvation + Van der Waals).
- **MCZ-Score**: Model Confidence Z-score.
- **PLDDTZ-Score**: pLDDT Z-score.
- **TEZ-Score**: Z-score calculated from Total.
- **TE2Z-Score**: Z-score calculated from Total2.
- **Sum_Z**: Sum of the Z-scores for Model Confidence and Total.
- **Sum2_Z**: Sum of the Z-scores for Model Confidence and Total2.
- **Z-PLT**: Sum of the Z-scores for pLDDT and Total.
- **Z-PLT2**: Sum of the Z-scores for pLDDT and Total2.
- **Ranking_Z**: Ranking based on Sum_Z.
- **Ranking2_Z**: Ranking based on Sum2_Z.
- **Ranking_PLT**: Ranking based on the Z-PLT criterion.
- **Ranking_PLT2**: Ranking based on the Z-PLT2 criterion.
- **Diferencia_R2_Z**: Difference between the current ranking and the next in the Ranking2_Z column. Indicates the cluster size.
</div>



<div style="font-family: Arial, sans-serif; line-height: 1.5; text-align: justify;">
   
This jupyter notebook is created to perform a analysis of complexes generated by different versions of Alphafold. There are 5 versions of Alpahafold:
- Aplhafold2: is the standart versions developed by Google Deepmind
- Alphafold multimer v1 (v1): a variation of Alphafold2 to model complexeses more properly
- Alphafold multimer v2 (v2):
- Alphafold multimer v3 (v3):

- RMSD:  stands for Root Mean Square Deviation, and it is a measure used in structural biology to assess the similarity or deviation between two or more protein or molecular structures. It quantifies the average distance between the corresponding atoms of two superimposed structures. So it works as indication of similarity, the lower the RMSD the higher likelihood of the two structures.This would be essential to see if AM is able to replicate the sctructure provided by the cristal of the PDB_databank. There are different types of RMSD...

- Total energy: Pydock4

- Model confidence: indicates the reliability of the model generated. When Alphafold generates a model it also assigns a predicted local difference distance test score (pLDDT, corresponding to local structural accuracy), predicted TM-score (pTM,corresponding to overall topological accuracy), and an interface pTM score (ipTM) which is used in conjunction with pTM to compute model scores. The model coonfidence is calculated by: $0.8 \cdot iptm + 0.2 \cdot ptm$. Also the AI determines the tol which are used to determine if the program should do another recyle or not depending of the previous generation, if the model doesn't improve much it will stop and perform a relaxation of the structure using AMBER force field.

</div>


### Description of the files and folders

#### Complex folders


<div style="font-family: Arial, sans-serif; line-height: 1.5; text-align: justify;">
The folder in which the rest files are stored are named by the complex, composed by the name of the cristal in the PDB bank followed by the chains used to do the complex

#### PDBS files

<div style="font-family: Arial, sans-serif; line-height: 1; text-align: justify;">

Indicates the information of of the protein structure. The names of the PDBs generated by Aplhafolds are composed by:  complex,state,rank,version of Alphafold, model and recycle (except cristals,"ranked" pdbs,Seed_0).<br><br>

- Complex: the name of the complex registered in the PDB bank, it  is composed by letters and numbers.<br><br>
- States:
  - unrelaxed: are crude structures provided by Alphafold in which it does it iterative proccess .
  
  - relaxed: The last structure recycled relaxed using AMBER force field. <br><br>

- Version: the five versions described in the introduction(v1,v2,v3) in the future can be more.<br><br>


- Model:

  - Models in Alphafold2: generates five predictions from the same seed, are named as "model_" followed by a number.
  

    - "ranked_" folled by a number: indicates in which position in the rank are the relaxed models according to the scores that alphafold assigns. Their name is entirely "ranked" it has no more data in it.


    - "pred_" followed by a number: identifies a model generated by the same seed, but with minor differences.<br><br>
    
  
  - Model in the versions of AM (v1,v2,v3,v3_short): the models are generate models 5 model from differents seeds and then it iterates the resolution of the structure until the tol variable surpass a threshold in which alphafold stop modeling ot reaches the recycle of 20.<br><br>

- Recycle: only for non-Alphafold2 predictions (at te moment).
  
  - "r_" followed by number : indicates the recycle of the model.


  - "Seed_0": is the same from recycle 20 that will be relaxed.<br><br>
  
- Rank folllowed by a number : it indicates which model of the five generated is best according to the highest score obtained in the last recyle, only in Alphafold2.<br><br>


- Examples of names:


  - unrelaxed_rank_001_alphafold2_multimer_v2_model_4_seed_000_r9.pdb (standart name in AM versions).


  - relaxed_model_4_multimer_v2_pred_1.pdb (standart name in Alphafold2 versions).


  - 3BT1.pdb (crystal).


  - ranked_0.pdb (relaxed and ranked in Alphafold2).

  
  - unrelaxed_rank_001_alphafold2_multimer_v3_model_2_seed_000_r0 ( Seed_0 example).

   
</div>


#### Json files

<div style="font-family: Arial, sans-serif; line-height: 1.5; text-align: justify;">
   

regarding to model confidence and the time used to create the model


</div>

#### Log.txt files

<div style="font-family: Arial, sans-serif; line-height: 1; text-align: justify;">
   
It gathers infromation about the execution of alphafold, the most relevant information is:

- Timestamps: The file starts with timestamps indicating when each event occurred. These timestamps are in the format "YYYY-MM-DD HH:MM:SS,sss" (Year, Month, Day, Hour, Minute, Second, Milliseconds).

- Information about the software: The first few entries provide information about the software version (ColabFold 1.5.2).

- Recycle iterations: The log then proceeds to provide information about the iterative process of protein structure prediction. It mentions recycling and various metrics such as "pLDDT," "pTM," "ipTM," and "tol" for each recycle step.

- Model ranking: The final section ranks the models based on the "multimer" metric, and it mentions the relaxation times for each model.
</div>

### Description of the program

<div class="alert alert-block alert-warning">
<b>Note:</b> The notebook is constructuted to gather the information of the standarized names and the exceptions mentioned (see examples). If the names of the folders, the pdbs, log.txt or outputs of pydock4 are severly changed, this notebook could not work as intented.
</div>

<div style="font-family: Arial, sans-serif; line-height: 1.5; text-align: justify;">
   
The analysis in the main program is divided in 6 sections:

1. Libraries and initial values: It loads the libraries are needed and gather the names of folders to do later itartions folder by folder. 
2. Running pydock4: to calculate the RMSD according to CAPRI and the bind energy.
3. Contruction of the dataframes: the outputs of pydock4 have to be cured to transform them into a dataframe. Also there is a Log.txt information retrieving in to a dataframe.  
4. Final Fusion and adjustments: due to some pecularities of the versions some data have to be renamed, modified and/or remove to harmonized the data.

There are two classes of folders. Ones have the pdb from Alphafold2 and the other are obtained from AlphaFold-multimers. The difference between them is how the information about the model confidence is stored, the ones from Alphafold2 have their model confidence stored in json archives and the ones from AM have in the log.txt. This implies a different aproach of gathering this data.
</div>


## Main program

### 0. Paths and selected molecules

In [None]:
# Directories
Target_name="T309"
directorio="/home/luis/CAPRI_R57/"+Target_name+"/Predictors/AF_MODELS/COMPLEX/"
directorio_csv= "/home/luis/CAPRI_R57/"+Target_name+"/Predictors/AF_MODELS/COMPLEX/"# This is the the directory of the folder that will gather the outputs
to_send_dir="/home/luis/CAPRI_R57/"+Target_name+"/Predictors/To_send/"
to_send_csv ="/home/luis/CAPRI_R57/"+Target_name+"/Predictors/To_send/"+Target_name+"_predictor_to_send.ene"

# Target_name="T254"
# directorio="/home/luis/CAPRI_R57/T254/Predictors/Superposition_models_T255_new/"
# directorio_csv= "//home/luis/CAPRI_R57/T254/Predictors/Superposition_models_T255_new/"# This is the the directory of the folder that will gather the outputs

# Target_name="T272"
# directorio="/home/luis/CAPRI_R57/T272/Predictors/SUPERPOSITION_MODELS/"
# directorio_csv= "/home/luis/CAPRI_R57/T272/Predictors/SUPERPOSITION_MODELS/"# This is the the directory of the folder that will gather the outputs

#Clustering
#receptor_mol,ligand_mol =["A","B"] #T236
#receptor_mol,ligand_mol =["A,B","C"] #T238
#receptor_mol,ligand_mol =["A,B,C,D"],["E"] #240
#receptor_mol,ligand_mol =["A"],["B"] #T242
#receptor_mol,ligand_mol =["A"],["B"] #T244
#receptor_mol,ligand_mol =["A"],["B"] #T248
#receptor_mol,ligand_mol =["A"],["B"] #T248
#receptor_mol,ligand_mol =["A","B"],["C"] #T250/T252
#receptor_mol,ligand_mol =["A"],["B"] #T254/T255 por simertria solo cogemos dos cadenas
#receptor_mol,ligand_mol =["A"],["B"] #T262 por simertria solo cogemos dos cadenas
# receptor_mol,ligand_mol =["B,C"],["A"] #T266 Antibody
# receptor_mol,ligand_mol =["B"],["I","K"] #T264-T265 Protein_DNA
# receptor_mol,ligand_mol =["A","C,E"],["B","K,F"] #T280
#receptor_mol,ligand_mol =["B,C"],["A"] #T284
#receptor_mol,ligand_mol =["B,C"],["A"], ["A,C"],["B"], ["B,A"],["C"] # T288
#receptor_mol,ligand_mol =# T290

receptor_mol,ligand_mol =["A,B,C"],["D,E,F"] #T290
#receptor_mol,ligand_mol =["A,B,C"],["D,E,F,G,H,I"] #T292

#receptor_mol,ligand_mol =["B,C"],["A"] #T266 Antibody
#receptor_mol,ligand_mol =["B","C","D","E","F","G"],["A"] #T272 Antibody


print(directorio)

### 0.1. Model Relaxation OpenMM

In [None]:
#%%bash -s "$directorio/../" "$Target_name"
#cd $1
#for i in `find  -name ''${2}'_*.r*.pdb'`;do echo "python relax_v2.py -model_name $i -output_model_name $i";done  > relax_greasy.txt
#export GREASY_NWORKERS=1
#greasy relax_greasy.txt 


### 1. Libraries and initial values

<div style="font-family: Arial, sans-serif; line-height: 1.5; text-align: justify;"> The following  libraries  are used to treat the data adn ploting:</div>


In [None]:
# File manegement
import os, zipfile 
import re 
import shutil

# Data manegement
import pandas as pd # used to manage dataframes
import numpy as np
from itertools import product
from Bio import PDB
from Bio.PDB import MMCIFParser, PDBIO, DSSP, NeighborSearch,Superimposer,PDBParser
from Bio.Align import PairwiseAligner
from scipy.spatial.transform import Rotation as R
from concurrent.futures import ProcessPoolExecutor, as_completed
import warnings
# Subprocess to calling bash
import subprocess # used to call bash and running external programs like pydock4

In [None]:
assert False, "Parando la ejecución aquí."

### 1.1 Preprocess AlphaFold3

In [None]:
#Uncompress the AplhaFold3 Job.
def descomprimir_archivo(zip_path, directorio_destino):
    """
    Descomprime un archivo ZIP en el directorio especificado.

    Parámetros:
    zip_path (str): Ruta del archivo ZIP.
    directorio_destino (str): Ruta del directorio donde se colocarán los archivos descomprimidos.
    """
    # Asegurarse de que el directorio destino existe, si no, crearlo
    if not os.path.exists(directorio_destino):
        os.makedirs(directorio_destino)

    # Abrir el archivo ZIP en modo de lectura
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        # Extraer todos los archivos en el directorio especificado
        zip_ref.extractall(directorio_destino)
patron = r'(.*(\d+)\.zip$)'
patron_CIF = r'(.*(\d+)\.cif$)'

zipfiles = [os.path.abspath(os.path.join(directorio, archivo)) for archivo in os.listdir(directorio) if re.match(patron, archivo)]
print(zipfiles)
for zipfille in zipfiles:
    print(zipfille)
    descomprimir_archivo(zipfille,zipfille.rstrip('.zip'))

CIF_files = [os.path.abspath(os.path.join(directorio, archivo)) for archivo in os.listdir(zipfille.rstrip('.zip')) if re.match(patron_CIF, archivo)]


In [None]:
#Convert CIF files to PDB files of the AplhaFold3 Job.
def convert_cif_to_pdb(cif_file, pdb_file):
    """
    Convert a CIF file to a PDB file using Biopython.

    Parameters:
    cif_file (str): Path to the input CIF file.
    pdb_file (str): Path to the output PDB file.
    """
    parser = MMCIFParser()
    structure = parser.get_structure('ID', cif_file)
    io = PDBIO()
    io.set_structure(structure)
    io.save(pdb_file)

patron_CIF = r'(.*(\d+)\.cif$)'
for zipfille in zipfiles:
    CIF_files = [os.path.join(zipfille.rstrip('.zip'),archivo) for archivo in os.listdir(zipfille.rstrip('.zip')) if re.match(patron_CIF, archivo)]
    #print(CIF_files)
    for CIF_file in CIF_files:
        #print(CIF_file)
        convert_cif_to_pdb(CIF_file, CIF_file.replace('.cif','.pdb'))


In [None]:
#Read the Json
import os
import json

def leer_json_extract_vars(directorio, claves):
    """
    Lee archivos JSON en un directorio específico y extrae las variables especificadas.
    
    Parámetros:
    directorio (str): Ruta al directorio que contiene los archivos JSON.
    claves (list): Lista de claves a extraer de los archivos JSON.
    
    Retorna:
    dict: Diccionario con nombres de archivo y sus variables extraídas.
    """
    resultados = {}  # Diccionario para almacenar los resultados

    # Recorrer todos los archivos en el directorio
    pattern_json = re.compile(r"summary_confidences_\w\.json$")
    for archivo in os.listdir(directorio):
        if pattern_json.search(archivo):  # Asegurarse de que es un archivo JSON
            ruta_completa = os.path.join(directorio, archivo)
            with open(ruta_completa, 'r') as f:
                data = json.load(f)  # Cargar el contenido JSON
                # Extraer las variables especificadas
                valores_extraidos = {clave: data.get(clave, None) for clave in claves}
                
                # Almacenar los resultados
                resultados[archivo] = valores_extraidos

    return resultados

# Usar la función
claves_a_extraer = ['ptm', 'iptm']  # Añadir aquí cualquier clave que necesites
for zipfille in zipfiles:
    log_folder = zipfille.rstrip('.zip')
    resultados = leer_json_extract_vars(zipfille.rstrip('.zip'), claves_a_extraer)
    with open(os.path.join(log_folder,'log.txt'), 'w') as file:
        for archivo, vars in resultados.items():
            # Formatear nombre del archivo y eliminar partes no deseadas
            nombre_archivo_formateado = archivo.replace('summary_confidences', 'model').rstrip('.json')
            # Crear una cadena de texto con los pares clave=valor
            vars_text = ' '.join([f"{key.replace('tm','TM')}={value}" for key, value in vars.items()])
            file.write(f"{nombre_archivo_formateado} {vars_text}\n")

In [None]:
1050/3

In [None]:
assert False, "Parando la ejecución aquí."

<div class="alert alert-block alert-warning">
<b>Note:</b>   Calculate the ByEnergy before the following step.
   Now we are going to gather the working directories where the archive are located, then generate a list with the rutes of the archives to perform iteration and automatitation of the analysis..
</div>

In [None]:
if not os.path.exists(directorio_csv):
    os.makedirs(directorio_csv)

# Folders of all models
carpetas = [nombre for nombre in os.listdir(directorio) if os.path.isdir(os.path.join(directorio, nombre))]

#PDB files of the folders and the way we will 
archivos_pdb=[]
patron = r'(.*(\d+)\.pdb$)'
#patron = r'fold_t\d+_b[a-z]+_a_\d+_model_\d+_supeimp\.pdb' # T272
datos_carpeta={}
for carpeta in carpetas:  
        patron ="("+carpeta[0:4]+ ".pdb)|" +patron
        direccion = directorio + "/" + carpeta + "/"
        pdbs=[os.path.abspath(os.path.join(direccion, archivo)) for archivo in os.listdir(direccion) if re.match(patron, archivo)]
        datos_carpeta={**datos_carpeta,**{carpeta:len(pdbs)}}
        archivos_pdb.extend(pdbs)
      
# Folders runned by colab_alphafold
carpetas_colab=[] 
for carpeta in carpetas:
    logic_fold= carpeta.count("am")
    if(logic_fold>0):
        carpetas_colab.append(carpeta)

# PDB files of archives with colabfold
archivos_colab=[]
for carpeta in carpetas_colab:  
        pdbs=[os.path.abspath(os.path.join(direccion, archivo)) for archivo in os.listdir(direccion) if re.match(patron, archivo)]
        narchivos=len(pdbs)
        archivos_colab.extend(pdbs)

# Name of cristals
cristales= [nombre[0:4] for nombre in os.listdir(directorio) if os.path.isdir(os.path.join(directorio, nombre))]
cristales= list(set(cristales))

patron_cristal= ".pdb)|(".join(cristales)
patron_cristal= "("+patron_cristal+".pdb)"


<div style="font-family: Arial, sans-serif; line-height: 1.5; text-align: justify;">
   The following block shows how many pdb models are in the folder
   
   </div>


In [None]:
def numero_pdbs_by_dir(directorio):  
    x=1
    n_archivos=0
    for carpeta in carpetas:
        # Accedemos a cada una de ellas y ponemos en un documento lista la dirección de cada uno de los .pdb
        direccion = directorio + "/" + carpeta + "/"
        archivos_pdb = [archivo for archivo in os.listdir(direccion) if re.match(patron, archivo)]
        print(x,carpeta,len(archivos_pdb))
        n_archivos=n_archivos+len(archivos_pdb) 
        x=x+1
    return (n_archivos)
print(numero_pdbs_by_dir(directorio))

### 3. Data Frame creation

<div style="font-family: Arial, sans-serif; line-height: 1.5; text-align: justify;">
In this section the information about ene and capri archives will be retrieve in their respective dataframes.	
	<br>


1. The CAPRI dataframe consists on:
    - Name: name of the pdb
    - Complex: name of the complex
    - State: relaxed or unrelaxed
    - Version: version of alphafold used.
    - Model: name of the model
    - Rank: rank according to model confidence
    - Recycle: recycle of the model
    - Conf
    - l_RMSD
    - i_RMSD
    - FNC
    - CAPRI	
    <br>	<br>		
2. The Bind Energy dataframes consists on:
    - Name: name of the pdb
    - Complex: name of the complex
    - State: relaxed or unrelaxed
    - Version: version of alphafold used.
    - Model: name of the model
    - Rank: rank according to model confidence
    - Recycle: recycle of the model
    - Conf	
    - Ele
    - Desolv
    - VDW
    - Total
    - RMSD
    - RANK
3. The log.txt dataframe consist of:
    - Complex: name of the complex
    - State: relaxed or unrelaxed
    - Version: version of alphafold used.
    - Model: name of the model
    - Rank: rank according to model confidence
    - Recycle: recycle of the model
    - pLDDT:
    - pTM:
    - ipTM:
    - tol:
    - Model_confindence:

</div>

#### 3.2 Bind energy dataframe

<div style="font-family: Arial, sans-serif; line-height: 1.5; text-align: justify;"> First,calculate the energy. Gathering the information from the .ene files.</div>

<div style="border: 1px solid red; padding: 10px; background-color: #ffdddd;">
  <strong>ACTUALIZACIÓN DEL CÓDIGO:</strong> He puesto para que si <code>./sum_ene_multy_bindEy_new.sh</code> no devuelve la cabecera pues haga la tabla cogiendo la primera fila. Issue detectado en el target <code>T284</code> en el que solo se generaba de la suma un ene sin cabeceras.
</div>


Actualizacion 2: Se ha comentado lo que escribio luis en un primer momento, mucho texto

In [None]:
# # Inicializar un DataFrame vacío para almacenar los resultados finales
# total_df=pd.DataFrame()
# resultado_df = pd.DataFrame()
# extension_final = len(".ene")
# patron = r".*\d\.ene$"
# #patron  = r'fold_t\d+_b[a-z]+_a_\d+_model_\d+_supeimp\.ene'
# #print(carpetas)
# for carpeta in carpetas:
#     direccion = directorio + "/" + carpeta + "/"
#     archivos_ene = [archivo for archivo in os.listdir(direccion) if re.match(patron, archivo)]
#     resultado_df = pd.DataFrame()
    
#     for archivo in archivos_ene:
#         print(os.path.join(direccion, archivo))
#         # Inicializar la tabla como una lista vacía
#         tabla = []
        
#         # Leer el archivo y procesar cada línea
#         with open(os.path.join(direccion, archivo), 'r') as file:
#             # Leer la primera línea como nombres de columnas
#             column_names = file.readline().strip().split()
                        
#             lineas = file.readlines()
#             num_lineas = len(lineas)
#             # Asegurarte de que haya una columna adicional en el encabezado
#             if len(column_names) < 5:
#                 column_names.append("ColumnaVacia")

#             # Ignorar la segunda línea
#             file.readline()

#             # Agregar los nombres de las columnas a la tabla
#             tabla.append(column_names)

#             # Leer y agregar los valores de la tercera línea en adelante
#             for linea in file:
#                 valores = linea.strip().split()
#                 # Asegurarte de que haya una columna adicional en los datos
#                 if len(valores) < 5:
#                     valores.append(np.nan)
#                 tabla.append(valores)
#         print(num_lineas)
#         if num_lineas==0:
#             df = pd.DataFrame([tabla[0]], columns=["Conf","Ele","Desolv", "VDW","Total","RANK"])
#         else:
#             df = pd.DataFrame(tabla[1:], columns=tabla[0])
#         df["Name"] = archivo[:-extension_final]
#         df["PATH"] = os.path.join(direccion, archivo).rstrip(".ene")+".pdb"
#         print(df)
#         #Concatenar el DataFrame actual con el resultado_df
      
#         resultado_df = pd.concat([resultado_df, df], ignore_index=True)        
#     if carpeta.startswith('fold'):
#         resultado_df["Complex"]=carpeta.split('_')[1].upper()
#     else:
#         resultado_df["Complex"]=carpeta[0:4]
#     total_df=pd.concat([total_df,resultado_df], ignore_index=True)
#     # print(carpeta)

# total_df.to_csv(directorio_csv + "pydock4_raw.csv", index=False)

In [None]:
# Inicializar un DataFrame vacío para almacenar los resultados finales
total_df=pd.DataFrame()
resultado_df = pd.DataFrame()
extension_final = len(".ene")
patron = r".*\d\.ene$"
#patron  = r'fold_t\d+_b[a-z]+_a_\d+_model_\d+_supeimp\.ene'
#print(carpetas)
for carpeta in carpetas:
    direccion = os.path.join(directorio, carpeta )
    archivos_ene = [archivo for archivo in os.listdir(direccion) if re.match(patron, archivo)]
    resultado_df = pd.DataFrame()
    
    for archivo in archivos_ene:
        print(os.path.join(direccion, archivo))
        # Inicializar la tabla como una lista vacía
        tabla = []
        df = pd.read_csv(os.path.join(direccion, archivo), sep='\s+', skiprows=[1])
        df["Name"] = archivo[:-extension_final]
        df["PATH"] = os.path.join(direccion, archivo).rstrip(".ene")+".pdb"
        print(df)
        #Concatenar el DataFrame actual con el resultado_df
      
        resultado_df = pd.concat([resultado_df, df], ignore_index=True)        
    if carpeta.startswith('fold'):
        resultado_df["Complex"]=carpeta.split('_')[1].upper()
    else:
        resultado_df["Complex"]=carpeta[0:4]
    total_df=pd.concat([total_df,resultado_df], ignore_index=True)
    # print(carpeta)

total_df.to_csv(directorio_csv + "pydock4_raw.csv", index=False)

In [None]:
total_df

<div style="font-family: Arial, sans-serif; line-height: 1.5; text-align: justify;"> Asignation of the data related to de name of the model: state, model, rank, version and recyle  </div>

In [None]:
print (directorio_csv)
df = pd.read_csv(directorio_csv + "pydock4_raw.csv", sep=r'\t|,')
print (df)

In [None]:
# Loading the data_frame
df = pd.read_csv(directorio_csv + "pydock4_raw.csv", sep=r'\t|,')

#information to retrieve
state_pattern = re.compile(r'.nrelaxed')
version_pattern = re.compile(r"((deepfold|alphafold2_multimer)_v\d+)_model")
model_pattern = re.compile(r'model_(\d+)')
rank_pattern = re.compile(r'(rank_(\d+))|(pred_\d+)|(ranked_.*)')
recycle_pattern = re.compile(r'(_|.)r(\d{1,})')
#seed_pattern = re.compile(r'seed_([0-9]+)\.')
#seed_pattern = re.compile(r'seed_([\d]+)\.')
seed_pattern = re.compile(r'seed_([0-9]+)(?:\.|$)')

# Defining empty list where the data from the file name will be gather
state=[]
model=[]
version = []
recycle = []
rank=[]
seed=[]
# Loop to gather the information entry by entry
for linea in (df["Name"].tolist()):
    #State relaxed, unrelaxed
    match = state_pattern.search(linea)
    if match:
        state.append(match.group(0))
    else:
        state.append("relaxed")
    
    # Model
    match = model_pattern.search(linea)
    if match:   
        model.append(match.group(1)) 
    else:
        model.append("cristal")   
    
    #Rank
    match = rank_pattern.search(linea)
    if match:   
        rank.append(match.group(0)) 
    else:
        rank.append("unrank")
    
    #Version 
    match = version_pattern.search(linea)
    if match: 
        version.append(match.group(1))
    else:
        version.append("cristal")
    
    # Recycle
    match = recycle_pattern.search(linea)
    if match:
        recycle.append(match.group(0)[2:])
    else:
        recycle.append("Seed_0")          
    #Seed
    match = seed_pattern.search(linea)
    if match:
        seed.append(match.group(1))
    else:
        seed.append("-")

# Adding the entries to the dataframe
df["State"]=state
df["Model"]=model
df["Rank"]=rank
df["Version"]=version
df["Recycle"]=recycle
df["Seed"]=seed

# Añadimos informacion de los cristales
df.loc[df["Rank"] == "unrank", "Version"] = "alphafold3"
df.loc[(df["Model"] == "cristal") & (df["Rank"] == "unrank"), ["Rank", "Recycle", "State", "Version"]] = "cristal"
lista_valores = ["pred_0", "pred_1", "pred_2", "pred_3", "pred_4", "pred_5"]
df.loc[df["Rank"].isin(lista_valores), "Version"] = "Alphafold2"

# Añadimos informacion del estadio
#df.loc[df["State"]=="relaxed","Recycle"]="relaxed"
df['Name']= df['Name']+".pdb"

# Añadimos informacion de los ranked
df.loc[(df["Model"] == "cristal") & (df["Rank"] != "cristal"),  "Version"] = "Alphafold2"
df.loc[(df["Model"] == "cristal") & (df["Rank"] != "cristal"), ["Model",  "Recycle"]] = "ranked"

# Eliminamos duplicaciones
#df=df.drop_duplicates(subset=["Name","State","Complex"],keep="first")
df_pydock=df
df_pydock

In [None]:
#The models were all relaxed
df_pydock["State"]="relaxed"


In [None]:
df_pydock

In [None]:
df_pydock.to_csv(directorio_csv+'/pydock4_all.csv', index=False)

#### 3.2.1 Calculation of additional parameters .

Calcula:
- Zonas estructuradas/desestructuradas y el numero total de aa en la desetructuracion
- choques entre cadenas(se coge el total, pero se puede ver la expecificación zonas)
- Simetria de cadenas iguales (Rise, grados sexagesimales)
- Deteccion de nudos

In [None]:
from multiprocessing import Pool
def add_cryst1_record(pdb_file):
    cryst1_line = "CRYST1   90.000   90.000   90.000  90.00  90.00  90.00 P 1           1\n"
    with open(pdb_file, 'r') as file:
        lines = file.readlines()
    
    if not any(line.startswith('CRYST1') for line in lines):
        with open(pdb_file, 'w') as file:
            file.write(cryst1_line)
            file.writelines(lines)

def preprocess_pdb_files(directory):
    for filename in os.listdir(directory):
        if filename.endswith('.pdb'):
            pdb_file = os.path.join(directory, filename)
            add_cryst1_record(pdb_file)

def calculate_chain_clashes(model, chain):
    atoms = [atom for atom in model.get_atoms() if atom.get_parent().get_parent() != chain and atom.element != 'H']
    ns = NeighborSearch(atoms)
    chain_atoms = [atom for atom in chain.get_atoms() if atom.element != 'H']
    clashes = 0

    for atom in chain_atoms:
        neighbors = ns.search(atom.coord, 3.0)
        clashes += len([neighbor for neighbor in neighbors if neighbor != atom])
    return clashes

def count_low_b_factors(chain, threshold):
    low_b_factor_count = 0
    total_residues = 0
    
    for residue in chain:
        for atom in residue:
            if atom.bfactor < threshold:
                low_b_factor_count += 1
                break
        total_residues += 1
    
    if total_residues == 0:
        return 0
    else:
        return (low_b_factor_count / total_residues) * 100

def calc_rise_and_symmetry(atoms1, atoms2):
    coords1 = np.array([atom.get_coord() for atom in atoms1])
    coords2 = np.array([atom.get_coord() for atom in atoms2])
    
    # Superimposición de las coordenadas
    super_imposer = Superimposer()
    super_imposer.set_atoms(atoms1, atoms2)
    super_imposer.apply(atoms1)  # Aplicar la rotación a atoms1

    # Calcular los centros de masas una vez alineados
    aligned_coords1 = np.array([atom.get_coord() for atom in atoms1])
    aligned_coords2 = np.array([atom.get_coord() for atom in atoms2])
    centroid1 = np.mean(aligned_coords1, axis=0)
    centroid2 = np.mean(aligned_coords2, axis=0)
    
    # Calcular el rise como la distancia entre los centros de masas alineados
    rise = np.linalg.norm(centroid1 - centroid2)
    
    # Obtener la matriz de rotación y calcular el ángulo de simetría
    rot_matrix = super_imposer.rotran[0]
    rotation = R.from_matrix(rot_matrix)
    symmetry_degrees = rotation.magnitude() * (180 / np.pi)  # Convertir de radianes a grados
    
    return rise, symmetry_degrees

def are_sequences_similar(chain1, chain2, threshold=0.8):
    seq1 = ''.join([residue.resname for residue in chain1.get_residues()])
    seq2 = ''.join([residue.resname for residue in chain2.get_residues()])

    aligner = PairwiseAligner()
    alignments = aligner.align(seq1, seq2)
    best_alignment = alignments[0]
    identity = best_alignment.score / max(len(seq1), len(seq2))
    
    return identity >= threshold

def calculate_symmetry(model):
    chain_ids = list(model.child_dict.keys())
    rises = []
    symmetries = []
    symmetry_pairs = []
    
    for i in range(len(chain_ids)):
        for j in range(i + 1, len(chain_ids)):
            chain1 = model[chain_ids[i]]
            chain2 = model[chain_ids[j]]
            
            if are_sequences_similar(chain1, chain2):
                atoms1 = [atom for atom in chain1.get_atoms() if atom.element != 'H']
                atoms2 = [atom for atom in chain2.get_atoms() if atom.element != 'H']
                
                if len(atoms1) > 0 and len(atoms2) > 0:
                    rise, symmetry = calc_rise_and_symmetry(atoms1, atoms2)
                    rises.append(rise)
                    symmetries.append(symmetry)
                    symmetry_pairs.append((chain_ids[i], chain_ids[j], rise, symmetry))
    
    if len(rises) == 0 or len(symmetries) == 0:
        return 0, 0, symmetry_pairs
    else:
        avg_rise = np.mean(rises)
        avg_symmetry = np.mean(symmetries)
        return avg_rise, avg_symmetry, symmetry_pairs

def count_unstructured_amino_acids_and_clashes(pdb_file, bfactor_threshold):
    parser = PDB.PDBParser(QUIET=True)
    try:
        structure = parser.get_structure('X', pdb_file)
        model = structure[0]

        dssp = DSSP(model, pdb_file)

        unstructured_count = 0
        max_unstructured_region = 0
        total_clashes = 0
        clashes_per_chain = {}
        low_b_factors_per_chain = {}

        for chain in model:
            chain_dssp = [dssp[key] for key in dssp.keys() if key[0] == chain.id]

            if not chain_dssp:
                continue

            ss = [aa[2] for aa in chain_dssp]
            ss_string = ''.join(ss)
            #print(f"Chain {chain.id} DSSP data: {ss_string}")

            first_structured = next((i for i, s in enumerate(ss) if s != '-'), None)
            last_structured = next((i, s) for i, s in enumerate(reversed(ss)) if s != '-')
            if last_structured is not None:
                last_structured = len(ss) - 1 - last_structured[0]

            if first_structured is None or last_structured is None:
                continue

            current_unstructured_count = 0
            for s in ss[first_structured:last_structured+1]:
                if s == '-':
                    current_unstructured_count += 1
                else:
                    if current_unstructured_count > max_unstructured_region:
                        max_unstructured_region = current_unstructured_count
                    current_unstructured_count = 0
            if current_unstructured_count > max_unstructured_region:
                max_unstructured_region = current_unstructured_count

            unstructured_count += sum(1 for s in ss[first_structured:last_structured+1] if s == '-')

            clashes = calculate_chain_clashes(model, chain)
            total_clashes += clashes
            clashes_per_chain[chain.id] = clashes

            low_b_factors = count_low_b_factors(chain, bfactor_threshold)
            low_b_factors_per_chain[chain.id] = low_b_factors

        avg_rise, avg_symmetry, symmetry_pairs = calculate_symmetry(model)

        return unstructured_count, max_unstructured_region, total_clashes, clashes_per_chain, low_b_factors_per_chain, avg_rise, avg_symmetry, symmetry_pairs
    except Exception as e:
        print(f"Error processing {pdb_file}: {e}")
        return None, None, None, None, None, None, None, None

def process_pdb_file(args):
    pdb_file, bfactor_threshold = args
    results = count_unstructured_amino_acids_and_clashes(pdb_file, bfactor_threshold)
    if results[0] is not None:
        unstructured_count, max_unstructured_region, total_clashes, clashes_per_chain, low_b_factors_per_chain, avg_rise, avg_symmetry, symmetry_pairs = results
        row = {
            'Name': os.path.basename(pdb_file),
            'Unstructured_count': unstructured_count,
            'Max_unstructured_region': max_unstructured_region,
            'Total_clashes': total_clashes,
            'Average_rise': avg_rise,
            'Average_symmetry': avg_symmetry
        }
        for chain_id, clashes in clashes_per_chain.items():
            row[f'Clashes_chain_{chain_id}'] = clashes
        for chain_id, low_b_factors in low_b_factors_per_chain.items():
            row[f'Low_B_factors_chain_{chain_id}'] = low_b_factors
        
        for chain1, chain2, rise, symmetry in symmetry_pairs:
            row[f'Rise_{chain1}_{chain2}'] = rise
            row[f'Symmetry_{chain1}_{chain2}'] = symmetry
        
        return row
    return None

def main(path, directories, bfactor_threshold, num_threads):
    #patron  = r'fold_t\d+_b[a-z]+_a_\d+_model_\d+_supeimp\.pdb'
    patron = r'(.*(\d+)\.pdb$)'
    for directory in directories:
        directory = os.path.join(path, directory)
        preprocess_pdb_files(directory)

    pdb_files = []
    for directory in directories:
        directory = os.path.join(path, directory)
        for filename in os.listdir(directory):
            if re.match(patron, filename):
                pdb_file = os.path.join(directory, filename)
                pdb_files.append((pdb_file, bfactor_threshold))

    with Pool(num_threads) as pool:
        data = pool.map(process_pdb_file, pdb_files)

    data = [row for row in data if row is not None]
    df = pd.DataFrame(data)
    return df

# def main(path, directories, bfactor_threshold):
#     patron = r'(.*(\d+)\.pdb$)'
#     for directory in directories:
#         directory = os.path.join(path, directory)
#         preprocess_pdb_files(directory)
#     data = []

#     for directory in directories:
#         directory = os.path.join(path, directory)
#         for filename in os.listdir(directory):
#             if re.match(patron, filename):
#                 pdb_file = os.path.join(directory, filename)
#                 results = count_unstructured_amino_acids_and_clashes(pdb_file, bfactor_threshold)
#                 if results[0] is not None:
#                     unstructured_count, max_unstructured_region, total_clashes, clashes_per_chain, low_b_factors_per_chain, avg_rise, avg_symmetry, symmetry_pairs = results
#                     row = {
#                         'Name': filename,
#                         'Unstructured_count': unstructured_count,
#                         'Max_unstructured_region': max_unstructured_region,
#                         'Total_clashes': total_clashes,
#                         #'Average_rise': avg_rise,
#                         #'Average_symmetry': avg_symmetry
#                     }
#                     for chain_id, clashes in clashes_per_chain.items():
#                         row[f'Clashes_chain_{chain_id}'] = clashes
#                     for chain_id, low_b_factors in low_b_factors_per_chain.items():
#                         row[f'Low_B_factors_chain_{chain_id}'] = low_b_factors
                    
#                     # Añadir las simetrías por par de cadenas al DataFrame
#                     for chain1, chain2, rise, symmetry in symmetry_pairs:
#                         row[f'Rise_{chain1}_{chain2}'] = rise
#                         row[f'Symmetry_{chain1}_{chain2}'] = symmetry
                    
#                     data.append(row)
    
#     df = pd.DataFrame(data)
#     return df

# Ejemplo de uso

bfactor_threshold = 50  # Umbral de B-factor
num_threads = 20
df_loop_clashes = main(directorio,carpetas,bfactor_threshold, num_threads)

# Imprime el dataframe resultante
print(df_loop_clashes)


In [None]:
df_pydock

In [None]:
#Join df_loop_clashes and df_pydock by Name
df_pydock = df_pydock.merge(df_loop_clashes, on= 'Name', how='left')
df_pydock
#T254 No Knots

*Detección de nudos*

Si existe probelmas, quitar el check nots

In [None]:

## si hay error comentar esta funcion , si esta mas alla de 40 minutos  descomentar la siguiente##
#def check_knots_and_get_info(pdb_file):
 #    command = ["knot_pull_check", "-kq", pdb_file]                                              
  #   try:
   #      result = subprocess.run(command, capture_output=True, text=True, check=True)
    #     output = result.stdout.strip()
     #   
      #   # Analiza la salida para determinar si contiene '#'
       #  if '#' in output:
        #     return (pdb_file, 'yes')
         #else:
          #   return (pdb_file, 'no')
    # except subprocess.CalledProcessError as e:
        # print(f"Error executing command: {e}")                                                 
    #return (pdb_file, 'no')  # Asumimos 'no' si hay un error al ejecutar el comando


## descomentar si es necesario##
def check_knots_and_get_info(pdb_file):
                                                    
     return (pdb_file, 'no')  # Asumimos 'no' si hay un error al ejecutar el comando

def process_pdb_files(path, directories, num_workers):
    patron = r'(.*(\d+)\.pdb$)'
   # patron = r'fold_t\d+_b[a-z]+_a_\d+_model_\d+_supeimp\.pdb'
    data = []

    pdb_files = []
    for directory in directories:
        directory_path = os.path.join(path, directory)
        for filename in os.listdir(directory_path):
            if re.match(patron, filename):
                pdb_file = os.path.join(directory_path, filename)
                pdb_files.append(pdb_file)
    
    # Usar ProcessPoolExecutor para paralelizar la ejecución con un número específico de trabajadores
    with ProcessPoolExecutor(max_workers=num_workers) as executor:
        future_to_pdb = {executor.submit(check_knots_and_get_info, pdb_file): pdb_file for pdb_file in pdb_files}
        for future in as_completed(future_to_pdb):
            pdb_file = future_to_pdb[future]
            try:
                filename, knot_info = future.result()
                row = {
                    'Name': os.path.basename(filename),
                    'Knots': knot_info
                }
                data.append(row)
            except Exception as e:
                print(f"Error processing file {pdb_file}: {e}")
    
    df = pd.DataFrame(data)
    return df

# Ejemplo de uso
num_workers = 20  # Número de hilos a utilizar
df_knots = process_pdb_files(directorio,carpetas,num_workers)

df_knots

In [None]:
#Join df_knots and df_pydock by Name
df_pydock = df_pydock.merge(df_knots, on= 'Name', how='left')
df_pydock.to_csv(directorio_csv+'/pydock4_all.csv', index=False)

In [None]:
df_pydock

#### 3.3. Log.txt information retrieving

In [None]:
df_pydock
# Folders of all models
carpetas_log = [nombre for nombre in os.listdir(directorio) if os.path.isdir(os.path.join(directorio, nombre))]
#carpetas_log.remove('Version1')
carpetas_log

In [None]:
# Dataframe columns
columns = ["Complex","Model","State",'Version', 'Recycle', 'pLDDT', 'pTM', 'ipTM', 'tol','Seed']

# Patrons in the text to gather the information
#complex_pattern = re.compile(r'((T|t)\/.*_A)') 
rank_pattern = re.compile(r'(rank_(\d+))|(pred_\d+)|(ranked_.*)')
model_pattern = re.compile(r'model_(\d+)')
state_pattern = re.compile(r'rank')
version_pattern = re.compile(r"((deepfold|alphafold2_multimer)_v\d+)_model")
recycle_pattern = re.compile(r'recycle=(\d+)')
plddt_pattern = re.compile(r'pLDDT=([\d.]+)')
ptm_pattern = re.compile(r'pTM=([\d.]+)')
iptm_pattern = re.compile(r'ipTM=([\d.]+)')
tol_pattern = re.compile(r'tol=([\d.]+)')
seed_pattern = re.compile(r'seed_([\d.]+)')
name_pattern = re.compile(r"(fold_t\d+_\d+_model_\d+)")
df_log=pd.DataFrame()

for carpeta in carpetas_log:
    directorio_log=f"{directorio}/{carpeta}/log.txt"
    # Loading the archive
    with open(directorio_log, 'r') as file:
        lines = file.readlines()
    
    # Value extraction
    #name = None
    complex=None
    model=None
    version = None
    state=None
    recycle = None
    plddt = None
    ptm = None
    iptm = None
    tol = None
    seed = None
    data=[]
    for line in lines:
        
        # # Complex
        # match = complex_pattern.search(line)
        # if match:
        #     print()
        #     complex = match.group(0)
        #     complex=complex[2:-2]

        #Name
        # match = name_pattern.search(line)
        # if match:
        #     name = match.group(1)+'.pdb'
        # else:
        #     name =directorio_log 
        #State
        match = state_pattern.search(line)
        if match:
            state="relaxed"
        else:
            state="unrelaxed"

        # Model
        match = model_pattern.search(line)
        if match:
            model= match.group(1)
        
        # Version
        match = version_pattern.search(line)
        if match:
            version = match.group(1)
        else:
            version = 'alphafold3'
            
        # Recycle
        match = recycle_pattern.search(line)
        if match:
            recycle = match.group(1)
        else:
            recycle = 'Seed_0'
        
        #  pLDDT
        match = plddt_pattern.search(line)
        if match:
            plddt = match.group(1)
        else:
            plddt = None
        
        #  pTM
        match = ptm_pattern.search(line)
        if match:
            ptm = match.group(1)
        else:
            ptm=None
        
        #  ipTM
        match = iptm_pattern.search(line)
        if match:
            iptm = match.group(1)
        else:
            iptm=None
        
        #  tol
        match = tol_pattern.search(line)
        if match:
            tol = match.group(1)
        else:
            tol="-"
        
        #seed
        match = seed_pattern.search(line)
        if match:
            seed = match.group(1)
        else:
            seed="-"
        
        # rank
        match = rank_pattern.search(line)
        if match:   
            recycle = 'Seed_0'
        # Guardar los valores en el DataFrame
        data.append([complex,model,state,version, recycle, plddt, ptm, iptm, tol,seed])

    # Crear el DataFrame
    df = pd.DataFrame(data, columns=columns)

    # Convierte las columnas 'ipTM' y 'pTM' a tipos de datos numéricos (flotantes)
    df['ipTM'] = pd.to_numeric(df['ipTM'], errors='coerce')
    df['pTM'] = pd.to_numeric(df['pTM'], errors='coerce')
    
    
    # Addicion de Model confidence según la formula del articulo
    df['Model_confidence'] = 0.8 * df['ipTM'] + 0.2 * df['pTM']
    if carpeta.startswith('fold'):
        df["Complex"]=carpeta.split('_')[1].upper()
    else:
        df["Complex"]=carpeta[0:4]
    df = df.dropna(subset=['Model_confidence'])
    df['pLDDT'] = pd.to_numeric(df['pLDDT'], errors='coerce')
    df_log=pd.concat([df_log,df])


In [None]:
#The models were all relaxed
df_log["State"]="relaxed"
df_log.to_csv(directorio_csv+'/log_all.csv', index=False)
df_log

### 4.  Final fusion

<div style="font-family: Arial, sans-serif; line-height: 1.5; text-align: justify;">Now we ensemble a new_dataframe to collect all the data obtained during the calculation of for a posterior statistical analysis

</div>

#### 4.1 Checking for possible issues

In [None]:
#Loading the dataframes
df_pydock =pd.read_csv(directorio_csv+'/pydock4_all.csv')
df_log=pd.read_csv(directorio_csv+'/log_all.csv')

In [None]:
df_log

In [None]:
df_pydock

In [None]:
# Checking for the length of dataframes, be aware of the existence of the entries from cristals!
unicos=set(df_pydock["Complex"])
for complejo in unicos:
    print(complejo)
    mu=len(df_pydock[df_pydock["Complex"]==complejo])
    #nu=len(df_rmsd[df_rmsd["Complex"]==complejo])
    clus=len(df_log[df_log["Complex"]==complejo])
    #print( datos_carpeta[complejo])
    print("Pydock:",mu,"RMSD:", " Log:",clus,"Diferencia:",mu,mu-clus)

#### 4.2 Merging dataframes

Determaining which columns are diferent and merging by the common ones

In [None]:
columna4=(df_pydock.columns).tolist()
columna3=(df_log.columns).tolist()
compartidos2=list(set(columna4).intersection(columna3))
#compartidos2=['Name']
df_pydock[compartidos2]=df_pydock[compartidos2].astype(str)
df_log[compartidos2]=df_log[compartidos2].astype(str)
merged_df2 = df_pydock.merge(df_log, on= compartidos2, how='left')
print (compartidos2)
merged_df2.to_csv(directorio_csv+'/merged_df2.csv')
#merged_df2["Total_Name"]=merged_df2["Complex"]+"_"+merged_df2["Name"]

#  Lo siguiente es para poner los Seed_0 iguales que los del 20, activar y corregir si es necesario de normal quitamos los seed_0
#mask = merged_df2["Recycle"] == "Seed_0"
#merged_df2.loc[merged_df2["Recycle"] == "20", ["pLDDT", "pTM", "ipTM", "tol"]]
#merged_df2.loc[mask, ["pLDDT", "pTM", "ipTM", "tol"]] = merged_df2.values[:len(merged_df2[mask])]

### 4.3 Filter the data set

In [None]:
def filter_pydock_advanced(
    df, Knots_value, Max_unstructured_region, Total_clashes,
    symmetry_conditions, res_conditions, invert=False
):
    # Filtro inicial basado en Knots_value, Max_unstructured_region y Total_clashes
    main_filter = (
        (df['Knots'] == Knots_value) &
        (df['Max_unstructured_region'] <= Max_unstructured_region) &
        (df['Total_clashes'] <= Total_clashes)
    )

    # Construye las condiciones de simetría dinámicamente
    symmetry_filter = None
    for condition in symmetry_conditions:
        Symmetry_col, up, low = condition
        if Symmetry_col not in df.columns:
            warnings.warn(f"Column '{Symmetry_col}' does not exist in the DataFrame.")
            continue
        current_filter = (
            (df[Symmetry_col].between(-up, -low)) | 
            (df[Symmetry_col].between(low, up))
        )
        if symmetry_filter is None:
            symmetry_filter = current_filter
        else:
            symmetry_filter |= current_filter
    
    if symmetry_filter is not None:
        main_filter &= symmetry_filter

    # Construye las condiciones de Res_with_low_pLDDT dinámicamente
    res_filter = None
    for condition in res_conditions:
        Res_col, threshold = condition
        if Res_col not in df.columns:
            warnings.warn(f"Column '{Res_col}' does not exist in the DataFrame.")
            continue
        # Verificar si el valor en df[Res_col] es 0, en cuyo caso se ignora el filtro
        if df[Res_col].eq(0).all():
            continue
        current_filter = (df[Res_col] <= threshold)
        if res_filter is None:
            res_filter = current_filter
        else:
            res_filter &= current_filter
    
    if res_filter is not None:
        main_filter &= res_filter

    # Aplicar el filtro inverso si invert es True
    if invert:
        filtered_df = df[~main_filter]
    else:
        filtered_df = df[main_filter]
    
    return filtered_df

Aplicacion de la funcion

- Max_unstructured_region
- Total_clashes
- res_conditions: [('Low_B_factors_chain_A', 20)] se cambia el 20 segun el bfactor deseado para filtrar ( % de residuos de la secuencia con un valor de ppldt <50)
- Knots_value: yes or 'no' 
- (opcional) symmetry_conditions: [('Symmetry_A_F', 185, 175)] intervalo de simetria deseado para filtrar. Para cadena repetidas que forman estructuras simetricas, no tiene porque ser simetrico, segun la estructura ( ver a ojo lo modelado)

In [None]:
# Max_unstructured_region, Total_clashes, res_conditions, Knots_value, symmetry_conditions = 4, 55, \
#     [('Low_B_factors_chain_A', 20), ('Low_B_factors_chain_B', 20)], \
#     'no', [('Symmetry_A_B', 185, 175)] # T242 incluimos la simtria  180-+5  E_F
#Max_unstructured_region, Total_clashes, res_conditions, Knots_value, symmetry_conditions = 8, 55, \
#     [('Low_B_factors_chain_A', 7), ('Low_B_factors_chain_B', 7)], \
#     'no', [('Symmetry_A_B', 185, 175)]# T244 incluimos, no se calcula Symmetry 
# Max_unstructured_region, Total_clashes, res_conditions, Knots_value, symmetry_conditions = 6, 55, \
    # [('Low_B_factors_chain_A', 9), ('Low_B_factors_chain_B', 12)], \
    # 'no', [('Symmetry_A_B', 185, 175)]# T248 incluimos, no se calcula Symmetry 
# Max_unstructured_region, Total_clashes, res_conditions, Knots_value, symmetry_conditions = 6, 500, \
#     [('Low_B_factors_chain_A', 70), ('Low_B_factors_chain_B', 70), ('Low_B_factors_chain_C', 70)], \
#     'no', [('Symmetry_B_C', 125, 115),('Symmetry_A_B', 125, 115)]# T250 T252 incluimos, se calcula Symmetry 
#Max_unstructured_region, Total_clashes, res_conditions, Knots_value, symmetry_conditions = 6, 300, [('Low_B_factors_chain_A', 10), ('Low_B_factors_chain_B', 10), ('Low_B_factors_chain_C', 10),('Low_B_factors_chain_D', 10),('Low_B_factors_chain_E', 10)], 'no', [('Symmetry_B_C', 125, 115),('Symmetry_A_B', 125, 115)]# T256 

# Max_unstructured_region, Total_clashes, res_conditions, Knots_value, symmetry_conditions = 6, 500, \
#     [('Low_B_factors_chain_A', 10), ('Low_B_factors_chain_B', 10), ('Low_B_factors_chain_C', 10),('Low_B_factors_chain_D', 10)], \
#     'no', [('Symmetry_A_B', 185, 175)]# T254 T255 incluimos, se calcula Symmetry 

# Max_unstructured_region, Total_clashes, res_conditions, Knots_value, symmetry_conditions = 20, 300, \
#     [('Low_B_factors_chain_A', 66), ('Low_B_factors_chain_B', 66), ('Low_B_factors_chain_C', 66),('Low_B_factors_chain_D', 66),('Low_B_factors_chain_E', 66)], \
#     'no', [('Symmetry_A_F', 185, 175)]# T262, se uss Symmetry_A_F que no exite para no calcular filtro por Symmetria 


### T264, se usa Symmetry_A_Z que no exite para no calcular filtro por Symmetria ##
# Max_unstructured_region, Total_clashes, res_conditions, Knots_value, symmetry_conditions = 24, 1000, \
#     [('Low_B_factors_chain_A', 24),('Low_B_factors_chain_B', 24),('Low_B_factors_chain_C', 24),('Low_B_factors_chain_D', 20)], \
#     'no', [('Symmetry_A_Z', 185, 175)]


### T266, se usa Symmetry_A_F que no exite para no calcular filtro por Symmetria ##
# Max_unstructured_region, Total_clashes,\
# res_conditions,\
# Knots_value, symmetry_conditions =\
#     20, 60, \
#     [('Low_B_factors_chain_A', 20)], \
#     'no', [('Symmetry_A_F', 185, 175)]



## Customizable
# Max_unstructured_region, Total_clashes=20, 60
# res_conditions=[('Low_B_factors_chain_A', 20)]
# Knots_value, symmetry_conditions ='no', [('Symmetry_A_F', 185, 175)]

# Max_unstructured_region, Total_clashes=20, 60
# res_conditions=[('Low_B_factors_chain_A', 20)]
# Knots_value, symmetry_conditions ='no', [('Symmetry_A_F', 185, 175)]

#T168
# Max_unstructured_region, Total_clashes=7, 60
# res_conditions=[('Low_B_factors_chain_A', 80)]
# Knots_value, symmetry_conditions ='no', [('Symmetry_A_F', 185, 175)]

# filtered_df = filter_pydock_advanced(
#     merged_df2, Knots_value, Max_unstructured_region, Total_clashes,
#     symmetry_conditions=symmetry_conditions,
#     res_conditions=res_conditions
#)
#T170
# Max_unstructured_region, Total_clashes=7, 60
# res_conditions=[('Low_B_factors_chain_A', 80)]
# Knots_value, symmetry_conditions ='no', [('Symmetry_A_F', 185, 175)]

# #T172
# merged_df2=pd.read_csv(directorio_csv+'/pydock4_all.csv')
# Max_unstructured_region, Total_clashes=12, 5000
# res_conditions=[('Clashes_chain_A', 600)]
# Knots_value, symmetry_conditions ='no', [('Symmetry_B_D', 185, 175),()]

# #T280
# merged_df2=pd.read_csv(directorio_csv+'/merged_df2.csv')
# Max_unstructured_region, Total_clashes=8, 150
# res_conditions=[('Low_B_factors_chain_A', 8),('Low_B_factors_chain_B', 8),('Low_B_factors_chain_C', 8),('Low_B_factors_chain_D', 8)]
# Knots_value, symmetry_conditions ='no', [('Symmetry_A_B', 185, 175)]

#T288
# Max_unstructured_region, Total_clashes=6, 110,
# res_conditions=[('Low_B_factors_chain_A', 21),('Low_B_factors_chain_B', 21),("Low_B_factors_chain_C",21)]
# Knots_value, symmetry_conditions ='no', [('Symmetry_A_B', 140, 100),("Symmetry_B_C",140, 100)]

#T290
# Max_unstructured_region, Total_clashes=2, 97,
# res_conditions=[('Low_B_factors_chain_A', 5),('Low_B_factors_chain_B', 5),("Low_B_factors_chain_C",5),("Low_B_factors_chain_D",5),("Low_B_factors_chain_E",5),("Low_B_factors_chain_F",5)]
# Knots_value, symmetry_conditions ='no', [('Symmetry_A_B', 65, 55),("Symmetry_B_C",65, 55)]

# #T290
# Max_unstructured_region, Total_clashes=4, 50,
# res_conditions=[('Low_B_factors_chain_A', 5),('Low_B_factors_chain_B', 50)]
# Knots_value, symmetry_conditions ='no', [('Symmetry_A_B', 65, 55)]

# #T292
# Max_unstructured_region, Total_clashes=7, 400,
# res_conditions=[('Low_B_factors_chain_A', 14),('Low_B_factors_chain_B', 14),("Low_B_factors_chain_C",14),("Low_B_factors_chain_D",14),("Low_B_factors_chain_E",14),("Low_B_factors_chain_F",14),("Low_B_factors_chain_G",14),("Low_B_factors_chain_H",14),("Low_B_factors_chain_I",14)]
# Knots_value, symmetry_conditions ='no', [('Symmetry_A_B', 135, 105),("Symmetry_B_C",135, 105)]

# # T282
# Max_unstructured_region, Total_clashes=6, 190,
# bfactor_1=3
# res_conditions=[('Low_B_factors_chain_A', bfactor_1),('Low_B_factors_chain_B', bfactor_1),("Low_B_factors_chain_C",bfactor_1),("Low_B_factors_chain_D",bfactor_1)]
# Knots_value, symmetry_conditions ='no', [('Symmetry_A_C', 190, 170),("Symmetry_B_D",190, 170)]

# T286
# Max_unstructured_region, Total_clashes=8, 400,
# bfactor_1=7
# res_conditions=[('Low_B_factors_chain_A', bfactor_1),('Low_B_factors_chain_B', bfactor_1),("Low_B_factors_chain_C",bfactor_1),("Low_B_factors_chain_D",bfactor_1)]
# Knots_value, symmetry_conditions ='no', [("Symmetry_B_C",140, 100)]

# T296
# Max_unstructured_region, Total_clashes=5, 32,
# #bfactor_1=7
# res_conditions=[('Low_B_factors_chain_A', 5),('Low_B_factors_chain_B', 100)]
# Knots_value, symmetry_conditions ='no', [("Symmetry_B_C",140, 100)]

# T294
# Max_unstructured_region, Total_clashes=5, 130,
# #bfactor_1=7
# res_conditions=[('Low_B_factors_chain_A', 2),('Low_B_factors_chain_B', 2),("Low_B_factors_chain_C",7),("Low_B_factors_chain_D",7), ("Low_B_factors_chain_E",17),("Low_B_factors_chain_F",17)]
# Knots_value, symmetry_conditions ='no', [("Symmetry_A_B",190, 160), ("Symmetry_C_D",190, 160), ("Symmetry_E_F",50, 0)]
# T298 
# Max_unstructured_region, Total_clashes=6, 90,
# bfactor_1=2.5
# res_conditions=[('Low_B_factors_chain_A', bfactor_1),('Low_B_factors_chain_B', bfactor_1),("Low_B_factors_chain_C",bfactor_1)]
# Knots_value, symmetry_conditions ='no', [("Symmetry_A_B",125, 115), ("Symmetry_B_C",125, 115)]
# T300
# Max_unstructured_region, Total_clashes=6, 120,
# bfactor_1=5
# res_conditions=[('Low_B_factors_chain_A', bfactor_1),('Low_B_factors_chain_B', bfactor_1),("Low_B_factors_chain_C",bfactor_1)]
# Knots_value, symmetry_conditions ='no', [("Symmetry_A_B",185, 175), ("Symmetry_C_D",185, 175)]
# T304 
# Max_unstructured_region, Total_clashes=6, 16,
# bfactor_1=5
# res_conditions=[('Low_B_factors_chain_A', bfactor_1)]
# Knots_value, symmetry_conditions ='no', [("Symmetry_H_B",185, 175)] 
# T302 
# Max_unstructured_region, Total_clashes=4, 280,
# bfactor_1=5
# res_conditions=[('Low_B_factors_chain_A', bfactor_1)]
# Knots_value, symmetry_conditions ='no', [("Symmetry_A_B",125, 115)] 
# T306
# Max_unstructured_region, Total_clashes=3, 10,
# bfactor_1=1.6
# res_conditions=[('Low_B_factors_chain_A', bfactor_1)]
# Knots_value, symmetry_conditions ='no', [("Symmetry_A_B",125, 115)] 
# T308
# Max_unstructured_region, Total_clashes=4, 200,
# bfactor_1=9.4
# res_conditions=[('Low_B_factors_chain_A', bfactor_1)]
# Knots_value, symmetry_conditions ='no', [('Symmetry_C_D', 185, 175)] 



In [None]:
assert False, "Parando la ejecución aquí."

In [None]:
merged_df2=pd.read_csv(directorio_csv+'/merged_df2.csv')
merged_df2.drop_duplicates(inplace=True,subset="Name")
merged_df2

In [None]:
# T309
Max_unstructured_region, Total_clashes=5, 30,
bfactor_1=1
res_conditions=[('Low_B_factors_chain_A', bfactor_1)]
Knots_value, symmetry_conditions ='no', [('Symmetry_A_B', 185, 175)] 


filtered_df = filter_pydock_advanced(
    merged_df2, Knots_value, Max_unstructured_region, Total_clashes,
    symmetry_conditions=symmetry_conditions,
    res_conditions=res_conditions
)

result_df_inverted = filter_pydock_advanced(
    merged_df2, Knots_value, Max_unstructured_region, Total_clashes,
    symmetry_conditions=symmetry_conditions,
    res_conditions=res_conditions,
    invert=True
)
print(len(filtered_df))
filtered_df

In [None]:
filtered_df.to_csv(directorio_csv+'/pydock4_all_filtered.csv', index=False)
result_df_inverted.to_csv(directorio_csv+'/pydock4_all_filtered_inv.csv', index=False)

In [None]:

# #Max_unstructured_region,Total_clashes,Res_with_low_pLDDT_A,Res_with_low_pLDDT_B,Knots_value = [10,90,20,20,'no']# T236
# #Max_unstructured_region,Total_clashes,Res_with_low_pLDDT_A,Res_with_low_pLDDT_B,Knots_value = [5,70,15,20,'no']# T238
# Max_unstructured_region,Total_clashes,Res_with_low_pLDDT_E,Res_with_low_pLDDT_F,Knots_value = [5,800,20,20,'no']# T240

# filtered_df = merged_df2[
#     (merged_df2['Knots'] == 'no') &
#     (merged_df2['Max_unstructured_region'] <= Max_unstructured_region) &
#     (merged_df2['Total_clashes'] <= Total_clashes) &
#     #(merged_df2['Res_with_low_pLDDT_A'] <= Res_with_low_pLDDT_A) & #T236 T238
#     #(merged_df2['Res_with_low_pLDDT_B'] <= Res_with_low_pLDDT_B)   #T236 T238
#     (merged_df2['Res_with_low_pLDDT_E'] <= Res_with_low_pLDDT_E) &  #T240
#     (merged_df2['Res_with_low_pLDDT_F'] <= Res_with_low_pLDDT_F)    #T240
# ]

# result_df_inverted = merged_df2[
#     ~(
#         (merged_df2['Knots'] == Knots_value) &
#         (merged_df2['Max_unstructured_region'] <= Max_unstructured_region) &
#         (merged_df2['Total_clashes'] <= Total_clashes) &
#         #(merged_df2['Res_with_low_pLDDT_A'] <= Res_with_low_pLDDT_A) & #T236 T238
#         #(merged_df2['Res_with_low_pLDDT_B'] <= Res_with_low_pLDDT_B)   #T236 T238
#         (merged_df2['Res_with_low_pLDDT_E'] <= Res_with_low_pLDDT_E) &  #T240
#         (merged_df2['Res_with_low_pLDDT_F'] <= Res_with_low_pLDDT_F)    #T240
#     )
# ]
# # Resultado final
# filtered_df


In [None]:
#result_df_inverted

### Normalization

### Z-score

In [None]:
df_norm=pd.read_csv(directorio_csv+'/pydock4_all_filtered.csv')
df_norm

In [None]:
df_norm=pd.read_csv(directorio_csv+'/pydock4_all_filtered.csv')

# Removing unnecesary columns
columnas=['Conf','RANK']
df_norm.drop(columnas, axis=1, inplace=True)
df_norm.dropna(subset=["Complex"],inplace=True)

# Removing duplicates
df_norm=df_norm.drop_duplicates(subset=["Name"],keep="first")
duplicados = df_norm[df_norm.duplicated(subset=["Name","Version","Complex","Recycle","State"])]

# Adding Total2 column
df_norm["Total2"]=df_norm["VDW"]+df_norm["Ele"]+df_norm["Desolv"] 

# Z-Score individuales, inicializacion
df_norm["MCZ-Score"] = 0 # Z-score de model_conficence
df_norm["PLDDTZ-Score"] = 0 # Z-score de pLDDT
df_norm["TEZ-Score"] = 0 # Z-score de Total
df_norm["TE2Z-Score"] = 0 # Z-score de Total2

# Suma de Z-Score, inicialicion
df_norm["Sum_Z"] = 0 # Z-score Model confidence + Total
df_norm["Sum2_Z"] = 0 # Z-score Model confidence + Total2
df_norm["Z-PLT"] = 0 # Z-score de pLDDT + Total
df_norm["Z-PLT2"]= 0 # Z-score de pLDDT + Total2

# Ranking Z-Score, inicializacion
df_norm["Ranking_Z"] = 0 # Ranking de Sum_Z
df_norm["Ranking2_Z"] = 0 # Ranking de Sum2_Z
df_norm["Ranking_PLT"] = 0 # Ranking de Z-PLT
df_norm["Ranking_PLT2"] = 0 # Ranking de Z-PLT2

# Calculo de medias y desviaciones segun complejo
grouped = df_norm.groupby(["Complex"])
medias=grouped.mean()
sdesv=grouped.std()

# Z-Score individuales
for name, group in grouped:
    # Calculamos Z_score de model_conficence y total energy
    df_norm.loc[group.index,["MCZ-Score"]] = (group["Model_confidence"]-medias.loc[name,"Model_confidence"])/sdesv.loc[name,"Model_confidence"]
    df_norm.loc[group.index,["TEZ-Score"]] = (group["Total"]-medias.loc[name,"Total"])/sdesv.loc[name,"Total"]
    df_norm.loc[group.index,["TE2Z-Score"]] = (group["Total2"]-medias.loc[name,"Total2"])/sdesv.loc[name,"Total2"]
    df_norm.loc[group.index,["PLDDTZ-Score"]] = (group["pLDDT"]-medias.loc[name,"pLDDT"])/sdesv.loc[name,"pLDDT"]

# Suma de Z-Score
df_norm.loc[:,"Sum_Z"]=df_norm.loc[:,"MCZ-Score"]-df_norm.loc[:,"TEZ-Score"]
df_norm.loc[:,"Sum2_Z"]=df_norm.loc[:,"MCZ-Score"]-df_norm.loc[:,"TE2Z-Score"]
df_norm.loc[:,"Z-PLT"]=df_norm.loc[:,"PLDDTZ-Score"]-df_norm.loc[:,"TEZ-Score"]
df_norm.loc[:,"Z-PLT2"]=df_norm.loc[:,"PLDDTZ-Score"]-df_norm.loc[:,"TE2Z-Score"]

# Ranking Z-Score
for name, group in grouped:
    df_norm.loc[group.index,"Ranking_Z"]=df_norm.loc[group.index,"Sum_Z"].rank(ascending=False)
    df_norm.loc[group.index,"Ranking2_Z"]=df_norm.loc[group.index,"Sum2_Z"].rank(ascending=False)
    df_norm.loc[group.index,"Ranking_PLT"]=df_norm.loc[group.index,"Z-PLT"].rank(ascending=False)
    df_norm.loc[group.index,"Ranking_PLT2"]=df_norm.loc[group.index,"Z-PLT2"].rank(ascending=False)



In [None]:
df_norm

In [None]:
df_norm.to_csv(directorio_csv + "/df_norm_"+Target_name+".csv",index=False)

## Repetimos para el dataframe inverso

Por si hay demasiados pocos en los filtrados añadir morralla, no son importantes solo para el submit

In [None]:
df_norm_inv=pd.read_csv(directorio_csv+'/pydock4_all_filtered_inv.csv')

# Removing unnecesary columns
columnas=['Conf','RANK']
#df_norm_inv.drop(columnas, axis=1, inplace=True)
#df_norm_inv.dropna(subset=["Complex"],inplace=True)

# Removing duplicates
df_norm_inv=df_norm_inv.drop_duplicates(subset=["Name"],keep="first")
duplicados = df_norm_inv[df_norm_inv.duplicated(subset=["Name","Version","Complex","Recycle","State"])]

# Adding Total2 column
df_norm_inv["Total2"]=df_norm_inv["VDW"]+df_norm_inv["Ele"]+df_norm_inv["Desolv"] 

# Z-Score individuales, inicializacion
df_norm_inv["MCZ-Score"] = 0 # Z-score de model_conficence
df_norm_inv["PLDDTZ-Score"] = 0 # Z-score de pLDDT
df_norm_inv["TEZ-Score"] = 0 # Z-score de Total
df_norm_inv["TE2Z-Score"] = 0 # Z-score de Total2

# Suma de Z-Score, inicialicion
df_norm_inv["Sum_Z"] = 0 # Z-score Model confidence + Total
df_norm_inv["Sum2_Z"] = 0 # Z-score Model confidence + Total2
df_norm_inv["Z-PLT"] = 0 # Z-score de pLDDT + Total
df_norm_inv["Z-PLT2"]= 0 # Z-score de pLDDT + Total2

# Ranking Z-Score, inicializacion
df_norm_inv["Ranking_Z"] = 0 # Ranking de Sum_Z
df_norm_inv["Ranking2_Z"] = 0 # Ranking de Sum2_Z
df_norm_inv["Ranking_PLT"] = 0 # Ranking de Z-PLT
df_norm_inv["Ranking_PLT2"] = 0 # Ranking de Z-PLT2

# Calculo de medias y desviaciones segun complejo
grouped = df_norm_inv.groupby([ "Complex"])
medias=grouped.mean()
sdesv=grouped.std()

# Z-Score individuales
for name, group in grouped:
    # Calculamos Z_score de model_conficence y total energy
    df_norm_inv.loc[group.index,["MCZ-Score"]] = (group["Model_confidence"]-medias.loc[name,"Model_confidence"])/sdesv.loc[name,"Model_confidence"]
    df_norm_inv.loc[group.index,["TEZ-Score"]] = (group["Total"]-medias.loc[name,"Total"])/sdesv.loc[name,"Total"]
    df_norm_inv.loc[group.index,["TE2Z-Score"]] = (group["Total2"]-medias.loc[name,"Total2"])/sdesv.loc[name,"Total2"]
    df_norm_inv.loc[group.index,["PLDDTZ-Score"]] = (group["pLDDT"]-medias.loc[name,"pLDDT"])/sdesv.loc[name,"pLDDT"]

# Suma de Z-Score
df_norm_inv.loc[:,"Sum_Z"]=df_norm_inv.loc[:,"MCZ-Score"]-df_norm_inv.loc[:,"TEZ-Score"]
df_norm_inv.loc[:,"Sum2_Z"]=df_norm_inv.loc[:,"MCZ-Score"]-df_norm_inv.loc[:,"TE2Z-Score"]
df_norm_inv.loc[:,"Z-PLT"]=df_norm_inv.loc[:,"PLDDTZ-Score"]-df_norm_inv.loc[:,"TEZ-Score"]
df_norm_inv.loc[:,"Z-PLT2"]=df_norm_inv.loc[:,"PLDDTZ-Score"]-df_norm_inv.loc[:,"TE2Z-Score"]

# Ranking Z-Score
for name, group in grouped:
    df_norm_inv.loc[group.index,"Ranking_Z"]=df_norm_inv.loc[group.index,"Sum_Z"].rank(ascending=False)
    df_norm_inv.loc[group.index,"Ranking2_Z"]=df_norm_inv.loc[group.index,"Sum2_Z"].rank(ascending=False)
    df_norm_inv.loc[group.index,"Ranking_PLT"]=df_norm_inv.loc[group.index,"Z-PLT"].rank(ascending=False)
    df_norm_inv.loc[group.index,"Ranking_PLT2"]=df_norm_inv.loc[group.index,"Z-PLT2"].rank(ascending=False)



In [None]:
df_norm_inv

In [None]:
df_norm_inv.to_csv(directorio_csv + "/df_norm_inv_"+Target_name+".csv", index=False)

### TOP100 

In [None]:
df_norm=pd.read_csv(directorio_csv + "/df_norm_"+Target_name+".csv")
df_norm_inv=pd.read_csv(directorio_csv + "/df_norm_inv_"+Target_name+".csv")

In [None]:
import os

def obtener_archivos_csv(directorio_csv):
    archivos_csv = [archivo for archivo in os.listdir(directorio_csv) if archivo.startswith('df_norm')]
    return archivos_csv

# Reemplaza 'ruta/del/directorio' con la ruta real de tu directorio
archivos_csv = obtener_archivos_csv(directorio_csv)

if archivos_csv:
    for archivo_csv in archivos_csv:
        print(archivo_csv)
else:
    print("No se encontraron archivos CSV en el directorio.")


In [None]:
Ranking = [ "Ranking2_Z"]

for archivo in archivos_csv:
    a = pd.read_csv(directorio + archivo)
    if len(archivo.split("_")) > 3: 

        inv='_'+archivo.split("_")[2]
        print(inv)
    else:
        inv=''

    for complejo in a["Complex"].unique():
        df_complejo = a[a["Complex"] == complejo].copy()
        
        for Rank in Ranking:
            if "PLT" in Rank:
               df_filtrado = df_complejo[df_complejo["Version"] == "deepfold_v1"].copy()
     
            else:
                df_filtrado = df_complejo
                #print(df_filtrado)
            
            # Filtrar primero por los top 100 según Rank
            top100_preorden = df_filtrado.nsmallest(100, Rank)
            
            # Luego, ordenar por Rank si es necesario
            top100_ordenado = top100_preorden.sort_values(by=Rank, ascending=True)
             # Crear la carpeta si no existe
            print (directorio , complejo , "_" ,Rank ,inv)
            nueva_carpeta = directorio + complejo + "_" + Rank + inv
            os.makedirs(nueva_carpeta, exist_ok=True)
            
            # Mover archivos especificados en la columna 'PATH'
            for idx, fila in top100_ordenado.iterrows():
                ruta_original = fila['PATH']
                shutil.copy(ruta_original, nueva_carpeta)
            
            # Guardar a CSV
            print(nueva_carpeta + "/"+ complejo + "_" + Rank + "_top100.txt")
            top100_ordenado['Name'].to_csv(nueva_carpeta + "/"+ complejo + "_" + Rank + "_top100.txt", index=False, header=False)
            top100_ordenado.to_csv(nueva_carpeta + "/"+ complejo + "_" + Rank + "_top100.csv", index=False)
            
        #print(complejo)


### Clustering

Interesante para ver diferentes conformaciones

In [None]:
def crear_archivo_ini(modelslist, RMSD_cutoff, receptor_mol, ligand_mol, filename='pyCluster_config.ini'):
    import configparser
    directorio = os.path.dirname(modelslist)
    nombre_config = os.path.join(directorio, filename)
    modelslist=os.path.basename(modelslist)
    # Crear el objeto ConfigParser
    config = configparser.ConfigParser()
    # Agregar la sección 'clustering'
    config['clustering'] = {
        'modelslist': modelslist,
        'RMSD_cutoff': RMSD_cutoff
    }
    # Agregar la sección 'receptor'
    config['receptor'] = {
        'mol': receptor_mol
    }
    # Agregar la sección 'ligand'
    config['ligand'] = {
        'mol': ligand_mol
    }
    # Escribir el archivo de configuración
    with open(nombre_config, 'w') as configfile:
        config.write(configfile)
    return os.path.basename(nombre_config)
#cluster_list_files = [directorio +"/"+ complejo+"_"+nombre+"/"+ complejo+"_"+nombre+ "_top100.txt" for nombre in Ranking]
cluster_list_files = [directorio +"/"+ Target_name+"_"+nombre+"/"+ Target_name+"_"+nombre+ "_top100.txt" for nombre in Ranking]

# for cluster_list_file in cluster_list_files:
#      #Ejecutar pydock4 pyCluster
#      INI_FILE = crear_archivo_ini(cluster_list_file, 2, receptor_mol,ligand_mol)
#      #INI_FILE = crear_archivo_ini(cluster_list_file, 4, receptor_mol,ligand_mol) #T266
#      DIR_NAME = os.path.dirname(cluster_list_file)
#      print (DIR_NAME)
#      #subprocess.call("pydock4 "+INI_FILE.strip(".ini")+" pyCluster", cwd=DIR_NAME, shell=True)
#      #Generar los csv para Ranking clusterizados
     



In [None]:
top100_ordenado =pd.read_csv(directorio + complejo + "_" + Rank + "/"+ complejo + "_" + Rank + "_top100.csv")
# clustered_list_file= pd.read_csv(DIR_NAME +"/cluster_pyCluster_config.list", header=None)

# directorio + complejo + "_" + Rank 
# clustered_list_file.columns=['Name']
# cols=top100_ordenado.columns
# clustered_all_pydock = clustered_list_file.merge(top100_ordenado, on= 'Name')
# # Reordenar las columnas según las columnas de top100_ordenado
# clustered_all_pydock = clustered_all_pydock.reindex(columns=top100_ordenado.columns)

# # Guardar el resultado en un archivo CSV
# clustered_all_pydock.to_csv(DIR_NAME + "/" + Target_name + "_cluster_pyCluster_config.csv", index=False)

# # Calcular la diferencia y agregarla como una nueva columna
# clustered_all_pydock['Diferencia_R2_Z'] = clustered_all_pydock['Ranking2_Z'].diff(periods=-1) * -1

# # Convertir las columnas seleccionadas a tipo numérico y manejar errores
# cols_to_convert = ['Ranking_Z', 'Ranking2_Z', 'Ranking_PLT', 'Ranking_PLT2', 'Diferencia_R2_Z']
# clustered_all_pydock[cols_to_convert] = clustered_all_pydock[cols_to_convert].apply(pd.to_numeric, errors='coerce').astype('Int64')

# # Lógica para elegir el conjunto de datos
# elegir_top100 = False

# # Condición 1: Si hay más de un dato en clustered_all_pydock
# if len(clustered_all_pydock) > 1:
#     # Condición 2: Si entre los primeros 5 elementos de Diferencia_R2_Z hay alguno mayor que 10
#     if (clustered_all_pydock['Diferencia_R2_Z'].head(5) > 10).any():
#         elegir_top100 = True
# else:
#     elegir_top100 = True

# # Seleccionar el DataFrame basado en las condiciones
# if elegir_top100:
#     df_to_send = top100_ordenado
# else:
#      # Seleccionar los datos de top100_ordenado que no están en clustered_all_pydock
#     inverse_selection = top100_ordenado[~top100_ordenado['Name'].isin(clustered_all_pydock['Name'])]
    
#     # Concatenar clustered_all_pydock con la selección inversa
#     df_to_send = pd.concat([clustered_all_pydock, inverse_selection], ignore_index=True)
#     cols_to_convert = ['Ranking_Z', 'Ranking2_Z', 'Ranking_PLT', 'Ranking_PLT2', 'Diferencia_R2_Z']
#     df_to_send[cols_to_convert] = df_to_send[cols_to_convert].apply(pd.to_numeric, errors='coerce').astype('Int64')




In [None]:
saltar_clus=True
if saltar_clus:
    df_to_send=df_norm

nueva_carpeta = directorio + complejo + "_" + Rank + '_inv'
result_df_inverted2= pd.read_csv(nueva_carpeta + "/"+ complejo + "_" + Rank + "_top100.csv")
result_df_inverted2=result_df_inverted2.sort_values('Ranking_PLT2')
if len(df_to_send)< 100: 
    df_to_send = pd.concat([df_to_send, result_df_inverted2], ignore_index=True)
    df_to_send = df_to_send[:100]
df_to_send

### Copy to the To_send directory

- Scorers: to send minuscula
- Predictors: to send en mayuscula

In [None]:
os.makedirs(to_send_dir,exist_ok=True)
for file in df_to_send['PATH']:
    shutil.copy(file,to_send_dir)
df_to_send.to_csv(to_send_csv, index=False)
df_to_send.to_csv(to_send_csv.replace('ene','csv'), index=False)
df_to_send['Name'].to_csv(to_send_csv.replace('ene','txt'), index=False, header=None)

### Extra target T264 T265


Leemos un ficher con los RMSD calculados y selccionamos los modelos para cada target

In [None]:
3500*7/112/60

In [None]:
8000*7*12/60/60/112

In [None]:
# RMSD_clusters_selection=pd.read_csv(to_send_dir + "/"+ "RMSD_selection",sep=" ")
# RMSD_clusters_selection = RMSD_clusters_selection.drop(index=0)
# RMSD_clusters_selection =RMSD_clusters_selection.drop(columns="Unnamed: 5")
# RMSD_clusters_selection


In [None]:
# os.makedirs(to_send_dir+'/T265',exist_ok=True)
# filtered_model_names_T265 = RMSD_clusters_selection[(RMSD_clusters_selection.iloc[:, 1] < 8) | (RMSD_clusters_selection.iloc[:, 3] < 8)]["Model_name"]+'.pdb'
# filtered_model_names_T265=filtered_model_names_T265.to_frame()
# filtered_model_names_T265.to_csv(to_send_dir+'/T265'+'/T265_predictor_to_send.txt',index=False,header=None)
# for file in filtered_model_names_T265['Model_name']:
#     shutil.copy(to_send_dir+file,to_send_dir+'/T265')

# os.makedirs(to_send_dir+'/T264',exist_ok=True)
# filtered_model_names_T264 = RMSD_clusters_selection[(RMSD_clusters_selection.iloc[:, 2] < 8) | (RMSD_clusters_selection.iloc[:, 4] < 8)]["Model_name"]+'.pdb'
# filtered_model_names_T264=filtered_model_names_T264.to_frame()
# filtered_model_names_T264.to_csv(to_send_dir+'/T264'+'/T264_predictor_to_send.txt',index=False,header=None)
# for file in filtered_model_names_T264['Model_name']:
#     shutil.copy(to_send_dir+file,to_send_dir+'/T264')


fold_t288_model_3.pdb fold_t288_model_0.pdb fold_t288_model_1.pdb T288_unrelaxed_rank_003_alphafold2_multimer_v2_model_1_seed_000.r19.pdb T288_unrelaxed_rank_001_alphafold2_multimer_v3_model_1_seed_001.r20.pdb T288_unrelaxed_rank_001_alphafold2_multimer_v3_model_1_seed_001.r19.pdb T288_unrelaxed_rank_009_alphafold2_multimer_v3_model_3_seed_000.r19.pdb T288_unrelaxed_rank_009_alphafold2_multimer_v3_model_3_seed_000.r18.pdb T288_unrelaxed_rank_009_alphafold2_multimer_v3_model_3_seed_000.r17.pdb T288_relaxed_rank_001_alphafold2_multimer_v3_model_1_seed_001.pdb T288_unrelaxed_rank_008_alphafold2_multimer_v3_model_2_seed_001.r6.pdb

T292_unrelaxed_rank_002_alphafold2_multimer_v3_model_1_seed_001.r16.pdb T292_unrelaxed_rank_002_alphafold2_multimer_v3_model_1_seed_001.r12.pdb T292_unrelaxed_rank_002_alphafold2_multimer_v3_model_1_seed_001.r14.pdb T292_unrelaxed_rank_002_alphafold2_multimer_v3_model_1_seed_001.r13.pdb T292_unrelaxed_rank_001_alphafold2_multimer_v2_model_3_seed_000.r11.pdb T292_unrelaxed_rank_001_alphafold2_multimer_v2_model_3_seed_000.r19.pdb T292_unrelaxed_rank_001_alphafold2_multimer_v2_model_3_seed_000.r15.pdb T292_unrelaxed_rank_004_alphafold2_multimer_v2_model_4_seed_000.r17.pdb T292_unrelaxed_rank_001_alphafold2_multimer_v2_model_3_seed_000.r14.pdb T292_unrelaxed_rank_004_alphafold2_multimer_v2_model_4_seed_000.r20.pdb T292_unrelaxed_rank_001_alphafold2_multimer_v2_model_3_seed_000.r10.pdb T292_unrelaxed_rank_001_alphafold2_multimer_v2_model_3_seed_000.r16.pdb T292_unrelaxed_rank_004_alphafold2_multimer_v2_model_4_seed_000.pdb T292_unrelaxed_rank_004_alphafold2_multimer_v2_model_4_seed_000.r18.pdb T292_relaxed_rank_001_alphafold2_multimer_v2_model_3_seed_000.pdb T292_unrelaxed_rank_004_alphafold2_multimer_v2_model_4_seed_000.r16.pdb T292_relaxed_rank_004_alphafold2_multimer_v2_model_4_seed_000.pdb T292_unrelaxed_rank_001_alphafold2_multimer_v2_model_3_seed_000.r20.pdb T292_unrelaxed_rank_001_alphafold2_multimer_v2_model_3_seed_000.r13.pdb T292_unrelaxed_rank_001_alphafold2_multimer_v2_model_3_seed_000.r12.pdb

In [None]:
print("1")

In [None]:
print("2")