# "AI-Assisted Designing of Novel Anticancer Compounds" project 
## using PKM2 (Pyruvate Kinase M2) as the target protein

#### Pyruvate kinase isozymes M1/M2 (PKM1/M2), also known as pyruvate kinase muscle isozyme (PKM), pyruvate kinase type K, cytosolic thyroid hormone-binding protein (CTHBP), thyroid hormone-binding protein 1 (THBP1), or opa-interacting protein 3 (OIP3), is an enzyme that in humans is encoded by the PKM2 gene.

#### PKM2 is an isoenzyme of the glycolytic enzyme pyruvate kinase. Depending upon the different metabolic functions of the tissues, different isoenzymes of pyruvate kinase are expressed. PKM2 is expressed in some differentiated tissues, such as lung, fat tissue, retina, and pancreatic islets, as well as in all cells with a high rate of nucleic acid synthesis, such as normal proliferating cells, embryonic cells, and especially tumor cells.

#### Structure
##### Two isozymes are encoded by the PKM gene: PKM1 and PKM2. The M-gene consists of 12 exons and 11 introns. PKM1 and PKM2 are different splicing products of the M-gene (exon 9 for PKM1 and exon 10 for PKM2) and solely differ in 23 amino acids within a 56-amino acid stretch (aa 378–434) at their carboxy terminus.

##### Clinical significance
###### Bi-functional role within tumors
PKM2 is expressed in most human tumors.Initially, a switch from PKM1 to PKM2 expression during tumorigenesis was discussed. These conclusions, however, were the result of misinterpretation of western blots that had used PKM1-expressing mouse muscle as the sole non-cancer tissue. In clinical cancer samples, solely an up-regulation of PKM2, but no cancer specificity, could be confirmed.

In contrast to the closely homologous PKM1, which always occurs in a highly active tetrameric form and which is not allosterically regulated, PKM2 may occur in a tetrameric form but also in a dimeric form. The tetrameric form of PKM2 has a high affinity to its substrate phosphoenolpyruvate (PEP), and is highly active at physiological PEP concentrations. When PKM2 is mainly in the highly active tetrameric form, which is the case in differentiated tissues and most normal proliferating cells, glucose is converted to pyruvate under the production of energy. Meanwhile, the dimeric form of PKM2 is characterized by a low affinity to its substrate PEP and is nearly inactive at physiological PEP concentrations. Dimeric PKM2 produces little to no ATP in the conversion of PEP to pyruvate, making the net yield of ATP zero for glycolysis. When PKM2 is mainly in the less active dimeric form, which is the case in tumor cells, all glycolytic intermediates above pyruvate kinase accumulate and are channelled into synthetic processes, which branch off from glycolytic intermediates such as nucleic acid-, phospholipid-, and amino acid synthesis. Nucleic acids, phospholipids, and amino acids are important cell building-blocks, which are greatly needed by highly proliferating cells, such as tumor cells.

Due to the key position of pyruvate kinase within glycolysis, the tetramer:dimer ratio of PKM2 determines whether glucose carbons are converted to pyruvate and lactate under the production of energy (tetrameric form) or channelled into synthetic processes (dimeric form) However, even if PKM2 activity is low leading to the diversion of upstream intermediates to synthetic processes, pyruvate and lactate will still be made using carbon atoms from glucose and other metabolites through 86 pathways bypassing pyruvate kinase. These pyruvate kinase bypassing pathways are different from those participating in gluconeogenesis. Interestingly, many of the pyruvate kinase bypassing pathways use metabolites that transit through mitochondria, highlighting the importance of mitochondria in cancer metabolism irrespective of oxidative phosphorylation.

In tumor cells, PKM2 is mainly in the dimeric form and has, therefore, been termed Tumor M2-PK. The quantification of Tumor M2-PK in plasma and stool is a tool for early detection of tumors and follow-up studies during therapy. The dimerization of PKM2 in tumor cells is induced by direct interaction of PKM2 with different oncoproteins (pp60v-src, HPV-16 E7, and A-Raf). The physiological function of the interaction between PKM2 and HERC1 as well as between PKM2 and PKCdelta is unknown). Due to the essential role of PKM2 in aerobic glycolysis (The Warburg effect) which is a dominant metabolic pathway used by cancer cells. Its overcome in this pathway in macrophages may lead to better outcome in experimental sepsis. Thus, PKM2 is a regulator of LPS- and tumor-induced PD-L1 expression on macrophages and dendritic cells as well as tumor cells

### Import necessary libraries

In [1]:
import numpy as np
import pandas as pd
import requests
from chembl_webresource_client.new_client import new_client
import nglview as nv
from Bio.PDB import PDBList
import os



### Retrieving the Protein

In [2]:
# Download PKM2 structure (PDB ID: 4G1N)
pdbl = PDBList()
pdbl.retrieve_pdb_file('4G1N', pdir='.', file_format='pdb')

Downloading PDB structure '4g1n'...


'.\\pdb4g1n.ent'

In [None]:
# Rename the downloaded file for clarity
os.rename('pdb4g1n.ent', 'PKM2_4G1N.pdb')
print("PKM2 structure downloaded as PKM2_4G1N.pdb")

In [4]:
# Load and visualize the PDB file
structure_file = 'PKM2_4G1N.pdb'
view = nv.show_file(structure_file)
view.add_representation('cartoon', selection='protein')  # Show protein backbone
view.add_representation('ball+stick', selection='ligand')  # Show ligand (if present)
view.center()
view  # Display the interactive viewer

NGLWidget()

### Download Ligands from ChEMBL and DrugBank

In [12]:
from chembl_webresource_client.new_client import new_client

# Initialize ChEMBL client
molecule = new_client.molecule
activity = new_client.activity

# fetch ChEMBL data and save partial results
def fetch_chembl_data(target_chembl_id=None, batch_size=1000, total_limit=10000):
    all_data = []
    offset = 0
    
    while len(all_data) < total_limit:
        if target_chembl_id:
            res = activity.filter(target_chembl_id=target_chembl_id, 
                                 standard_type__in=['IC50', 'Ki']).order_by('molecule_chembl_id')[offset:offset+batch_size]
        else:
            res = activity.filter(target_organism='Homo sapiens', 
                                 standard_type__in=['IC50', 'Ki']).order_by('molecule_chembl_id')[offset:offset+batch_size]
        
        if not res:
            break
        
        for entry in res:
            standard_value = entry['standard_value']
            if standard_value is None:
                continue
                
            mol_data = molecule.get(entry['molecule_chembl_id'])
            if mol_data and 'molecule_structures' in mol_data and mol_data['molecule_structures']:
                try:
                    activity_label = 1 if float(standard_value) < 10000 else 0
                    all_data.append({
                        'Compound_ID': entry['molecule_chembl_id'],
                        'SMILES': mol_data['molecule_structures']['canonical_smiles'],
                        'Activity_Label': activity_label
                    })
                except (ValueError, TypeError):
                    continue
        
        offset += batch_size
        print(f"Fetched {len(all_data)} ChEMBL compounds so far...")
        
        # Save partial data every 1000 compounds and check for stop
        if len(all_data) >= 4700:  # Save at 4,000+ and offer to stop
            partial_df = pd.DataFrame(all_data)
            partial_df.to_csv('chembl_anticancer_compounds_partial.csv', index=False)
            print(f"Saved {len(all_data)} compounds to chembl_anticancer_compounds_partial.csv")
            stop_now = input("Want to stop now? (yes/no): ")
            if stop_now.lower() == 'yes':
                return partial_df
    
    final_df = pd.DataFrame(all_data[:total_limit])
    final_df.to_csv('chembl_anticancer_compounds.csv', index=False)
    return final_df

# Batch 2: Anticancer-related compounds
anticancer_chembl_data = fetch_chembl_data(total_limit=9500)
print(f"Final save: {len(anticancer_chembl_data)} compounds to chembl_anticancer_compounds.csv")

Fetched 817 ChEMBL compounds so far...
Fetched 836 ChEMBL compounds so far...
Fetched 836 ChEMBL compounds so far...
Fetched 856 ChEMBL compounds so far...
Fetched 857 ChEMBL compounds so far...
Fetched 877 ChEMBL compounds so far...
Fetched 892 ChEMBL compounds so far...
Fetched 911 ChEMBL compounds so far...
Fetched 931 ChEMBL compounds so far...
Fetched 948 ChEMBL compounds so far...
Fetched 968 ChEMBL compounds so far...
Fetched 988 ChEMBL compounds so far...
Fetched 1006 ChEMBL compounds so far...
Fetched 1026 ChEMBL compounds so far...
Fetched 1038 ChEMBL compounds so far...
Fetched 1042 ChEMBL compounds so far...
Fetched 1043 ChEMBL compounds so far...
Fetched 1062 ChEMBL compounds so far...
Fetched 1062 ChEMBL compounds so far...
Fetched 1080 ChEMBL compounds so far...
Fetched 1099 ChEMBL compounds so far...
Fetched 1115 ChEMBL compounds so far...
Fetched 1134 ChEMBL compounds so far...
Fetched 1154 ChEMBL compounds so far...
Fetched 1174 ChEMBL compounds so far...
Fetched 1194

Want to stop now? (yes/no):  yes


Final save: 4718 compounds to chembl_anticancer_compounds.csv


In [13]:
# Load CSV files
chembl_pkm2 = pd.read_csv('chembl_pkm2_inhibitors.csv')
chembl_anticancer = pd.read_csv('chembl_anticancer_compounds_partial.csv')

# Standardize columns
def standardize_df(df, source):
    df = df[['Compound_ID', 'SMILES', 'Activity_Label']].copy()
    df['Source'] = source
    return df

chembl_pkm2_std = standardize_df(chembl_pkm2, 'ChEMBL_PKM2')
chembl_anticancer_std = standardize_df(chembl_anticancer, 'ChEMBL_Anticancer')

# Combine datasets
combined_df = pd.concat([chembl_pkm2_std, chembl_anticancer_std], ignore_index=True)

# Remove duplicates based on SMILES
combined_df.drop_duplicates(subset=['SMILES'], keep='first', inplace=True)

# Report current total
print(f"Total unique compounds so far: {len(combined_df)}")
combined_df.to_csv('combined_ligands_partial.csv', index=False)
print("Combined partial dataset saved to combined_ligands_partial.csv")

Total unique compounds so far: 2187
Combined partial dataset saved to combined_ligands_partial.csv


#### Setup and Load Data

In [7]:
# Import necessary libraries
import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors

# Load your combined dataset
df = pd.read_csv('combined_ligands_partial.csv')

In [9]:
# Function to preprocess a single SMILES string
def preprocess_smiles(smiles):
    try:
        # Convert SMILES to an RDKit molecule object
        mol = Chem.MolFromSmiles(smiles)
        if mol is None:  # Check if SMILES is invalid
            return None
        
        # Add hydrogens to the molecule
        mol = Chem.AddHs(mol)
        
        # Generate basic descriptors
        mw = Descriptors.MolWt(mol)      # Molecular weight
        logp = Descriptors.MolLogP(mol)  # LogP (hydrophobicity)
        hba = Descriptors.NumHAcceptors(mol)  # Hydrogen bond acceptors
        hbd = Descriptors.NumHDonors(mol)     # Hydrogen bond donors
        
        # Return a dictionary with results
        return {
            'SMILES': Chem.MolToSmiles(mol),  # Canonical SMILES
            'MolWt': mw,
            'LogP': logp,
            'HBA': hba,
            'HBD': hbd
        }
    except Exception:
        return None  # Return None if anything fails

In [10]:
# Process all SMILES in the dataset
processed_data = df['SMILES'].apply(preprocess_smiles)

# Filter out any invalid SMILES (where result is None)
valid_data = [d for d in processed_data if d is not None]

print(f"Processed {len(valid_data)} valid compounds (out of {len(df)})")

Processed 2187 valid compounds (out of 2187)


##### Build and Merge the Processed DataFrame

In [11]:
# Create a DataFrame from the processed data
processed_df = pd.DataFrame(valid_data)

# Merge with original columns (Compound_ID, Activity_Label, Source)
final_df = pd.merge(df[['Compound_ID', 'Activity_Label', 'Source']], 
                    processed_df, 
                    left_index=True, 
                    right_index=True, 
                    how='right')

# first few rows
print("Processed data preview:")
print(final_df.head())

Processed data preview:
     Compound_ID  Activity_Label       Source  \
0   CHEMBL106202               1  ChEMBL_PKM2   
1  CHEMBL1083499               0  ChEMBL_PKM2   
2  CHEMBL1084908               0  ChEMBL_PKM2   
3  CHEMBL1088332               0  ChEMBL_PKM2   
4   CHEMBL109044               0  ChEMBL_PKM2   

                                              SMILES    MolWt     LogP  HBA  \
0  [H]OC(=O)C([H])([H])[C@@]([H])(C(=O)N1C([H])([...  617.704 -0.18820    9   
1  [H]N=C(N([H])[H])N([H])OC([H])([H])C([H])([H])...  423.424  0.28649    7   
2  [H]N=C(N([H])[H])N([H])OC([H])([H])C([H])([H])...  434.407 -0.15025    8   
3  [H]N=C(N([H])[H])N([H])OC([H])([H])C([H])([H])...  443.842  0.63147    7   
4  [H]C(=O)[C@@]([H])(N([H])C(=O)[C@]1([H])C([H])...  454.528 -1.02710    7   

   HBD  
0    6  
1    5  
2    5  
3    5  
4    4  


In [13]:
# Save the processed dataset
final_df.to_csv('processed_ligands_2187.csv', index=False)
print("Processed dataset saved to processed_ligands_2187.csv")

Processed dataset saved to processed_ligands_2187.csv


#### Add 3D Structures

In [15]:
# Process all SMILES in 3D 
processed_data_3d = df['SMILES'].apply(preprocess_smiles_3d)

# Filter out invalid SMILES
valid_data_3d = [d for d in processed_data_3d if d is not None]

print(f"Processed {len(valid_data_3d)} valid compounds in 3D (out of {len(df)})")

[05:55:14] UFFTYPER: Unrecognized charge state for atom: 1
[05:55:15] UFFTYPER: Unrecognized charge state for atom: 1
[05:58:18] UFFTYPER: Unrecognized charge state for atom: 1
[05:58:18] UFFTYPER: Unrecognized charge state for atom: 1
[06:01:17] UFFTYPER: Unrecognized atom type: S_6+6 (1)
[06:01:17] UFFTYPER: Unrecognized atom type: S_6+6 (1)
[06:01:37] UFFTYPER: Unrecognized atom type: Se2+2 (20)
[06:01:37] UFFTYPER: Unrecognized atom type: Se2+2 (20)
[06:01:37] UFFTYPER: Unrecognized atom type: Se2+2 (5)
[06:01:37] UFFTYPER: Unrecognized atom type: Se2+2 (5)
[06:01:44] UFFTYPER: Unrecognized charge state for atom: 1
[06:01:44] UFFTYPER: Unrecognized charge state for atom: 1
[06:02:04] UFFTYPER: Unrecognized charge state for atom: 29
[06:02:05] UFFTYPER: Unrecognized charge state for atom: 29
[06:05:09] UFFTYPER: Unrecognized charge state for atom: 6
[06:05:09] UFFTYPER: Unrecognized charge state for atom: 6
[06:05:09] UFFTYPER: Unrecognized charge state for atom: 7
[06:05:09] UFFTYP

Processed 2179 valid compounds in 3D (out of 2187)


In [16]:
# Create a DataFrame from 3D processed data
processed_df_3d = pd.DataFrame(valid_data_3d)

# Merge with original columns
final_df_3d = pd.merge(df[['Compound_ID', 'Activity_Label', 'Source']], 
                      processed_df_3d, 
                      left_index=True, 
                      right_index=True, 
                      how='right')

# Check the first few rows
print("3D processed data preview:")
print(final_df_3d.head())

3D processed data preview:
     Compound_ID  Activity_Label       Source  \
0   CHEMBL106202               1  ChEMBL_PKM2   
1  CHEMBL1083499               0  ChEMBL_PKM2   
2  CHEMBL1084908               0  ChEMBL_PKM2   
3  CHEMBL1088332               0  ChEMBL_PKM2   
4   CHEMBL109044               0  ChEMBL_PKM2   

                                              SMILES    MolWt     LogP  HBA  \
0  [H]OC(=O)C([H])([H])[C@@]([H])(C(=O)N1C([H])([...  617.704 -0.18820    9   
1  [H]N=C(N([H])[H])N([H])OC([H])([H])C([H])([H])...  423.424  0.28649    7   
2  [H]N=C(N([H])[H])N([H])OC([H])([H])C([H])([H])...  434.407 -0.15025    8   
3  [H]N=C(N([H])[H])N([H])OC([H])([H])C([H])([H])...  443.842  0.63147    7   
4  [H]C(=O)[C@@]([H])(N([H])C(=O)[C@]1([H])C([H])...  454.528 -1.02710    7   

   HBD  
0    6  
1    5  
2    5  
3    5  
4    4  


In [17]:
# Save the 3D processed dataset
final_df_3d.to_csv('processed_ligands_2187_3d.csv', index=False)
print("3D processed dataset saved to processed_ligands_2187_3d.csv")
print(f"Final 3D compounds: {len(final_df_3d)}")

3D processed dataset saved to processed_ligands_2187_3d.csv
Final 3D compounds: 2179


#### Expand Feature Extraction

In [19]:
# Load your 3D processed data
df = pd.read_csv('processed_ligands_2187_3d.csv')

# Function to calculate additional descriptors
def calculate_additional_descriptors(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None
    tpsa = Descriptors.TPSA(mol)
    ecfp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2048)  # Updated function
    return {'TPSA': tpsa, 'ECFP': ecfp}

# Apply to all SMILES
additional_data = df['SMILES'].apply(calculate_additional_descriptors)
valid_additional = [d for d in additional_data if d is not None]

# Convert to DataFrame
additional_df = pd.DataFrame(valid_additional)
ecfp_bits = [list(d['ECFP']) for d in valid_additional]
ecfp_df = pd.DataFrame(ecfp_bits, columns=[f'ECFP_{i}' for i in range(2048)])

# Merge with existing data
feature_df = pd.concat([df.iloc[additional_df.index], additional_df[['TPSA']], ecfp_df], axis=1)
print(f"Feature matrix created with {len(feature_df)} compounds")
print(feature_df.head())



Feature matrix created with 2179 compounds
     Compound_ID  Activity_Label       Source  \
0   CHEMBL106202               1  ChEMBL_PKM2   
1  CHEMBL1083499               0  ChEMBL_PKM2   
2  CHEMBL1084908               0  ChEMBL_PKM2   
3  CHEMBL1088332               0  ChEMBL_PKM2   
4   CHEMBL109044               0  ChEMBL_PKM2   

                                              SMILES    MolWt     LogP  HBA  \
0  [H]OC(=O)C([H])([H])[C@@]([H])(C(=O)N1C([H])([...  617.704 -0.18820    9   
1  [H]N=C(N([H])[H])N([H])OC([H])([H])C([H])([H])...  423.424  0.28649    7   
2  [H]N=C(N([H])[H])N([H])OC([H])([H])C([H])([H])...  434.407 -0.15025    8   
3  [H]N=C(N([H])[H])N([H])OC([H])([H])C([H])([H])...  443.842  0.63147    7   
4  [H]C(=O)[C@@]([H])(N([H])C(=O)[C@]1([H])C([H])...  454.528 -1.02710    7   

   HBD    TPSA  ECFP_0  ...  ECFP_2038  ECFP_2039  ECFP_2040  ECFP_2041  \
0    6  235.61       0  ...          0          0          0          0   
1    5  147.15       0  ...          

In [20]:
from Bio.PDB import PDBParser, PDBIO, Select
import os

# Define a class to select only protein atoms (remove water, ligands)
class ProteinSelect(Select):
    def accept_residue(self, residue):
        return residue.get_resname() not in ['HOH', 'LIG']  # Remove water and ligands

# Load and clean the PDB file
parser = PDBParser(QUIET=True)
structure = parser.get_structure('PKM2', 'PKM2_4G1N.pdb')
io = PDBIO()
io.set_structure(structure)
io.save('PKM2_4G1N_cleaned.pdb', ProteinSelect())

# Add hydrogens (requires Open Babel installed: pip install openbabel-wheel)
os.system('obabel PKM2_4G1N_cleaned.pdb -O PKM2_4G1N_prepped.pdb -h')
print("Protein preprocessed and saved as PKM2_4G1N_prepped.pdb")

Protein preprocessed and saved as PKM2_4G1N_prepped.pdb


In [22]:
# Save the final feature matrix
feature_df.to_csv('feature_matrix_2179.csv', index=False)
print("Feature matrix saved to feature_matrix_2179.csv")

Feature matrix saved to feature_matrix_2179.csv
