# Feature

> A collection of tools to extract features from SMILES, proteins, etc.

In [None]:
#| default_exp feature

## Overview

This module provides tools to extract features from SMILES (chemical compounds) and protein sequences for machine learning applications.

---

**Utility Functions**

`remove_hi_corr(df, thr)` - Removes highly correlated features from a DataFrame based on Pearson correlation threshold. Useful for reducing multicollinearity before modeling.

```python
df_cleaned = remove_hi_corr(
    df=my_features,  # DataFrame with features as columns
    thr=0.98,        # correlation threshold above which to drop columns
)
```

`preprocess(df, thr)` - Combines zero-variance removal with correlation filtering. Drops columns with no variance (e.g., constant values) and highly correlated features.

```python
df_processed = preprocess(
    df=my_features,  # DataFrame with features
    thr=0.98,        # correlation threshold
)
```

`standardize(df)` - Standardizes features to zero mean and unit variance using sklearn's StandardScaler.

```python
df_scaled = standardize(
    df=my_features,  # DataFrame to standardize
)
```

---

**Compound Features (SMILES)**

`get_rdkit(SMILES)` - Extracts ~200 RDKit molecular descriptors from a SMILES string.

```python
features = get_rdkit(
    SMILES="CC(=O)O",  # SMILES representation of molecule
)
```

`get_rdkit_3d(SMILES)` - Extracts 3D molecular descriptors after generating a conformer using ETKDG embedding.

```python
features_3d = get_rdkit_3d(
    SMILES="CC(=O)O",  # SMILES representation of molecule
)
```

`get_rdkit_df(df, col, postprocess)` - Batch extracts RDKit features (2D + 3D) from a DataFrame column containing SMILES. Optionally removes redundant features and standardizes.

```python
rdkit_features = get_rdkit_df(
    df=compounds_df,   # DataFrame containing SMILES
    col='SMILES',      # column name with SMILES strings
    postprocess=True,  # remove redundant columns & standardize
)
```

`get_morgan(df, col, radius)` - Generates 2048-bit Morgan fingerprints (circular fingerprints) from SMILES.

```python
morgan_fps = get_morgan(
    df=compounds_df,  # DataFrame containing SMILES
    col='SMILES',     # column name with SMILES strings
    radius=3,         # radius for Morgan fingerprint
)
```

---

**Protein Sequence Features - One-Hot Encoding**

`onehot_encode(sequences, transform_colname, n)` - Converts amino acid sequences to one-hot encoded matrix.

```python
encoded = onehot_encode(
    sequences=df['site_seq'],  # iterable of AA sequences
    transform_colname=True,    # convert column names to position format
    n=20,                      # number of standard amino acids
)
```

`onehot_encode_df(df, seq_col)` - Convenience wrapper for one-hot encoding from a DataFrame.

```python
encoded = onehot_encode_df(
    df=my_df,            # DataFrame with sequences
    seq_col='site_seq',  # column name containing sequences
)
```

`filter_range_columns(df, low, high)` - Filters one-hot encoded columns to specific sequence positions (e.g., -10 to +10 around a site).

```python
filtered = filter_range_columns(
    df=onehot_df,  # one-hot encoded DataFrame with position+AA column names
    low=-10,       # minimum position to include
    high=10,       # maximum position to include
)
```

---

**Clustering**

`run_kmeans(onehot, n, seed)` - Performs K-means clustering on encoded data and returns cluster assignments.

```python
clusters = run_kmeans(
    onehot=encoded_df,  # one-hot or other feature matrix
    n=10,               # number of clusters
    seed=42,            # random seed for reproducibility
)
```

`get_clusters_elbow(encoded_data, max_cluster, interval)` - Plots the elbow curve (WCSS vs. # clusters) to help choose optimal k.

```python
get_clusters_elbow(
    encoded_data=onehot_df,  # feature matrix for clustering
    max_cluster=400,         # maximum clusters to test
    interval=50,             # step size between cluster counts
)
```

---

**Protein Language Model Embeddings**

`get_esm(df, col, model_name)` - Extracts ESM2 embeddings (mean-pooled) from protein sequences. Requires GPU.

```python
esm_features = get_esm(
    df=kinase_df,                      # DataFrame with protein sequences
    col='sequence',                    # column name with AA sequences
    model_name='esm2_t33_650M_UR50D',  # ESM2 model variant
)
```

`get_t5(df, col)` - Extracts ProtT5-XL-UniRef50 embeddings from protein sequences.

```python
t5_features = get_t5(
    df=kinase_df,       # DataFrame with protein sequences
    col='sequence',     # column name with AA sequences
)
```

`get_t5_bfd(df, col)` - Extracts ProtT5-XL-BFD embeddings (trained on Big Fantastic Database).

```python
t5bfd_features = get_t5_bfd(
    df=kinase_df,       # DataFrame with protein sequences
    col='sequence',     # column name with AA sequences
)
```


## Setup

In [None]:
#| export
import pandas as pd, numpy as np
import torch, re, gc
from tqdm.notebook import tqdm; tqdm.pandas()
from katlas.data import Data
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Rdkit
from rdkit import Chem
from rdkit.Chem import Descriptors, Descriptors3D, AllChem, rdFingerprintGenerator

# Clustering
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt

from sklearn import set_config
set_config(transform_output="pandas")

ModuleNotFoundError: No module named 'rdkit'

## Utils

In [None]:
#| export
def_device = 'mps' if torch.backends.mps.is_available() else 'cuda' if torch.cuda.is_available() else 'cpu'

In [None]:
#| export
def remove_hi_corr(df: pd.DataFrame, 
                   thr: float=0.98 # threshold
                   ):
    "Remove highly correlated features in a dataframe given a pearson threshold"
    
    # Create correlation matrix
    corr_matrix = df.corr(numeric_only=True).abs()
    
    # Select upper triangle of correlation matrix
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    
    # Find index of feature columns with correlation greater than threshold
    to_drop = [column for column in upper.columns if any(upper[column] > thr)]
    
    # Drop features 
    df = df.drop(to_drop, axis=1)
    
    return df

`remove_hi_corr` is a function to remove highly correlated features based on threshold of Pearson correlation between features.

In [None]:
# Load data
df = Data.get_aa_rdkit()
df.shape

In [None]:
remove_hi_corr(df,thr=0.9).shape

In [None]:
#| export
def preprocess(df: pd.DataFrame,
               thr: float=0.98):
    
    "Remove features with no variance, and highly correlated features based on threshold"
    
    col_ori = df.columns
    df = df.loc[:,df.std() != 0].copy()
    df = remove_hi_corr(df, thr)
    dropping_col = set(col_ori) - set(df.columns)
    print(f'removing columns: {dropping_col}')
    return df

This function is similar to `remove_hi_corr`, but can additionaly remove features of zero variance (e.g., 1 across all samples)

In [None]:
preprocess(df,thr=0.9).shape

In [None]:
#| export
def standardize(df): 
    "Standardize features from a df"
    return StandardScaler().fit_transform(df.copy())

## Compound features

### RDKit descriptors

In [None]:
#| export
def get_rdkit(SMILES):
    """
    Extract chemical features from SMILES
    Reference: https://greglandrum.github.io/rdkit-blog/posts/2022-12-23-descriptor-tutorial.html
    """
    mol = Chem.MolFromSmiles(SMILES)
    return Descriptors.CalcMolDescriptors(mol)

In [None]:
#| export
def get_rdkit_3d(SMILES):
    """
    Extract 3d features from SMILES
    """
    mol = Chem.MolFromSmiles(SMILES)
    mol = Chem.AddHs(mol)
    AllChem.EmbedMolecule(mol, AllChem.ETKDG())
    AllChem.UFFOptimizeMolecule(mol)
    return Descriptors3D.CalcMolDescriptors3D(mol)

In [None]:
#| export
def get_rdkit_all(SMILES):
    "Extract chemical features and 3d features from SMILES"
    feat = get_rdkit(SMILES)
    feat_3d = get_rdkit_3d(SMILES)
    return feat|feat_3d

In [None]:
#| export
def get_rdkit_df(df,
                 col, # column of SMILES
                 postprocess=True, # remove redundant columns and standardize features for dimension reduction
                 ):
    "Extract rdkit features (including 3d) from SMILES in a df"
    out = df[col].apply(get_rdkit_all).apply(pd.Series)

    if postprocess:
        out = preprocess(out) # remove redundant
        out = standardize(out)
    return out

In [None]:
aa = Data.get_aa_info()
aa.head()

In [None]:
aa_rdkit = get_rdkit_df(aa, 'SMILES')
aa_rdkit.head()

### Morgan fingerprint

In [None]:
#| export
def get_morgan(df: pd.DataFrame, # a dataframe that contains smiles
               col: str = "SMILES", # colname of smile
               radius=3
              ):
    "Get 2048 morgan fingerprint (binary feature) from smiles in a dataframe"
    mols = [Chem.MolFromSmiles(smi) for smi in df[col]]

    mfpgen = rdFingerprintGenerator.GetMorganGenerator(radius=radius,fpSize=2048)
    morgan_fps = [mfpgen.GetFingerprint(mol) for mol in mols]
    
    fp_df = pd.DataFrame(np.array(morgan_fps), index=df.index)
    fp_df.columns = "morgan_" + fp_df.columns.astype(str)
    return fp_df

In [None]:
aa_morgan = get_morgan(aa, 'SMILES')
aa_morgan.head()

In [None]:
aa_morgan = get_morgan(aa, 'SMILES')
aa_morgan.head()

## Protein sequence

### Onehot

In [None]:
#| export
def onehot_encode(sequences, transform_colname=True, n=20):
    encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
    encoded_array = encoder.fit_transform([list(seq) for seq in sequences])
    colnames = [x[1:] for x in encoder.get_feature_names_out()]
    if transform_colname:
        colnames = [f"{int(item.split('_', 1)[0]) - n}{item.split('_', 1)[1]}" for item in colnames]
    encoded_df = pd.DataFrame(encoded_array)
    encoded_df.columns=colnames
    return encoded_df

In [None]:
#| export
def onehot_encode_df(df,seq_col='site_seq', **kwargs):
    return onehot_encode(df[seq_col],**kwargs)

In [None]:
df=Data.get_combine_site_psp_ochoa()

In [None]:
df_k = df.head(1000)

In [None]:
onehot = onehot_encode_df(df_k, seq_col='site_seq')
onehot

### Kemans of onehot

In [None]:
#| export
def run_kmeans(onehot,n=2,seed=42):
    "Take onehot encoded and regurn the cluster number."
    kmeans = KMeans(n_clusters=n, random_state=seed,n_init='auto')
    return kmeans.fit_predict(onehot)

In [None]:
run_kmeans(onehot.head(100),n=10)

In [None]:
onehot

In [None]:
#| export
def filter_range_columns(df, # df need to have column names of position + aa
                         low=-10,high=10):
    positions = df.columns.str[:-1].astype(int)
    mask = (positions >= low) & (positions <= high)
    return df.loc[:,mask]

In [None]:
onehot_10 = filter_range_columns(onehot,low=-10,high=10)
onehot_10

Pipeline:

```python
onehot = onehot_encode(df_k.site_seq)
onehot_10 = filter_range_columns(onehot)
df_k['Cluster'] = run_kmeans(onehot_10,n=n,seed=42)
```

Then plot onehot of onehot_10 with hue ='Cluster'

### Elbow method

In [None]:
#| export
def get_clusters_elbow(encoded_data,max_cluster=400, interval=50):

    wcss = []
    for i in range(1, max_cluster,interval):
        kmeans = KMeans(n_clusters=i, random_state=42)
        kmeans.fit(encoded_data)
        wcss.append(kmeans.inertia_)

    # Plot the Elbow graph
    plt.figure(figsize=(5, 3))
    plt.plot(range(1, max_cluster,interval), wcss)
    plt.title(f'Elbow Method (n={len(encoded_data)})')
    plt.xlabel('# Clusters')
    plt.ylabel('WCSS')

In [None]:
get_clusters_elbow(onehot,5,2)

### ESM2

In [None]:
#| export
def get_esm(
    df: pd.DataFrame, # DataFrame containing protein sequences
    col: str, # column with amino acid sequences
    model_name: str = "esm2_t33_650M_UR50D",
    batch_size: int = 1, # Number of sequences per batch
):
    "Extract ESM2 embeddings (mean pooled per sequence)."

    # model, alphabet = esm.pretrained.load_model_and_alphabet(model_name)
    model, alphabet = torch.hub.load("facebookresearch/esm:main", model_name)
    model = model.to(def_device)
    model.eval()

    batch_converter = alphabet.get_batch_converter()

    # Infer repr layer
    match = re.search(r"_t(\d+)_", model_name)
    if not match:
        raise ValueError(f"Cannot infer repr layer from {model_name}")
    layer = int(match.group(1))

    print(f"Using ESM layer {layer}")
    print("Available models:\n"
          "esm2_t48_15B_UR50D\n"
          "esm2_t36_3B_UR50D\n"
          "esm2_t33_650M_UR50D\n"
          "esm2_t30_150M_UR50D\n"
          "esm2_t12_35M_UR50D\n"
          "esm2_t6_8M_UR50D\n")

    sequences = df[col].tolist()
    all_embeddings = []

    for i in tqdm(range(0, len(sequences), batch_size)):
        batch_seqs = sequences[i : i + batch_size]
        data = [(f"seq_{j}", s) for j, s in enumerate(batch_seqs)]

        batch_labels, batch_strs, batch_tokens = batch_converter(data)
        batch_tokens = batch_tokens.to(def_device)

        with torch.no_grad():
            results = model(batch_tokens, repr_layers=[layer], return_contacts=False)

        token_reps = results["representations"][layer]
        batch_lens = (batch_tokens != alphabet.padding_idx).sum(1)

        for j, seq_len in enumerate(batch_lens):
            # skip BOS (0), stop before EOS
            emb = token_reps[j, 1 : seq_len - 1].mean(0)
            all_embeddings.append(emb.cpu().numpy())

        del results, token_reps, batch_tokens
        torch.cuda.empty_cache()
        gc.collect()

    df_emb = pd.DataFrame(
        all_embeddings,
        index=df.index,
        columns=[f"esm_{i}" for i in range(len(all_embeddings[0]))],
    )

    return df_emb

[ESM2 model](https://github.com/facebookresearch/esm) is trained on UniRef sequence. The default model in the function is esm2_t33_650M_UR50D, which is trained  on UniRef50.

Uncheck below to use:

In [None]:
# # Examples
# df = Data.get_kinase_info().set_index('kinase')
# sample = df[:5]
# esmfeature = get_esm(sample,'sequence')
# esmfeature.head()

### ProtT5

In [None]:
#| export
def get_t5(df: pd.DataFrame, 
           col: str = 'sequence'
           ):
    "Extract ProtT5-XL-uniref50 embeddings from protein sequence in a dataframe"
    from transformers import T5Tokenizer, T5EncoderModel
    
    # Reference: https://github.com/agemagician/ProtTrans/tree/master/Embedding/PyTorch/Advanced
    # Load the tokenizer
    tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_half_uniref50-enc', do_lower_case=False)

    # Load the model
    model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_half_uniref50-enc").to(def_device)

    # Set the model precision based on the device
    model.half()
    
    def T5_embeddings(sequence):
        seq_len = len(sequence)
        # Prepare the protein sequences as a list
        sequence = [" ".join(list(re.sub(r"[UZOB]", "X", sequence)))]

        # Tokenize sequences and pad up to the longest sequence in the batch
        ids = tokenizer.batch_encode_plus(sequence, add_special_tokens=True, padding="longest")
        input_ids = torch.tensor(ids['input_ids']).to(def_device)
        attention_mask = torch.tensor(ids['attention_mask']).to(def_device)
        # Generate embeddings
        with torch.no_grad():
            embedding_rpr = model(input_ids=input_ids, attention_mask=attention_mask)

        emb_mean = embedding_rpr.last_hidden_state[0][:seq_len].detach().to(torch.float32).cpu().numpy().mean(axis=0)

        return emb_mean

    series = df[col].progress_apply(T5_embeddings)
        

    T5_feature = pd.DataFrame(series.tolist(),index=df.index)
    T5_feature.columns = 'T5_' + T5_feature.columns.astype(str)
    
    return T5_feature

[XL-uniref50 model](https://huggingface.co/Rostlab/prot_t5_xl_uniref50) is a t5-3b model trained on Uniref50 Dataset.

Uncheck below to use:

In [None]:
# t5feature = get_t5(sample,'sequence')
# t5feature.head()

In [None]:
#| export
def get_t5_bfd(df:pd.DataFrame, 
               col: str = 'sequence'
               ):
    
    "Extract ProtT5-XL-BFD embeddings from protein sequence in a dataframe"
    # Reference: https://github.com/agemagician/ProtTrans/tree/master/Embedding/PyTorch/Advanced
    from transformers import T5Tokenizer, T5Model
    # Load the tokenizer
    tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_bfd', do_lower_case=False)

    model = T5Model.from_pretrained("Rostlab/prot_t5_xl_bfd").to(def_device)

    model.eval()
    
    def T5_embeddings_bfd(sequence, device = def_device):
        seq_len = len(sequence)

        # Prepare the protein sequences as a list
        sequence = [" ".join(list(re.sub(r"[UZOB]", "X", sequence)))]

        # Tokenize sequences and pad up to the longest sequence in the batch
        ids = tokenizer.batch_encode_plus(sequence, add_special_tokens=True, padding="longest")
        input_ids = torch.tensor(ids['input_ids']).to(def_device)
        attention_mask = torch.tensor(ids['attention_mask']).to(def_device)

        # Generate embeddings
        with torch.no_grad():
            embedding_rpr = model(input_ids=input_ids, attention_mask=attention_mask, decoder_input_ids = input_ids)

        emb_mean = embedding_rpr.last_hidden_state[0][:seq_len].detach().to(torch.float32).cpu().numpy().mean(axis=0)

        return emb_mean

    series = df[col].progress_apply(T5_embeddings_bfd)
        

    T5_feature = pd.DataFrame(series.tolist(),index=df.index)
    T5_feature.columns = 'T5bfd_' + T5_feature.columns.astype(str)
    
    return T5_feature

[XL-BFD model](https://huggingface.co/Rostlab/prot_t5_xl_bfd) is a t5-3b model trained on Big Fantastic Database(BFD).

Uncheck below to use:

In [None]:
# t5bfd = get_t5_bfd(sample,'sequence')
# t5bfd.head()

## Export -

In [None]:
#| hide
import nbdev; nbdev.nbdev_export()