# Regression Transformer Example 

This notebook executes unconditional generation and embedding extraction with Regression Transformer [1].

[1] Born, J. and Manica, M. Regression transformer enables concurrent sequence regression and generation for molecular language modelling.
Nature Machine Intelligence, 5(4):432–444, 2023.

In [1]:
from jointformer.configs.model import ModelConfig
from jointformer.models.auto import AutoModel
%load_ext autoreload
%autoreload 2

In [2]:
path_to_model_ckpt ='qed'
path_to_model_config = '../../configs/models/regression_transformer'
chembl_training_file = '/Users/pankhilgawade/Downloads/guacamol_v1_train.smiles'
seed_dataset_file = '/Users/pankhilgawade/jointformer/chembl_smiles_eval.txt'

temperature = 1.0
top_k = None
device = 'cpu'
fraction_to_mask = 0.4
embedding_dim= 256

In [3]:
model_config = ModelConfig.from_config_file(path_to_model_config)

In [4]:
model = AutoModel.from_config(model_config)
generator = model.to_guacamole_generator(
    temperature=temperature, fraction_to_mask=fraction_to_mask, seed_dataset_file=seed_dataset_file, device=device)


INFO:gt4sd.configuration:using as local cache path for algorithms: algorithms
INFO:gt4sd.configuration:using as local cache path for properties: properties


In [5]:
generator.load_pretrained(path_to_model_ckpt)

In [6]:
generator.generate(number_samples=2)

Generating samples:   0%|          | 0/2 [00:00<?, ?it/s]

INFO:gt4sd.algorithms.core:runnning RegressionTransformer with configuration=RegressionTransformerMolecules(algorithm_version='qed', search='sample', temperature=1.0, batch_size=8, tolerance=100.0, sampling_wrapper={'property_goal': {'<qed>': 0.345}, 'fraction_to_mask': 0.4})
INFO:gt4sd.algorithms.conditional_generation.regression_transformer.core:ensure artifacts for the application are present.
INFO:gt4sd.s3:starting syncing
INFO:gt4sd.s3:syncing complete
INFO:gt4sd.s3:starting syncing
INFO:gt4sd.s3:syncing complete


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'XLNetTokenizer'. 
The class this function is called from is 'InferenceBertTokenizer'.


INFO:gt4sd.algorithms.conditional_generation.regression_transformer.implementation:Model restored from /Users/pankhilgawade/.gt4sd/algorithms/conditional_generation/RegressionTransformer/RegressionTransformerMolecules/qed


  return torch.load(checkpoint_file, map_location="cpu")


INFO:gt4sd.algorithms.conditional_generation.regression_transformer.implementation:Will start sampling molecules similar to COC12C3NC3CN1C1=C(C2COC(N)=O)C(=O)C(NCCN2CCCC2)=C(C)C1=O with goal: {'<qed>': 0.345} and masking 0.4 of the tokens.
INFO:gt4sd.algorithms.conditional_generation.regression_transformer.implementation:Starting prediction for sequence <qed>0.345|[C][MASK][C][C][MASK][C][MASK][Ring1][C][N][Ring1][MASK][C][=C][MASK][O][C][MASK][MASK][C][MASK][MASK][Branch1_1][C][N][MASK][C][Branch1_2][MASK][MASK][MASK][MASK][MASK][N][MASK][C][N][C][C][MASK][MASK][Ring1][MASK][MASK][Branch1_1][C][C][C][MASK][Ring1][Branch1_2][=O][PAD]
INFO:gt4sd.algorithms.conditional_generation.regression_transformer.implementation:Successes: (('C1OC2CCC1N2C=C(C=C(C(N3)=N))C(=CC(NCCNC4=C3))N(C)CC4=O', '<qed>0.287'), ('C1OCCNC1NC4=C(C2=CCC3OC(N)=O)C(=CC2NCCNC=C3)N(C)C4=O', '<qed>0.487'), ('C13C2CCCC1N2C=C(C(COC(N)=O))C(=O)N(C)C3=O', '<qed>0.557'), ('C1C23CCCC1N2C=C(C(C#N)=C(N)N)C(=NNN=O)N(C)C3=O', '<qed

Generating samples:  50%|█████     | 1/2 [00:02<00:02,  2.19s/it]

INFO:gt4sd.algorithms.core:runnning RegressionTransformer with configuration=RegressionTransformerMolecules(algorithm_version='qed', search='sample', temperature=1.0, batch_size=8, tolerance=100.0, sampling_wrapper={'property_goal': {'<qed>': 0.477}, 'fraction_to_mask': 0.4})
INFO:gt4sd.algorithms.conditional_generation.regression_transformer.core:ensure artifacts for the application are present.
INFO:gt4sd.s3:starting syncing
INFO:gt4sd.s3:syncing complete
INFO:gt4sd.s3:starting syncing
INFO:gt4sd.s3:syncing complete


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'XLNetTokenizer'. 
The class this function is called from is 'InferenceBertTokenizer'.


INFO:gt4sd.algorithms.conditional_generation.regression_transformer.implementation:Model restored from /Users/pankhilgawade/.gt4sd/algorithms/conditional_generation/RegressionTransformer/RegressionTransformerMolecules/qed
INFO:gt4sd.algorithms.conditional_generation.regression_transformer.implementation:Will start sampling molecules similar to COc1cc(CC2N(C(C)=O)C(=O)C(NC2=O)=Cc2cc(OC)c(OC)c(C)c2OC)c(OC)c(C)c1OC with goal: {'<qed>': 0.477} and masking 0.4 of the tokens.
INFO:gt4sd.algorithms.conditional_generation.regression_transformer.implementation:Starting prediction for sequence <qed>0.477|[C][O][C][=C][C][Branch2_1][Ring2][MASK][C][MASK][MASK][Branch1_1][Branch1_2][C][Branch1_1][MASK][C][=O][MASK][Branch1_2][MASK][MASK][MASK][MASK][Branch1_2][MASK][MASK][MASK][MASK][=O][=C][MASK][MASK][C][Branch1_1][Ring1][O][C][=C][MASK][Ring1][O][C][C][MASK][C][C][=C][Ring1][MASK][O][MASK][MASK][MASK][MASK][MASK][C][MASK][MASK][C][C][MASK][Ring2][Ring2][MASK][O][C]
INFO:gt4sd.algorithms.conditi

Generating samples: 100%|██████████| 2/2 [00:04<00:00,  2.34s/it]


['C1OC2CCC1N2C=C(C=C(C(N3)=N))C(=CC(NCCNC4=C3))N(C)CC4=O',
 'COC=C2C(CCN(C(C)=O)C(=O)C(N(N=O))=CC1=CC(OC)=C(OC)C(C)=C1OC=C)=C=CCC(CC)=C2OC']

In [7]:
smiles  =['CC12CC(CC(=O)C1CCC3(C)C2C4OC(=O)C3(O)COC4C)C=CC=COC#C',
 'CC1=CC=C([N-][N+]#N)C=C1','BrC1=CC=C(C=C1)C(=O)C=CC2=CC=CC=C2',
 'BrC1=CC=C(C=C1)C(=O)C=CC2=CC=CC=C2']

In [8]:
path_to_encoder_ckpt= '/Users/pankhilgawade/jointformer/qed'
#path_to_encoder_ckpt = '/Users/pankhilgawade/.gt4sd/algorithms/conditional_generation/RegressionTransformer/RegressionTransformerMolecules/qed'

In [9]:
model_config = ModelConfig.from_config_file(path_to_model_config)

In [10]:
model = AutoModel.from_config(model_config)

In [27]:
model.to_smiles_encoder(tokenizer=None,batch_size=None,device='cpu',search='sample')

RegressionTransformer(
  (search): SamplingSearch()
  (_search): SamplingSearch()
)

In [28]:

model.load_pretrained(path_to_encoder_ckpt)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'XLNetTokenizer'. 
The class this function is called from is 'InferenceBertTokenizer'.


INFO:gt4sd.algorithms.conditional_generation.regression_transformer.implementation:Model restored from /Users/pankhilgawade/jointformer/qed


  return torch.load(checkpoint_file, map_location="cpu")


INFO:gt4sd.algorithms.conditional_generation.regression_transformer.implementation:Model restored from /Users/pankhilgawade/jointformer/qed


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'XLNetTokenizer'. 
The class this function is called from is 'InferenceBertTokenizer'.


In [29]:
model.encode(smiles)


Encoding samples: 100%|██████████| 4/4 [00:00<00:00, 24.48it/s]


array([[ 0.9671758 , -0.41499734,  1.296763  , ...,  0.6502534 ,
         0.10490204, -0.85646594],
       [ 1.303101  , -0.57494247,  1.1045016 , ...,  0.557437  ,
        -0.05404019, -0.5745893 ],
       [ 0.36216116, -0.4644822 ,  0.96407056, ...,  0.6472383 ,
        -0.08904713, -0.94837767],
       [ 0.36216116, -0.4644822 ,  0.96407056, ...,  0.6472383 ,
        -0.08904713, -0.94837767]], dtype=float32)

In [32]:
model.predict(smiles)

Predicting qed: 100%|██████████| 4/4 [00:00<00:00, 24.57it/s]


array([['<qed>0.331'],
       ['<qed>0.429'],
       ['<qed>0.606'],
       ['<qed>0.606']], dtype='<U10')