# Notebook Information
----------------------
**Created By:**   Steven Bennett, Friedrich Hastedt

This is an example notebook for task 1. The goal of task 1 is to develop a model that is able to perform single-step retrosynthesis prediction. Specifically, the goal of the model is given a target molecule, the model should be able to predict a single reaction step that will produce the target molecule from one or more reagents.
The model will be evaluated in X different ways:

Top-10 score: The percentage the reactants from the test set appear in the top-10 predictions proposed by the model.
Duplicates: The percentage of duplicated reactants in the top-10 predictions proposed by the model. (The actual score is 1 - the percentage of duplicated reactants to maximise the score.)
Invalidity: The percentage of invalid predictions in the top-10 predictions proposed by the model.
(The actual score is 1 - the percentage of invalid predictions to maximise the score.)

From these evaluation metrics, we will provide a final score for each model, using a weighted average of the different metrics. The final score will be used to rank each team on the GitHub leaderboard.


## Notebook Contents
----------------------

In this notebook, we will show an example of using a pre-trained model as a starting point to generate the output file for submission to the competition. The notebook will cover the following steps:

1. Loading the pre-trained model, and performing data pre-processing steps
2. Generating predictions on the held-out test set and saving the output file
3. Performing a single fine-tuning step on the pre-trained model
4. Generating predictions using OpenAI-API to generate predictions using the GPT series of models

You are free to experiment with as many different models as you like, and this notebook only serves as an example of how to get started. You are free to use any other models that you like, including using ChatGPT to make prediction, and you are free to use any other data that you like. 
The only requirement is that the test set data is used to generate the output file for the submission.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from example_model import Model
from eval import TopK, Duplicates, SCScore, InvalidSMILES, Diversity, tokenize_smiles
from rdkit import Chem
from rdkit.Chem import Draw

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
model = Model(
    model_path="Models/USPTO50_model_step_500000.pt", 
)

In [10]:
results = model.predict(
    source_path='Data/test_input.txt',
    num_predictions=10,
    batch_size=100,
    beam_size=10
)

  var = torch.tensor(arr, dtype=self.dtype, device=device)


PRED AVG SCORE: -0.0009, PRED PPL: 1.0009


In [18]:
results

{'CC(=O)c1ccc2c(ccn2C(=O)OC(C)(C)C)c1': ['CC(=O)c1ccc(NC(=O)OC(C)(C)C)c(CC(C)(C)O)c1CC(C)(C)O',
  'CC(=O)c1ccc(NC(=O)OC(C)(C)C)c(C(C)(C)O)c1CC(C)(C)O',
  'CC(=O)c1ccc(N)c(CC(C)(C)OC(=O)OC(=O)OC(C)(C)C)c1OC(C',
  'CC(=O)c1ccc(NC(=O)OC(C)(C)C)c(C(C)(C)O)c1CC(C)(C)O',
  'CC(=O)c1ccc(NC(=O)OC(C)(C)C)c(C#CC(C)(C)OC(=O)O)c1',
  'CC(=O)c1ccc(NC(=O)OC(C)(C)C)c(C#CC(C)(C)OC(=O)O)c1',
  'CC(=O)c1ccc(NC(=O)OC(C)(C)C)c(C#CC(C)(C)OC(=O)O)c1',
  'CC(=O)c1ccc(NC(=O)OC(C)(C)C)c(C#CC(C)(C)O)c1CC(C)(C)',
  'CC(=O)c1ccc(N)c(CC(C)(C)OC(=O)OC(=O)OC(C)(C)C)c1C(C)',
  'CC(=O)c1ccc2c(c1)ccn2C(=O)OC(C)(C)C.CC(C)(C)OC(=O)OC'],
 'Cc1ccc(S(=O)(=O)O[C@@H]2CN(C(=O)OC(C)(C)C)[C@@H]3[C@@H](O)CO[C@@H]32)cc1': ['CC(C)(C)OC(=O)OC(=O)OC(C)(C)C1Cc1ccc(S(=O)(=O)O[C@@H]2CN[C@@H]3C(=O)CO[C@@H]32)cc1',
  'CC(C)(C)OC(=O)OC(=O)OC(C)(C)C.O=S(=O)(O[C@@H]1CN[C@@H]2[C@@H](O)CO[C@@H]21)c1ccc(CBr)cc1',
  'CC(C)(C)OC(=O)OC(=O)OC(C)(C)C1Cc1ccc(S(=O)(=O)O[C@@H]2CN[C@@H]3[C@H](OCc4ccccc4)CO',
  'CC(C)(C)OC(=O)OC(=O)OC(C)(C)C.CC(C)(C)OC(=