# Lecture 16 Data Wrangling: Evaluating a Retrosynthesis Model

## The Challenge of Retrosynthesis

One of the fundamental problems in synthetic chemistry is **retrosynthesis**: given a target molecule (product), can we identify the starting materials (reactants) needed to synthesize it? This is essentially working backwards from the desired outcome to determine the recipe.

In recent years, machine learning models have been trained to predict retrosynthetic routes. These models take a target product molecule as input and suggest possible sets of reactants that could be used to make it. However; **How do we know if these predictions are actually good?**

## What is Round-Trip Accuracy?

Round-trip accuracy is an evaluation metric for retrosynthesis models that leverages the relationship between forward and backward reaction prediction:

![round trip accuracy](roundtrip.png)

### The Process:

1. **Start with a target molecule** (e.g., aspirin)
2. **Use a retrosynthesis model** to predict what reactants could make it
3. **Use a forward reaction model** to predict what product those reactants would actually make
4. **Compare the predicted product with the original target**
   - If they match → the retrosynthesis prediction was likely good! ✓
   - If they don't match → the prediction was probably incorrect ✗

### Why This Works:

Round-trip accuracy is based on a simple insight: if a retrosynthesis model predicts reactants R for a target product P, then a forward model should predict that R produces P. If the forward model predicts a different product, the retrosynthetic prediction is questionable.

## Your Task

In this exercise, you will implement a complete pipeline to calculate round-trip accuracy for a set of retrosynthetic predictions. In our scenario, the forward model is accessible via API and as calling the forward model is expensive we already have the outcome of a set of reactions cached. 
Your task will be:
- to load the dataset 
- clean it removing invalid predictions
- check for which reactions we already know the outcome from the cache
- call a mock API simulating the forward model for the remaining ones
- calculate the round trip accuracy

### The Dataset

You'll work with retrosynthesis predictions from a real model that include several challenges:
- Some predictions have invalid SMILES strings
- Some reactions are trivial (product = reactants)
- Some reactions have already been evaluated (cached)
- Data may contain whitespace and formatting issues

Your job is to clean the data, validate the chemistry, make API calls to a forward model, and ultimately calculate how often the forward model agrees with the retrosynthesis predictions.

## SMILES

In this exercise we will be working with SMILES, a text format for molecules. For example the SMILES representation of Aspirin is: `CC(=O)Oc1ccccc1C(=O)O`. You can visualise SMILES by using this website: https://www.simolecule.com/cdkdepict/depict.html. 

Note: Our SMILES sometimes contain the character ~. To visualise molecules containing ~, remove the ~ before passing it to the website.

## Setup:

Make sure the following packages are installed by running the next cell.

In [None]:
!pip install pandas rdkit flask requests click

## Step 0: Starting the Forward Model API

In this exercise we will be simulated the forward model with a mock API. To do this go ahead and start the server with the following python command in the terminal:

```python
python scripts/rxn_fwd_mock.py --rxn-file <path to rxn file>
```

The `rxn-file` is located in `resources/rxn_fwd_api_data.json`. After starting the server you should see the following:

```python
Available endpoints:
  POST /predict_forward - Predict forward reaction
  GET  /stats - Get API usage statistics
  POST /reset - Reset statistics
```
These are the endpoints on the server we can query. For us the `/prediction_forward` is going to be relevant.

## Step 1: Loading the data

The data is saved as a csv file in `exercises/lecture_16/inference_retro_checkpoint_6260.csv`. Load the data with pandas and familiarise yourself with the data. Have a look at some of the molecules and retrosynthesis predictions on the website mentioned above. Can you already spot some rows with problems?

In [None]:
import pandas as pd

retro_data = 

## Step 2: Cleaning the data

You will have likely seen some rows with problematic predictions. We will next clean the data removing all invalid molecules and standardising the SMILES strings. Follow the steps below:

1. The output of our model contains the formatting tokens `<smiles>` and `<smiles/>`. Remove these from the prediction.
2. Canonicalise the SMILES strings
3. Create a column called `need_to_run_fwd` which keeps track of if we will need to call the forward API

For canonicalisation use the function below that canonicalises (standardisation procedure) SMILES strings.

HINT: The pandas function `.map` is very useful here.

In [247]:
from rdkit import Chem
from pytoda.smiles.transforms import Canonicalization
canonicalize = Canonicalization()

def canonicalise(smiles: str) -> str:
    """Canonicalises SMILES strings. If the SMILES string is invalid an empty string will be returned."""

    mol = Chem.MolFromSmiles(smiles)

    if mol is None:
        return ""
    else:
        return Chem.MolToSmiles(mol)

Check how many invalid predictions the model made. If everything went correctly, you should see that the model predicted invalid SMILES in 46 out of 9707 cases. In the column `need_to_run_fwd` mark these as `False`.

In [None]:
retro_data['need_to_run_fwd'] = 

## Step 3: Caching

As we are evaluating different checkpoints of this model we already have run forward prediction for a large set of reactants. As calling the API is expensive we store the predictions of the forward model, so we don't have to rerun them. In the next step you will:
- Load the cache
- Check for which reagents we have already run the forward prediction
- In the column `need_to_run_fwd` indicate these rows with `False`

The cache is stored in `exercises/lecture_16/data/rxn_cache.json`.

In [None]:
rxn_cache = 

If everything went correctly you should now see that you still need to run forward for 2164 rows.

## Step 4: Calling the Forward API

In the next step we will be calling the API to get the forward predictions for the remaining 2185 rows. First double check in the terminal at which address the server can be reached. In my case this is: `http://127.0.0.1:5000`.

Then follow these steps:
- Play a bit around with the API
- Run forward prediction for all reactions which are valid and not present in the cache
- Create a column called `predicted_product` in the dataframe containing the forward predictions from the cache and the ones you just obtained from the API

Hint 1: You can use tqdm to create a progress bar, e.g. `for i in tqdm(range(100)):` functions as a normal for loop just with an added progress bar.

Hint 2: The column can be create by concatenating the rxn_cache and with the predicted products from the API, followed by a merge.

In [259]:
import requests
from typing import Optional, Dict
import tqdm.auto as tqdm

API_URL = "http://127.0.0.1:5000"

def call_forward_model(reactants: str) -> Optional[Dict]:
    """
    Call the forward model API to predict products
    
    Args:
        reactants: SMILES string of reactants
        
    Returns:
        Dictionary with prediction results or None if error
    """
    
    inp = {"reactants": reactants}
    return requests.post(API_URL+ "/predict_forward", json = inp).json()

In [None]:
subset_to_run = 

## Step 5: Calculating the Roundtrip accuracy

With this you can now calculate the round trip accuracy for our model by simply comparing the desired product the predicted product.

Hint: We expect a round trip accuracy of 70.33%

In [None]:
rt_acc = 
print(f"Round Trip accuracy: {rt_acc*100:.2f}")