## CATALOOP Hackathon 2025

### Introduction
The following code shows you how to load the data for the Hackathon. This data was taken from the following publication, preprocessed, and simplified by removing some data columns: https://doi.org/10.1021/acscentsci.5c00900

The goal is to develop two data-driven models, one that predicts the product yield (in %), and another one that predicts the $\Delta$$\Delta$G associated to the enantioselectivity of the Sharpless Asymmetric Dihydroxylation reactions based on the information provided in the data and additional descriptors and representations you may choose to use. It is both allowed and encouraged to compute features for the SMILES of the reactant and product provided.

### Preprocessing
For the sake of simplicity, some reaction conditions such as solvent, temperature, and concentrations, which admittedly can have significant influence on reaction outcomes, were removed from the data. As the goal of this Hackathon is not to build a perfect model, there is unfortunately not enough time for that, this action is defendable. Nevertheless, it is important to be aware that this might introduce errors into the model.

The following few code cells show you how you can load the data into your notebook and also how to convert the SMILES strings of reactants and products into RDKit mol objects.

### Submission
Your submission must consist of two files. The first must be a Python script or a Jupyter notebook and must contain 1) your model such that its predictions can be reproduced by another person 2) a function that reads one line of input of a `pandas` DataFrame in the exact same format as the DataFrame generated in the sample code below and 3) predicts both yield (in %) and enantioselectivity (in kcal/mol) for the new data read. The second must be a PDF document that describes the methodology implemented in your code.

The prediction accuracy of your submitted model will be evaluated based on data that is not provided to you. This allows us to compare the performance of all the submitted models in a fair and unbiased manner. The submitted models will be ranked based on both the mean absolute error (MAE) and the corresponding coefficient of determination (R²) for both yield and enantioselectivity. For each prediction task (yield and enantioselectivty), the rankings based on MAE and R² will be translated into points that correspond to the corresponding rank. The submission with the lowest sum of these two rankings will win for the corresponding prediction task.

In [None]:
# Install RDKit
! pip install rdkit

In [None]:
# Import essential packages
import pandas as pd
import rdkit.Chem as rdc

In [None]:
# Load data
data = pd.read_csv("https://raw.githubusercontent.com/robpollice/cataloop-hackathon25/refs/heads/main/data_preprocessed.csv")

In [None]:
# Create mol objects from SMILES
data['Reactant'] = [rdc.MolFromSmiles(str(si)) for si in data.loc[:, 'Reactant SMILES'].values]
data['Product'] = [rdc.MolFromSmiles(str(si)) for si in data.loc[:, 'Product SMILES'].values]