In [1]:
import os
from dotenv import load_dotenv
from chemcrow.agents import ChemTools, ChemCrow
from chemcrow.evaluation import NoToolsAgent, Evaluator

load_dotenv()
chemtools = ChemTools()

In [2]:
task = (
    "Your first task is to preprocess a dataset of molecules and train a RF regressor. \n"
    "The file data/all.csv contains molecules with some properties. "
    "This task has 4 steps:\n"
    "1. Print the first rows to learn the format. "
    "Then filter the solvent by acetonitrile (filter using the SMILES). "
    "From this dataset take only the columns for the molecule and the absorbance max wavelength. "
    "2. With that dataset, calculate MorganFingerprints using rdkit with 1024 bits, and split "
    "the dataset into train and test with a 80:20 ratio. \n"
    "3. Train a Random Forest model from sklearn (use seed=42 and 100 estimators), "
    "train it on the train data and test it on the test data. "
    "Report the root mean squared error of the model.\n"
    "4. Load the molecules in data/pool_clean.smi, "
    "use the model to predict the absorption wavelength of all these molecules, "
    "and give as a final result the molecule that is closer to 369nm.\n"
    "5. As a final step, plan a synthetic procedure for this molecule."
    "Use the PythonREPL for the first steps. Run every step and revise your code if you get no response in return."
)

# First, answer with tools
mrkl = ChemCrow(
    chemtools.all_tools,
    model="gpt-4",
    temp=0.1,
)

result_tools = mrkl.run(task)



[1m> Entering new RetryAgentExecutor chain...[0m
[32;1m[1;3mI need to preprocess the dataset and train a RF regressor. First, I need to print the first rows of the dataset to understand the format.
Action: Python REPL
Action Input: print(open("data/all.csv").readlines()[:5])[0m
Observation: [36;1m[1;3m['\ufeffTag,Chromophore,Solvent,Absorption max (nm),Emission max (nm),Lifetime (ns),Quantum yield,log(e/mol-1 dm3 cm-1),abs FWHM (cm-1),emi FWHM (cm-1),abs FWHM (nm),emi FWHM (nm),Molecular weight (g mol-1),Reference\n', '1,N#Cc1cc2ccc(O)cc2oc1=O,O,355,410,2.804262479,NaN,NaN,NaN,NaN,NaN,NaN,187.1537,DOI: 10.1021/acs.jpcb.5b09905\n', '2,N#Cc1cc2ccc([O-])cc2oc1=O,O,408,450,3.961965135,NaN,NaN,NaN,NaN,NaN,43,186.14576,DOI: 10.1021/acs.jpcb.5b09905\n', '3,CCCCCCCCCCCC#CC#CCCCCCCCCCN1C(=O)c2ccc3c4ccc5c6c(ccc(c7ccc(c2c37)C1=O)c64)C(=O)N(CCCCCCCCCC#CC#CCCCCCCCCCCCC)C5=O,ClC(Cl)Cl,526,535,3.602954423,NaN,NaN,NaN,NaN,NaN,NaN,1061.54348,https://doi.org/10.1002/smll.201901342\n', '4,[O-]c1

[32;1m[1;3mNow that I have the trained model and its RMSE, I need to load the molecules in data/pool_clean.smi, use the model to predict the absorption wavelength of all these molecules, and find the molecule that is closer to 369nm.
Action: Python REPL
Action Input: pool_clean = [line.strip() for line in open("data/pool_clean.smi").readlines()]; pool_clean_fps = [AllChem.GetMorganFingerprintAsBitVect(Chem.MolFromSmiles(smiles), 2, nBits=1024) for smiles in pool_clean]; pool_clean_preds = rf.predict(pool_clean_fps); closest_molecule_index = np.argmin(np.abs(np.array(pool_clean_preds) - 369)); closest_molecule_smiles = pool_clean[closest_molecule_index]; print(closest_molecule_smiles)[0m
Observation: [36;1m[1;3mCOC(=O)c1ccc(/C=C/c2ccc(-c3cccc(NS(C)(=O)=O)c3)cc2)c(C)c1
[0m
Thought:[32;1m[1;3mNow that I have the SMILES of the molecule closest to 369nm, I need to plan a synthetic procedure for this molecule.
Action: RXNPlanner
Action Input: COC(=O)c1ccc(/C=C/c2ccc(-c3cccc(NS(C)(=O)