# datset reference -- https://pubs.acs.org/doi/10.1021/acscatal.9b04293

## Labels: Selectivity, Yeild, Conversion, Combo

### Synthesis methods:

#### Adapted: key terms: support1, M1,m1, CT, Total_flow, meth/oxy, arg_P, Temp, Name

"Name was prepared based on a co-impregnation method. A support support1 (1.0 g) was impregnated with 4.5 mL of an aqueous solution of consiting of M1 (m1),M2 (m2), M3 (m3), at 50 °C for 6 h. After vacuum drying at 110 °C, the product was calcined at 1000 °C under air for 3 h.Once Name is activated the reaction is ran at Temp °C. The total flow volume was Total_flow mL/min with flow ratios of  CH4/O2 meth/oxy mol/mol, and an Ar concentration of arg_P atm. The height of the catalyst bed was fixed at 10 mm, leading to a contact time of CT s at Total_flow ml/min." 

#### Original:
"The catalysts were prepared based on a co-impregnation method. A support substrate (1.0 g) was impregnated with 4.5 mL of an aqueous solution of specified metal precursors at 50 °C for 6 h. After vacuum drying at 110 °C, the product was calcined at 1000 °C under air for 3 h to yield a catalyst. When a water-sensitive metal alkoxide was employed, the impregnation was sequentially performed in the order of an aqueous solution of tungstate and an ethanol solution of a metal alkoxide. The obtained catalysts were thoroughly ground before any usage. The catalyst preparation was appropriately parallelized with the aids of a parallel hot stirrer (Reacti-Therm, Thermo Scientific) and a centrifugal evaporator (CVE-3100, Eyela). Twenty catalysts were produced in one batch. The samples were characterized by X-ray diffraction and scanning electron microscopy."

### Reaction conditions:

"Once catalysts are activated at 1000 °C for 160 min under O2, the temperature is stepwise declined from 900 to 850, 800, 775, 750, and 700 °C. At each temperature, the total flow volume (10, 15, and 20 mL/min/channel), the CH4/O2 ratio (2, 3, 4, and 6 mol/mol), and the Ar concentration (PAr = 0.15, 0.40, 0.70 atm) are stepwise varied. One reaction condition is held for 6–7 min, which allows 2–3 rounds of sampling in the same condition for acquiring the error range of observations. The ascending temperature protocol was not employed as it causes excessive CO and CO2 production due to the combustion of carbon deposits. The height of the catalyst bed was fixed at 10 mm, leading to a contact time of 0.75, 0.50, or 0.38 s at the given total flow volumes. Combined variations in the temperature, the total flow volume, the CH4/O2 ratio, and the Ar concentration lead to 216 conditions per catalyst and 4320 observations for 20 catalysts in a single automated operation."

In [None]:
import pandas as pd
import bolift
from collections import OrderedDict

from dotenv import load_dotenv

load_dotenv("../.env")

In [None]:
name_dict = {
    'Name': 'name',
    'Support ': 'sup',
    'M1': 'm1',
    'M1_mol': 'm1_mol',
    'M2': 'm2',
    'M2_mol': 'm2_mol',
    'M3': 'm3',
    'M3_mol': 'm3_mol',
    'Temp': 'react_temp',
    'Total_flow': 'flow_vol',
    'Ar_flow': 'ar_vol',
    'CH4_flow': 'ch4_vol',
    'O2_flow': 'o2_vol',
    'CT': 'contact'
}

prompt_template = "To synthesize {name}, {sup} (1.0 g) was impregnated with 4.5 mL of an aqueous solution consisting of {m1} ({m1_mol} mol), {m2} ({m2_mol} mol), {m3} ({m3_mol} mol), "\
            "at 50 ºC for 6 h. Once activated the reaction is ran at {react_temp} ºC. "\
            "The total flow rate was {flow_vol} mL/min (Ar: {ar_vol} mL/min, CH4: {ch4_vol} mL/min, O2: {o2_vol} mL/min), "\
            "leading to a contact time of {contact} s."


In [None]:
# round((((y["M1_mol%"])/100)*(y["M2_mol"]+y["M3_mol"])/(1-(y["M1_mol%"]/100))),3)
df = pd.read_csv('oxidative_methane_coupling.csv')

def calculate_M1_mol(row):
    return round((((row["M1_mol%"]) / 100) * (row["M2_mol"] + row["M3_mol"]) / (1 - (row["M1_mol%"] / 100))), 3)

df["M1_mol"] = df.apply(calculate_M1_mol, axis=1)
df.rename(columns=name_dict, inplace=True)

In [None]:
# df.groupby(['name', 'm1', 'M1_atom_number', 'm2', 'M2_atom_number', 'm3', 'M3_atom_number', 'sup', 'Support_ID', 'M2_mol', 'M3_mol', 'm1_mol', 'm2_mol', 'm3_mol']).count()

In [None]:
print(f"We have {len(df['name'].unique())} unique catalysts.")
filter = 216 # created dataset with 50, 100, 150, and 216 samples per catalyst
unique_catalysts = df['name'].unique()
filtered_df = pd.DataFrame()
for k in unique_catalysts:
    # print(f"{k:<18s}:{len(df[df['name'] == k])}")
    filtered_df = pd.concat([filtered_df, df[df['name'] == k].iloc[:filter]])
print(f"We created a pool of {len(filtered_df)} by selectin {filter} samples from each catalyst.")

# The last catalyst has only 180 available samples. That's why we have 12708 samples instead of 59*216 = 12744.
# (59*216) - (216-180) = 12708

In [None]:
with open(f'./data/{len(filtered_df)}_ocm_dataset.csv', 'w') as f:
    f.write("prompt;completion\n")
    for i, r in filtered_df.iterrows():
        props = OrderedDict({
            k:v for k,v in r.items() if k in name_dict.values()
        })

        f.write(f'{prompt_template.format(**props)};{r["C2y"]}\n')


In [None]:
ocm_ds = pd.read_csv(f'./data/{len(filtered_df)}_ocm_dataset.csv', sep=";")

pool = bolift.Pool(ocm_ds['prompt'].tolist(), formatter=lambda x: f"experimental procedure: {x}")
pool