# Generating polymers pool

This notebook outlines its main objective: to cultivate a diverse pool of polymers, utilizing computational techniques for simulating and studying polymer structures with an aim to propel the discovery of functional polymers using the rule-based virtual library generator, `SMiPoly`. Additionally, it integrates `RadonPy`, a powerful toolkit for molecular dynamics and property prediction, further enriching the analysis and optimization of the generated polymer structures by providing detailed insights into their physical and chemical behaviors.

## 1. Import Libraries
his cell loads essential Python libraries for numerical and data manipulation (`numpy`, `pandas`) and specialized libraries for molecular informatics (`smipoly.smip`). It also introduces `RadonPy` for molecular dynamics and properties prediction, setting the foundation for processing and analyzing polymer structures and their molecular dynamics.

In [None]:
import numpy as np
import pandas as pd
from smipoly.smip import monc, polg
from radonpy.core import poly
from radonpy.ff.gaff2_mod import GAFF2_mod
from radonpy.ff.descriptor import FF_descriptor

## 2. Generating pool
This function leverages smipoly to efficiently process a dataframe of molecular data, represented in `SMILES` strings, into a unique pool of polymers. Initially, it classifies monomers by analyzing their `SMILES` representations, filtering out unsuitable candidates for polymerization and focusing on those with potential to form polymers. 

Following classification, it generates bipolymers by simulating chemical reactions between pairs of monomers, adhering to `SMiPoly`'s rules on polymer formation. This step emphasizes the creation of polymers from two distinct monomer units, mirroring actual polymerization methods. The process further entails filtering for specific structural features, deduplication to ensure uniqueness, and data cleansing to prepare a structured dataframe. This dataframe, enriched with unique identifiers and placeholders for molecular properties, stands ready for advanced analyses like Bayesian Optimizations.

In [None]:
def generate_pool(df):
    _ = monc.moncls(df=df, smiColn='smiles', dsp_rsl=False)
    _ = polg.biplym(df=_, Pmode='a', dsp_rsl=False)
    _ = _[_['polym'].apply(lambda x: x.count('*') == 2)]
    df_ = _.drop_duplicates(subset=['polym']).reset_index(drop=True)
    _ = poly.full_match_smiles_listself(pd.Series(df_["polym"]), mp=2)
    _ = pd.DataFrame(_, columns=["idx1", "idx2"])

    del_idx = _["idx1"].values.tolist()
    not_del_idx = list(df_.drop(del_idx).index)
    df_ = df_.iloc[not_del_idx]
    df_ = df_.sample(frac=1).reset_index(drop=True)

    df_["monomer_ID"] = ["SMiPoly_VL" + str(_) for _ in np.arange(len(df_))]
    df_["smiles"] = df_["polym"]
    df_["cycle"] = ""

    for prop in props:
        df_[prop] = 0

    return df_[["monomer_ID", "mon1", "mon2", "smiles"] + props + ["cycle"]]

## Calculate Force Field (FF) descriptors
This function calculates force field descriptors essential for understanding polymers' molecular dynamics, using kernel mean embedding to standardize the complex and variable molecular force field parameters from GAFF2 (General Amber Force Field 2) into uniform, fixed-length vectors. GAFF2 parameters, which include a wide range of molecular interactions from covalent bonds to non-covalent forces like van der Waals and Coulomb forces, are mapped into a high-dimensional feature space using a Gaussian kernel function. This allows for molecular comparisons by simplifying their interactions into a single, comprehensive vector. 

The discretization of these parameters into intervals represented by Gaussian functions further refines this process, enabling an accurate approximation of their distribution across the dataset. This technique not only facilitates the quantitative analysis of molecular behaviors but also significantly enhances the efficiency of polymer research by providing a streamlined, informative view of each molecule's intrinsic properties. Detailed explanations are given [here](https://github.com/RadonPy/RadonPy/blob/develop/docs/FF-Descriptor_man.pdf).

In [None]:
def calc_ff_descriptors(df, mp=2, n=10, nk=20):
    sigma = 1/nk/2
    mu = None

    ff_desc = FF_descriptor(GAFF2_mod(), polar=False)
    desc = ff_desc.ffkm_mp(
        df["smiles"], mp=mp, nk=nk, s=sigma, mu=mu, cyclic=n
    )
    desc_names = ff_desc.ffkm_desc_names(nk=nk)
    return pd.DataFrame(desc, columns=desc_names)

## Main execution
Acting as the central command unit, this cell directs the overall workflow. It begins with loading the molecular dataset, then employs the previously defined functions to generate a unique polymer pool and calculate their force field descriptors. The final step involves exporting the processed data and descriptors for future use, demonstrating a complete cycle from data loading to result storage in molecular informatics research.

In [None]:
DATA_DIR = "../spacier/data/"
monomer_path = DATA_DIR + "monomer.csv"
df = pd.read_csv(monomer_path)
props = ["refractive_index", "Cp"]

df_pool = generate_pool(df)
df_pool_X = calc_ff_descriptors(df_pool)

df_pool.to_csv("df_pool.csv", index=False)
df_pool_X.to_csv("df_pool_X.csv", index=False)