# USSPM Common Chemical Compond Lookup Table Utils
Utility script to lookup synonyms of common chemical compounds as extracted fom the following Wikipedia entries:
- https://en.wikipedia.org/wiki/Glossary_of_chemical_formulae
- https://en.wikipedia.org/wiki/List_of_CAS_numbers_by_chemical_compound


From prediction analysis we found that most models we trained had difficulties in predictng the score for phrase containing chemical formulae. We hence introduce a lookup table to find synonyms for common chemical compounds based on their formulae. 

Since phrases in the USPPPM training are lower-cased, chemical formulas might be abigous and are hard to determine directly. Furthermore, chemical formulas might spell actual words. For example the string `base` can be interpreded as a word (e.g. as in *military base*) or as the chemical compound *barium selenide*. 
Since replacing the actual word with the name of a chemical compound would likely produce wrong samples, we need a strategy for which strings to replace when they match a certain formula. Here we use three heuristics:
1. A manually curated blacklist of phrases to not replace
2. A pattern based whiteiist where we replace all strings that contain numbers
3. A final blacklist using a lookup into the wordnet database. If we find a synset there, we assume that the string is a commonly used word that most large Language Models can work with properly. 

### Libraries and Setup

In [1]:
import os
from typing import List, Optional
import nltk
from tqdm import tqdm
import itertools
import pandas as pd

CHEM_COMP_PATH = "../input/chemical-compounds-lookup/compounds.csv"

def setup_nltk():
    nltk.data.path.append('../input/wordnet')
    # nltk.download('wordnet', download_dir='./nltk_data')

setup_nltk()
from nltk.corpus import wordnet

### Phrase replacement

In [2]:
def formula_viable_for_augmentation(formula: str) -> bool:
    '''
      Checks wether a given string is viable for being replaced by a chemical synonym
      @formula: A string representing a formula of a chemical compound
      @retrun: `True` if `formula` is viable for replacement with the name(s) of the chemical compound
    '''
    # Manually currated blacklist of words not to replace
    blacklist = ['fi']
    if formula in blacklist:
        return False

     # Whitelist-Pattern: If there's any digit in the string, we assume it to be a chemical formula
    if any(i.isdigit() for i in formula):
        return True

    # If we find a synset for the string, we assume it is a regular word and will not replace it
    if len(wordnet.synsets(formula)) > 0:
        return False

    return True

def chem_formula_synonyms(chem_df: pd.DataFrame, formula: str) -> Optional[List[str]]:
    '''
      Finds the synonyms of a chemical formulae
      @param chem_df The lookup table for synonyms to chemical formulae
      @param formula A string potentially representing a replaceable chemical formulae
      @return A list of synonyms for the chemical compount referenced by the formula in 
        string`formula` or `None` if the string is not a formula referencing a chemical 
        compund viable for replacement. 
    '''
    if formula_viable_for_augmentation(formula):
        res = chem_df[chem_df['Formula'] == formula.lower()]
        if len(res) > 0:
            return res['Name'].to_list()
    
    return None

def phrase_chem_formula_synonym(chem_df: pd.DataFrame, phrase: str, replace: bool=True):
    '''
      Checks a given phrase for common chemical formulae and 
      returns a list of phrases where common chemical formulae were substituted
      by their synonyms. If no chemical formula is found in the phrase, the original 
      phrase is returned
      @param chem_df the lookup dataframe for common chemical formulae
      @param phrase the phrase to check for chemical formulae
      @replace: if`True` and the phrase will not be part of the returned list in case 
      synoyms have been found
      @return a list of phrases containing synonyms to chemical formulae.
    '''
    # Trigger warning: sloppy implementation - but it works
    
    # we get a map of list of synonyms to words in the phrase 
    syn_map = {w: chem_formula_synonyms(chem_df, w) for w in phrase.split()}
       
    # we start with the original phrase and for each word replace it with 
    # the corresponding synonym. If there are multiple synonyms, we introduce 
    # a new phrase to the list
    phrases = [phrase]
    found_sym = False
    for w, syns in syn_map.items():
        if not syns: continue
        found_sym = True
            
        new_ps = []
        for syn in syns:
            new_ps += [p.replace(w, syn) for p in phrases]
            
        phrases = new_ps
    
    # if we don't find any replaceable chemical forumlae in the phrase we return the original phrase
    if not found_sym:
        return phrase
    else:
        return phrases if replace else (phrases + [phrase])

**Tests**

In [3]:
chem_lookup = pd.read_csv(CHEM_COMP_PATH)
display(chem_lookup)
print(phrase_chem_formula_synonym(chem_lookup, 'agbr test'))
print(phrase_chem_formula_synonym(chem_lookup, 'agbr dna test agonc ag2cl2'))
print(phrase_chem_formula_synonym(chem_lookup, 'dna test d2o'))
chem_formula_synonyms(chem_lookup, 'c3h6')

Unnamed: 0,Name,Formula
0,actiniumiiioxide,ac2o3
1,silvertetrachloroaluminate,agalcl4
2,silverbromide,agbr
3,silverbromate,agbro3
4,silvercyanide,agcn
...,...,...
4067,zirconateion,zro32
4068,zirconiumphosphide,zrp2
4069,zirconiumsulfide,zrs2
4070,zirconiumsilicide,zrsi2


['silverbromide test']
['silverbromide dna test silverfulminate disilverdichloride', 'silverbromide dna test silvercyanate disilverdichloride', 'silverbromide dna test silverfulminate silveriidichloride', 'silverbromide dna test silvercyanate silveriidichloride']
['dna test deuteriumoxide', 'dna test heavywater']


['cyclopropane', 'propylene']

### USPPPM Dataset Augmentation
Utility functions to augment a USPPPM datasets with samples generated from replacing common chemical formulae in *anchor* and *target* phrases with their corresponding synonyms.

This can either be used for training by augment the dataset with additional samples to train on and/or at inference time, by averaging predictions of additional augmented samples

In [4]:
def augment_chem_formulae(df: pd.DataFrame, replace:bool=True):
    '''
      Augments the data with synonyms of chemical formulae. 
      If param `replace` is `False` or more then one synonym was found 
      for the formula, additional rows with the same id will be introduced.
      @param df Dataframe with columns `id`, `anchor` and `target`
      @param replace If true will replace the existing phrase
    '''
    
    # get synonym lists for anchor and target
    syms = pd.DataFrame({'id':df['id']})
    syms['anchor'] = df['anchor'].apply(lambda p: phrase_chem_formula_synonym(chem_lookup, p, replace))
    syms['target'] = df['target'].apply(lambda p: phrase_chem_formula_synonym(chem_lookup, p, replace))
    
    # create all combinations of anchor and target synonyms
    syms = syms.explode('anchor').explode('target')
    
    return syms


def average_scores_by_id(scores: pd.DataFrame):
    return scores.group_by('id').mean().reset_index()

**Tests**

In [5]:
chem_test = pd.DataFrame({'id': pd.Series(['t1', 't2', 't3', 't4']), 
                         'anchor': pd.Series(['agbr dna test agonc ag2cl2', 'agbr test', 'agbr', 'last']),
                         'target': pd.Series(['agonc ag2cl2', 'test thingy', 'c4h7no4', 'last']),
                         'context': pd.Series(['G02', 'G02', 'C12', 'C12'])})
print("Before")
display(chem_test)
print("After")
augment_chem_formulae(chem_test, True)

Before


Unnamed: 0,id,anchor,target,context
0,t1,agbr dna test agonc ag2cl2,agonc ag2cl2,G02
1,t2,agbr test,test thingy,G02
2,t3,agbr,c4h7no4,C12
3,t4,last,last,C12


After


Unnamed: 0,id,anchor,target
0,t1,silverbromide dna test silverfulminate disilve...,silverfulminate disilverdichloride
0,t1,silverbromide dna test silverfulminate disilve...,silvercyanate disilverdichloride
0,t1,silverbromide dna test silverfulminate disilve...,silverfulminate silveriidichloride
0,t1,silverbromide dna test silverfulminate disilve...,silvercyanate silveriidichloride
0,t1,silverbromide dna test silvercyanate disilverd...,silverfulminate disilverdichloride
0,t1,silverbromide dna test silvercyanate disilverd...,silvercyanate disilverdichloride
0,t1,silverbromide dna test silvercyanate disilverd...,silverfulminate silveriidichloride
0,t1,silverbromide dna test silvercyanate disilverd...,silvercyanate silveriidichloride
0,t1,silverbromide dna test silverfulminate silveri...,silverfulminate disilverdichloride
0,t1,silverbromide dna test silverfulminate silveri...,silvercyanate disilverdichloride
