<center><h1> Determination of the best chromatography type </h1></center>
    
## **Introduction**

<p style="text-align: justify;">Chromatography is a crucial analytical method in chemistry used for separating, identifying, and quantifying components in a mixture, and selecting the appropriate type of chromatography (e.g., gas chromatography, liquid chromatography, thin-layer chromatography) is essential for optimal results. This is why we decided to create a programming tool that allows to determine easily the chromatpgraphy type as well as the best eluent needed to carry out the chromatography. 
This report details the development and execution of a programming project aimed at creating a tool to determine the most suitable chromatography technique for analyzing a given mixture of molecules. The project was conducted using python programming laguage. The different functions created to make this tool are listed in different notebooks in which each of them are commented. Moreover, the full project has been written on the source python file. 
To carry out this project, a data base has been used in order to extract all the important information that were needed to establish the best chromatography type, then a function that takes into account all the different information has been created and finaly, a user interface has been created.

## **Objectives**

The primary objective of this project was to develop a program that:

- Accepts input data on the molecular composition of a mixture.
- Analyzes the properties of the molecules.
- Recommends the most suitable chromatography technique based on predefined criteria and standards.

## **Used Packages**

- Tkinter: used to create a Graphical User Interface 
- Pandas: used to create a data frame in which all the properties are added
- Pubchemprops: used to retrieve most of the properties of the molecules from pubchem (missing pKa)
- pka_lookup.py: A function written by Khoi Van which request the needed dictionnary of strings on PubChem.
 

## **Methodology**

### 1. Graphical User Interface ###

<p style="text-align: justify;">A simple user interface was developed using Tkinter, allowing users to input the name of several molecules and obtain chromatography recommendations. The user can give molecule names that will be added to a listbox to keep in mind the different molecules in the mixture. The user then decides to determine the best chromatography type that will be indicated on the interface using the "Determine Chromatography" button. If an error occurs, an error message will be displayed using a message box on the user interface.

<div style="display: flex; justify-content: center; align-items: center; padding: 20px;">
    <div style="margin: 10px;">
        <img src="../assets/edited-hydrocarbure.gif" alt="hydrocarbures" style="width: 400px; height: auto;">
    </div>
    <div style="margin: 10px;">
        <img src="../assets/edited-naphtalene-.gif" alt="HPLCexample" style="width: 400px; height: auto;">
    </div>
</div>


These two examples show how the results appear on the user ineterface when molecules are added. Above, two different mixtures are given, hence, two different results appear on the screen.


### 2. Data Collection and Preparation ###
<p style="text-align: justify;">
A dataset containing various molecular properties (e.g., molecular weight, polarity, solubility) and corresponding successful chromatography types was compiled.
Data preprocessing included handling missing values such as the pKa values. Relevant features such as molecular weight, polarity index, boiling point, pKa and solubility were selected and added to a data frame. 

First we import the function from the Chrfinder.py file.

In [3]:
from Chrfinder import find_pka, find_boiling_point, get_df_properties

Then we can create a list of compounds, which will normally be done by tkinter.

In [4]:
mixture = ["caffeine", "Aspirin", "Maleic acid"]

Then we run and print the dataframe:

In [5]:
print(get_df_properties(mixture))

      CID MolecularFormula  MolecularWeight                     InChIKey  \
0    2519        C8H10N4O2           194.19  RYYVLZVUVIJVGH-UHFFFAOYSA-N   
1    2244           C9H8O4           180.16  BSYNRYMUTXBXSQ-UHFFFAOYSA-N   
2  444266           C4H4O4           116.07  VZCYOOQTPOCHFL-UPHRSURJSA-N   

                         IUPACName  XLogP    pKa  Boiling Point  
0  1,3,7-trimethylpurine-2,6-dione   -0.1  14.00         177.93  
1          2-acetyloxybenzoic acid    1.2   3.47         140.00  
2          (Z)-but-2-enedioic acid   -0.3   1.83         135.00  
      CID MolecularFormula  MolecularWeight                     InChIKey  \
0    2519        C8H10N4O2           194.19  RYYVLZVUVIJVGH-UHFFFAOYSA-N   
1    2244           C9H8O4           180.16  BSYNRYMUTXBXSQ-UHFFFAOYSA-N   
2  444266           C4H4O4           116.07  VZCYOOQTPOCHFL-UPHRSURJSA-N   

                         IUPACName  XLogP    pKa  Boiling Point  
0  1,3,7-trimethylpurine-2,6-dione   -0.1  14.00         177

### 3. Determination of the chromatography type ###
<p style="text-align: justify;">
A function has been created to determine the most suiatble chromatography type based on the physical-chemical properties of the molecules saved on a data frame. To do this the following dependencies has tp be installed:

In [7]:
import pandas as pd

Pandas will create a data frame from a dictionary as follows:

In [8]:
data = {
    'Mixture': ['Caf', 'Ace', 'Asp'],
    'Boiling Point': [178, 332.7, 246],
    'XlogP': [-0.07, -1.33, 0.07],
    'pKa': [[14], [3.02], [1.08, 9.13]],
    'MolecularWeight': [194, 204, 133],
}

df = pd.DataFrame(data)
print(df)

  Mixture  Boiling Point  XlogP           pKa  MolecularWeight
0     Caf          178.0  -0.07          [14]              194
1     Ace          332.7  -1.33        [3.02]              204
2     Asp          246.0   0.07  [1.08, 9.13]              133


As can be seen in the second column it is indicated the names of all the molecules from the mixture. Then the properties used by the function are the boiling point in [°C], the XlogP, the pKa and the molecular weight in [g/mol].

Following an example of the functionality of the function:

In [None]:
import det_chromato(df) as detc
Mixture_chromato_type, eluent_nature, proposed_pH = det_chromato(df)

The function will look up column per column in the data frame and search the maximum or the minimum of each fisical or chemical properties. Based on that it start to predict the best chromatogrphy to perform:

The GC will be performed if the max **boiling temperature** of the mixture is under 250 °C, this will ensure the high volatility of the mixture.

The **molar weight** will indicate if the steric exclusion chromatography is allowed or not. It is suitable for heavy molecules, thus for the compounds in the mixture with a mass larger than 2000 g/mol.

The choice between high pressure liquid chromatography (HPLC) and ionic chromatography (IC) is determined by the **XlogP**, wich is an indicator of the hydrophilicity of the molecules. Negative values for the **XlogP** indicates that the molecule is likely to dissolve in water (thus hydrophile), positive values indicates that the molecule is more likely to dissolve in organic solvents (thus hydrofobe). Therefore the more hydrophile is the molecule the more likely it will ionize, hence the suitable chromatography will be IC. 

In the liquid chromatography such as HPLC or IC the pH of the eluent is an important parameter which can lead to a good separation. All the molecules have to be stable at this pH otherwise they can degrade and the separation will not be good. to ensure the stability of the molecules, the function propose a pH wich is 2 values above the minimum **pKa** of the mixture compunds.

In [None]:
print(f"The advisable chromatography type is: {Mixture_chromato_type}")
print(f"Eluent nature: {eluent_nature}")
if proposed_pH is not None:
    print(f"Proposed pH for the eluent: {proposed_pH}")

### 4. Explanation of each part of get_df_properties(mixture) (c.f. Methodology.2.) ###
#### Getting 'CID', 'MolecularFormula', 'MolecularWeight', 'InChIKey', 'IUPACName', 'XLogP' using PubChem requests 
(only missing Boiling Point and pka)

This part of the code uses pubchempy and the code of *Maxim Shevelev* (pubchemprops.py) to easily find most of the properties. It takes into arguments a list of compound's names. It will return a dictionnary for each compound with the following properties: 'CID', 'MolecularFormula', 'MolecularWeight', 'InChIKey', 'IUPACName', 'XLogP'.

The tkinter interface returns a list of compounds as string. The first part of the code iterates through the whole list (compound_list), and encodes every compound name (string) as URL. Then it will search for the properties on pubchem using pubchempy and add them into a dictionnary.

In [None]:
from Chrfinder.Chrfinder import get_df_properties

compound_list = ["caffeine", "Aspartame", "Acesulfame K"]
#Delete '#' for a list with a wrong name
#compound_list = ["Water", "Acetone", "Wrong name"]

get_df_properties(compound_list, verbose = True)

#### <ins> Finding pka using Pubchem from InchiKey String</ins>

Using the InchiKey String found by the function right before, the following function **returns the first pka found on PubChem as string**. This value, similary to the Boiling temperature, is a lot harder to find. This script uses a file of *Khoi Van* named **pka_lookup.py** which request the needed dictionnary of strings on PubChem. This means it takes **quite a while to find the string**, but creating a database is in scope. 

From the string found, this code extracts the pka_value from the dictionnary and returns the value as a string which will be converted in float in the function of Chrfinder.py.

In [None]:
from Chrfinder.Chrfinder import find_pka

#inchikey of caffeine
inchikey_string = 'RYYVLZVUVIJVGH-UHFFFAOYSA-N'

find_pka(inchikey_string, verbose=True)

#### <ins> Finding Boiling Temperature using Pubchem from name</ins>

Using the names of compound_list, the following function **returns the mean of the celsius and Fahrenheit Boiling Temperatures found on PubChem as float**. This value, similary to the Boiling temperature, is a lot harder to find. This script uses a file of *Maxim Shevelev* named **pubchemprops.py** which request the needed dictionnary of strings on PubChem. This means it takes **quite a while to find the string**, but creating a database is in scope. 

From the string_value in text_dict (extracted from PubChem), this code extracts all the boiling points from the dictionnary and returns the mean after converting Fahrenheit in celsius. The output is a float with 2 decimals.

In [None]:
from Chrfinder.Chrfinder import find_boiling_point

mixture = "Caffeine"
find_boiling_point(mixture, verbose= True)

## **Results** ##
<p style="text-align: justify;">
The final model demonstrated high accuracy in recommending the appropriate chromatography technique based on the molecular properties of the mixture. Key results include:

**Feature Importance:** 
<p style="text-align: justify;">
Polarity index, molecular weight, and solubility were identified as the most influential features in determining the suitable chromatography type.

**User Feedback:**
<p style="text-align: justify;">
The interactive nature of the user interface allowed for seamless user input and clear presentation of results, enhancing the user experience.
    
When error occurs, messages appear to inform the user of the encountered error. An example of this feature is shown bellow:

<div style="text-align: center; padding: 20px;">
    <img src="../assets/edited-error-molecule-not-added.gif" alt="errornomolecule" style="width: 400px; height: auto;">
</div>

  

## **Limits** ##

The example bellow shows the limit of the program. Indeed, the suitable chromatography given by the interface is not correct, this is due to the fact that the boiling point is not given in pubchem which generates an error in the determination of the chromatography. Hence, a lack of data is limiting our project.

<div style="text-align: center; padding: 20px;">
    <img src="../assets/edited-limit.gif" alt="limit" style="width: 400px; height: auto;">
</div>

##### Take into account: 
- It takes into account **wrong names** and **compounds with no pages on PubChem**: it returns "The_compound_name" not found on PubChem'
- The code works even with missing boiling point or pka in dataframe.

##### To improve: 
- **Build a database**: mostly with thermostability for better choice of chromatography;
- Taking into account **multiple pKa** values for polyacids for exemple;
- Optimize the research: search only one time the **same name**, don't search for **one name**;
- **Easier usage and options** of functionnalities to search physicalchemical properties in a dataframe;



## **Encountered difficulties** ##

A lot of challenges were faced to make request to pubchem:
- Handling of **errors concerning pubchem**, no page found, wrong compound name, etc
  
- How to **extract string using regex**, used for pka, boiling temperature (°C) and (°F).
  
- Handling of **pka errors wrongly written in Pubchem** (either value in value, either input in the key of the dictionnary) <br>
    {"pka = 20.0", } instead of {'pKa': '14'}
  
- Tests of pubchem request,
   
## **Conclusion** ##
<p style="text-align: justify;">
The project successfully developed a robust program for recommending chromatography techniques based on molecular composition. The use of a user interface facilitated an efficient and interactive development process, enabling easy experimentation and visualization. This tool has significant potential applications in chemical analysis, improving efficiency and accuracy in selecting the appropriate analytical methods.

## **Future Work** ##

Future enhancements could include:

- Expanding the dataset to cover more diverse molecular structures and additional chromatography techniques.
- Improve the code to have a more efficient way to take the information on the molecules contained in the mixture to have the results more rapidly.
- Integrating the program with real-time data acquisition systems for automated analysis.
- Enhancing the user interface with more sophisticated visualization tools and user-friendly features.
- By continuing to refine and expand this tool, it can become an invaluable resource for chemists and researchers in analytical laboratories.

This report provides an overview of the programming project conducted using Python, detailing the development process, results, and future directions for the chromatography recommendation program.
