<a href="https://colab.research.google.com/github/valsson-group/UNT-ChemicalApplicationsOfMachineLearning-Spring2026/blob/main/Lecture-3_January-20-2026/Lecture-3_January-20-2026_pandas_and_RDKit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Lecture 3 - January 20, 2026






In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.mixture import GaussianMixture

plt.rcParams['figure.dpi'] = 100

## pandas and RDKit

### pandas
Pandas is popular python library for data analysis that is espesically useful for tabular data. There data is stored in a pandas DataFrame object that is similar to numpy array but is more flexible concerning the data it can store, e.g., it can store strings and other python types.

- [pandas user guide](https://pandas.pydata.org/docs/user_guide/index.html)
- [pandas getting started tutorials](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html)

### RDKit
RDKit is a powerful cheminformatics library that we will use extensively throughout the course.

- [Getting Started with the RDKit in Python](https://www.rdkit.org/docs/GettingStartedInPython.html)

### Useful tutorals
[Pat Walters](https://github.com/PatWalters/practical_cheminformatics_tutorials?tab=readme-ov-file) has two excellent tutorial Jupyter notebooks on pandas and RDKit:
- [A Quick Overview of Pandas for Cheminformatics](https://colab.research.google.com/github/PatWalters/practical_cheminformatics_tutorials/blob/main/fundamentals/pandas_intro.ipynb)
- [A Whirlwind Introduction to the RDKit for Cheminformatics](https://colab.research.google.com/github/PatWalters/practical_cheminformatics_tutorials/blob/main/fundamentals/A_Whirlwind_Introduction_To_The_RDKit.ipynb)

- [MolSSI - Introduction to RDKit](https://education.molssi.org/python-data-science-chemistry/rdkit_descriptors/rdkit.html)

#### Dataset

Here, we will consider the Bradley Melting Point Dataset, which is curated chemical dataset with melting points of around 3,000 chemical compounds, see [here](https://www.kaggle.com/datasets/aliffaagnur/melting-point-chemical-dataset/data).

This dataset is stored in a comma-separated values (csv) file, which is common format used to start data in text files. We load this into a pandas DataFrame using the `load_csv` function.



In [None]:
# download datasets
%%capture
!wget https://raw.githubusercontent.com/valsson-group/UNT-ChemicalApplicationsOfMachineLearning-Spring2026/refs/heads/main/Lecture-3_January-20-2026/BradleyDoublePlusGoodMeltingPointDataset.csv


In [None]:
import pandas as pd

In [None]:
data_mp = pd.read_csv("BradleyDoublePlusGoodMeltingPointDataset.csv")

In [None]:
data_mp

Pandas reads in the header of the csv file and uses that to define keys for each column that we can use to reference a certain column.

You can obtain the keys in the DataFrame by using `data_mp.keys()`.

In [None]:
print(data_mp.keys())
# or
print(" ")
print(list(data_mp.keys()))

In [None]:
data_mp['mpC']

You can plot data by using the `.plot()` function and make a histogram using `.hist()` function.

In [None]:
data_mp['mpC'].plot()
plt.show()
# this is identical to
plt.plot(data_mp['mpC'])
plt.show()



In [None]:
data_mp['mpC'].hist(bins=40)
plt.xlabel("Melting Point in Celsius")
plt.ylabel("Density")
plt.show()
# this is identical to
plt.hist(data_mp['mpC'],bins=40)
plt.xlabel("Melting Point in Celsius")
plt.ylabel("Density")
plt.show()

One of the columns are the compounds [SMILES strings](https://en.wikipedia.org/wiki/Simplified_Molecular_Input_Line_Entry_System), which is a line notation to represent molecular structure of chemical compounds in a text format.

- [An Introduction to the Simplified Molecular Input Line Entry System (SMILES)](https://colab.research.google.com/github/PatWalters/practical_cheminformatics_tutorials/blob/main/fundamentals/SMILES_tutorial.ipynb#scrollTo=infectious-smell)

In [None]:
# note the double brackets, here we are passing a list of keys to show two columns
data_mp[ ['name','smiles'] ]

We can use RDKit to work with smiles strings.

First we need to install RDKit into our Google Colab instance as it is not installed by default. You will need to do that everytime. However, that will take a very short time.

In [None]:
# the %%capture command will surpress output to screen
%%capture
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    !pip install rdkit

In [None]:
from rdkit import Chem

In [None]:
idx=10
name = data_mp['name'][idx]
smi = data_mp['smiles'][idx]
print(f"{name:s}: {smi:s}")
mol = Chem.MolFromSmiles(smi)
mol

You can convert the column into a numpy array using the `.to_numpy()` function and to a list using the `.to_list()` function.

In [None]:
melting_point_c = data_mp['mpC'].to_numpy()
names = data_mp['name'].to_list()
smiles = data_mp['smiles'].to_list()

Let's calculate some descriptors/features for the molecules and see how they correlate the melting points.

We start with very naive and stupid features.

In [None]:
# here we use list comprehension that allows us
# to write very compact code when working with lists.
names_length = [len(s) for s in names]
smiles_length = [len(s) for s in smiles]

In [None]:
plt.plot(names_length,melting_point_c,'.')
plt.ylabel("Melting Point in Celsius")
plt.xlabel("Length of compound name")
plt.show()

plt.plot(names_length,melting_point_c,'.')
plt.ylabel("Melting Point in Celsius")
plt.xlabel("Length of compound name")
plt.xlim([0,50])
plt.show()


In [None]:
plt.plot(smiles_length,melting_point_c,'.')
plt.ylabel("Melting Point in Celsius")
plt.xlabel("Length of smiles string")
plt.show()

plt.plot(smiles_length,melting_point_c,'.')
plt.ylabel("Melting Point in Celsius")
plt.xlabel("Length of smiles string")
plt.xlim([0,60])
plt.show()

In [None]:
# .lower() is a string function that converts a string to lower case
# .count() is a string function that counts that number of certain character/word in a string
# you can see how we can stack these two commands
number_of_carbons = [s.lower().count('c') for s in smiles]
number_of_oxygens = [s.lower().count('o') for s in smiles]

In [None]:
plt.plot(number_of_carbons,melting_point_c,'.')
plt.ylabel("Melting Point in Celsius")
plt.xlabel("Number of Carbon atoms")
plt.show()

plt.plot(number_of_carbons,melting_point_c,'.')
plt.ylabel("Melting Point in Celsius")
plt.xlabel("Number of Carbon atoms")
plt.xlim([0,25])
plt.show()

In [None]:
plt.plot(number_of_oxygens,melting_point_c,'.')
plt.ylabel("Melting Point in Celsius")
plt.xlabel("Number of Oxygen atoms")
plt.show()


We can use RDKit to calculate various molecular descriptors/features.

- [Descriptor calculation tutorial](https://greglandrum.github.io/rdkit-blog/posts/2022-12-23-descriptor-tutorial.html)
- [rdkit.Chem.Descriptors module](https://www.rdkit.org/docs/source/rdkit.Chem.Descriptors.html)
- [rdkit.Chem.rdMolDescriptors module](https://www.rdkit.org/docs/source/rdkit.Chem.rdMolDescriptors.html)

In [None]:
from rdkit.Chem import Descriptors, rdMolDescriptors


There are a wide range of descriptors/features available

In [None]:
print("Descriptors.__")
for des in Descriptors._descList: print("-",des[0])

In [None]:
# this will not work correctly
molecular_weight = [Descriptors.MolWt(Chem.MolFromSmiles(smi)) for smi in smiles]

In [None]:
# We start with molecular weight

# create empty list
molecular_weight = [None]*len(smiles)
for i, smi in enumerate(smiles):
  mol = Chem.MolFromSmiles(smi)
  if mol is not None:
    molecular_weight[i] = Descriptors.MolWt(mol)
  else:
    molecular_weight[i] = np.nan
# convert list to numpy array
molecular_weight = np.array(molecular_weight)



In [None]:
plt.plot(molecular_weight,melting_point_c,'.')
plt.ylabel("Melting Point in Celsius")
plt.xlabel("Molecular weight [Dalton]")
plt.show()

plt.plot(molecular_weight,melting_point_c,'.')
plt.ylabel("Melting Point in Celsius")
plt.xlabel("Molecular weight [Dalton]")
plt.xlim([0,400])
plt.show()

In [None]:
# can also write a function to simplify:
def number_of_rings(smi):
  mol = Chem.MolFromSmiles(smi)
  if mol is not None:
    return rdMolDescriptors.CalcNumRings(mol)
  else:
    return np.nan

# can also write a function to simplify:
def number_of_rotatable_bonds(smi):
  mol = Chem.MolFromSmiles(smi)
  if mol is not None:
    return rdMolDescriptors.CalcNumRotatableBonds(mol)
  else:
    return np.nan



In [None]:
number_of_rings = [number_of_rings(smi) for smi in smiles]

number_of_rotatable_bonds = [number_of_rotatable_bonds(smi) for smi in smiles]

In [None]:
plt.plot(number_of_rings,melting_point_c,'.')
plt.ylabel("Melting Point in Celsius")
plt.xlabel("Number of rings")
plt.show()

In [None]:
plt.plot(number_of_rotatable_bonds,melting_point_c,'.')
plt.ylabel("Melting Point in Celsius")
plt.xlabel("Number of rotatable bonds")
plt.show()

plt.plot(number_of_rotatable_bonds,melting_point_c,'.')
plt.ylabel("Melting Point in Celsius")
plt.xlabel("Number of rotatable bonds")
plt.xlim([0,20])
plt.show()

We can also create a new pandas DataFrame and save it as a CSV file

In [None]:
data = list(zip(names, smiles, melting_point_c, molecular_weight))
data_new = pd.DataFrame.from_records(data, columns=["name", "smiles", "melting_point_c","molecular_weight"])

In [None]:
data_new

In [None]:
data_new.to_csv("test.csv")

## Example of using inf

In [None]:
def find_min_max_value(array):
  min_value = +np.inf
  max_value = -np.inf
  min_idx = None
  max_idx = None
  for idx, value in enumerate(array):
    if value < min_value:
      min_value = value
      min_idx = idx
    if value > max_value:
      max_value = value
      max_idx = idx
  return min_value, max_value, min_idx, max_idx


In [None]:
rng = np.random.default_rng()

test = rng.normal(loc=0,scale=10,size=1000)

print(find_min_max_value(test))
print(test.min())
print(test.argmin())
print(test.max())
print(test.argmax())
