# Capturing greenhouse gases with data

## Data Wrangling

### by Zachary Brown

The goal of this project was originally to merge two MOF databases to determine what chemical properties increase the CO2 capacity of a metal-organic framework (MOF). Those two databases only had 30 entries with the same MOF identifiers, so instead I will be using the [ARC MOF database](https://zenodo.org/record/7600474#.Y_ofvXbMKM8) which has over 200,000 theoretical MOFs and has both chemical properties and gas adsorption predictions included.

Some key terms that are used throughout this dataset and project include RDF - radial distribution functions (calculated for electronegativity, atomic hardness, van der Waals volume, dipole polarizability, atomic mass, and none), RAC - revised autocorrelations (calculated for electronegativity, nuclear charge, atom identity, connectivity and covalent radii), 

First we'll install the necessary libraries and import them.

In [1]:
!pip install numpy==1.24.2
!pip install pandas==1.5.3
!pip install requests==2.28.2
!pip install matplotlib==3.7.0



In [2]:
import numpy as np
import pandas as pd
import requests
import matplotlib.pyplot as plt

Now I'll start by downloading the topology dataset, which describes the geometric topology of the MOFs.

In [3]:
url = 'https://zenodo.org/record/7600474/files/all_topology_lists.csv?download=1'
r = requests.get(url, allow_redirects=True)
open('../data/raw/topology.csv', 'wb').write(r.content)

11683716

In [4]:
top = pd.read_csv('../data/raw/topology.csv')
top.head()

  top = pd.read_csv('../data/raw/topology.csv')


Unnamed: 0,Name,filename,Crystalnet,likely topology
0,DB0-m12_o10_bcu.cif,bcu,bcu,bcu
1,DB0-m12_o12_bcu.cif,bcu,bcu,bcu
2,DB0-m12_o13_bcu.cif,bcu,bcu,bcu
3,DB0-m12_o14_bcu.cif,bcu,bcu,bcu
4,DB0-m12_o14_o22_f0_bcu.cif,bcu,bcu,bcu


In [5]:
top.shape

(264225, 4)

To join this dataframe with future ones I'll need to set the 'Name' column as the index, so I'll do that and then download the geometry dataset which has geometric properties of the MOFs.

In [6]:
top.set_index('Name', inplace=True)

In [7]:
url = 'https://zenodo.org/record/7600474/files/geometric_properties.csv?download=1'
r = requests.get(url, allow_redirects=True)
open('../data/raw/geom.csv', 'wb').write(r.content)

110395714

In [8]:
geo = pd.read_csv('../data/raw/geom.csv')
geo.head()

Unnamed: 0.1,Unnamed: 0,filename,UC_volume,Density,ASA,vASA,gASA,NASA,gNASA,vNASA,...,NPOAVA,NPOAVAf,NPOAVAg,Di,Df,Dif,ARC-MOF,DB_num,order_geo,bool_geo
0,0,DB0-m2_o1_o10_f0_pcu.sym.66.cif,901.788,1.23322,87.4832,970.108,786.644,0.0,0.0,0.0,...,0.0,0.0,0.0,5.41813,4.36524,5.39798,True,DB0,0,True
1,1,DB0-m28_o161_o113_f0_pts.cif,8183.19,0.389995,1749.55,2137.97,5482.05,0.0,0.0,0.0,...,0.0,0.0,0.0,16.83322,15.07954,16.80076,False,DB0,1,True
2,2,DB1-Zn2O8N2-irmof20_A-irmof8_A_No13.cif,3853.14,0.652434,824.502,2139.82,3279.75,0.0,0.0,0.0,...,0.0,0.0,0.0,11.24255,9.36124,11.24255,False,DB1,2,True
3,3,DB1-Zn4O13-BDC_A-irmof6_A_No267.cif,16975.8,0.815191,3234.86,1905.57,2337.57,0.0,0.0,0.0,...,0.0,0.0,0.0,14.9643,6.83319,14.95745,False,DB1,3,True
4,4,DB0-m15_o27_aww.cif,236848.0,0.12761,17612.1,743.601,5827.13,0.0,0.0,0.0,...,0.0,0.0,0.0,48.43682,38.41622,48.43682,False,DB0,4,True


In [9]:
geo.shape

(521316, 29)

In [10]:
geo.columns

Index(['Unnamed: 0', 'filename', 'UC_volume', 'Density', 'ASA', 'vASA', 'gASA',
       'NASA', 'gNASA', 'vNASA', 'AVA', 'AVAf', 'AVAg', 'NAVA', 'NAVAf',
       'NAVAg', 'POAVA', 'POAVAf', 'POAVAg', 'NPOAVA', 'NPOAVAf', 'NPOAVAg',
       'Di', 'Df', 'Dif', 'ARC-MOF', 'DB_num', 'order_geo', 'bool_geo'],
      dtype='object')

These column headers aren't particularly insightful, so I'm going to reference the journal article to rename these to something more useful.

In [11]:
geo.rename(columns={'UC_volume':'unit_cell_volume', 'ASA':'accessible_surface_area', 'vASA':'volumetric_surface_area',\
 'gASA':'gravimetric_surface_area', 'NASA':'inaccessible_surface_area', 'gNASA':'inac_grav_surf_area',\
 'vNASA':'inac_vol_surf_area', 'AVA':'accessible_volume_per_uc', 'AVAf':'volume_fraction', 'AVAg':'grav_volume',\
 'NAVA':'inac_vol', 'NAVAf':'inac_vol_frac', 'NAVAg':'inac_grav_vol', 'POAVA':'probe_occupiable_vol',\
 'POAVAf':'probe_occ_vol_frac', 'POAVAg':'grav_probe_occ_vol', 'NPOAVA':'inac_probe_occ_vol',\
 'NPOAVAf':'inac_probe_occ_vol_frac', 'NPOAVAg':'inac_probe_occ_grav_vol', 'Di':'largest_cav_diameter',\
 'Df':'pore_limiting_diameter', 'Dif':'largest_free_sphere_path_diam'},
           inplace=True)

In [12]:
geo.columns

Index(['Unnamed: 0', 'filename', 'unit_cell_volume', 'Density',
       'accessible_surface_area', 'volumetric_surface_area',
       'gravimetric_surface_area', 'inaccessible_surface_area',
       'inac_grav_surf_area', 'inac_vol_surf_area', 'accessible_volume_per_uc',
       'volume_fraction', 'grav_volume', 'inac_vol', 'inac_vol_frac',
       'inac_grav_vol', 'probe_occupiable_vol', 'probe_occ_vol_frac',
       'grav_probe_occ_vol', 'inac_probe_occ_vol', 'inac_probe_occ_vol_frac',
       'inac_probe_occ_grav_vol', 'largest_cav_diameter',
       'pore_limiting_diameter', 'largest_free_sphere_path_diam', 'ARC-MOF',
       'DB_num', 'order_geo', 'bool_geo'],
      dtype='object')

I don't care which database ARC-MOF drew these from, so I'm going to drop 'ARC-MOF' and 'DB_num'.

In [13]:
geo.drop(columns=['ARC-MOF', 'DB_num'], inplace=True)
geo.shape

(521316, 27)

Now I'll join the geometry and topology dataframes.

In [14]:
geo_top = geo.join(other = top, on = 'filename', how = 'inner', lsuffix='_geo', rsuffix='_top')

In [15]:
geo_top.shape

(263744, 31)

In [16]:
geo_top.head()

Unnamed: 0.1,filename,Unnamed: 0,filename_geo,unit_cell_volume,Density,accessible_surface_area,volumetric_surface_area,gravimetric_surface_area,inaccessible_surface_area,inac_grav_surf_area,...,inac_probe_occ_vol_frac,inac_probe_occ_grav_vol,largest_cav_diameter,pore_limiting_diameter,largest_free_sphere_path_diam,order_geo,bool_geo,filename_top,Crystalnet,likely topology
0,DB0-m2_o1_o10_f0_pcu.sym.66.cif,0,DB0-m2_o1_o10_f0_pcu.sym.66.cif,901.788,1.23322,87.4832,970.108,786.644,0.0,0.0,...,0.0,0.0,5.41813,4.36524,5.39798,0,True,pcu,pcu,pcu
6,DB0-m3_o23_o23_f0_pcu.sym.74.cif,6,DB0-m3_o23_o23_f0_pcu.sym.74.cif,7545.84,0.537679,1566.33,2075.75,3860.57,0.0,0.0,...,0.0,0.0,10.43731,9.91429,10.43731,6,True,pcu,pcu,pcu
7,DB0-m2_o8_o25_f0_pcu.sym.91.cif,7,DB0-m2_o8_o25_f0_pcu.sym.91.cif,4172.23,0.371648,771.93,1850.16,4978.27,0.0,0.0,...,0.0,0.0,12.93441,11.01397,12.93441,7,True,pcu,pcu,pcu
8,DB0-m29_o82_o46_f0_pts.sym.1.cif,8,DB0-m29_o82_o46_f0_pts.sym.1.cif,1715.11,0.786327,378.905,2209.23,2809.55,0.0,0.0,...,0.0,0.0,8.35282,5.44658,7.30192,8,True,pts,pts,pts
10,DB0-m29_o99_o470_f0_pts.sym.128.cif,10,DB0-m29_o99_o470_f0_pts.sym.128.cif,2552.97,0.754924,419.589,1643.53,2177.08,0.164038,0.642539,...,0.0,0.0,7.57868,4.51994,7.57868,10,True,pts,pts,pts


Let's clean it up a little and drop the 'Unnamed:0' and 'filename_geo' columns.

In [17]:
geo_top.drop(columns=['Unnamed: 0', 'filename_geo'])

Unnamed: 0,filename,unit_cell_volume,Density,accessible_surface_area,volumetric_surface_area,gravimetric_surface_area,inaccessible_surface_area,inac_grav_surf_area,inac_vol_surf_area,accessible_volume_per_uc,...,inac_probe_occ_vol_frac,inac_probe_occ_grav_vol,largest_cav_diameter,pore_limiting_diameter,largest_free_sphere_path_diam,order_geo,bool_geo,filename_top,Crystalnet,likely topology
0,DB0-m2_o1_o10_f0_pcu.sym.66.cif,901.788,1.233220,87.4832,970.108,786.644,0.000000,0.000000,0.000000,26.0256,...,0.0,0.0,5.41813,4.36524,5.39798,0,True,pcu,pcu,pcu
6,DB0-m3_o23_o23_f0_pcu.sym.74.cif,7545.840,0.537679,1566.3300,2075.750,3860.570,0.000000,0.000000,0.000000,2364.4100,...,0.0,0.0,10.43731,9.91429,10.43731,6,True,pcu,pcu,pcu
7,DB0-m2_o8_o25_f0_pcu.sym.91.cif,4172.230,0.371648,771.9300,1850.160,4978.270,0.000000,0.000000,0.000000,2102.0500,...,0.0,0.0,12.93441,11.01397,12.93441,7,True,pcu,pcu,pcu
8,DB0-m29_o82_o46_f0_pts.sym.1.cif,1715.110,0.786327,378.9050,2209.230,2809.550,0.000000,0.000000,0.000000,281.5860,...,0.0,0.0,8.35282,5.44658,7.30192,8,True,pts,pts,pts
10,DB0-m29_o99_o470_f0_pts.sym.128.cif,2552.970,0.754924,419.5890,1643.530,2177.080,0.164038,0.642539,0.851131,268.4700,...,0.0,0.0,7.57868,4.51994,7.57868,10,True,pts,pts,pts
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
521302,DB0-m2_o12_o16_f0_pcu.sym.10.cif,1358.680,0.754709,290.7150,2139.690,2835.120,0.000000,0.000000,0.000000,189.8350,...,0.0,0.0,7.52873,5.74171,7.44496,100001,False,pcu,pcu,pcu
521304,DB0-m3_o160_o480_f0_fsc.sym.50.cif,1243.540,0.972493,216.3000,1739.390,1788.590,0.000000,0.000000,0.000000,154.4720,...,0.0,0.0,7.86425,5.25972,7.85850,100001,False,fsc,fsc,fsc
521310,DB0-m3_o7_o15_f0_pcu.sym.26.cif,3245.820,0.460190,607.9870,1873.140,4070.370,0.000000,0.000000,0.000000,1383.1700,...,0.0,0.0,14.76229,10.91728,14.76229,100001,False,pcu,pcu,pcu
521311,DB0-m2_o9_o11_f0_nbo.sym.43.cif,5025.910,0.784130,842.4600,1676.240,2137.700,0.000000,0.000000,0.000000,799.8230,...,0.0,0.0,9.80754,4.61436,9.34008,100001,False,nbo,nbo,nbo


Now that the two are merged I'll download the RDF dataset, which describes a wide range of chemical properties. 

In [18]:
url = 'https://zenodo.org/record/7600474/files/RDFs.csv?download=1'
r = requests.get(url, allow_redirects=True)
open('../data/raw/rdf.csv', 'wb').write(r.content)

2129693042

In [19]:
rdf = pd.read_csv('../data/raw/rdf.csv')
rdf.head(10)

Unnamed: 0.1,Unnamed: 0,Structure_Name,RDF_electronegativity_2.000,RDF_electronegativity_2.004,RDF_electronegativity_2.013,RDF_electronegativity_2.027,RDF_electronegativity_2.044,RDF_electronegativity_2.066,RDF_electronegativity_2.093,RDF_electronegativity_2.124,...,RDF_none_25.700,RDF_none_26.161,RDF_none_26.625,RDF_none_27.094,RDF_none_27.568,RDF_none_28.046,RDF_none_28.528,RDF_none_29.015,RDF_none_29.506,RDF_none_30.001
0,0,DB0-m29_o97_o420_f0_pts.sym.57_repeat.cif,0.000818,0.000833,0.000862,0.000907,0.000969,0.001049,0.001149,0.001269,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,DB0-m3_o440_o13_f0_fsc.sym.76_repeat.cif,0.000856,0.000869,0.000895,0.000934,0.000989,0.00106,0.001148,0.001253,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,DB1-Cu2O8N2-irmof14_A-irmof7_A_No101_repeat.cif,0.000788,0.000797,0.000814,0.000843,0.000886,0.000946,0.001029,0.001141,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,DB0-m3_o96_o13_f0_fsc.sym.51_repeat.cif,0.001121,0.001136,0.001166,0.001211,0.001272,0.001347,0.001437,0.001539,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,DB0-m2_o11_o11_f0_nbo.sym.9_repeat.cif,0.000867,0.000876,0.000894,0.000921,0.000957,0.001004,0.00106,0.001126,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,5,DB0-m3_o52_o6_f0_fsc.sym.57_repeat.cif,0.000762,0.000771,0.000789,0.000817,0.000856,0.000908,0.000974,0.001055,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,6,DB0-m9_o6_o25_f0_sra.sym.66_repeat.cif,0.000868,0.000878,0.000899,0.000931,0.000975,0.00103,0.001098,0.001178,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,7,DB0-m29_o82_o86_f0_pts.sym.38_repeat.cif,0.000809,0.000824,0.000853,0.000899,0.000962,0.001046,0.001151,0.001279,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,8,DB0-m2_o11_o23_f0_nbo.sym.138_repeat.cif,0.000978,0.000989,0.001011,0.001045,0.00109,0.001148,0.001218,0.0013,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,9,DB0-m2_o6_o11_f0_pcu.sym.15_repeat.cif,0.000728,0.000737,0.000754,0.000781,0.000818,0.000866,0.000925,0.000997,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
rdf.shape

(279609, 680)

In [21]:
rdf.columns

Index(['Unnamed: 0', 'Structure_Name', 'RDF_electronegativity_2.000',
       'RDF_electronegativity_2.004', 'RDF_electronegativity_2.013',
       'RDF_electronegativity_2.027', 'RDF_electronegativity_2.044',
       'RDF_electronegativity_2.066', 'RDF_electronegativity_2.093',
       'RDF_electronegativity_2.124',
       ...
       'RDF_none_25.700', 'RDF_none_26.161', 'RDF_none_26.625',
       'RDF_none_27.094', 'RDF_none_27.568', 'RDF_none_28.046',
       'RDF_none_28.528', 'RDF_none_29.015', 'RDF_none_29.506',
       'RDF_none_30.001'],
      dtype='object', length=680)

In [22]:
rdf.set_index('Structure_Name', inplace=True)

I noticed that the structure names in this dataframe include 'repeat' in the names, which wasn't included in the other tables. I'm going to confirm that it's in all of the names, then remove it so I can join this dataframe to the geo_top.

In [23]:
rdf[rdf.index.str.contains('.repeat.') == False]

Unnamed: 0_level_0,Unnamed: 0,RDF_electronegativity_2.000,RDF_electronegativity_2.004,RDF_electronegativity_2.013,RDF_electronegativity_2.027,RDF_electronegativity_2.044,RDF_electronegativity_2.066,RDF_electronegativity_2.093,RDF_electronegativity_2.124,RDF_electronegativity_2.159,...,RDF_none_25.700,RDF_none_26.161,RDF_none_26.625,RDF_none_27.094,RDF_none_27.568,RDF_none_28.046,RDF_none_28.528,RDF_none_29.015,RDF_none_29.506,RDF_none_30.001
Structure_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1


In [24]:
rdf.index = rdf.index.str.replace('_repeat', '')

In [25]:
rdf.head()

Unnamed: 0_level_0,Unnamed: 0,RDF_electronegativity_2.000,RDF_electronegativity_2.004,RDF_electronegativity_2.013,RDF_electronegativity_2.027,RDF_electronegativity_2.044,RDF_electronegativity_2.066,RDF_electronegativity_2.093,RDF_electronegativity_2.124,RDF_electronegativity_2.159,...,RDF_none_25.700,RDF_none_26.161,RDF_none_26.625,RDF_none_27.094,RDF_none_27.568,RDF_none_28.046,RDF_none_28.528,RDF_none_29.015,RDF_none_29.506,RDF_none_30.001
Structure_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DB0-m29_o97_o420_f0_pts.sym.57.cif,0,0.000818,0.000833,0.000862,0.000907,0.000969,0.001049,0.001149,0.001269,0.001406,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DB0-m3_o440_o13_f0_fsc.sym.76.cif,1,0.000856,0.000869,0.000895,0.000934,0.000989,0.00106,0.001148,0.001253,0.001375,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DB1-Cu2O8N2-irmof14_A-irmof7_A_No101.cif,2,0.000788,0.000797,0.000814,0.000843,0.000886,0.000946,0.001029,0.001141,0.001286,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DB0-m3_o96_o13_f0_fsc.sym.51.cif,3,0.001121,0.001136,0.001166,0.001211,0.001272,0.001347,0.001437,0.001539,0.001649,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DB0-m2_o11_o11_f0_nbo.sym.9.cif,4,0.000867,0.000876,0.000894,0.000921,0.000957,0.001004,0.00106,0.001126,0.001202,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
rdf.drop(columns='Unnamed: 0', inplace=True)

Time to join!

In [27]:
geo_top_rdf = geo_top.join(other = rdf, on = 'filename', how='inner', rsuffix='rdf')
geo_top_rdf.shape

(263743, 709)

In [28]:
geo_top_rdf.head()

Unnamed: 0.1,filename,Unnamed: 0,filename_geo,unit_cell_volume,Density,accessible_surface_area,volumetric_surface_area,gravimetric_surface_area,inaccessible_surface_area,inac_grav_surf_area,...,RDF_none_25.700,RDF_none_26.161,RDF_none_26.625,RDF_none_27.094,RDF_none_27.568,RDF_none_28.046,RDF_none_28.528,RDF_none_29.015,RDF_none_29.506,RDF_none_30.001
0,DB0-m2_o1_o10_f0_pcu.sym.66.cif,0,DB0-m2_o1_o10_f0_pcu.sym.66.cif,901.788,1.23322,87.4832,970.108,786.644,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,DB0-m3_o23_o23_f0_pcu.sym.74.cif,6,DB0-m3_o23_o23_f0_pcu.sym.74.cif,7545.84,0.537679,1566.33,2075.75,3860.57,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,DB0-m2_o8_o25_f0_pcu.sym.91.cif,7,DB0-m2_o8_o25_f0_pcu.sym.91.cif,4172.23,0.371648,771.93,1850.16,4978.27,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,DB0-m29_o82_o46_f0_pts.sym.1.cif,8,DB0-m29_o82_o46_f0_pts.sym.1.cif,1715.11,0.786327,378.905,2209.23,2809.55,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,DB0-m29_o99_o470_f0_pts.sym.128.cif,10,DB0-m29_o99_o470_f0_pts.sym.128.cif,2552.97,0.754924,419.589,1643.53,2177.08,0.164038,0.642539,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
geo_top_rdf.set_index('filename', inplace=True)
geo_top_rdf.drop(columns=['Unnamed: 0', 'filename_geo'], inplace = True)

Next I'll download the process datafile, which has adsorption data for five different gas separation processes: natural gas purification (90% CH4/10% CO2), post-combustion VSA (17% CO2/83% N2), pre-combustion PSA (40% CO2/60% H2), landfill gas VPSA (42-96% CO2/58-4% CH4) and methane storage PSA (100% CH4). 

In [30]:
url = 'https://zenodo.org/record/7600474/files/overall_process.csv?download=1'
r = requests.get(url, allow_redirects=True)
open('../data/raw/process.csv', 'wb').write(r.content)

228656569

In [31]:
process = pd.read_csv('../data/raw/process.csv')
process.head()

Unnamed: 0.1,Unnamed: 0,filename,process,mmol/g_uptake,mmol/g_working_capacity,v/v_uptake,v/v_working_capacity,wt%_uptake,wt%_working_capacity,selectivity,purity,ssp,afm
0,0,DB0-m3_o12_o22_f0_pcu.sym.90_repeat,post-combustion-vsa,0.42247,0.370563,9.976014,8.750299,1.859267,1.630826,13.582315,0.840454,71.548843,27.906003
1,1,DB0-m3_o12_o22_f0_pcu.sym.90_repeat,pre-combustion-40-40,9.998517,9.25787,236.100667,218.611342,44.002972,40.743421,106.889928,0.98889,9514.302202,3623.126284
2,2,DB0-m3_o12_o22_f0_pcu.sym.90_repeat,natural-gas-purification,2.213165,1.923483,52.260721,45.420287,9.740028,8.46515,5.521347,0.390499,3.537445,14.054911
3,3,DB0-m3_o12_o22_f0_pcu.sym.90_repeat,landfill-gas-vpsa,2.927485,2.650803,69.128371,62.594927,12.883715,11.666052,3.301257,0.686128,7.216585,72.01408
4,4,DB0-m3_o12_o22_f0_pcu.sym.90_repeat,methane-storage-psa,8.560576,5.384048,202.145755,127.136581,13.696922,8.614477,,,,


In [32]:
process.shape

(1395025, 13)

Again I'm noticing the 'repeat' in the filename, so I'll remove that so I can join the dataframes.

In [33]:
process['filename'] = process['filename'].str.replace('_repeat', '.cif')

In [34]:
process.set_index('filename', inplace=True)

Fortunately, this database has data for exactly the process I'm interested in: post-combustion VSA, where CO2 is removed from power plant exhaust. I'll cut down the number of rows now to preserve memory and make it easier to work with.

In [35]:
pcv = process[process['process'] == 'post-combustion-vsa']

In [36]:
pcv.head()

Unnamed: 0_level_0,Unnamed: 0,process,mmol/g_uptake,mmol/g_working_capacity,v/v_uptake,v/v_working_capacity,wt%_uptake,wt%_working_capacity,selectivity,purity,ssp,afm
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
DB0-m3_o12_o22_f0_pcu.sym.90.cif,0,post-combustion-vsa,0.42247,0.370563,9.976014,8.750299,1.859267,1.630826,13.582315,0.840454,71.548843,27.906003
DB0-m2_o2_o7_f0_nbo.sym.27.cif,5,post-combustion-vsa,0.247034,0.212637,1.497506,1.288993,1.087185,0.935805,1.12193,0.527647,1.253264,0.596904
DB0-m2_o2_o8_f0_pcu.sym.17.cif,10,post-combustion-vsa,0.557584,0.490135,6.232654,5.478715,2.453899,2.157061,6.85865,0.794227,26.472508,23.720292
DB1-Zn4O13-irmof14_A-irmof6_A_No127.cif,15,post-combustion-vsa,0.253349,0.208746,3.443824,2.837526,1.114976,0.91868,3.563765,0.674693,7.391321,1.672552
DB0-m3_o16_o460_f0_fsc.sym.19.cif,20,post-combustion-vsa,3.931845,2.774926,110.762621,78.171464,17.303852,12.21231,196.043735,0.97647,8135.655217,615.449466


There are a few unclear columns in this dataframe: ssp, and afm. I'm going to rename them so they're clearly understood.

In [37]:
pcv.rename(columns={'ssp':'specific_surf_area_particle', 'afm':'adsorption_fig_of_merit'}, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pcv.rename(columns={'ssp':'specific_surf_area_particle', 'afm':'adsorption_fig_of_merit'}, inplace = True)


Now I'll merge the post-combustion VSA data into the rest.

In [38]:
merged = geo_top_rdf.join(other = pcv, on = 'filename', how='inner', rsuffix = 'process')

In [39]:
merged.shape

(263295, 718)

My last addition will be the RACs, which describe properties around the ligands and metal centers.

In [40]:
url = 'https://zenodo.org/record/7600474/files/RACs.csv?download=1'
r = requests.get(url, allow_redirects=True)
open('../data/raw/racs.csv', 'wb').write(r.content) 

838376934

In [41]:
racs = pd.read_csv('../data/raw/racs.csv')
racs.head()

Unnamed: 0.1,Unnamed: 0,f-chi-0-all,f-chi-1-all,f-chi-2-all,f-chi-3-all,f-Z-0-all,f-Z-1-all,f-Z-2-all,f-Z-3-all,f-I-0-all,...,ARC_MOF,DB_num,order_f-lig,bool_f-lig,order_mc,bool_mc,order_func,bool_func,order_lc,bool_lc
0,0,127.8988,244.928,456.1952,525.632,2338.0,4480.0,4832.0,6016.0,14.0,...,False,DB1,0,True,0,True,0,True,0,True
1,1,172.392,330.048,688.2868,930.368,2580.0,5628.0,7264.0,8224.0,20.0,...,True,DB0,100001,False,25085,True,59576,True,11104,True
2,2,126.1238,231.168,445.9952,511.872,2456.0,4608.0,4928.0,6144.0,14.0,...,True,DB1,100001,False,8681,True,78,True,24242,True
3,3,172.392,330.048,688.2868,930.368,2580.0,5628.0,7264.0,8224.0,20.0,...,False,DB1,100001,False,8049,True,73075,True,12228,True
4,4,126.1238,231.168,445.9952,511.872,2456.0,4608.0,4928.0,6144.0,14.0,...,True,DB0,100001,False,4252,True,41032,True,7990,True


In [42]:
racs.shape

(472571, 188)

In [43]:
print(racs.columns.tolist())

['Unnamed: 0', 'f-chi-0-all', 'f-chi-1-all', 'f-chi-2-all', 'f-chi-3-all', 'f-Z-0-all', 'f-Z-1-all', 'f-Z-2-all', 'f-Z-3-all', 'f-I-0-all', 'f-I-1-all', 'f-I-2-all', 'f-I-3-all', 'f-T-0-all', 'f-T-1-all', 'f-T-2-all', 'f-T-3-all', 'f-S-0-all', 'f-S-1-all', 'f-S-2-all', 'f-S-3-all', 'mc-chi-0-all', 'mc-chi-1-all', 'mc-chi-2-all', 'mc-chi-3-all', 'mc-Z-0-all', 'mc-Z-1-all', 'mc-Z-2-all', 'mc-Z-3-all', 'mc-I-0-all', 'mc-I-1-all', 'mc-I-2-all', 'mc-I-3-all', 'mc-T-0-all', 'mc-T-1-all', 'mc-T-2-all', 'mc-T-3-all', 'mc-S-0-all', 'mc-S-1-all', 'mc-S-2-all', 'mc-S-3-all', 'D_mc-chi-0-all', 'D_mc-chi-1-all', 'D_mc-chi-2-all', 'D_mc-chi-3-all', 'D_mc-Z-0-all', 'D_mc-Z-1-all', 'D_mc-Z-2-all', 'D_mc-Z-3-all', 'D_mc-I-0-all', 'D_mc-I-1-all', 'D_mc-I-2-all', 'D_mc-I-3-all', 'D_mc-T-0-all', 'D_mc-T-1-all', 'D_mc-T-2-all', 'D_mc-T-3-all', 'D_mc-S-0-all', 'D_mc-S-1-all', 'D_mc-S-2-all', 'D_mc-S-3-all', 'f-lig-chi-0', 'f-lig-chi-1', 'f-lig-chi-2', 'f-lig-chi-3', 'f-lig-Z-0', 'f-lig-Z-1', 'f-lig-Z-2', 'f

I'll drop a few unnecessary columns.

In [44]:
racs.drop(columns=['ARC_MOF', 'DB_num', 'Unnamed: 0'], inplace=True)
racs.set_index('filename', inplace=True)

In [45]:
racs.head()

Unnamed: 0_level_0,f-chi-0-all,f-chi-1-all,f-chi-2-all,f-chi-3-all,f-Z-0-all,f-Z-1-all,f-Z-2-all,f-Z-3-all,f-I-0-all,f-I-1-all,...,D_func-alpha-2-all,D_func-alpha-3-all,order_f-lig,bool_f-lig,order_mc,bool_mc,order_func,bool_func,order_lc,bool_lc
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DB1-Cu2O8-irmof14_A-irmof6_A_No157.cif,127.8988,244.928,456.1952,525.632,2338.0,4480.0,4832.0,6016.0,14.0,32.0,...,-5.1,-8.477666,0,True,0,True,0,True,0,True
DB0-m2_o8_o16_f0_pcu.sym.12.cif,172.392,330.048,688.2868,930.368,2580.0,5628.0,7264.0,8224.0,20.0,44.0,...,-6.517821,-8.092118,100001,False,25085,True,59576,True,11104,True
DB1-Zn2O8-ADC_A-irmof8_A_No1.cif,126.1238,231.168,445.9952,511.872,2456.0,4608.0,4928.0,6144.0,14.0,32.0,...,0.0,0.0,100001,False,8681,True,78,True,24242,True
DB1-Cu2O8N2-DPAC_A-irmof8_A_No233.cif,172.392,330.048,688.2868,930.368,2580.0,5628.0,7264.0,8224.0,20.0,44.0,...,-3.0,-5.150889,100001,False,8049,True,73075,True,12228,True
DB0-m3_o6_o23_f0_nbo.sym.163.cif,126.1238,231.168,445.9952,511.872,2456.0,4608.0,4928.0,6144.0,14.0,32.0,...,-2.867852,-1.735703,100001,False,4252,True,41032,True,7990,True


In [46]:
merged.head()

Unnamed: 0_level_0,unit_cell_volume,Density,accessible_surface_area,volumetric_surface_area,gravimetric_surface_area,inaccessible_surface_area,inac_grav_surf_area,inac_vol_surf_area,accessible_volume_per_uc,volume_fraction,...,mmol/g_uptake,mmol/g_working_capacity,v/v_uptake,v/v_working_capacity,wt%_uptake,wt%_working_capacity,selectivity,purity,specific_surf_area_particle,adsorption_fig_of_merit
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DB0-m2_o1_o10_f0_pcu.sym.66.cif,901.788,1.23322,87.4832,970.108,786.644,0.0,0.0,0.0,26.0256,0.02886,...,1.518477,1.269431,41.966421,35.083493,6.682741,5.586702,76.822165,0.944155,1298.817827,369.913624
DB0-m3_o23_o23_f0_pcu.sym.74.cif,7545.84,0.537679,1566.33,2075.75,3860.57,0.0,0.0,0.0,2364.41,0.31334,...,0.85092,0.761627,10.253114,9.177189,3.744858,3.351886,28.049575,0.867411,183.503669,234.69423
DB0-m2_o8_o25_f0_pcu.sym.91.cif,4172.23,0.371648,771.93,1850.16,4978.27,0.0,0.0,0.0,2102.05,0.50382,...,0.297468,0.250113,2.47756,2.083153,1.309143,1.100738,1.770005,0.608544,2.751594,1.106303
DB0-m29_o82_o46_f0_pts.sym.1.cif,1715.11,0.786327,378.905,2209.23,2809.55,0.0,0.0,0.0,281.586,0.16418,...,1.428373,1.252157,25.170934,22.065633,6.286197,5.510678,40.84989,0.900005,367.668977,445.19779
DB0-m29_o99_o470_f0_pts.sym.128.cif,2552.97,0.754924,419.589,1643.53,2177.08,0.164038,0.642539,0.851131,268.47,0.10516,...,0.980361,0.825367,16.586056,13.963825,4.31452,3.632401,31.178985,0.88375,237.027773,102.918679


Now I'll join RACs to the merged dataframe.

In [47]:
total = merged.join(other = racs, on = 'filename', how = 'inner', rsuffix = 'racs')
total.shape

(255938, 902)

In [48]:
total.head()

Unnamed: 0_level_0,unit_cell_volume,Density,accessible_surface_area,volumetric_surface_area,gravimetric_surface_area,inaccessible_surface_area,inac_grav_surf_area,inac_vol_surf_area,accessible_volume_per_uc,volume_fraction,...,D_func-alpha-2-all,D_func-alpha-3-all,order_f-lig,bool_f-lig,order_mc,bool_mc,order_func,bool_func,order_lc,bool_lc
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DB0-m2_o1_o10_f0_pcu.sym.66.cif,901.788,1.23322,87.4832,970.108,786.644,0.0,0.0,0.0,26.0256,0.02886,...,41.88578,23.764297,43831,True,18963,True,21096,True,4072,True
DB0-m3_o23_o23_f0_pcu.sym.74.cif,7545.84,0.537679,1566.33,2075.75,3860.57,0.0,0.0,0.0,2364.41,0.31334,...,4.4,17.857187,100001,False,44145,True,100001,False,42138,True
DB0-m2_o8_o25_f0_pcu.sym.91.cif,4172.23,0.371648,771.93,1850.16,4978.27,0.0,0.0,0.0,2102.05,0.50382,...,-7.433333,-5.745183,100001,False,100001,False,100001,False,100001,False
DB0-m29_o82_o46_f0_pts.sym.1.cif,1715.11,0.786327,378.905,2209.23,2809.55,0.0,0.0,0.0,281.586,0.16418,...,-6.0,-12.0,100001,False,100001,False,100001,False,100001,False
DB0-m29_o99_o470_f0_pts.sym.128.cif,2552.97,0.754924,419.589,1643.53,2177.08,0.164038,0.642539,0.851131,268.47,0.10516,...,-10.339258,-17.664703,100001,False,100001,False,84717,True,100001,False


Ok, time to start cleaning. I'll check the datatypes of the columns to start.

In [49]:
print(total.columns.to_series().groupby(total.dtypes).groups)

{bool: ['bool_geo', 'bool_f-lig', 'bool_mc', 'bool_func', 'bool_lc'], int64: ['order_geo', 'Unnamed: 0', 'order_f-lig', 'order_mc', 'order_func', 'order_lc'], float64: ['unit_cell_volume', 'Density', 'accessible_surface_area', 'volumetric_surface_area', 'gravimetric_surface_area', 'inaccessible_surface_area', 'inac_grav_surf_area', 'inac_vol_surf_area', 'accessible_volume_per_uc', 'volume_fraction', 'grav_volume', 'inac_vol', 'inac_vol_frac', 'inac_grav_vol', 'probe_occupiable_vol', 'probe_occ_vol_frac', 'grav_probe_occ_vol', 'inac_probe_occ_vol', 'inac_probe_occ_vol_frac', 'inac_probe_occ_grav_vol', 'largest_cav_diameter', 'pore_limiting_diameter', 'largest_free_sphere_path_diam', 'RDF_electronegativity_2.000', 'RDF_electronegativity_2.004', 'RDF_electronegativity_2.013', 'RDF_electronegativity_2.027', 'RDF_electronegativity_2.044', 'RDF_electronegativity_2.066', 'RDF_electronegativity_2.093', 'RDF_electronegativity_2.124', 'RDF_electronegativity_2.159', 'RDF_electronegativity_2.199',

First of all, I'm seeing some more columns I need to remove.

In [50]:
trimmed = total.drop(columns=['filename_top', 'process'])

Now I'm not seeing any obviously incorrect datatypes, so I'll check to see if I have any columns that are missing too much data.

In [51]:
trimmed.head()

Unnamed: 0_level_0,unit_cell_volume,Density,accessible_surface_area,volumetric_surface_area,gravimetric_surface_area,inaccessible_surface_area,inac_grav_surf_area,inac_vol_surf_area,accessible_volume_per_uc,volume_fraction,...,D_func-alpha-2-all,D_func-alpha-3-all,order_f-lig,bool_f-lig,order_mc,bool_mc,order_func,bool_func,order_lc,bool_lc
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DB0-m2_o1_o10_f0_pcu.sym.66.cif,901.788,1.23322,87.4832,970.108,786.644,0.0,0.0,0.0,26.0256,0.02886,...,41.88578,23.764297,43831,True,18963,True,21096,True,4072,True
DB0-m3_o23_o23_f0_pcu.sym.74.cif,7545.84,0.537679,1566.33,2075.75,3860.57,0.0,0.0,0.0,2364.41,0.31334,...,4.4,17.857187,100001,False,44145,True,100001,False,42138,True
DB0-m2_o8_o25_f0_pcu.sym.91.cif,4172.23,0.371648,771.93,1850.16,4978.27,0.0,0.0,0.0,2102.05,0.50382,...,-7.433333,-5.745183,100001,False,100001,False,100001,False,100001,False
DB0-m29_o82_o46_f0_pts.sym.1.cif,1715.11,0.786327,378.905,2209.23,2809.55,0.0,0.0,0.0,281.586,0.16418,...,-6.0,-12.0,100001,False,100001,False,100001,False,100001,False
DB0-m29_o99_o470_f0_pts.sym.128.cif,2552.97,0.754924,419.589,1643.53,2177.08,0.164038,0.642539,0.851131,268.47,0.10516,...,-10.339258,-17.664703,100001,False,100001,False,84717,True,100001,False


In [52]:
print((trimmed.isna().sum()/len(trimmed)).sort_values(ascending=False))

RDF_hardness_3.022          0.001621
RDF_hardness_9.315          0.001621
RDF_hardness_4.195          0.001621
RDF_hardness_4.336          0.001621
RDF_hardness_4.482          0.001621
                              ...   
RDF_polarizability_3.221    0.000000
RDF_polarizability_3.327    0.000000
RDF_polarizability_3.438    0.000000
RDF_polarizability_3.553    0.000000
bool_lc                     0.000000
Length: 900, dtype: float64


Ok, missing data isn't a problem in this dataset. Let me take a look at how much of a columns values are 0. I've got ~50k rows and 899 columns, so I will need to reduce the features of this dataset. I suspect there are some mostly-zero columns I can drop.

In [53]:
zeroes = {}

for col in trimmed.columns:
    if trimmed[col].dtypes == 'int64':
        if 0 in trimmed[col].unique():
            zeroes[col] = trimmed[col].value_counts(normalize=True)[0]
    
sorted_zeroes = sorted(zeroes.items(), key = lambda x:x[1])
print(sorted_zeroes)

[('order_geo', 3.907196274097633e-06), ('Unnamed: 0', 3.907196274097633e-06)]


Wow, so none of the columns have many zero values either. This is a surprisingly clean dataset. Since there are already over 900 columns I'll have to check for values that are unexpected by going through each group of columns (for example, RDFs of electronegativity) and make a loop that calls out anything outside of an expected range. RDFs cover electronegativity, atomic hardness, van der Waals volume, dipole polarizability, atomic mass, and none, so I'll break each of them out and see if any invalid values are present.

In [54]:
# Electronegativity should range betwen 0 and 4
for col in trimmed.columns:
    if 'RDF_electronegativity' in col:
        if min(trimmed[col]) < 0 or max(trimmed[col]) > 4:
            print(col)

In [55]:
# Atomic hardness ranges between 0 and 13
for col in trimmed.columns:
    if 'RDF_hardness' in col:
        if min(trimmed[col]) < 0 or max(trimmed[col]) > 13:
            print(col)

In [56]:
# van der Waals volume should be greater than zero
for col in trimmed.columns:
    if 'RDF_v' in col:
        if min(trimmed[col]) < 0:
            print(col)

In [57]:
# Dipole polarizability should be non-negative
for col in trimmed.columns:
    if 'RDF_polarizability' in col:
        if min(trimmed[col]) < 0:
            print(col)

In [58]:
# Atomic mass must be greater than zero
for col in trimmed.columns:
    if 'RDF_mass' in col:
        if min(trimmed[col]) < 0:
            print(col)

Ok, all of the RDF values meet expectations. This seems to be a very clean dataset. Given that the values are provided by theoretical calculations, it seems reasonable to expect the data to be within reasonable bounds and already cleaned for the analysis the authors did. I'll drop any rows with missing working CO2 capacity and then save the data.

In [59]:
trimmed.dropna(subset='v/v_working_capacity', inplace=True)

In [60]:
trimmed.shape

(255938, 900)

In [61]:
trimmed.to_csv('../data/interim/wrangled.csv')