# Capturing greenhouse gases with data

## Data Wrangling

### by Zachary Brown

The goal of this project was originally to merge two MOF databases to determine what chemical properties increase the CO2 capacity of a metal-organic framework (MOF). Those two databases only had 30 entries with the same MOF identifiers, so instead I will be using the ARC MOF database which has over 200,000 theoretical MOFs and has both chemical properties and gas adsorption predictions included.

Some key terms that are used throughout this dataset and project include RDF - radial distribution functions (calculated for electronegativity, atomic hardness, van der Waals volume, dipole polarizability, atomic mass, and none), RAC - revised autocorrelations (calculated for electronegativity, nuclear charge, atom identity, connectivity and covalent radii), 

First we'll install the necessary libraries and import them.

In [None]:
!pip install numpy==1.24.2
!pip install pandas==1.5.3
!pip install requests==2.28.2

In [2]:
import numpy as np
import pandas as pd
import requests

Now I'll start by downloading the topology dataset, which describes the geometric topology of the MOFs.

In [3]:
url = 'https://zenodo.org/record/7600474/files/all_topology_lists.csv?download=1'
r = requests.get(url, allow_redirects=True)
open('../data/raw/topology.csv', 'wb').write(r.content)

11683716

In [4]:
top = pd.read_csv('../data/raw/topology.csv')
top.head()

  top = pd.read_csv('../data/raw/topology.csv')


Unnamed: 0,Name,filename,Crystalnet,likely topology
0,DB0-m12_o10_bcu.cif,bcu,bcu,bcu
1,DB0-m12_o12_bcu.cif,bcu,bcu,bcu
2,DB0-m12_o13_bcu.cif,bcu,bcu,bcu
3,DB0-m12_o14_bcu.cif,bcu,bcu,bcu
4,DB0-m12_o14_o22_f0_bcu.cif,bcu,bcu,bcu


In [5]:
top.shape

(264225, 4)

To join this dataframe with future ones I'll need to set the 'Name' column as the index, so I'll do that and then download the geometry dataset which has geometric properties of the MOFs.

In [9]:
top.set_index('Name', inplace=True)

In [10]:
url = 'https://zenodo.org/record/7600474/files/geometric_properties.csv?download=1'
r = requests.get(url, allow_redirects=True)
open('../data/raw/geom.csv', 'wb').write(r.content)

110395714

In [11]:
geo = pd.read_csv('../data/raw/geom.csv')
geo.head()

Unnamed: 0.1,Unnamed: 0,filename,UC_volume,Density,ASA,vASA,gASA,NASA,gNASA,vNASA,...,NPOAVA,NPOAVAf,NPOAVAg,Di,Df,Dif,ARC-MOF,DB_num,order_geo,bool_geo
0,0,DB0-m2_o1_o10_f0_pcu.sym.66.cif,901.788,1.23322,87.4832,970.108,786.644,0.0,0.0,0.0,...,0.0,0.0,0.0,5.41813,4.36524,5.39798,True,DB0,0,True
1,1,DB0-m28_o161_o113_f0_pts.cif,8183.19,0.389995,1749.55,2137.97,5482.05,0.0,0.0,0.0,...,0.0,0.0,0.0,16.83322,15.07954,16.80076,False,DB0,1,True
2,2,DB1-Zn2O8N2-irmof20_A-irmof8_A_No13.cif,3853.14,0.652434,824.502,2139.82,3279.75,0.0,0.0,0.0,...,0.0,0.0,0.0,11.24255,9.36124,11.24255,False,DB1,2,True
3,3,DB1-Zn4O13-BDC_A-irmof6_A_No267.cif,16975.8,0.815191,3234.86,1905.57,2337.57,0.0,0.0,0.0,...,0.0,0.0,0.0,14.9643,6.83319,14.95745,False,DB1,3,True
4,4,DB0-m15_o27_aww.cif,236848.0,0.12761,17612.1,743.601,5827.13,0.0,0.0,0.0,...,0.0,0.0,0.0,48.43682,38.41622,48.43682,False,DB0,4,True


In [12]:
geo.shape

(521316, 29)

In [13]:
geo.columns

Index(['Unnamed: 0', 'filename', 'UC_volume', 'Density', 'ASA', 'vASA', 'gASA',
       'NASA', 'gNASA', 'vNASA', 'AVA', 'AVAf', 'AVAg', 'NAVA', 'NAVAf',
       'NAVAg', 'POAVA', 'POAVAf', 'POAVAg', 'NPOAVA', 'NPOAVAf', 'NPOAVAg',
       'Di', 'Df', 'Dif', 'ARC-MOF', 'DB_num', 'order_geo', 'bool_geo'],
      dtype='object')

These column headers aren't particularly insightful, so I'm going to reference the journal article to rename these to something more useful.

In [None]:
{'UC_volume', 'Density', 'ASA', 'vASA', 'gASA',
       'NASA', 'gNASA', 'vNASA', 'AVA', 'AVAf', 'AVAg', 'NAVA', 'NAVAf',
       'NAVAg', 'POAVA', 'POAVAf', 'POAVAg', 'NPOAVA', 'NPOAVAf', 'NPOAVAg',
       'Di', 'Df', 'Dif', 'ARC-MOF', 'DB_num', 'order_geo', 'bool_geo'}

In [21]:
geo_top = geo.join(other = top, on = 'filename', how = 'inner', lsuffix='_geo', rsuffix='_top')

In [22]:
geo_top.shape

(263744, 33)

In [25]:
geo_top.set_index('filename', inplace=True)

In [28]:
geo_top.head()

Unnamed: 0_level_0,Unnamed: 0,filename_geo,UC_volume,Density,ASA,vASA,gASA,NASA,gNASA,vNASA,...,Di,Df,Dif,ARC-MOF,DB_num,order_geo,bool_geo,filename_top,Crystalnet,likely topology
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DB0-m2_o1_o10_f0_pcu.sym.66.cif,0,DB0-m2_o1_o10_f0_pcu.sym.66.cif,901.788,1.23322,87.4832,970.108,786.644,0.0,0.0,0.0,...,5.41813,4.36524,5.39798,True,DB0,0,True,pcu,pcu,pcu
DB0-m3_o23_o23_f0_pcu.sym.74.cif,6,DB0-m3_o23_o23_f0_pcu.sym.74.cif,7545.84,0.537679,1566.33,2075.75,3860.57,0.0,0.0,0.0,...,10.43731,9.91429,10.43731,True,DB0,6,True,pcu,pcu,pcu
DB0-m2_o8_o25_f0_pcu.sym.91.cif,7,DB0-m2_o8_o25_f0_pcu.sym.91.cif,4172.23,0.371648,771.93,1850.16,4978.27,0.0,0.0,0.0,...,12.93441,11.01397,12.93441,True,DB0,7,True,pcu,pcu,pcu
DB0-m29_o82_o46_f0_pts.sym.1.cif,8,DB0-m29_o82_o46_f0_pts.sym.1.cif,1715.11,0.786327,378.905,2209.23,2809.55,0.0,0.0,0.0,...,8.35282,5.44658,7.30192,True,DB0,8,True,pts,pts,pts
DB0-m29_o99_o470_f0_pts.sym.128.cif,10,DB0-m29_o99_o470_f0_pts.sym.128.cif,2552.97,0.754924,419.589,1643.53,2177.08,0.164038,0.642539,0.851131,...,7.57868,4.51994,7.57868,True,DB0,10,True,pts,pts,pts


Now that the two are merged I'll download the RDF dataset, which describes a wide range of chemical properties. 

In [14]:
url = 'https://zenodo.org/record/7600474/files/RDFs.csv?download=1'
r = requests.get(url, allow_redirects=True)
open('../data/raw/rdf.csv', 'wb').write(r.content)

2129693042

In [15]:
rdf = pd.read_csv('../data/raw/rdf.csv')
rdf.head(10)

Unnamed: 0.1,Unnamed: 0,Structure_Name,RDF_electronegativity_2.000,RDF_electronegativity_2.004,RDF_electronegativity_2.013,RDF_electronegativity_2.027,RDF_electronegativity_2.044,RDF_electronegativity_2.066,RDF_electronegativity_2.093,RDF_electronegativity_2.124,...,RDF_none_25.700,RDF_none_26.161,RDF_none_26.625,RDF_none_27.094,RDF_none_27.568,RDF_none_28.046,RDF_none_28.528,RDF_none_29.015,RDF_none_29.506,RDF_none_30.001
0,0,DB0-m29_o97_o420_f0_pts.sym.57_repeat.cif,0.000818,0.000833,0.000862,0.000907,0.000969,0.001049,0.001149,0.001269,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,DB0-m3_o440_o13_f0_fsc.sym.76_repeat.cif,0.000856,0.000869,0.000895,0.000934,0.000989,0.00106,0.001148,0.001253,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,DB1-Cu2O8N2-irmof14_A-irmof7_A_No101_repeat.cif,0.000788,0.000797,0.000814,0.000843,0.000886,0.000946,0.001029,0.001141,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,DB0-m3_o96_o13_f0_fsc.sym.51_repeat.cif,0.001121,0.001136,0.001166,0.001211,0.001272,0.001347,0.001437,0.001539,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,DB0-m2_o11_o11_f0_nbo.sym.9_repeat.cif,0.000867,0.000876,0.000894,0.000921,0.000957,0.001004,0.00106,0.001126,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,5,DB0-m3_o52_o6_f0_fsc.sym.57_repeat.cif,0.000762,0.000771,0.000789,0.000817,0.000856,0.000908,0.000974,0.001055,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,6,DB0-m9_o6_o25_f0_sra.sym.66_repeat.cif,0.000868,0.000878,0.000899,0.000931,0.000975,0.00103,0.001098,0.001178,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,7,DB0-m29_o82_o86_f0_pts.sym.38_repeat.cif,0.000809,0.000824,0.000853,0.000899,0.000962,0.001046,0.001151,0.001279,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,8,DB0-m2_o11_o23_f0_nbo.sym.138_repeat.cif,0.000978,0.000989,0.001011,0.001045,0.00109,0.001148,0.001218,0.0013,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,9,DB0-m2_o6_o11_f0_pcu.sym.15_repeat.cif,0.000728,0.000737,0.000754,0.000781,0.000818,0.000866,0.000925,0.000997,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
rdf.shape

(279609, 680)

In [18]:
rdf.columns

Index(['Unnamed: 0', 'Structure_Name', 'RDF_electronegativity_2.000',
       'RDF_electronegativity_2.004', 'RDF_electronegativity_2.013',
       'RDF_electronegativity_2.027', 'RDF_electronegativity_2.044',
       'RDF_electronegativity_2.066', 'RDF_electronegativity_2.093',
       'RDF_electronegativity_2.124',
       ...
       'RDF_none_25.700', 'RDF_none_26.161', 'RDF_none_26.625',
       'RDF_none_27.094', 'RDF_none_27.568', 'RDF_none_28.046',
       'RDF_none_28.528', 'RDF_none_29.015', 'RDF_none_29.506',
       'RDF_none_30.001'],
      dtype='object', length=680)

In [None]:
rdf.set_index('Structure_Name', inplace=True)

In [44]:
rdf[rdf.index.str.contains('.repeat.') == False]

Unnamed: 0_level_0,Unnamed: 0,RDF_electronegativity_2.000,RDF_electronegativity_2.004,RDF_electronegativity_2.013,RDF_electronegativity_2.027,RDF_electronegativity_2.044,RDF_electronegativity_2.066,RDF_electronegativity_2.093,RDF_electronegativity_2.124,RDF_electronegativity_2.159,...,RDF_none_25.700,RDF_none_26.161,RDF_none_26.625,RDF_none_27.094,RDF_none_27.568,RDF_none_28.046,RDF_none_28.528,RDF_none_29.015,RDF_none_29.506,RDF_none_30.001
Structure_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1


In [45]:
rdf.index = rdf.index.str.replace('_repeat', '')

In [46]:
rdf.head()

Unnamed: 0_level_0,Unnamed: 0,RDF_electronegativity_2.000,RDF_electronegativity_2.004,RDF_electronegativity_2.013,RDF_electronegativity_2.027,RDF_electronegativity_2.044,RDF_electronegativity_2.066,RDF_electronegativity_2.093,RDF_electronegativity_2.124,RDF_electronegativity_2.159,...,RDF_none_25.700,RDF_none_26.161,RDF_none_26.625,RDF_none_27.094,RDF_none_27.568,RDF_none_28.046,RDF_none_28.528,RDF_none_29.015,RDF_none_29.506,RDF_none_30.001
Structure_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DB0-m29_o97_o420_f0_pts.sym.57.cif,0,0.000818,0.000833,0.000862,0.000907,0.000969,0.001049,0.001149,0.001269,0.001406,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DB0-m3_o440_o13_f0_fsc.sym.76.cif,1,0.000856,0.000869,0.000895,0.000934,0.000989,0.00106,0.001148,0.001253,0.001375,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DB1-Cu2O8N2-irmof14_A-irmof7_A_No101.cif,2,0.000788,0.000797,0.000814,0.000843,0.000886,0.000946,0.001029,0.001141,0.001286,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DB0-m3_o96_o13_f0_fsc.sym.51.cif,3,0.001121,0.001136,0.001166,0.001211,0.001272,0.001347,0.001437,0.001539,0.001649,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DB0-m2_o11_o11_f0_nbo.sym.9.cif,4,0.000867,0.000876,0.000894,0.000921,0.000957,0.001004,0.00106,0.001126,0.001202,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [47]:
geo_top_rdf = geo_top.join(other = rdf, on = 'filename', how='inner', rsuffix='rdf')
geo_top_rdf.shape

(263743, 711)

In [12]:
url = 'https://zenodo.org/record/7600474/files/overall_process.csv?download=1'
r = requests.get(url, allow_redirects=True)
open('../data/raw/process.csv', 'wb').write(r.content)

228656569

In [13]:
process = pd.read_csv('../data/raw/process.csv')
process.head()

Unnamed: 0.1,Unnamed: 0,filename,process,mmol/g_uptake,mmol/g_working_capacity,v/v_uptake,v/v_working_capacity,wt%_uptake,wt%_working_capacity,selectivity,purity,ssp,afm
0,0,DB0-m3_o12_o22_f0_pcu.sym.90_repeat,post-combustion-vsa,0.42247,0.370563,9.976014,8.750299,1.859267,1.630826,13.582315,0.840454,71.548843,27.906003
1,1,DB0-m3_o12_o22_f0_pcu.sym.90_repeat,pre-combustion-40-40,9.998517,9.25787,236.100667,218.611342,44.002972,40.743421,106.889928,0.98889,9514.302202,3623.126284
2,2,DB0-m3_o12_o22_f0_pcu.sym.90_repeat,natural-gas-purification,2.213165,1.923483,52.260721,45.420287,9.740028,8.46515,5.521347,0.390499,3.537445,14.054911
3,3,DB0-m3_o12_o22_f0_pcu.sym.90_repeat,landfill-gas-vpsa,2.927485,2.650803,69.128371,62.594927,12.883715,11.666052,3.301257,0.686128,7.216585,72.01408
4,4,DB0-m3_o12_o22_f0_pcu.sym.90_repeat,methane-storage-psa,8.560576,5.384048,202.145755,127.136581,13.696922,8.614477,,,,


In [53]:
process.shape

(1395025, 12)

In [48]:
process['filename'] = process['filename'].str.replace('_repeat', '.cif')

In [49]:
process.set_index('filename', inplace=True)

In [54]:
merged = geo_top_rdf.join(other = process, how='inner', rsuffix = 'process')

MemoryError: Unable to allocate 6.65 GiB for an array with shape (678, 1316475) and data type float64

In [None]:
merged.shape