# Data preprocessing

The dataset is composed of three parts:
1. descriptors of **MOFs** (under `data/ML_data`)
2. descriptors of **adsorbants** (mannually added below)
3. **adsorption uptakes** of **(MOF, adsorbate) pairs** (under `data/flexibility_data/y_data/adsorption_data`), containing two values:
    1. values from rigid model
    2. mean values from flexible model

## read MOF data

In [1]:
import pandas as pd
import numpy as np
import os

# read ML data
dfMLOrigin = pd.read_excel('data/ML_data/descriptor_4717MOF.xlsx')

In [2]:
print(dfMLOrigin.shape)

(4717, 1024)


The dataset contains 4717 MOFs with 1024 features.

## read adsorption update data

In [3]:
# the MOFs in "dfMLOrigin" and adsorption data sets are different, so it is necessary to match the MOFs in two datasets
def datasetMatch(MOFName):
    dfML= dfMLOrigin[dfMLOrigin['MOF'].isin(MOFName)]
    matchedMOFIndex=np.isin(MOFName, dfML['MOF'].values)
    return matchedMOFIndex, dfML

# read flexibility data
flexibilityList=os.listdir('data/flexibility_data/y_data/adsorption_data') # obtain list of csv files for 9 adsorption uptakes
flexivilityData=[]
adsorbantNameList = []

for i, name in enumerate(flexibilityList):
    # read csv files for certain adsorption uptakes
    df = pd.read_csv('data/flexibility_data/y_data/adsorption_data/' + name)
    
    # obtain the rigid value
    rigidValue = np.array(df[df.columns[1]], dtype = float)
    
    # obtain the flexible mean value
    flexValue = np.mean(np.array(df[df.columns[2:]],dtype=float),axis=1)
    
    # obtain the adsorbate label
    label = np.array([name.split("_")[1] for x in range(0,len(flexValue))],dtype=str)
    adsorbantNameList.append(name.split("_")[1])
    
    # stack the rigid value, flexible mean value and the adsorbate label
    singleSet = np.column_stack([rigidValue,flexValue,label])

    if i == 0:
        # obtain the name list of MOFs
        MOFNaemTemp = np.array(df[df.columns[0]], dtype = str)
        MOFName = [x.split("_")[0] for x in MOFNaemTemp]
        
        # search the MOF name in "dfMLOrigin", generating dfML
        matchedMOFIndex, dfML = datasetMatch(MOFName)
        print("The number of MOFs shared by two datasets are: {:d}.\n".format(dfML.shape[0]))
        
        # generating flexibilityData as "y"
        flexibilityData = singleSet[matchedMOFIndex,:].copy()
    else:
        # concatenate "y"
        flexibilityData = np.concatenate([flexibilityData.copy(),singleSet[matchedMOFIndex,:].copy()])

flexibilityData

The number of MOFs shared by two datasets are: 98.



array([['2.842392813', '2.7255026209065', 'butane'],
       ['2.086272931', '3.1063526610226666', 'butane'],
       ['1.472946691', '2.0552884249058887', 'butane'],
       ...,
       ['14.15219967', '13.821848680833336', 'xenon'],
       ['6.662372437', '6.378310010356667', 'xenon'],
       ['6.65909719', '6.326572230466223', 'xenon']], dtype='<U32')

In [4]:
print(flexibilityData.shape)

(882, 3)


flexibilityData contains the adsorption update data for (MOF, adsorbate) pairs. There are 98 MOFs and 9 adsorbants, so there are 882 data points in total.
- 1st column: rigid data
- 2nd column: flexible mean data
- 3rd column: adsorbate label

The order of the flexibilityData is:

| MOF | adsorbant |
|------|------------|
| MOF1 | adsorbant1 |
| MOF2 | adsorbant1 |
| MOF3 | adsorbant1 |
| ...  | ...        |
| MOF98 | adsorbant1 |
| MOF1 | adsorbant2 |
| MOF2 | adsorbant2 |
| MOF3 | adsorbant2 |
| ...  | ...        |
| MOF98 | adsorbant2 |
| MOF1 | adsorbant3 |
| MOF2 | adsorbant3 |
| MOF3 | adsorbant3 |
| ...  | ...        |

## manually add adsorbant data

In [6]:
# manually add adsorbate descriptors

# Mw/gr.mol-1, Tc/K, Pc/bar, ω, Tb/K, Tf/K

adsorbateData=np.array([
    ['xenon',131.293,289.7,58.4,0.008,164.87,161.2], 
    ['butane',58.1,449.8,39.5,0.3,280.1,146.7], 
    ['propene',42.1,436.9,51.7,0.2,254.8,150.6], 
    ['ethane',30.1,381.8,50.3,0.2,184.0,126.2], 
    ['propane',44.1,416.5,44.6,0.2,230.1,136.5], 
    ['CO2',44.0,295.9,71.8,0.2,317.4,204.9], 
    ['ethene',28.054,282.5,51.2,0.089,169.3,228], 
    ['methane',16.04,190.4,46.0,0.011,111.5,91],
    ['krypton',83.798,209.4,55.0,0.005,119.6,115.6]])

adsorbateData.shape
adDf = pd.DataFrame(data=adsorbateData, columns=["adsorbant", "Mw/gr.mol-1", "Tc/K", "Pc/bar", "ω", "Tb/K", "Tf/K"])

# sort the dataframe based on adsorbantNameList
sorterIndex = dict(zip(adsorbantNameList,range(len(adsorbantNameList))))
adDf['an_Rank'] = adDf['adsorbant'].map(sorterIndex)
adDf.sort_values(['an_Rank'],ascending = [True], inplace = True)
adDf.drop('an_Rank', 1, inplace = True)
adDf

Unnamed: 0,adsorbant,Mw/gr.mol-1,Tc/K,Pc/bar,ω,Tb/K,Tf/K
1,butane,58.1,449.8,39.5,0.3,280.1,146.7
5,CO2,44.0,295.9,71.8,0.2,317.4,204.9
3,ethane,30.1,381.8,50.3,0.2,184.0,126.2
6,ethene,28.054,282.5,51.2,0.089,169.3,228.0
8,krypton,83.798,209.4,55.0,0.005,119.6,115.6
7,methane,16.04,190.4,46.0,0.011,111.5,91.0
4,propane,44.1,416.5,44.6,0.2,230.1,136.5
2,propene,42.1,436.9,51.7,0.2,254.8,150.6
0,xenon,131.293,289.7,58.4,0.008,164.87,161.2


In [7]:
print(adDf.shape)

(9, 7)


There are 7 descriptors (including name label) for each adsorbant.

## combine MOF and adsorbant descriptors
The combined dataset should have $1024+7=1031$ descriptors:

In [8]:
# replicate dfML for 9 adsorbants
dfMLReplicate = pd.concat([dfML]*9)

# replicate adDf for 98 MOFs
adDfReplicate = pd.DataFrame(np.repeat(adDf.values,98,axis=0))
adDfReplicate.columns = adDf.columns

# concatenate two datasets
dfMLReplicate.reset_index(drop=True, inplace=True)
adDfReplicate.reset_index(drop=True, inplace=True)
XAllDescriptor = pd.concat([dfMLReplicate, adDfReplicate],axis=1)
XAllDescriptor

Unnamed: 0,ID,MOF,Periodic Chemical Formula,ρ(g.cm-3),PLD (Å),LCD (Å),VSA (m2/cm3),GSA (m2/g),vf,vp (cm3/g),...,F01[Ne-Te],F01[Ne-I],F01[Ne-Xe],adsorbant,Mw/gr.mol-1,Tc/K,Pc/bar,ω,Tb/K,Tf/K
0,13,ABUWOJ,H7C12O7Zn2,1.158330,4.03039,5.07969,1007.55,869.832,0.545974,0.532253,...,0,0,0,butane,58.1,449.8,39.5,0.3,280.1,146.7
1,25,ACOLIP,H19C22N5O4Zn,1.049490,3.57647,4.91034,0.00,0.000,0.454051,0.521040,...,0,0,0,butane,58.1,449.8,39.5,0.3,280.1,146.7
2,78,AGARUW,H4C11N4O11La2,1.771950,6.25183,6.77693,1091.16,615.795,0.450504,0.291150,...,0,0,0,butane,58.1,449.8,39.5,0.3,280.1,146.7
3,84,AHOKIR01,H4C2O3PCu,1.927010,3.46842,4.30724,0.00,0.000,0.460404,0.251926,...,0,0,0,butane,58.1,449.8,39.5,0.3,280.1,146.7
4,124,AMILUE,H12C15N2O4Zn,0.982365,11.07263,11.39418,1097.04,1116.740,0.566397,0.622654,...,0,0,0,butane,58.1,449.8,39.5,0.3,280.1,146.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
877,1515,HEGJUZ,H32C52N2O12Zn3,1.090320,4.73915,6.22529,1064.09,975.945,0.485434,0.508728,...,0,0,0,xenon,131.293,289.7,58.4,0.008,164.87,161.2
878,1536,HICVUM,H9C19N7O6Cd2,1.387310,4.18424,5.70734,1255.07,904.679,0.545750,0.404764,...,0,0,0,xenon,131.293,289.7,58.4,0.008,164.87,161.2
879,1541,HIFTOG,H12C24O13Zn4,1.165820,7.94194,15.08334,2197.02,3760.470,0.567548,0.536198,...,0,0,0,xenon,131.293,289.7,58.4,0.008,164.87,161.2
880,1542,HIFTOG01,H12C24O13Zn4,0.584242,4.15104,7.88297,1315.57,1136.140,0.826706,1.384750,...,0,0,0,xenon,131.293,289.7,58.4,0.008,164.87,161.2


Different descriptors of MOFs and adsorbants can be select:

In [9]:
X_temp = XAllDescriptor[['ρ(g.cm-3)', 'PLD (Å)', 'LCD (Å)', 'VSA  (m2/cm3)', 'GSA (m2/g)', 'vf', 'vp (cm3/g)', "Mw/gr.mol-1", "Tc/K", "Pc/bar", "ω", "Tb/K", "Tf/K"]].values
X_temp

array([[1.15833, 4.03039, 5.07969, ..., '0.3', '280.1', '146.7'],
       [1.04949, 3.57647, 4.91034, ..., '0.3', '280.1', '146.7'],
       [1.77195, 6.25183, 6.77693, ..., '0.3', '280.1', '146.7'],
       ...,
       [1.16582, 7.94194, 15.08334, ..., '0.008', '164.87', '161.2'],
       [0.584242, 4.15104, 7.88297, ..., '0.008', '164.87', '161.2'],
       [1.15793, 4.14066, 7.95891, ..., '0.008', '164.87', '161.2']],
      dtype=object)

In [10]:
print(X_temp.shape)

(882, 13)


## generate X and y
The rigid uptake data can be added into X:

In [11]:
X = np.concatenate((X_temp, flexibilityData[:, 0].reshape(-1, 1)),axis=1).astype(np.float)
print(X.shape)

(882, 14)


The flexible mean data is chosen as y:

In [12]:
y = flexibilityData[:, 1]
print(y.shape)

(882,)


# Classification/regression

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)