# Data preprocessing

The dataset is composed of three parts:
1. descriptors of **MOFs** (under `data/ML_data`)
2. descriptors of **adsorbants** (mannually added below)
3. **adsorption uptakes** of **(MOF, adsorbate) pairs** (under `data/flexibility_data/y_data/adsorption_data`), containing two values:
    1. values from rigid model
    2. mean values from flexible model

## read MOF data

In [1]:
import pandas as pd
import numpy as np
import os

# read the 36-descriptor data
df36Descriptor = pd.read_excel('data/ML_data/descriptor_used.xlsx',header=4,index_col=1)

In [2]:
# obtain the descriptor list
columns = [df36Descriptor.columns[1]] + df36Descriptor.columns[3: -11].tolist()
print(columns)

['MOF', 'ρ(g.cm-3)', 'PLD (Å)', 'vf', 'vp (cm3.g-1)', 'V (A3)', 'nAT-H', 'nNM', 'nM', 'nTB', 'nSB', 'nMB', 'nRB', 'nR6', 'nTrM', 'nDB', 'nAcyclB', 'nR8', 'nAlkylC', 'nVinylC', 'nEnamineAN', 'nOHEPh', 'nR5', 'nR4', 'MType', 'MaxMVal', 'n-O-', 'F01[H-C]', 'F01[C-N]', 'F01[C-O]', 'F02[H-C]', 'F02[C-N]', 'F02[C-O]']


In [3]:
# clean up columns
newColumns = {}
for ci in columns:
    if ' ' in ci:
        newColumns[ci] = ci.split(' ',1)[0]
    elif '(' in ci:
        newColumns[ci] = ci.split('(',1)[0]
    else:
        newColumns[ci] = ci
        
print(newColumns)

{'MOF': 'MOF', 'ρ(g.cm-3)': 'ρ', 'PLD (Å)': 'PLD', 'vf': 'vf', 'vp (cm3.g-1)': 'vp', 'V (A3)': 'V', 'nAT-H': 'nAT-H', 'nNM': 'nNM', 'nM': 'nM', 'nTB': 'nTB', 'nSB': 'nSB', 'nMB': 'nMB', 'nRB': 'nRB', 'nR6': 'nR6', 'nTrM': 'nTrM', 'nDB': 'nDB', 'nAcyclB': 'nAcyclB', 'nR8': 'nR8', 'nAlkylC': 'nAlkylC', 'nVinylC': 'nVinylC', 'nEnamineAN': 'nEnamineAN', 'nOHEPh': 'nOHEPh', 'nR5': 'nR5', 'nR4': 'nR4', 'MType': 'MType', 'MaxMVal': 'MaxMVal', 'n-O-': 'n-O-', 'F01[H-C]': 'F01[H-C]', 'F01[C-N]': 'F01[C-N]', 'F01[C-O]': 'F01[C-O]', 'F02[H-C]': 'F02[H-C]', 'F02[C-N]': 'F02[C-N]', 'F02[C-O]': 'F02[C-O]'}


In [4]:
# read ML data
#dfMLOrigin = pd.read_excel('data/ML_data/descriptor_4717MOF.xlsx')

The dataset contains 4717 MOFs with 1024 features.

In [5]:
dfShortNames = df36Descriptor[columns].rename(columns=newColumns)

The reduced dataset contains 4717 MOFs with 29 features (excluding the first column).

## read adsorption update data

In [6]:
# the MOFs in "dfMLReduced" and adsorption data sets are different, so it is necessary to match the MOFs in two datasets
def datasetMatch(MOFName):
    dfML= dfShortNames[dfShortNames['MOF'].isin(MOFName)].drop_duplicates()
    matchedMOFIndex=np.isin(MOFName, dfML['MOF'].values)
    return matchedMOFIndex, dfML

# read flexibility data
flexibilityList=os.listdir('data/flexibility_data/y_data/adsorption_data') # obtain list of csv files for 9 adsorption uptakes
flexivilityData=[]
adsorbantNameList = []

for i, name in enumerate(flexibilityList):
    # read csv files for certain adsorption uptakes
    df = pd.read_csv('data/flexibility_data/y_data/adsorption_data/' + name)
    
    # obtain the rigid value
    rigidValue = np.array(df[df.columns[1]], dtype = float)
    
    # obtain the flexible mean value
    flexValue = np.mean(np.array(df[df.columns[2:]],dtype=float),axis=1)
    
    # obtain the adsorbate label
    label = np.array([name.split("_")[1] for x in range(0,len(flexValue))],dtype=str)
    adsorbantNameList.append(name.split("_")[1])
    
    # stack the rigid value, flexible mean value and the adsorbate label
    singleSet = np.column_stack([rigidValue,flexValue,label])

    if i == 0:
        # obtain the name list of MOFs
        MOFNaemTemp = np.array(df[df.columns[0]], dtype = str)
        MOFName = [x.split("_")[0] for x in MOFNaemTemp]
        
        # search the MOF name in "dfMLReduced", generating dfML
        matchedMOFIndex, dfML = datasetMatch(MOFName)
        print("The number of MOFs shared by two datasets are: {:d}.\n".format(dfML.shape[0]))
        
        # generating flexibilityData as "y"
        flexibilityData = singleSet[matchedMOFIndex,:].copy()
    else:
        # concatenate "y"
        flexibilityData = np.concatenate([flexibilityData.copy(),singleSet[matchedMOFIndex,:].copy()])

flexibilityData

The number of MOFs shared by two datasets are: 89.



array([['5.395625473', '5.016580520775', 'xenon'],
       ['5.788896266', '4.817801024044944', 'xenon'],
       ['2.323461698', '3.3505466442417773', 'xenon'],
       ...,
       ['6.718467229', '6.4977646543946666', 'krypton'],
       ['5.026615122', '4.6271537747455', 'krypton'],
       ['5.028696345', '4.590551175359306', 'krypton']], dtype='<U32')

In [7]:
print(flexibilityData.shape)

(801, 3)


flexibilityData contains the adsorption update data for (MOF, adsorbate) pairs. There are 98 MOFs and 9 adsorbants, so there are 882 data points in total.
- 1st column: rigid data
- 2nd column: flexible mean data
- 3rd column: adsorbate label

The order of the flexibilityData is:

| MOF | adsorbant |
|------|------------|
| MOF1 | adsorbant1 |
| MOF2 | adsorbant1 |
| MOF3 | adsorbant1 |
| ...  | ...        |
| MOF98 | adsorbant1 |
| MOF1 | adsorbant2 |
| MOF2 | adsorbant2 |
| MOF3 | adsorbant2 |
| ...  | ...        |
| MOF98 | adsorbant2 |
| MOF1 | adsorbant3 |
| MOF2 | adsorbant3 |
| MOF3 | adsorbant3 |
| ...  | ...        |

## manually add adsorbant data

In [8]:
# manually add adsorbate descriptors

# Mw/gr.mol-1, Tc/K, Pc/bar, ω, Tb/K, Tf/K

adsorbateData=np.array([
    ['xenon',131.293,289.7,58.4,0.008,164.87,161.2], 
    ['butane',58.1,449.8,39.5,0.3,280.1,146.7], 
    ['propene',42.1,436.9,51.7,0.2,254.8,150.6], 
    ['ethane',30.1,381.8,50.3,0.2,184.0,126.2], 
    ['propane',44.1,416.5,44.6,0.2,230.1,136.5], 
    ['CO2',44.0,295.9,71.8,0.2,317.4,204.9], 
    ['ethene',28.054,282.5,51.2,0.089,169.3,228], 
    ['methane',16.04,190.4,46.0,0.011,111.5,91],
    ['krypton',83.798,209.4,55.0,0.005,119.6,115.6]])

adsorbateData.shape
adDf = pd.DataFrame(data=adsorbateData, columns=["adsorbant", "Mw/gr.mol-1", "Tc/K", "Pc/bar", "ω", "Tb/K", "Tf/K"])

# sort the dataframe based on adsorbantNameList
sorterIndex = dict(zip(adsorbantNameList,range(len(adsorbantNameList))))
adDf['an_Rank'] = adDf['adsorbant'].map(sorterIndex)
adDf.sort_values(['an_Rank'],ascending = [True], inplace = True)
adDf.drop('an_Rank', 1, inplace = True)
adDfFloat = adDf.iloc[:, 1:].astype(np.float)
adDfFloat["adsorbant"] = adDf["adsorbant"]
adDfFloat

Unnamed: 0,Mw/gr.mol-1,Tc/K,Pc/bar,ω,Tb/K,Tf/K,adsorbant
0,131.293,289.7,58.4,0.008,164.87,161.2,xenon
1,58.1,449.8,39.5,0.3,280.1,146.7,butane
2,42.1,436.9,51.7,0.2,254.8,150.6,propene
3,30.1,381.8,50.3,0.2,184.0,126.2,ethane
4,44.1,416.5,44.6,0.2,230.1,136.5,propane
5,44.0,295.9,71.8,0.2,317.4,204.9,CO2
6,28.054,282.5,51.2,0.089,169.3,228.0,ethene
7,16.04,190.4,46.0,0.011,111.5,91.0,methane
8,83.798,209.4,55.0,0.005,119.6,115.6,krypton


In [9]:
print(adDfFloat.shape)

(9, 7)


There are 6 descriptors (excluding name label) for each adsorbant.

## combine MOF and adsorbant descriptors
The combined dataset should have $29+6=35$ descriptors:

In [10]:
# replicate dfML for 9 adsorbants
dfMLReplicate = pd.concat([dfML]*9)

# replicate adDf for 98 MOFs
adDfReplicate = pd.DataFrame(np.repeat(adDfFloat.values,88,axis=0))
adDfReplicate.columns = adDfFloat.columns

# concatenate two datasets
dfMLReplicate.reset_index(drop=True, inplace=True)
adDfReplicate.reset_index(drop=True, inplace=True)
XAllDescriptor = pd.concat([dfMLReplicate, adDfReplicate],axis=1)
print(XAllDescriptor.shape)
XAllDescriptor.head()

(801, 40)


Unnamed: 0,MOF,ρ,PLD,vf,vp,V,nAT-H,nNM,nM,nTB,...,F02[H-C],F02[C-N],F02[C-O],Mw/gr.mol-1,Tc/K,Pc/bar,ω,Tb/K,Tf/K,adsorbant
0,ABUWOJ,1.15833,4.03039,0.545974,0.532253,4354.2656,168,1.238095,0.095238,1.571428,...,0.571429,0.0,0.285714,131.293,289.7,58.4,0.008,164.87,161.2,xenon
1,ACOLIP,1.04949,3.57647,0.454051,0.52104,6253.029,256,1.5625,0.03125,1.75,...,0.75,0.25,0.125,131.293,289.7,58.4,0.008,164.87,161.2,xenon
2,AGARUW,1.77195,6.25183,0.450504,0.29115,4625.2617,224,1.071428,0.071429,1.678572,...,0.214286,0.428571,0.285714,131.293,289.7,58.4,0.008,164.87,161.2,xenon
3,AHOKIR01,1.92701,3.46842,0.460404,0.251926,2360.6523,112,1.428572,0.142857,1.928572,...,0.857143,0.0,0.428571,131.293,289.7,58.4,0.008,164.87,161.2,xenon
4,AMILUE,0.982365,11.07263,0.566397,0.622654,4871.6387,176,1.5,0.045455,1.795454,...,0.545455,0.0,0.181818,131.293,289.7,58.4,0.008,164.87,161.2,xenon


## generate X and y
The rigid uptake data can be added into X:

In [11]:
X = np.concatenate((XAllDescriptor.iloc[:, 1:-1], flexibilityData[:, 0].reshape(-1, 1)),axis=1).astype(np.float)
print(X.shape)

(801, 39)


The flexible mean data is chosen as y:

In [12]:
y = flexibilityData[:, 1]
print(y.shape)

(801,)


# Validation set split

In [13]:
# ----------------------------------------------------------------------------------------
# ---------------------------- don't touch the validation set ----------------------------
np.random.seed(0)
from sklearn.model_selection import train_test_split
X_train_test, X_validation, y_train_test, y_validation = train_test_split(X, y, test_size=0.25)
# ---------------------------- don't touch the validation set ----------------------------
# ----------------------------------------------------------------------------------------

# Regression