# Machine Learning for CO2 adsorption on porous carbons

# Part 1

## Problem´s description

Porous carbons are known to adsorp CO2 on its surface. The main features of porous carbon are surface area (S BET), mesopores volume (V meso) and micropores volume (V micro). Which of the three factors is responsible for CO2 adsorption is unknown. The CO2 uptake has also temperature (T) in which the measurements occured, and partial pressure (P).

I will utilize the machine learning to predict the CO2 uptake of the porous carbons with known surface properties - S BET, V micro, V meso, adsorption temperature (T) and pressure (P). For this purpose, I will extract the experimental data for the CO2 adsorption of different porous carbons from the literature. 
Surface characteristics as well as CO2 uptake of a bit more than 1000 porous carbons were described in this article. 

The first step is to extract the data from pdf-file so that we would obtain training set for our machine learning algorithms.  

## Data mining. Table clean-up

The table was downloaded from the supporting data of a published article on the CO2 adsorption:

https://onlinelibrary.wiley.com/doi/full/10.1002/anie.201812363

In order to extract the table data from the pdf file, a software tabula-py was used:

https://blog.chezo.uno/tabula-py-extract-table-from-pdf-into-python-dataframe-6c7acfa5f302

tabula is a greate package that allows for a table mining from pdf files with ease! The tables from the Supporting Info-file were extracted with "Tabula" in .tsv format and then opened as csv file using the following code:

In [11]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from pandas import DataFrame

df = DataFrame.from_csv("tabula-CO2_DNN_SI.tsv", sep="\t")

  


Unfortunately, "Tabula" extracted table from the pdf-file in a very reader-unfriendly way - it is hard to read, most of the columns are misplaced and values are off. It means that we have to spend some time cleaning our table so that it will be readable.   
Let us have a look on what are the names of the columns:

In [12]:
columns_old = df.columns
columns_old

Index(['S BET (m /g)', '2', 'Unnamed: 3', 'V total (cm /g)', '3', 'Unnamed: 6',
       'Vmicro (cm /g) 3', 'Unnamed: 8', 'T(oC)', 'P(bar)',
       'CO2 uptake (mmol/g)'],
      dtype='object')

Since the table's layout is completely off, we generate the column names of those that we are interested in:

In [13]:
column_names = [columns_old[0],columns_old[3],columns_old[6], columns_old[8], columns_old[9], columns_old[10]]
column_names

['S BET (m /g)',
 'V total (cm /g)',
 'Vmicro (cm /g) 3',
 'T(oC)',
 'P(bar)',
 'CO2 uptake (mmol/g)']

In [14]:
df.head(14)

Unnamed: 0_level_0,S BET (m /g),2,Unnamed: 3,V total (cm /g),3,Unnamed: 6,Vmicro (cm /g) 3,Unnamed: 8,T(oC),P(bar),CO2 uptake (mmol/g)
Entry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,,798.0,,,0.18,,0.42,,25.0,5.0,2.0
2,,241.0,,,1.01,,0.09,,25.0,5.0,0.7
3,,448.0,,,1.09,,0.21,,25.0,5.0,0.9
4,,826.0,,,0.91,,0.39,,25.0,5.0,1.8
5,,895.0,,,0.9,,0.4,,25.0,5.0,2.8
6,,862.0,,,0.91,,0.39,,25.0,5.0,2.1
7,,678.0,,,1.1,,0.3,,25.0,5.0,2.3
8,,304.0,,,1.06,,0.14,,25.0,5.0,0.9
9,,500.0,,,0.77,,0.23,,25.0,5.0,2.2
10,931.0,,,1.01,,0.39,,25.0,5.0,3.0,


For the second column, V tottal, which is "2", the data are skewed from 1 to row 39. 
The data for this column are found in the column detoned as "V total".
We have to copy data for rows 1-39 from column "V total" to column "2"

In [15]:
df['S BET (m /g)'] = df['S BET (m /g)'].fillna(df['2'])
df['V total (cm /g)'] = df['V total (cm /g)'].fillna(df['3'])

df['2'][0:37] = df['V total (cm /g)'][0:37]  
df['2'] = df['2'].fillna(0.138)

# Now the columns 1 and 2 are fixed. Moving to the third column.
# Beginning of the column 3 is located in the column "V micro" and the end is the 
# column "Unnamed 3". Lets us check how many data points are missing in "Unnamed 3".
number_withNaNs = df['Unnamed: 3'].isnull().count() # data points including NaNs
number_values_noNaN = df['Unnamed: 3'].count()
diff = number_withNaNs - number_values_noNaN
diff

38

Ok, so there are 38 missing values and they are in the beginning of the table.

In [16]:
df[0:50].head()

Unnamed: 0_level_0,S BET (m /g),2,Unnamed: 3,V total (cm /g),3,Unnamed: 6,Vmicro (cm /g) 3,Unnamed: 8,T(oC),P(bar),CO2 uptake (mmol/g)
Entry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,798.0,0.18,,0.18,0.18,,0.42,,25.0,5.0,2.0
2,241.0,1.01,,1.01,1.01,,0.09,,25.0,5.0,0.7
3,448.0,1.09,,1.09,1.09,,0.21,,25.0,5.0,0.9
4,826.0,0.91,,0.91,0.91,,0.39,,25.0,5.0,1.8
5,895.0,0.9,,0.9,0.9,,0.4,,25.0,5.0,2.8


In [17]:
df['Unnamed: 3'][0:9] = df['Vmicro (cm /g) 3'][0:9]
df[:50]

Unnamed: 0_level_0,S BET (m /g),2,Unnamed: 3,V total (cm /g),3,Unnamed: 6,Vmicro (cm /g) 3,Unnamed: 8,T(oC),P(bar),CO2 uptake (mmol/g)
Entry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,798.0,0.18,0.42,0.18,0.18,,0.42,,25.0,5.0,2.0
2,241.0,1.01,0.09,1.01,1.01,,0.09,,25.0,5.0,0.7
3,448.0,1.09,0.21,1.09,1.09,,0.21,,25.0,5.0,0.9
4,826.0,0.91,0.39,0.91,0.91,,0.39,,25.0,5.0,1.8
5,895.0,0.9,0.4,0.9,0.9,,0.4,,25.0,5.0,2.8
6,862.0,0.91,0.39,0.91,0.91,,0.39,,25.0,5.0,2.1
7,678.0,1.1,0.3,1.1,1.1,,0.3,,25.0,5.0,2.3
8,304.0,1.06,0.14,1.06,1.06,,0.14,,25.0,5.0,0.9
9,500.0,0.77,0.23,0.77,0.77,,0.23,,25.0,5.0,2.2
10,931.0,1.01,,1.01,,0.39,,25.0,5.0,3.0,


In [19]:
# some values are moved from one column to another
df['Unnamed: 3'][9:37] = df['Unnamed: 6'][9:37]
df['Unnamed: 3'][37:40] = df['V total (cm /g)'][37:40]

df['V total (cm /g)'][0:37] = df['Unnamed: 8'][0:37]
df['V total (cm /g)'][0:9] = df['T(oC)'][0:9]
df['V total (cm /g)'][37:40] = df['Vmicro (cm /g) 3'][37:40]

df['P(bar)'][10:21] = df['T(oC)'][10:21] 
df['P(bar)'][9] = 5
df['P(bar)'][21:37] = df['T(oC)'][21:37]
df['P(bar)'][37:39] = df['Unnamed: 8'][37:39]
df['P(bar)'][39] = 1
df['P(bar)'][41:] = df['3'][41:]
df['P(bar)'][40] = 10

In [20]:
# Last one to fix is the last column which is CO2 uptake
df_new = DataFrame.from_csv("tabula-CO2_DNN_SI.tsv", sep="\t")

  


In [21]:
first_part = df_new['P(bar)'][9:37]
second_part = df_new['T(oC)'][37:39]
third_part = 2.47
fourth_part = df_new['Unnamed: 6'][40:]

df['CO2 uptake (mmol/g)'][9:37] = first_part
df['CO2 uptake (mmol/g)'][37:39] = second_part
df['CO2 uptake (mmol/g)'][40:] = fourth_part
df['CO2 uptake (mmol/g)'][39] = 2.47

Let us check for the missing values. 

In [22]:
df.isnull().sum().sort_values(ascending=False)

Vmicro (cm /g) 3       1043
Unnamed: 8             1024
T(oC)                  1016
3                        29
Unnamed: 6               12
CO2 uptake (mmol/g)       1
P(bar)                    0
V total (cm /g)           0
Unnamed: 3                0
2                         0
S BET (m /g)              0
dtype: int64

In [23]:
to_drop = ["Vmicro (cm /g) 3","Unnamed: 8",'T(oC)', '3', 'Unnamed: 6']
new_labels = ['S BET(m2/g)', 'Vtotal (cm3/g)','Vmicro (cm3/g)','T(oC)', 'P(bar)', 'CO2uptake (mmol/g)']
df = df.drop(to_drop, axis=1)
df.columns = new_labels

In [24]:
df.index

Index(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10',
       ...
       '1011', '1012', '1013', '1014', '1015  346', '1016', '1017', '1018',
       '1019', '1020'],
      dtype='object', name='Entry', length=1055)

In [25]:
df.index[1049]

'1015  346'

In [26]:
df['CO2uptake (mmol/g)'][1049] = 1.360
df['P(bar)'][1049] = 5
df['T(oC)'][1049] = 25
df['Vmicro (cm3/g)'][1049] = 0.05
df['Vtotal (cm3/g)'][1049] = 0.46
df['S BET(m2/g)'][1049] = 346
#df.index[1049] = 1015
df.dtypes

S BET(m2/g)           float64
Vtotal (cm3/g)        float64
Vmicro (cm3/g)        float64
T(oC)                 float64
P(bar)                float64
CO2uptake (mmol/g)    float64
dtype: object

In [27]:
df.index[1049]

'1015  346'

In [28]:
df = df.reset_index(drop=True)
df.head()

Unnamed: 0,S BET(m2/g),Vtotal (cm3/g),Vmicro (cm3/g),T(oC),P(bar),CO2uptake (mmol/g)
0,798.0,0.18,0.42,25.0,5.0,2.0
1,241.0,1.01,0.09,25.0,5.0,0.7
2,448.0,1.09,0.21,25.0,5.0,0.9
3,826.0,0.91,0.39,25.0,5.0,1.8
4,895.0,0.9,0.4,25.0,5.0,2.8


Now, the table is set up and we are exporting it to its own csv-file:

In [29]:
export_csv = df.to_csv ('clean_datatable.csv', index = True, header=True) 