# Capturing greenhouse gases with data

## Model Development

### by Zachary Brown

Now that my dataset has been cleaned, explored, and preprocessed, it's time to begin testing a range of models to determine which can best predict the volumetric CO2 working capacity and then explore which features boost that capacity the most.

I'll start by installing the necessary libraries and then importing everything we'll need.

In [2]:
!pip install xgboost==1.7.4
!pip install lightgbm==3.3.5

Collecting xgboost==1.7.4
  Downloading xgboost-1.7.4-py3-none-win_amd64.whl (89.1 MB)
     ---------------------------------------- 89.1/89.1 MB 9.3 MB/s eta 0:00:00
Installing collected packages: xgboost
Successfully installed xgboost-1.7.4
Collecting lightgbm==3.3.5
  Downloading lightgbm-3.3.5-py3-none-win_amd64.whl (1.0 MB)
     ---------------------------------------- 1.0/1.0 MB 8.1 MB/s eta 0:00:00
Installing collected packages: lightgbm
Successfully installed lightgbm-3.3.5


In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Ridge
from sklearn.model_selection import RandomizedSearchCV
from sklearn import linear_model
from scipy.stats import uniform as sp_randFloat
from scipy.stats import randint as sp_randInt
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
import xgboost
from sklearn import tree
from sklearn.neural_network import MLPRegressor
from sklearn.neighbors import KNeighborsRegressor
import lightgbm as lgb
from matplotlib import rcParams

In [6]:
sns.set_theme('notebook')
rcParams['mathtext.default'] = 'regular'

Let's start by importing the training data.

In [12]:
X_train = pd.read_csv('../data/processed/X_train.csv', index_col = 'filename')
y_train = pd.read_csv('../data/processed/y_train.csv', index_col = 0)

In [13]:
X_train.head()

Unnamed: 0_level_0,unit_cell_volume,Density,accessible_surface_area,volumetric_surface_area,gravimetric_surface_area,accessible_volume_per_uc,volume_fraction,grav_volume,probe_occupiable_vol,probe_occ_vol_frac,...,lc-S-0-all,lc-S-1-all,lc-S-2-all,lc-S-3-all,lc-alpha-0-all,lc-alpha-1-all,lc-alpha-2-all,lc-alpha-3-all,D_lc-chi-2-all,D_lc-S-2-all
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DB0-m2_o8_o14_f0_pcu.sym.96.cif,2835.48,0.594287,639.155,2254.13,3793.01,710.004,0.2504,0.421345,1739.57,0.6135,...,0.542767,0.759733,1.3,1.176967,36.98,95.673333,136.635076,138.33964,1.48,0.213333
DB0-m2_o6_o25_f0_pcu.sym.25.cif,3731.46,0.378038,798.181,2139.06,5658.31,1873.34,0.50204,1.32801,3057.93,0.8195,...,0.542767,0.567233,0.9225,0.759733,36.98,67.8,86.526667,95.673333,0.756667,-0.033333
DB0-m29_o89_o148_f0_pts.sym.12.cif,1977.56,0.774433,393.895,1991.83,2571.98,333.139,0.16846,0.217527,1054.04,0.533,...,0.5329,0.5621,1.095,1.1242,28.09,59.89,87.98,119.78,0.89,-0.04
DB0-m9_o3_o7_f0_sra.sym.2.cif,4622.09,0.533492,697.715,1509.52,2829.52,1820.64,0.3939,0.738343,3064.45,0.663,...,0.5329,0.5621,1.095,1.1242,28.09,59.89,87.98,119.78,0.89,-0.04
DB0-m2_o21_o22_f0_pcu.sym.120.cif,1861.23,0.576,416.902,2239.93,3888.77,493.449,0.26512,0.460278,1209.8,0.65,...,0.542767,0.567233,0.9225,1.034467,36.98,67.8,86.526667,118.844205,0.756667,-0.033333


In [14]:
y_train.head()

Unnamed: 0,0
0,5.853326
1,1.588771
2,16.626334
3,4.305267
4,2.952315


Now I want to perform an initial train/test split using just 10% of this data to train initial models so I can get a quick feel for how well each model will perform.

In [15]:
X_tr, X_te, y_tr, y_te = train_test_split(X_train, y_train, test_size=0.9, random_state=15)

In [16]:
X_tr.shape, X_te.shape, y_tr.shape, y_te.shape

((20475, 450), (184275, 450), (20475, 1), (184275, 1))

In [17]:
# Reshape y_train for model fitting
y_tr = y_tr.values.ravel()
y_te = y_te.values.ravel()

In [18]:
y_tr.shape, y_te.shape

((20475,), (184275,))