# Objective: Here the models are trained using the dataset without creating new features and comparing the boosting techniques like LightGBM, Random Forest, XGBoost, CatBoost

# 1. Loading Data and Packages 

In [1]:
import pandas as pd
import numpy as np
import scipy
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import warnings
from scipy.stats import skew
from scipy import stats
from scipy.stats.stats import pearsonr
from scipy.stats import norm
%matplotlib inline
!pip install category_encoders eli5 shap

from category_encoders import OrdinalEncoder, OneHotEncoder
import eli5
from eli5.sklearn import PermutationImportance

from scipy.stats import randint, uniform

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import OneHotEncoder as OHE
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

from hyperopt import fmin, hp, tpe, Trials, space_eval, STATUS_OK, STATUS_RUNNING
from functools import partial
from tqdm import tqdm_notebook as tqdm
from sklearn.metrics import precision_score, roc_auc_score, recall_score, confusion_matrix, roc_curve, precision_recall_curve, accuracy_score
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform

import xgboost as xgb
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
import lightgbm as lgb



Using TensorFlow backend.


ModuleNotFoundError: No module named 'hyperopt'

https://medium.com/@pushkarmandot/https-medium-com-pushkarmandot-what-is-lightgbm-how-to-implement-it-how-to-fine-tune-the-parameters-60347819b7fc
use this link for theory and explanation

In [2]:
trainval = pd.read_csv(r'C:\Users\praty\Documents\backup work data science\work\capstone\datasets\train\train.csv')
test = pd.read_csv(r'C:\Users\praty\Documents\backup work data science\work\capstone\datasets\test\test.csv')
# structures = pd.read_csv(r'C:\Users\praty\Documents\backup work data science\work\capstone\datasets\train\structures.csv')
# dipole = pd.read_csv(r'C:\Users\praty\Documents\backup work data science\work\capstone\datasets\train\dipole_moments.csv')
# contrib = pd.read_csv(r'C:\Users\praty\Documents\backup work data science\work\capstone\datasets\train\scalar_coupling_contributions.csv')
# magnetic = pd.read_csv(r'C:\Users\praty\Documents\backup work data science\work\capstone\datasets\train\magnetic_shielding_tensors.csv')
# mulliken = pd.read_csv(r'C:\Users\praty\Documents\backup work data science\work\capstone\datasets\train\mulliken_charges.csv')
# potential_energy = pd.read_csv(r'C:\Users\praty\Documents\backup work data science\work\capstone\datasets\train\potential_energy.csv')


# 2. Defining to reduce memory usage

In [3]:
def reduce_mem_usage(df, verbose=True):
    """
    This function reduces the numeric to the least possible numeric type that fits the data so 
    memory usage during transforming and training will be reduced.
    Taken from: https://www.kaggle.com/todnewman/keras-neural-net-for-champs
    
    Han
    Parameters:
    ===========
    dataframe: input dataframe 
    verbose: verbose mode, default True.
    Output:
    ===========
    dataframe: dataframe with numeric columns types changed to the least possible size
    """

    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

# 3. Inference of EDA

a. Looking at the data, we can see that the train and test sets had relatively even distributions of scalar coupling type and of the number of atoms present in each dataset. This tells us that the train data is a good enough representation of the test data in order to create a model that predicts the scalar coupling constants.

b. The distribution of the scalar coupling constant values isolated by type also reveals that there are clear differences in the ranges that these values appear in. This gives us the insight that different molecular properties affect each type of J coupling differently and unique models should be used for all 8 coupling types found in the dataset.

c. The test set structure is the same as the training except that we don't have the scalar_coupling_constant column. Among the other datasets, the most promising seems to be structures.csv, as it is the only one that is available for both training and testing sets. All the other dataframes are available only for the former.

d. It is mentioned in the problem statement that the molecules listed for train and test are totally different.

e. there are no null values for both train and test


# 4. Grouping the data according to 'coupling type'

In [4]:
trainval_1JHC = trainval[trainval['type'] == '1JHC']
trainval_1JHC.head(2)

Unnamed: 0,id,molecule_name,atom_index_0,atom_index_1,type,scalar_coupling_constant
0,0,dsgdb9nsd_000001,1,0,1JHC,84.8076
4,4,dsgdb9nsd_000001,2,0,1JHC,84.8074


In [5]:
trainval_1JHC.tail(2)

Unnamed: 0,id,molecule_name,atom_index_0,atom_index_1,type,scalar_coupling_constant
4658136,4658136,dsgdb9nsd_133884,16,7,1JHC,99.6572
4658146,4658146,dsgdb9nsd_133884,17,8,1JHC,117.934


# 5. Changing the categorical coulmns (molecule name and type) to numerical values using encoding

In [6]:
from category_encoders import OrdinalEncoder, OneHotEncoder
def encode(df):
    df = OrdinalEncoder.fit_transform(OrdinalEncoder(df), df)
    return df

In [7]:
trainval_1JHC_encoded = encode(trainval_1JHC)

In [8]:
trainval_1JHC_encoded.tail(2)

Unnamed: 0,id,molecule_name,atom_index_0,atom_index_1,type,scalar_coupling_constant
4658136,4658136,84747,16,7,1,99.6572
4658146,4658146,84747,17,8,1,117.934


In [9]:
trainval_1JHC_encoded.shape

(709416, 6)

# 6. Standard scalar , normalizing the data

In [10]:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
train_val_1JHC = scaler.fit_transform(trainval_1JHC_encoded)
train_val_1JHC = pd.DataFrame(train_val_1JHC, columns=['id', 'molecule_name', 'atom_index_0', 'atom_index_1', 'type', 'scalar_coupling_constant' ])


  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


# 7. Split the data into train test split

In [11]:
train_1JHC, val_1JHC = train_test_split(train_val_1JHC, shuffle=False, random_state=47)

train_1JHC_molecules = train_1JHC['molecule_name'].unique()
val_1JHC_molecules = np.delete(val_1JHC['molecule_name'].unique(), 0)

train_1JHC = train_1JHC[train_1JHC['molecule_name'].isin(train_1JHC_molecules)]
val_1JHC = val_1JHC[val_1JHC['molecule_name'].isin(val_1JHC_molecules)]

# 8. Define features and target

In [12]:
features = ['id', 'molecule_name', 'atom_index_0', 'atom_index_1', 'type']
target = 'scalar_coupling_constant'

In [13]:
#full data
X_trainval_1JHC = train_val_1JHC[features]
y_trainval_1JHC = train_val_1JHC[target]

# split data
X_train_1JHC = train_1JHC[features]
y_train_1JHC = train_1JHC[target]

X_val_1JHC = val_1JHC[features]
y_val_1JHC = val_1JHC[target]


# 9. Defining evaluation metric as Mean Absolute Error

In [14]:
def custom_eval_metric(y_true, y_pred):
    return 'custom_eval_metric', ((y_true - y_pred).abs().mean()), False

def custom_score_metric(y_true, y_pred, sample_weight):
    return ((y_true - y_pred).abs().mean())

# 10. Model using LightGBM

# 1JHC LightGBM

In [15]:
import lightgbm as lgb
from lightgbm import LGBMRegressor

fit_params={"early_stopping_rounds":30, 
            "eval_metric" : custom_eval_metric, 
            "eval_set" : [(X_val_1JHC, y_val_1JHC)],
            'eval_names': ['valid'],
            'verbose': 1,
            'categorical_feature': 'auto'}

param_test ={'num_leaves': [x for x in range(0, 100, 10)], 
             'min_child_samples': [x for x in range(100, 500, 10)], 
             'min_child_weight': [1e-5, 1e-3, 1e-2, 1e-1, 1, 1e1, 1e2, 1e3, 1e4],
             'subsample': uniform(loc=0.2, scale=0.8), 
             'colsample_bytree': uniform(loc=0.4, scale=0.6),
             'reg_alpha': [0, 1e-1, 1, 2, 5, 7, 10, 50, 100],
             'reg_lambda': [0, 1e-1, 1, 5, 10, 20, 50, 100]}


clf_1JHC = lgb.LGBMRegressor(max_depth=-1, random_state=47, n_jobs=-1, n_estimators=50)
                          

gs_1JHC = RandomizedSearchCV(
    estimator=clf_1JHC,
    param_distributions=param_test, 
    n_iter=3,
    cv=3,
    refit=True,
    random_state=47)

gs_1JHC.fit(X_train_1JHC, y_train_1JHC, **fit_params)

final_params_1JHC = gs_1JHC.best_params_
final_params_1JHC['n_estimators'] = 2000
final_params_1JHC['max_depth'] = -1
final_params_1JHC['random_state'] = 47
final_params_1JHC['n_jobs'] = -1


clf_1JHC = lgb.LGBMRegressor()

clf_1JHC.set_params(**final_params_1JHC)

clf_1JHC.fit(X_train_1JHC, y_train_1JHC)

y_pred_1JHC = clf_1JHC.predict(X_val_1JHC)

pred_vs_actual_1JHC = pd.DataFrame(data={
    'predictions': y_pred_1JHC,
    'actual': y_val_1JHC
})

[1]	valid's l2: 0.794975	valid's custom_eval_metric: 0.622325
Training until validation scores don't improve for 30 rounds.
[2]	valid's l2: 0.790933	valid's custom_eval_metric: 0.619363
[3]	valid's l2: 0.788222	valid's custom_eval_metric: 0.606708
[4]	valid's l2: 0.787007	valid's custom_eval_metric: 0.595515
[5]	valid's l2: 0.775378	valid's custom_eval_metric: 0.590123
[6]	valid's l2: 0.772971	valid's custom_eval_metric: 0.588634
[7]	valid's l2: 0.773052	valid's custom_eval_metric: 0.580243
[8]	valid's l2: 0.773914	valid's custom_eval_metric: 0.572725
[9]	valid's l2: 0.765648	valid's custom_eval_metric: 0.56905
[10]	valid's l2: 0.764245	valid's custom_eval_metric: 0.568121
[11]	valid's l2: 0.765674	valid's custom_eval_metric: 0.562036
[12]	valid's l2: 0.767656	valid's custom_eval_metric: 0.556256
[13]	valid's l2: 0.761724	valid's custom_eval_metric: 0.553707
[14]	valid's l2: 0.760931	valid's custom_eval_metric: 0.553164
[15]	valid's l2: 0.762997	valid's custom_eval_metric: 0.548819
[16

[28]	valid's l2: 0.746267	valid's custom_eval_metric: 0.61279
[29]	valid's l2: 0.742959	valid's custom_eval_metric: 0.610762
[30]	valid's l2: 0.742302	valid's custom_eval_metric: 0.610093
[31]	valid's l2: 0.742608	valid's custom_eval_metric: 0.610792
[32]	valid's l2: 0.7417	valid's custom_eval_metric: 0.608691
[33]	valid's l2: 0.738946	valid's custom_eval_metric: 0.60692
[34]	valid's l2: 0.738275	valid's custom_eval_metric: 0.606249
[35]	valid's l2: 0.738612	valid's custom_eval_metric: 0.60705
[36]	valid's l2: 0.737818	valid's custom_eval_metric: 0.605139
[37]	valid's l2: 0.735501	valid's custom_eval_metric: 0.60358
[38]	valid's l2: 0.734803	valid's custom_eval_metric: 0.602896
[39]	valid's l2: 0.735157	valid's custom_eval_metric: 0.603772
[40]	valid's l2: 0.734457	valid's custom_eval_metric: 0.602024
[41]	valid's l2: 0.732472	valid's custom_eval_metric: 0.600629
[42]	valid's l2: 0.731743	valid's custom_eval_metric: 0.599929
[43]	valid's l2: 0.732102	valid's custom_eval_metric: 0.60085

[4]	valid's l2: 0.75512	valid's custom_eval_metric: 0.634615
[5]	valid's l2: 0.734046	valid's custom_eval_metric: 0.617714
[6]	valid's l2: 0.72843	valid's custom_eval_metric: 0.606221
[7]	valid's l2: 0.724666	valid's custom_eval_metric: 0.5977
[8]	valid's l2: 0.717791	valid's custom_eval_metric: 0.589736
[9]	valid's l2: 0.703696	valid's custom_eval_metric: 0.580958
[10]	valid's l2: 0.701696	valid's custom_eval_metric: 0.573377
[11]	valid's l2: 0.697177	valid's custom_eval_metric: 0.567505
[12]	valid's l2: 0.686799	valid's custom_eval_metric: 0.558709
[13]	valid's l2: 0.684099	valid's custom_eval_metric: 0.5525
[14]	valid's l2: 0.67627	valid's custom_eval_metric: 0.546028
[15]	valid's l2: 0.669882	valid's custom_eval_metric: 0.540541
[16]	valid's l2: 0.664477	valid's custom_eval_metric: 0.535742
[17]	valid's l2: 0.663877	valid's custom_eval_metric: 0.534141
[18]	valid's l2: 0.659012	valid's custom_eval_metric: 0.529511
[19]	valid's l2: 0.655461	valid's custom_eval_metric: 0.525625
[20]	

[34]	valid's l2: 0.724665	valid's custom_eval_metric: 0.508519
[35]	valid's l2: 0.726389	valid's custom_eval_metric: 0.509283
[36]	valid's l2: 0.726419	valid's custom_eval_metric: 0.5095
[37]	valid's l2: 0.728402	valid's custom_eval_metric: 0.510216
[38]	valid's l2: 0.731151	valid's custom_eval_metric: 0.510989
[39]	valid's l2: 0.731307	valid's custom_eval_metric: 0.510868
[40]	valid's l2: 0.731597	valid's custom_eval_metric: 0.511178
[41]	valid's l2: 0.7342	valid's custom_eval_metric: 0.512173
[42]	valid's l2: 0.735456	valid's custom_eval_metric: 0.512684
[43]	valid's l2: 0.736053	valid's custom_eval_metric: 0.512881
[44]	valid's l2: 0.739188	valid's custom_eval_metric: 0.51309
[45]	valid's l2: 0.739068	valid's custom_eval_metric: 0.513155
[46]	valid's l2: 0.73881	valid's custom_eval_metric: 0.513136
[47]	valid's l2: 0.741082	valid's custom_eval_metric: 0.514118
[48]	valid's l2: 0.742108	valid's custom_eval_metric: 0.513923
[49]	valid's l2: 0.742499	valid's custom_eval_metric: 0.51368

# 11. Random forest model 

In [16]:
model = RandomForestRegressor()
model.fit(X_train_1JHC,y_train_1JHC)



RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [17]:
#*Get the mean absolute error on the validation data *

# In[ ]:


y_pred_1JHC = model.predict(X_val_1JHC)
MAE = mean_absolute_error(y_val_1JHC , y_pred_1JHC)
print('Random forest validation MAE = ', MAE)

Random forest validation MAE =  0.6046684372539352


# 12. XGBoost model

In [18]:
import xgboost as xgb
from xgboost import XGBRegressor
XGBModel = XGBRegressor()
XGBModel.fit(X_train_1JHC,y_train_1JHC , verbose=False)


# Get the mean absolute error on the validation data

# In[ ]:


y_pred_1JHC = XGBModel.predict(X_val_1JHC)
MAE = mean_absolute_error(y_val_1JHC , y_pred_1JHC)
print('XGBoost validation MAE = ',MAE)


  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \


XGBoost validation MAE =  0.5054160797357077


https://github.com/lilly-chen/Bite-sized-Machine-Learning/blob/master/Tree-BasedEsemble/Basic%20Ensemble%20Learning%20-%20Sample%20Code.ipynb

https://setscholars.net/2019/02/19/how-to-find-optimal-parameters-for-catboost-using-gridsearchcv-for-regression-in-python/

In [19]:
!pip install catboost



# 13. Catboost

In [20]:
from catboost import CatBoostRegressor
model = CatBoostRegressor()
parameters = {'depth' : [6,8,10],
                  'learning_rate' : [0.01, 0.05, 0.1],
                  'iterations'    : [30, 50, 100]
                 }
grid = GridSearchCV(estimator=model, param_grid = parameters, cv = 2, n_jobs=-1)
grid.fit(X_train_1JHC,y_train_1JHC)    

y_pred_1JHC = XGBModel.predict(X_val_1JHC)
MAE = mean_absolute_error(y_val_1JHC , y_pred_1JHC)
print('CatBoost validation MAE = ',MAE)
    

0:	learn: 1.0209365	total: 256ms	remaining: 12.5s
1:	learn: 1.0139672	total: 402ms	remaining: 9.65s
2:	learn: 1.0076518	total: 563ms	remaining: 8.82s
3:	learn: 1.0005164	total: 715ms	remaining: 8.22s
4:	learn: 0.9938081	total: 872ms	remaining: 7.84s
5:	learn: 0.9891390	total: 1.02s	remaining: 7.48s
6:	learn: 0.9843445	total: 1.17s	remaining: 7.2s
7:	learn: 0.9799889	total: 1.31s	remaining: 6.89s
8:	learn: 0.9760341	total: 1.47s	remaining: 6.7s
9:	learn: 0.9723841	total: 1.62s	remaining: 6.47s
10:	learn: 0.9688812	total: 1.77s	remaining: 6.29s
11:	learn: 0.9657645	total: 1.92s	remaining: 6.08s
12:	learn: 0.9626065	total: 2.07s	remaining: 5.89s
13:	learn: 0.9599713	total: 2.21s	remaining: 5.69s
14:	learn: 0.9573222	total: 2.36s	remaining: 5.52s
15:	learn: 0.9546346	total: 2.51s	remaining: 5.33s
16:	learn: 0.9523246	total: 2.66s	remaining: 5.16s
17:	learn: 0.9499638	total: 2.8s	remaining: 4.97s
18:	learn: 0.9453626	total: 2.95s	remaining: 4.82s
19:	learn: 0.9433835	total: 3.1s	remaining: 

# Inference:
    When this model is compared with the model complete with new features, it clearly indicates that this model being dimensionally small is getting better MAE for the same 1JHC type.