

# Reach Top-54 using only LGBM

![](https://i.ibb.co/M77Pchy/Screenshot-from-2019-08-29-09-34-01.png)

![](https://i.ibb.co/st3vrXX/Screenshot-from-2019-08-29-10-18-28.png)

<br>

We had problems with the GPUs since Kaggle only lets you use one. MPNN was running on GCP and we used kernels for SchNet + NN but the performance wasn't  that good, so we tried to improve our **lgbm baseline** ```-1.956 public LB``` with the aim of being able to run 8 lgbm models at the same time.

At my post [42nd Solution, explanations and apologies](https://www.kaggle.com/c/champs-scalar-coupling/discussion/106263#latest-611323) you can find more information.
Also I realized that the **43rd place team** has a similar approach, and similar LGBM, check their post [43 place solution of 2 Experts and the farmer](https://www.kaggle.com/c/champs-scalar-coupling/discussion/106280#latest-611320)

**References**

- LGBM based on [Distance - is all you need. LB -1.481](https://www.kaggle.com/criskiev/distance-is-all-you-need-lb-1-481) by @criskiev 

other important kernels:
- [Using RDKit for Atomic Feature and Visualization](https://www.kaggle.com/sunhwan/using-rdkit-for-atomic-feature-and-visualization) by @sunhwan 
- [Molecule with OpenBabel](https://www.kaggle.com/jmtest/molecule-with-openbabel) by @jmtest 


# Pipeline

0. Data: we decided to split **1JHC** into 2 (you will see *1JHC_0* and *1JHC_1*) due to the discussions.
1. Features.
2. Manual finetuning
3. Bayessian Optimization (1 morning)
4. Run 8 models (1 per type) using 8 kernels and at  the end stack the solutions


# Cross Validation

| type  | CV  | 
|---|---|---|
|1JHC   | -0.91  |
| 1JHN  | -1.52  |
| 2JHH  |  -2.49 |
| 2JHC  | -1.772  | 
| 2JHN  | -2.319  |
| 3JHH  | -2.46  |
| 3JHC  | -1.701  |
| 3JHN  | -2.64  |



# Features

<img src="https://i.ibb.co/7kdn8s7/inbox-2779868-c7a27bb40ebfeb48f2f68ddd76e57f97-Screenshot-from-2019-08-25-09-42-58.png" alt="inbox-2779868-c7a27bb40ebfeb48f2f68ddd76e57f97-Screenshot-from-2019-08-25-09-42-58" border="0">

I started with **distance features** and then started to add based on my knowledge, papers, discussions etc

### 1. Distance Features. 

Distance between C-C bonds is important. My baseline was obviously the kernel [Distance - is all you need. LB -1.481](https://www.kaggle.com/criskiev/distance-is-all-you-need-lb-1-481) by @criskiev . Distances between atom helps to know more about the geometry and strenghts, bond type, electonegativity... remember the atoms have charge and attract and repel each other.

### 2. Angles: Bond Angles (2J) and Dihedral angels (3J)

We used this kernel in order to get this features: [Molecule with OpenBabel](https://www.kaggle.com/jmtest/molecule-with-openbabel) by @jmtest 

**Karplus Ecuation**

Bothner-By equation
JHH = 7 -cos Θ + 5 cos 2Θ

where Θ is the torsion angle ... the problem wasn't obtain those angles, the problem was that I checked openbabel and RDKit and were not well calculated! For example we have **CH4** (1st molecule), the angles should be 60 but I obtained random numbers like: 83, 74,57 etc ([check here](http://www.ochempal.org/index.php/alphabetical/c-d/dihedral-angle/)).
Even the bond angles, I tried with water **H20** and insted of 104.5 I obtained random results like: 122, 97...
I read about it [here](https://www.rdkit.org/docs/GettingStartedInPython.html) and in the case of RDKit, it uses an algorithm based on distance geometry for conorming molecules from 3D (xyz), probably that's the reason :(

> Note that the conformations that result from this procedure tend to be fairly ugly. They should be cleaned up using a force field. This can be done within the RDKit using its implementation of the Universal Force Field (UFF).


### 3. Bond type

If the bond is simple, double, triple ... we didn't count the number of each type, but I could be a molecule feature. As I said, in the theory all these features are connected, based on the distances and atom types you can guess the bond **type**. However, I think is way better to add this features the explicit form when training/feeding the model.

### 4. Atom features


- Atom type
- Hybridization
- Aromatization
- Electronegativity
- Valences 
- Charges

We used RDKit and I learned how to get them in the kernel - [Using RDKit for Atomic Feature and Visualization](https://www.kaggle.com/sunhwan/using-rdkit-for-atomic-feature-and-visualization) by @sunhwan.

### 5. Molecule Features

Here you could use molecular properties like molecular polarity, potential energy etc. We added number of components and substitutens like numer of benzenes, number of aromatic nitrogens, number of hydroxil groups etc.
We used RDKit, see: [rdkit.Chem.Fragments](http://rdkit.org/docs_temp/source/rdkit.Chem.Fragments.html) (Number of aromatic nitrogens, Number of carboxylic acids etc)

Important post: **[Is 1JHC really one class](https://www.kaggle.com/c/champs-scalar-coupling/discussion/104241#latest-606224)**

> Giba: The two groups are easily splited setting a threshold in 1J coupling distance to 1.065.

The reason is this:

<img src="https://i.ibb.co/2j9DG4r/inbox-2779868-4cef58f29e1062c9271e1b17e9bc80f9-Screenshot-from-2019-08-25-09-43-22.png" alt="inbox-2779868-4cef58f29e1062c9271e1b17e9bc80f9-Screenshot-from-2019-08-25-09-43-22" border="0">

**References**

- https://www.ucl.ac.uk/nmr/NMR_lecture_notes/L3_3_97_web.pdf
- https://www.chem.wisc.edu/areas/reich/nmr/notes-5-hmr-5-vic-coupling.pdf
- https://www.chem.wisc.edu/areas/reich/nmr/05-hmr-05-3j.htm



In [None]:
import lightgbm as lgb
from sklearn.metrics import mean_absolute_error
import pandas as pd
import numpy as np 
from sklearn.model_selection import KFold, StratifiedKFold
import lightgbm as lgb
import warnings
warnings.filterwarnings("ignore")

### Basic config

In [None]:
PATH = '../input/mol-features'

In [None]:
type_name = {
             0: '1_1JHC_0',
             1: '1_1JHC_1',
             2: '2_2JHH',
             3: '3_1JHN',
             4: '4_2JHN',
             5: '5_2JHC',
             6: '6_3JHH',
             7: '7_3JHC',
             8: '8_3JHN', 
            }

In [None]:
!ls ../input/mol-features

In [None]:
!ls ../input/mol-features/1_1jhc-20190826t180727z-001 #compressed file

In [None]:
folder = {
            '1_1JHC_0': '1_1jhc_0-20190825t153133z-001',
            '1_1JHC_1': '1_1jhc_1-20190826t180741z-001',
            '2_2JHH': '2_2jhh-20190825t170952z-001',
            '3_1JHN': '3_1jhn-20190825t171137z-001',
            '4_2JHN': '4_2jhn-20190825t170517z-001',
            '5_2JHC':'5_2jhc-20190825t175318z-001',
            '6_3JHH':'6_3jhh-20190825t170153z-001',
            '7_3JHC':'7_3jhc-20190825t153045z-001',
            '8_3JHN':'8_3jhn-20190825t170413z-001'
    
}

# Parameters

We did manual finetuning + bayesian optimization.

In [None]:
type_params = {
    
    0: {
    'boosting_type': "gbdt",
    'objective': "huber",
    'learning_rate': 0.1,
    'num_leaves': 511,
    'sub_feature': 0.50,
    'sub_row': 0.5,
    'bagging_freq': 1,
    'metric': 'mae'},

    1: {
    'boosting_type': "gbdt",
    'objective': "huber",
    'learning_rate': 0.1,
    'num_leaves': 100,
    'sub_feature': 0.50,
    'sub_row': 0.5,
    'bagging_freq': 1,
    'metric': 'mae'},
  
    2: {    
    'boosting_type': "gbdt",
    'objective': "huber",
    'learning_rate': 0.01,
    'bagging_freq': 1,
    'metric': 'mae',
    'min_data_in_leaf': 130, 
    'num_leaves': 150, 
    'reg_alpha': 0.5, 
    'reg_lambda': 0.6000000000000001, 
    'sub_feature': 0.30000000000000004, 
    'sub_row': 0.4},
    
    
    3: {
    'boosting_type': "gbdt",
    'objective': "huber",
    'learning_rate': 0.01,
    'bagging_freq': 1,
    'metric': 'mae',
    'min_data_in_leaf': 96, 
    'num_leaves': 30, 
    'reg_alpha': 0.2, 
    'reg_lambda': 0.4, 
    'sub_feature': 0.4, 
    'sub_row': 0.5},
    
    
    4: {
    'boosting_type': "gbdt",
    'objective': "huber",
    'learning_rate': 0.1,
    'bagging_freq': 1,
    'metric': 'mae',
    'min_data_in_leaf': 21, 
    'num_leaves': 200, 
    'reg_alpha': 0.30000000000000004, 
    'reg_lambda': 0.2, 
    'sub_feature': 0.4, 
    'sub_row': 1.0},
    
    5: {
    'boosting_type': "gbdt",
    'objective': "huber",
    'learning_rate': 0.1,
    'bagging_freq': 1,
    'metric': 'mae',
    'num_leaves': 1023, 
    'sub_feature': 0.5, 
    'sub_row': 0.5},
    
    6:{
   'boosting_type': "gbdt",
   'objective': "huber",
   'learning_rate': 0.1,
   'min_data_in_leaf': 50, 
   'num_leaves': 700, 
   'reg_alpha': 0.30000000000000004, 
   'reg_lambda': 0.8, 
   'sub_feature': 0.5, 
   'sub_row': 0.5},
    
    7: {
    'boosting_type': "gbdt",
    'objective': "huber",
    'learning_rate': 0.1,
    'bagging_freq': 1,
    'metric': 'mae',
    'num_leaves': 1023, 
    'sub_feature': 0.5, 
    'sub_row': 0.5},
               

    8: {
    'boosting_type': "gbdt",
    'objective': "huber",
    'learning_rate': 0.01,
    'min_data_in_leaf': 38,
        'num_leaves': 350,
        'reg_alpha': 0.30000000000000004,
        'reg_lambda': 0.6000000000000001,
        'sub_feature': 0.6000000000000001,
        'sub_row': 0.5,
        'metric': 'mae'}


}

In [None]:
sub = pd.read_csv(f'../input/champs-scalar-coupling/sample_submission.csv', low_memory=False)
sub ['typei'] = pd.read_csv(f'../input/champs-scalar-coupling/test.csv', low_memory=False)['type']
sub.head()

# Run

**Important** we used 8 kaggle kernels, that means run 8 different models at the same time. In this part you should indicate a specific *bond type* by modifying ```type_name``` (a dictionary at the beginning)

In [None]:

score = []

# fit 8(type) model
for idx in type_name:

    ntype = type_name[idx]
    print(ntype)
    direct = folder[ntype]
    x_train_val = pd.read_csv(PATH+f'/{direct}/{ntype}/x_train.csv', index_col=0, low_memory=False)
    print('x_train:', x_train_val.shape)
    x_test_val =pd.read_csv(PATH+f'/{direct}/{ntype}/x_test.csv', index_col=0, low_memory=False)
    print('x_test:', x_test_val.shape)
    ID = pd.read_pickle(PATH+f'/{direct}/{ntype}/ID.csv')
    y_train_val = pd.read_pickle(PATH+f'/{direct}/{ntype}/y_train.csv')

    break
    
    print(f'------------type{ntype}------------')

    maes = []
    predictions = np.zeros(len(x_test_val))
    preds_train = np.zeros(len(x_train_val))

    n_fold = 5
    folds = StratifiedKFold(n_splits=n_fold, shuffle=False, random_state=42)

    for fold_, (trn_idx, val_idx) in enumerate(folds.split(x_train_val, ID)):
        strLog = "fold {}".format(fold_)
        print(strLog)

        x_tr, x_val = x_train_val.iloc[trn_idx], x_train_val.iloc[val_idx]
        y_tr, y_val = y_train_val.iloc[trn_idx], y_train_val.iloc[val_idx]

        model = lgb.LGBMRegressor(**type_params[idx], n_estimators=30000,random_state=1)
        model.fit(x_tr,
                  y_tr,
                  eval_set=[(x_tr, y_tr), (x_val, y_val)],
                  eval_metric='mae',
                  verbose=1000,
                  early_stopping_rounds=200
                  )

        # predictions
        preds = model.predict(
            x_test_val)  # , num_iteration=model.best_iteration_)
        predictions += preds / folds.n_splits
        preds = model.predict(
            x_train_val)  # , num_iteration=model.best_iteration_)
        preds_train += preds / folds.n_splits

        preds = model.predict(x_val)

        # mean absolute error
        mae = mean_absolute_error(y_val, preds)
        print('MAE: %.6f' % mae)
        print('Score: %.6f' % np.log(mae))
        maes.append(mae)
        print('')

    sub.loc[sub['typei'] == idx, 'scalar_coupling_constant'] = predictions
    score.append(np.mean(maes))
    print(f'{ntype} MAE:', np.mean(maes))
    print(f'{ntype} Score:', np.log(np.mean(maes)))

    print('')

print('')
print('----------------------')
# print('train score:', sum(np.log(score)) / 8)

In [None]:
#sub.to_csv(f'{config.DATA_DIR}/ano_20000.csv', index=False)

In [None]:
final = sub.drop(['typei'], axis=1)
final.to_csv(f'example.csv', index=False)
final.head()

Now I'm going to use the submission we obtained

In [None]:
submission = pd.read_csv('../input/molsubs/lgbm_final.csv')
submission.to_csv('lgbm_final.csv', index=False)
submission.head()

# Conclusion

At the post: [1JHC best CV score?](https://www.kaggle.com/c/champs-scalar-coupling/discussion/98444#latest-602104) CPMP posted this:

![](https://i.ibb.co/4J33JPs/Screenshot-from-2019-08-29-10-11-29.png)

> I don't think lgb is competitive here, will move to graph NNs now.

We used that as reference and definitely he was right (not a surprise), but LGBM performs really good with the right **features**, and actually our LGBM single model (well, 1 model per type) could reach top-54.

> **NOTE:** we didn't submit LGBM alone (during the last week), so we realized about this yesterday :)
