# A big thanks to all kaggleer out there

## Core Idea

Despite a lot of creeping Physics and Chemistry knowledge introduced in the description, this competition is more about Geometry and pattern matching.

The hypothesis of this kernel is next:
1. If we have two similar sets of atoms with the same distances between them and the same types - the scalar coupling constant should be very close.
2. More closest atoms to the pair of atoms under prediction have higher influence on scalar coupling constant then those with higher distance

So, basically, this problem could be dealt with some kind of K-Nearest Neighbor algorithm or any tree-based - e.g. LightGBM, in case we can find some representation which would describe similar configurations with similar feature sets.

Each atom is described with 3 cartesian coordinates. This representation is not stable. Each coupling pair is located in a different point in space and two similar coupling sets would have very different X,Y,Z.

So, instead of using coordinates let's consider next system:
1. Take each pair of atoms as two first core atoms
2. Calculate the center between the pair
3. Find all n-nearest atoms to the center (excluding first two atoms)
4. Take two closest atoms from step 3 - they will be 3rd and 4th core atoms
5. Calculate the distances from 4 core atoms to the rest of the atoms and to the core atoms as well

Using this representation each atom position can be described by 4 distances from the core atoms. This representation is stable to rotation and translation. And it's suitable for pattern-matching. So, we can take a sequence of atoms, describe each by 4 distances + atom type(H,O,etc) and looking up for the same pattern we can find similar configurations and detect scalar coupling constant.

Here I used LightGBM, because sklearn KNN can't deal with the amount of data. My blind guess is that hand-crafted KNN can outperform LightGBM.

Let's code the solution!

## Load Everything

In [None]:
DATA_PATH = '../input'
SUBMISSIONS_PATH = './'
# use atomic numbers to recode atomic names
ATOMIC_NUMBERS = {
    'H': 1,
    'C': 6,
    'N': 7,
    'O': 8,
    'F': 9
}

In [None]:
%matplotlib inline

import pandas as pd
import numpy as np

import math
import gc
import copy

from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import RandomizedSearchCV

import matplotlib.pyplot as plt
import seaborn as sns

from xgboost import XGBRegressor

In [None]:
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', 120)
pd.set_option('display.max_columns', 120)

In [None]:
import os
print(os.listdir("../working"))

## Load Dataset

By default all data is read as `float64` and `int64`. We can trade this uneeded precision for memory and higher prediction speed. So, let's read with Pandas all the data in the minimal representation: 

In [None]:
train_dtypes = {
    'molecule_name': 'category',
    'atom_index_0': 'int8',
    'atom_index_1': 'int8',
    'type': 'category',
    'scalar_coupling_constant': 'float32'
}
train_csv = pd.read_csv(f'{DATA_PATH}/train.csv', index_col='id', dtype=train_dtypes)
train_csv['molecule_index'] = train_csv.molecule_name.str.replace('dsgdb9nsd_', '').astype('int32')
train_csv = train_csv[['molecule_index', 'atom_index_0', 'atom_index_1', 'type', 'scalar_coupling_constant']]
train_csv.head(10)

In [None]:
print('Shape: ', train_csv.shape)
print('Total: ', train_csv.memory_usage().sum())
train_csv.memory_usage()

In [None]:
submission_csv = pd.read_csv(f'{DATA_PATH}/sample_submission.csv', index_col='id')

In [None]:
test_csv = pd.read_csv(f'{DATA_PATH}/test.csv', index_col='id', dtype=train_dtypes)
test_csv['molecule_index'] = test_csv['molecule_name'].str.replace('dsgdb9nsd_', '').astype('int32')
test_csv = test_csv[['molecule_index', 'atom_index_0', 'atom_index_1', 'type']]
test_csv.head(10)

In [None]:
structures_dtypes = {
    'molecule_name': 'category',
    'atom_index': 'int8',
    'atom': 'category',
    'x': 'float32',
    'y': 'float32',
    'z': 'float32'
}
structures_csv = pd.read_csv(f'{DATA_PATH}/structures.csv', dtype=structures_dtypes)
structures_csv['molecule_index'] = structures_csv.molecule_name.str.replace('dsgdb9nsd_', '').astype('int32')
structures_csv = structures_csv[['molecule_index', 'atom_index', 'atom', 'x', 'y', 'z']]
structures_csv['atom'] = structures_csv['atom'].replace(ATOMIC_NUMBERS).astype('int8')
structures_csv.head(10)

In [None]:
print('Shape: ', structures_csv.shape)
print('Total: ', structures_csv.memory_usage().sum())
structures_csv.memory_usage()

## Build Distance Dataset

In [None]:
def build_type_dataframes(base, structures, coupling_type):
    base = base[base['type'] == coupling_type].drop('type', axis=1).copy()
    base = base.reset_index()
    base['id'] = base['id'].astype('int32')
    structures = structures[structures['molecule_index'].isin(base['molecule_index'])]
    return base, structures

In [None]:
def add_coordinates(base, structures, index):
    df = pd.merge(base, structures, how='inner',
                  left_on=['molecule_index', f'atom_index_{index}'],
                  right_on=['molecule_index', 'atom_index']).drop(['atom_index'], axis=1)
    df = df.rename(columns={
        'atom': f'atom_{index}',
        'x': f'x_{index}',
        'y': f'y_{index}',
        'z': f'z_{index}'
    })
    return df

In [None]:
def add_atoms(base, atoms):
    df = pd.merge(base, atoms, how='inner',
                  on=['molecule_index', 'atom_index_0', 'atom_index_1'])
    return df

In [None]:
def merge_all_atoms(base, structures):
    df = pd.merge(base, structures, how='left',
                  left_on=['molecule_index'],
                  right_on=['molecule_index'])
    df = df[(df.atom_index_0 != df.atom_index) & (df.atom_index_1 != df.atom_index)]
    return df

In [None]:
def add_center(df):
    df['x_c'] = ((df['x_1'] + df['x_0']) * np.float32(0.5))
    df['y_c'] = ((df['y_1'] + df['y_0']) * np.float32(0.5))
    df['z_c'] = ((df['z_1'] + df['z_0']) * np.float32(0.5))

# TODO: Эвклидово расстояние. Можно попробовать манхетенское, т.к. сила взаимодействия с расстоянием ослабляется нелинейно
def add_distance_to_center(df):
    df['d_c'] = ((
        (df['x_c'] - df['x'])**np.float32(2) +
        (df['y_c'] - df['y'])**np.float32(2) + 
        (df['z_c'] - df['z'])**np.float32(2)
    )**np.float32(0.5))

def add_distance_between(df, suffix1, suffix2):
    df[f'd_{suffix1}_{suffix2}'] = ((
        (df[f'x_{suffix1}'] - df[f'x_{suffix2}'])**np.float32(2) +
        (df[f'y_{suffix1}'] - df[f'y_{suffix2}'])**np.float32(2) + 
        (df[f'z_{suffix1}'] - df[f'z_{suffix2}'])**np.float32(2)
    )**np.float32(0.5))

In [None]:
def add_distances(df):
    n_atoms = 1 + max([int(c.split('_')[1]) for c in df.columns if c.startswith('x_')])
    for i in range(1, n_atoms):
        for vi in range(min(4, i)):
            add_distance_between(df, i, vi)

In [None]:
def add_n_atoms(base, structures):
    dfs = structures['molecule_index'].value_counts().rename('n_atoms').to_frame()
    return pd.merge(base, dfs, left_on='molecule_index', right_index=True)

In [None]:
def take_n_atoms(df, n_atoms, four_start=4):
    labels = []
    for i in range(2, n_atoms):
        label = f'atom_{i}'
        labels.append(label)

    for i in range(n_atoms):
        num = min(i, 4) if i < four_start else 4
        for j in range(num):
            labels.append(f'd_{i}_{j}')
    if 'scalar_coupling_constant' in df:
        labels.append('scalar_coupling_constant')
    return df[labels]

In [None]:
def build_couple_dataframe(some_csv, structures_csv, coupling_type, n_atoms=10):
    base, structures = build_type_dataframes(some_csv, structures_csv, coupling_type) # отбирает трен.набор и структуры заданного типа
    base = add_coordinates(base, structures, 0) # Добавили координаты взаимодействующих атомов
    base = add_coordinates(base, structures, 1)
    
    base = base.drop(['atom_0', 'atom_1'], axis=1)
    atoms = base.drop('id', axis=1).copy()
    if 'scalar_coupling_constant' in some_csv:
        atoms = atoms.drop(['scalar_coupling_constant'], axis=1)
        
    add_center(atoms) # координаты геометрического центра атома
    # Remove coordinates of coupling atoms?
    atoms = atoms.drop(['x_0', 'y_0', 'z_0', 'x_1', 'y_1', 'z_1'], axis=1) # удалили координаты взаимодействующих атомов

    atoms = merge_all_atoms(atoms, structures) # каждый атом молекулы в отдельной строке
    
    add_distance_to_center(atoms) # каждому атому добавлено расстояние до центра
    
    atoms = atoms.drop(['x_c', 'y_c', 'z_c', 'atom_index'], axis=1) # удалили координаты атомов
    atoms.sort_values(['molecule_index', 'atom_index_0', 'atom_index_1', 'd_c'], inplace=True) # сортировка перед группировкой
    atom_groups = atoms.groupby(['molecule_index', 'atom_index_0', 'atom_index_1']) # сгруппировали по молекула-связь
    atoms['num'] = atom_groups.cumcount() + 2 # выставляет порядковый номер в группе
    atoms = atoms.drop(['d_c'], axis=1)
    atoms = atoms[atoms['num'] < n_atoms] # отбросили "лишние" атомы (сначала пронумеровали)

    atoms = atoms.set_index(['molecule_index', 'atom_index_0', 'atom_index_1', 'num']).unstack()
    atoms.columns = [f'{col[0]}_{col[1]}' for col in atoms.columns]
    atoms = atoms.reset_index()
    
    # downcast back to int8
    for col in atoms.columns:
        if col.startswith('atom_'):
            atoms[col] = atoms[col].fillna(0).astype('int8')
            
    atoms['molecule_index'] = atoms['molecule_index'].astype('int32')
    
    full = add_atoms(base, atoms)
    add_distances(full)
    
    full.sort_values('id', inplace=True)
    
    return full

## Check XGBoost with the smallest type

In [None]:
%%time

def type_select(types = '1JHN', n_atoms=10):
    full = build_couple_dataframe(train_csv, structures_csv, types, n_atoms=n_atoms)
    print(full.shape)
    df = take_n_atoms(full, n_atoms)
    df = df.fillna(0)

    X_data = df.drop(['scalar_coupling_constant'], axis=1).values.astype('float32')
    y_data = df['scalar_coupling_constant'].values.astype('float32')

    X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=0.1, random_state=128)
    print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)
    
    return X_train, X_val, y_train, y_val



We don't calculate distances for `d_0_x`, `d_1_1`, `d_2_2`, `d_2_3`, `d_3_3` because we already have them in later atoms(`d_0_1` == `d_1_0`) or they are equal to zeros(e.g. `d_1_1`, `d_2_2`).

For experiments, full dataset can be built with higher number of atoms, and for building a training/validation sets we can trim them:

In [None]:
X_train, X_val, y_train, y_val = type_select(types = '3JHN')
#X_data, y_data = type_select(types = '3JHN')

In [None]:
model_parameters = {'n_estimators': [50, 100, 150, 200, 250, 300], 
                    'max_depth':[5,7,9,11], 'learning_rate': [0.07, 0.1, 0.15],
                    'gamma': [0, 0.0001, 0.001, 0.01],
                    'subsample': [0.5, 0.8, 1], 'colsample_bytree': [0.5, 0.66, 1]
                    }
fit_params = {'eval_metric': 'mae',
              'early_stopping_rounds': 5,
              'eval_set': [(X_val, y_val)]}

In [None]:
%%time
#model = XGBRegressor(objective='reg:squarederror')
#choose_model = RandomizedSearchCV(model, model_parameters, scoring='neg_mean_absolute_error', 
#                                  n_iter=100, cv=3, verbose=3, n_jobs=6)
#choose_model.fit(X_train, y_train) #, **fit_params)

In [None]:
#print("Best score: %0.3f" % choose_model.best_score_)
#print("Best parameters set:")
#best_parameters=choose_model.best_estimator_.get_params()
#for param_name in sorted(best_parameters.keys()):
#    print("\t%s: %r" % (param_name, best_parameters[param_name]))

In [None]:
best_parameters = {'base_score': 0.5, 
                   'booster': 'gbtree',
        'colsample_bylevel': 1,
        'colsample_bynode': 1,
        'colsample_bytree': 1,
        'gamma': 0,
        'importance_type': 'gain',
        'learning_rate': 0.15,
        'max_delta_step': 0,
        'max_depth': 11,
        'min_child_weight': 1,
        'missing': None,
        'n_estimators': 300,
        'n_jobs': 4,
        'nthread': None,
        'objective': 'reg:squarederror',
        'random_state': 0,
        'reg_alpha': 0,
        'reg_lambda': 1,
        'scale_pos_weight': 1,
        'seed': None,
        'silent': None,
        'subsample': 0.8,
        'verbosity': 1
                  }

In [None]:
#y_pred = choose_model.best_estimator_.predict(X_val)
#print(np.log(mean_absolute_error(y_val, y_pred)))

In [None]:
#categorical_feature=[0,1,2,3,4]

In [None]:
%%time
model_params = {        
    '1JHN': 10,
    '1JHC': 16,
    '2JHH': 11,
    '2JHN': 10,
    '2JHC': 13,
    '3JHH': 12,
    '3JHC': 15,
    '3JHN': 11,
}

#for coupling_type in model_params.keys():
#    for n in range(7, 20):
#        X_train, X_val, y_train, y_val = type_select(types=coupling_type, n_atoms=n)
#        model = XGBRegressor(**best_parameters)
#        model.fit(X_train, y_train)
#        y_pred = model.predict(X_val)
#        err = np.log(mean_absolute_error(y_val, y_pred))
#        print(coupling_type, n, err)
model_params.values()

In [None]:
#model2.best_iteration
# https://www.kaggle.com/nikitinale/using-xgboost-with-scikit-learn/edit 
# Посмотреть важность фич

It's funny, but looks like atom types aren't used a lot in the final decision. Quite a contrary to what a man would do.

## Submission Model

In [None]:
def build_x_y_data(some_csv, coupling_type, n_atoms):
    full = build_couple_dataframe(some_csv, structures_csv, coupling_type, n_atoms=n_atoms)
    
    df = take_n_atoms(full, n_atoms)
    df = df.fillna(0)
    print(df.columns)
    
    if 'scalar_coupling_constant' in df:
        X_data = df.drop(['scalar_coupling_constant'], axis=1).values.astype('float32')
        y_data = df['scalar_coupling_constant'].values.astype('float32')
    else:
        X_data = df.values.astype('float32')
        y_data = None
    
    return X_data, y_data

In [None]:
def train_and_predict_for_one_coupling_type(coupling_type, submission, n_atoms, random_state=128):
    print(f'*** Training Model for {coupling_type} ***')
    
    X_data, y_data = build_x_y_data(train_csv, coupling_type, n_atoms)
    X_test, _ = build_x_y_data(test_csv, coupling_type, n_atoms)
    y_pred = np.zeros(X_test.shape[0], dtype='float32')
    
    model_ = XGBRegressor(**best_parameters)
    model_.fit(X_data, y_data)

    y_pred += model_.predict(X_test)
    submission.loc[test_csv['type'] == coupling_type, 'scalar_coupling_constant'] = y_pred

Let's build a separate model for each type of coupling. Dataset is split into 5 pieces and in this kernel we will use only 3 folds for speed up.

Main tuning parameter is the number of atoms. I took good numbers, but accuracy can be improved a bit by tuning them for each type.

In [None]:
submission = submission_csv.copy()

#cv_scores = {}
for coupling_type in model_params.keys():
    cv_score = train_and_predict_for_one_coupling_type(
        coupling_type, submission, n_atoms=model_params[coupling_type])

Checking cross-validation scores for each type:

In [None]:
# pd.DataFrame({'type': list(cv_scores.keys()), 'cv_score': list(cv_scores.values())})

And cv mean score:

In [None]:
#np.mean(list(cv_scores.values()))

Sanity check for all cells to be filled with predictions:

In [None]:
submission[submission['scalar_coupling_constant'] == 0].shape

In [None]:
submission.head(10)

In [None]:
submission.to_csv(f'{SUBMISSIONS_PATH}/submission.csv')

## Room for improvement

There are many steps, how to improve the score for this kernel:
* Tune LGB hyperparameters - I did nothing for this
* Tune number of atoms for each type
* Try to add other features
* Play with categorical features for atom types (one-hot-encoding, CatBoost?)
* Try other tree libraries

Also, this representation fails badly on `*JHC` coupling types. The main reason for this is that 3rd and 4th atoms are usually located on the same distance and representation starts "jittering" randomly picking one of them. So, two similar configurations will have different representation due to usage of 3/4 of 4/3 distances.

The biggest challenge would be to implement handcrafted KNN with some compiled language(Rust, C++, C).

Would be cool to see this kernel forked and addressed some of the issues with higher LB score.